Introduction to Pandas (tutorial)#

Pandas is a Python package that contains functionality for data analysis and manipulatoin. It is used mostly for “table-like” data, i.e., data that can be organized in 2D tables with observations organized across rows and variables across columns. With Pandas, you can do most of the data processing routines you might know from R. (R contains more extensive statistics-related functionality, though.)

Contents#

  1. Dataframes

  2. Indexing

  3. Dataframe operations

  4. Aggregation (optional)

Dataframes#

Alright, let’s get started. The Pandas package isis usually imported as follows:

import pandas as pd

Although Pandas contains several functions for processing data, the package is very object-oriented. Specifically, most of its functionality is implemented in the DataFrame class, which is similar to R’s dataframe (or tibble).

Now, you can actually create a DataFrame object using different methods. You can create it from scratch (in Python) or you can load it from an external file, like a CSV (comma-separated value) file or an Excel file. The latter is probably the most often used, so let’s take a look at that. We created some data from a (very small) hypothetical working memory experiment, which we saved in a CSV file (where each column is separated by commas, hence the name). To load in this data, we will use the function pd.read_csv:

df = pd.read_csv('example_data.csv')
df.index = [0, 1, 2, 3]  # you may ignore this line
Note: Pandas can read data from more than just CSV files, including Excel files (using pd.read_excel), JSON files (using pd.read_json), or even SQL databases (using pd.read_sql)! In this course, though, we'll mainly work with CSV files.

To view its contents, we can just enter df as the last line in a code cell (so without print) and run the cell. This way, Jupyter notebooks will actually render the dataframes nicely:

df
participant_id age condition wm_score
0 sub-01 25 A 41
1 sub-02 21 B 48
2 sub-03 28 A 34
3 sub-04 25 B 39
ToDo: The pd.read_csv is in fact a very flexible function and has a lot of arguments to tweak it according to your needs. Check out its documentation to get an overview of its arguments! For example, try to load the same data (from example_data.csv), but include only the "age" and "wm_score" columns, and store this 2-column dataframe in a variable named df_todo.
Hide code cell content
""" Implement the ToDo here. """
### BEGIN SOLUTION
df_todo = pd.read_csv('example_data.csv', usecols=['age', 'wm_score'])
### END SOLUTION
""" Tests the above ToDo. """
if df_todo.columns.tolist() != ['age', 'wm_score']:
    raise ValueError('Something went wrong ... :-(')
else:
    print("Well done!")
Well done!

As mentioned, Pandas dataframes assume that its rows represent different observations (here: participants) and its columns represent different variables (here: participant ID, age, condition, and working memory score). Apart from the dataframe’s values (e.g., “sub-01”, “sub-02” and 25, 21), the dataframe also contains column names (here: “participant_id”, “age”, “condition”, “wm_score”) and row names (0, 1, 2, 3). The row names are usually referred to as the dataframe’s index. The column names can be accessed using the columns attribute:

# Actually, this is a specific object called an Index, but
# you can forget about that for now
df.columns
Index(['participant_id', 'age', 'condition', 'wm_score'], dtype='object')

… and the row index can be accessed using the index attribute:

# Again, it is a specific object, which you may ignore for now
df.index
Index([0, 1, 2, 3], dtype='int64')
Note: there is no reason that the column names should be strings and the index should be integers. It completely valid to use, for example, integers as column names and strings as index values!

Creating dataframes from scratch#

In some situations, you may want to create a DataFrame object from scratch. Although there are different ways to do this, arguably the easiest way is to pass a dictionary to the DataFrame class:

data = {'var1': [1, 2, 3], 'var2': ['a', 'b',  'c']}
some_df = pd.DataFrame(data)
some_df
var1 var2
0 1 a
1 2 b
2 3 c

As you can see, the DataFrame class automatically infers the column names from the dictionary keys and creates a default integer index starting at 0.

ToDo: You can also specify the index explicitly by passing a list to the index parameter (see also the documentation). Using the variable data again, create a new dataframe (called df_todo) but use "x", "y", "z" for the index.
""" Implement your ToDo here. """
df_todo = pd.DataFrame(data, index=['x', 'y', 'z'])
""" Tests the above ToDo. """
if df_todo.index.tolist() != ['x', 'y', 'z']:
    raise ValueError("Something went wrong ...")
    
print("Yes! Well done!")
Yes! Well done!

Indexing#

One important operation you need to now is how to select (or “extract”) particular rows and/or columns. There are, in fact, multiple ways to do this, but we recommend sticking with loc and iloc.

Indexing with loc#

The loc indexer selects rows and/or columns based on their name (similar to indexing a dictionary). You always need to supply the loc indexer both your row selection and your column selection. In general, it should be used as follows:

df_subset = df.loc[row_selection, column_selection]

where row_selection and column_selection may be either a single row/column name or a list of row/column names. For example, if you’d want to select the first two rows and the “age” and “wm_score” columns, you could do the following:

df_subset = df.loc[[0, 1], ['age', 'wm_score']]
df_subset
age wm_score
0 25 41
1 21 48

Note that other indexing methods also exist, like indexing columns directly (e.g., df['age']), or indexing columns by accessing them as attributes (e.g., df.age). We strongly suggest always using loc (and iloc), though, for consistency (also because this method is the most versatile).

ToDo: Select the second and fourth row and the "participant_id" and "condition" columns from the dataframe (df) and store the result in a new variable named df_todo.
Hide code cell content
""" Implement your ToDo here. """
### BEGIN SOLUTION
df_todo = df.loc[[1, 3], ['participant_id', 'condition']]
### END SOLUTION
""" Tests the above ToDo. """
if df_todo['participant_id'].tolist() != ['sub-02', 'sub-04']:
    raise ValueError("Something went wrong ...")

if df_todo['condition'].tolist() != ['B', 'B']:
    raise ValueError("Something went wrong ...")
    
print("Well done!")
Well done!

Note that you can also select all rows or columns at once using the colon (:). For example, if you’d want to select all rows from the “age” and “wm_score” columns, you can do the following:

df.loc[:, ['age', 'wm_score']]
age wm_score
0 25 41
1 21 48
2 28 34
3 25 39

Intermezzo: Series vs. Dataframes#

Before continuing with position-based indexing using iloc, we need to quickly discuss pandas Series objects. Series are basically single-column (or single-row) dataframes. For example, if we’d select a single column from our example dataframe, we’d have a Series object:

srs = df.loc[:, 'age']  # note the absence of the list
srs
0    25
1    21
2    28
3    25
Name: age, dtype: int64
type(srs)
pandas.core.series.Series

As such, you can think of DataFrame objects as a collection of Series objects. As opposed to DataFrame object, Series object does not have column names (but its original column name can be accessed through the name attribute), but it does have an index:

srs.index
Index([0, 1, 2, 3], dtype='int64')

We won’t discuss Series in this notebook, as in practice you’ll mostly work with DataFrame objects (but we wanted to show it to you in case you encounter them).

Tip: Want to select a single column/row, but still want pandas to return a dataframe? Make sure you use a list to index the single column/row!

That is, use:
df.loc[:, ['age']]
and not:
df.loc[:, 'age']).

Indexing with iloc#

Instead of selecting subsets from dataframes based on their column names and row names (index) using loc, you can also do so using the “integer-location” of the subsets using iloc. This method of indexing follows the same rules as indexing lists. For example, you can simply select specific columns, e.g., the first and fourth column:

df.iloc[:, [0, 3]]
participant_id wm_score
0 sub-01 41
1 sub-02 48
2 sub-03 34
3 sub-04 39

But you can also use negative indices (counting from the end) or using slices:

# Take the first and last row and every second column starting at 1
# i.e., all odd columns
df.iloc[[0, -1], 1::2]
age wm_score
0 25 41
3 25 39
ToDo: Using a row and column slice, select the first three rows and the last three columns of the df dataframe and store it a new variable named df_todo.
Hide code cell content
""" Implement your ToDo here. """
### BEGIN SOLUTION
df_todo = df.iloc[:3, 1:]
### END SOLUTION
""" Tests the above ToDo. """
if df_todo.index.tolist() != [0, 1, 2]:
    raise ValueError("Row index is incorrect!")
    
if df_todo.columns.tolist() != ['age', 'condition', 'wm_score']:
    raise ValueError("Column index is incorrect!")

Boolean indexing#

A last, slightly trickier, indexing method is boolean indexing. This method is often used to select a specific subset of rows from a dataframe. It essentially boils down to creating a Series object with boolean values (i.e., True and False values) and using that in combination with loc.

For example, suppose that you want to select all participants (i.e., rows) who are older than 21. First, you need to create a boolean series:

b_idx = df.loc[:, 'age'] > 21
b_idx
0     True
1    False
2     True
3     True
Name: age, dtype: bool

… and then, you can use it together with loc:

df.loc[b_idx, :]
participant_id age condition wm_score
0 sub-01 25 A 41
2 sub-03 28 A 34
3 sub-04 25 B 39

You can, of course, also do this within a single line of code:

df.loc[df.loc[:, 'age'] > 21, :]
participant_id age condition wm_score
0 sub-01 25 A 41
2 sub-03 28 A 34
3 sub-04 25 B 39
ToDo: Using boolean indexing, select all participants in condition A and store it in new variable named conA_pp.
Hide code cell content
""" Implement your ToDo here. """
### BEGIN SOLUTION
conA_pp = df.loc[df.loc[:, 'condition'] == 'A', :]
### END SOLUTION
""" Tests the above ToDo. """
if len(conA_pp.columns) != 4:
    raise ValueError("Make sure you only filter the rows, not the columns!")
    
if conA_pp.index.tolist() != [0, 2]:
    raise ValueError("Hmm, the indexing operation didn't go as expected ...")
    
print("WELL DONE!")
WELL DONE!

Adding and removing rows and columns#

When working with dataframes, at some point you’ll may want to add columns and/or rows. You can actually use the loc and iloc indexers for this as well. For example, if you would like to add a new column named “nationality” to the dataframe, you can do the following:

# Note: no list this time
df.loc[:, 'nationality'] = ['Dutch', 'British', 'Dutch', 'German']
df
participant_id age condition wm_score nationality
0 sub-01 25 A 41 Dutch
1 sub-02 21 B 48 British
2 sub-03 28 A 34 Dutch
3 sub-04 25 B 39 German

In the above cell we basically used loc to assign a list of values to a new column named “nationality”. Importantly, this list should have the same number of values as the number of rows of the dataframe.

ToDo: In the same way as shown above, you can add rows to the dataframe. You got some new data from "sub-05" (a 23 year old French participant who was part of condition B and had a working memory score of 51). Add this to the existing dataframe (df) in the code cell below. Make sure not to overwrite the existing data by choosing a new index value (e.g., 4).
Hide code cell content
""" Implement the ToDo here. """
### BEGIN SOLUTION
df.loc[4, :] = ['sub-05', 23, 'B', 51, 'French']
### END SOLUTION
""" Tests the above ToDo. """
if df.iloc[-1, :].loc['participant_id'] != 'sub-05':
    raise ValueError("Hmm, the participant_id of the new row does not seem to be correct ...")
    
if df.iloc[-1, :].loc['age'] != 23:
    raise ValueError("Hmm, the age of the new row does not seem to be correct ...")

if df.iloc[-1, :].loc['condition'] != 'B':
    raise ValueError("Hmm, the condition of the new row does not seem to be correct ...")

if df.iloc[-1, :].loc['wm_score'] != 51:
    raise ValueError("Hmm, the wm_score of the new row does not seem to be correct ...")

if df.iloc[-1, :].loc['nationality'] != 'French':
    raise ValueError("Hmm, the nationality of the new row does not seem to be correct ...")

print("CORRECT!")
CORRECT!

To delete a row or column, you can use the drop method (see the documentation for more details). You can drop both columns (using the argument columns) and rows (using the argument rows) at the same time. For example, if I’d want to drop data fram “sub-05” and the entire “nationality” column, we could do the following:

df.drop(columns='nationality', index=4)
participant_id age condition wm_score
0 sub-01 25.0 A 41.0
1 sub-02 21.0 B 48.0
2 sub-03 28.0 A 34.0
3 sub-04 25.0 B 39.0

Note that the dataframe is not automatically updated! If we want to replace the original dataframe, we need to assign the result of the drop call to the original variable name (df):

df = df.drop(columns='nationality', index=4)
ToDo: Try running the above cell again. You'll get an error! Do you understand why?

Note that, instead of saving the output of the drop method in a new (or the same) variable, you can also modify the dataframe “in place” by setting the inplace argument to True (i.e., drop(inplace=True)). In my opinion, however, this often leads to bugs so I’d recommend against usign this approach.

ToThink: Suppose you assign the output of a call to drop(inplace=True), which doesn't actually return anything, to a new variable — do you know what happens? Try it out below (create a new code cell first).

Dataframe operations#

DataFrame (and Series) objects have a lot of methods that implement common computational and statistical operations. Often, these methods return a new DataFrame or Series object with the result from the operation. For example, a useful method is describe, which returns a set of descriptive statistics for all numeric columns (i.e., it will automatically skip non-numeric columns):

df.describe()
age wm_score
count 4.000000 4.000000
mean 24.750000 40.500000
std 2.872281 5.802298
min 21.000000 34.000000
25% 24.000000 37.750000
50% 25.000000 40.000000
75% 25.750000 42.750000
max 28.000000 48.000000

Separate operations, like computing the mean, are also available:

# Note that it returns a Series, because it consists
# of a single column (with two values)
df.mean(numeric_only=True)
age         24.75
wm_score    40.50
dtype: float64
ToDo: Compute the standard deviation of the "age" column only using the appropriate dataframe method (google this!) and store single value in a new variable named age_sd.
Hide code cell content
""" Implement the ToDo here. """
### BEGIN SOLUTION
age_sd = df.loc[:, 'age'].std()
### END SOLUTION
""" Tests the above ToDo. """
if round(age_sd, 3) != 2.872:
    raise ValueError("Something went wrong ...")

print("Well done!")
Well done!

Saving DataFrames#

Just like Pandas is able to read data from different file formats, it can also save data from DataFrames to different file format, including CSV files, JSON files, and databases. Again, for now, we’ll only focus on CSV files.

Importantly, saving data to disk is usually done using DataFrame methods. For example, to save your DataFrame data to a CSV file, you can use its to_csv method.

ToDo: Read the documentation from the read_csv method.

As you can see, the to_csv method has a lot of arguments you can use to specify exactly how you’d like the data to be saved. Often, you only need to specify the first argument, i.e., the path and name of the file you’d like to save the data to. For example, to save the data to a CSV file with the name “some_data.csv” to the current working directory, you’d run the following:

# We'll use the `df` DataFrame we used earlier
df.to_csv('some_data.csv')
ToDo: Check out the current directory (i.e., the same directory that contains this 3_pandas.ipynb notebook) to see the CSV file we just created!

If you open the CSV file, you probably noticed that its first column contains integers and lacks a column name — this is the DataFrame index! You likely don’t need this index. To avoid saving the index when using to_csv, you can set the argument index to False (i.e., index=False).

ToDo: Let's try something else. Save the df variable to a TSV file in the current working directory with the name other_data.tsv which contains tab-separated values instead of comma-separated values. Note: you still need to use the to_csv function! Make sure you don't save the index.
Hide code cell content
""" Implement the ToDo here. """
### BEGIN SOLUTION
df.to_csv('other_data.tsv', sep='\t', index=False)
### END SOLUTION
""" Tests the above ToDo. """
df_test = pd.read_csv('other_data.tsv', sep='\t')
assert(df_test.shape[1] == 4)
print("Awesome!")
Awesome!

Let’s remove the two files we created to not clutter our current directory.

import os
for f in ['some_data.csv', 'other_data.tsv']:
    if os.path.isfile(f):
        os.remove(f)

Aggregation (advanced / optional)#

In the context of data analysis and statistics, you may want to perform operations separately for different subsets of the data. For example, you may want to compute the mean age separately for the different conditions. You may know this aggregation operation as creating a “pivot table”. With pandas DataFrame objects, this can be easily done using the groupby method in combination with a particular operation.

In the groupby method, you define by which factor you want to group your data (e.g., “condition”) and you may subsequently “chain” it with another method (e.g, mean):

means = df.groupby('condition').mean(numeric_only=True)
means
age wm_score
condition
A 26.5 37.5
B 23.0 43.5

Note that the group factor becomes the index in the resulting dataframe! Again, the mean operation is only performed for the numeric columns. To explain all of the functionality of the groupby method probably needs a completely separate tutorial, so let’s just finish with a relatively more difficult (and optional) ToDo.

ToDo: Below, we simulate a dataset with reaction time data (in ms.) across four groups ('A', 'B', 'C', 'D') of 25 participants each. Suppose that we want to, for each group separately, a one-sided t-test of reaction time (against a population mean of 300 ms.). Create a dataframe with 3 columns ("mean", "std", and "t_value") and 4 rows ('A', 'B', 'C', 'D', representing the groups), where the "t_value" column should contain t-values corresponding to the groups.
Hide code cell content
import pandas as pd
import numpy as np
np.random.seed(42)

data = np.random.normal(300, 70, 100)
cond = ['A', 'B', 'C', 'D'] * 25
df_final = pd.DataFrame({'rt': data, 'condition': cond})
df_final

### BEGIN SOLUTION
# Not necessarily the most elegant solution!
df_stats = df_final.groupby("condition").mean()
df_stats.columns = ['mean']
df_stats['std'] = df_final.groupby("condition").std()
df_stats['t_value'] = (df_stats.loc[:, 'mean'] - 300) / (df_stats.loc[:, 'std'] / np.sqrt(25))
### END SOLUTION
""" Tests the above ToDo. """
if df_stats.columns.tolist() != ['mean', 'std', 't_value']:
    raise ValueError("The dataframe does not contain the right columns!")
    
if df_stats.index.tolist() != ['A', 'B', 'C', 'D']:
    raise ValueError("The dataframe does not contain the right index!")

if df_stats.loc[:, 't_value'].round(4).tolist() != [-1.0839, -1.0983, -0.6441, 0.2139]:
    raise ValueError("T-values are not yet correct!")

print("YOU'RE AMAZING!")
YOU'RE AMAZING!