Introduction to Pandas (tutorial)#
Pandas is a Python package that contains functionality for data analysis and manipulatoin. It is used mostly for “table-like” data, i.e., data that can be organized in 2D tables with observations organized across rows and variables across columns. With Pandas, you can do most of the data processing routines you might know from R. (R contains more extensive statistics-related functionality, though.)
Contents#
Dataframes
Indexing
Dataframe operations
Aggregation (optional)
Dataframes#
Alright, let’s get started. The Pandas package isis usually imported as follows:
import pandas as pd
Although Pandas contains several functions for processing data, the package is very object-oriented. Specifically, most of its functionality is implemented in the DataFrame
class, which is similar to R’s dataframe
(or tibble).
Now, you can actually create a DataFrame
object using different methods. You can create it from scratch (in Python) or you can load it from an external file, like a CSV (comma-separated value) file or an Excel file. The latter is probably the most often used, so let’s take a look at that. We created some data from a (very small) hypothetical working memory experiment, which we saved in a CSV file (where each column is separated by commas, hence the name). To load in this data, we will use the function pd.read_csv
:
df = pd.read_csv('example_data.csv')
df.index = [0, 1, 2, 3] # you may ignore this line
To view its contents, we can just enter df
as the last line in a code cell (so without print
) and run the cell. This way, Jupyter notebooks will actually render the dataframes nicely:
df
participant_id | age | condition | wm_score | |
---|---|---|---|---|
0 | sub-01 | 25 | A | 41 |
1 | sub-02 | 21 | B | 48 |
2 | sub-03 | 28 | A | 34 |
3 | sub-04 | 25 | B | 39 |
Show code cell content
""" Implement the ToDo here. """
### BEGIN SOLUTION
df_todo = pd.read_csv('example_data.csv', usecols=['age', 'wm_score'])
### END SOLUTION
""" Tests the above ToDo. """
if df_todo.columns.tolist() != ['age', 'wm_score']:
raise ValueError('Something went wrong ... :-(')
else:
print("Well done!")
Well done!
As mentioned, Pandas dataframes assume that its rows represent different observations (here: participants) and its columns represent different variables (here: participant ID, age, condition, and working memory score). Apart from the dataframe’s values (e.g., “sub-01”, “sub-02” and 25, 21), the dataframe also contains column names (here: “participant_id”, “age”, “condition”, “wm_score”) and row names (0, 1, 2, 3). The row names are usually referred to as the dataframe’s index. The column names can be accessed using the columns
attribute:
# Actually, this is a specific object called an Index, but
# you can forget about that for now
df.columns
Index(['participant_id', 'age', 'condition', 'wm_score'], dtype='object')
… and the row index can be accessed using the index
attribute:
# Again, it is a specific object, which you may ignore for now
df.index
Index([0, 1, 2, 3], dtype='int64')
Creating dataframes from scratch#
In some situations, you may want to create a DataFrame
object from scratch. Although there are different ways to do this, arguably the easiest way is to pass a dictionary to the DataFrame
class:
data = {'var1': [1, 2, 3], 'var2': ['a', 'b', 'c']}
some_df = pd.DataFrame(data)
some_df
var1 | var2 | |
---|---|---|
0 | 1 | a |
1 | 2 | b |
2 | 3 | c |
As you can see, the DataFrame
class automatically infers the column names from the dictionary keys and creates a default integer index starting at 0.
""" Implement your ToDo here. """
df_todo = pd.DataFrame(data, index=['x', 'y', 'z'])
""" Tests the above ToDo. """
if df_todo.index.tolist() != ['x', 'y', 'z']:
raise ValueError("Something went wrong ...")
print("Yes! Well done!")
Yes! Well done!
Indexing#
One important operation you need to now is how to select (or “extract”) particular rows and/or columns. There are, in fact, multiple ways to do this, but we recommend sticking with loc
and iloc
.
Indexing with loc
#
The loc
indexer selects rows and/or columns based on their name (similar to indexing a dictionary). You always need to supply the loc
indexer both your row selection and your column selection. In general, it should be used as follows:
df_subset = df.loc[row_selection, column_selection]
where row_selection
and column_selection
may be either a single row/column name or a list of row/column names. For example, if you’d want to select the first two rows and the “age” and “wm_score” columns, you could do the following:
df_subset = df.loc[[0, 1], ['age', 'wm_score']]
df_subset
age | wm_score | |
---|---|---|
0 | 25 | 41 |
1 | 21 | 48 |
Note that other indexing methods also exist, like indexing columns directly (e.g., df['age']
), or indexing columns by accessing them as attributes (e.g., df.age
). We strongly suggest always using loc
(and iloc
), though, for consistency (also because this method is the most versatile).
Show code cell content
""" Implement your ToDo here. """
### BEGIN SOLUTION
df_todo = df.loc[[1, 3], ['participant_id', 'condition']]
### END SOLUTION
""" Tests the above ToDo. """
if df_todo['participant_id'].tolist() != ['sub-02', 'sub-04']:
raise ValueError("Something went wrong ...")
if df_todo['condition'].tolist() != ['B', 'B']:
raise ValueError("Something went wrong ...")
print("Well done!")
Well done!
Note that you can also select all rows or columns at once using the colon (:
). For example, if you’d want to select all rows from the “age” and “wm_score” columns, you can do the following:
df.loc[:, ['age', 'wm_score']]
age | wm_score | |
---|---|---|
0 | 25 | 41 |
1 | 21 | 48 |
2 | 28 | 34 |
3 | 25 | 39 |
Intermezzo: Series vs. Dataframes#
Before continuing with position-based indexing using iloc
, we need to quickly discuss pandas Series
objects. Series
are basically single-column (or single-row) dataframes. For example, if we’d select a single column from our example dataframe, we’d have a Series
object:
srs = df.loc[:, 'age'] # note the absence of the list
srs
0 25
1 21
2 28
3 25
Name: age, dtype: int64
type(srs)
pandas.core.series.Series
As such, you can think of DataFrame
objects as a collection of Series
objects. As opposed to DataFrame
object, Series
object does not have column names (but its original column name can be accessed through the name
attribute), but it does have an index:
srs.index
Index([0, 1, 2, 3], dtype='int64')
We won’t discuss Series
in this notebook, as in practice you’ll mostly work with DataFrame
objects (but we wanted to show it to you in case you encounter them).
That is, use:
df.loc[:, ['age']]
and not:
df.loc[:, 'age']).
Indexing with iloc
#
Instead of selecting subsets from dataframes based on their column names and row names (index) using loc
, you can also do so using the “integer-location” of the subsets using iloc
. This method of indexing follows the same rules as indexing lists. For example, you can simply select specific columns, e.g., the first and fourth column:
df.iloc[:, [0, 3]]
participant_id | wm_score | |
---|---|---|
0 | sub-01 | 41 |
1 | sub-02 | 48 |
2 | sub-03 | 34 |
3 | sub-04 | 39 |
But you can also use negative indices (counting from the end) or using slices:
# Take the first and last row and every second column starting at 1
# i.e., all odd columns
df.iloc[[0, -1], 1::2]
age | wm_score | |
---|---|---|
0 | 25 | 41 |
3 | 25 | 39 |
Show code cell content
""" Implement your ToDo here. """
### BEGIN SOLUTION
df_todo = df.iloc[:3, 1:]
### END SOLUTION
""" Tests the above ToDo. """
if df_todo.index.tolist() != [0, 1, 2]:
raise ValueError("Row index is incorrect!")
if df_todo.columns.tolist() != ['age', 'condition', 'wm_score']:
raise ValueError("Column index is incorrect!")
Boolean indexing#
A last, slightly trickier, indexing method is boolean indexing. This method is often used to select a specific subset of rows from a dataframe. It essentially boils down to creating a Series
object with boolean values (i.e., True
and False
values) and using that in combination with loc
.
For example, suppose that you want to select all participants (i.e., rows) who are older than 21. First, you need to create a boolean series:
b_idx = df.loc[:, 'age'] > 21
b_idx
0 True
1 False
2 True
3 True
Name: age, dtype: bool
… and then, you can use it together with loc
:
df.loc[b_idx, :]
participant_id | age | condition | wm_score | |
---|---|---|---|---|
0 | sub-01 | 25 | A | 41 |
2 | sub-03 | 28 | A | 34 |
3 | sub-04 | 25 | B | 39 |
You can, of course, also do this within a single line of code:
df.loc[df.loc[:, 'age'] > 21, :]
participant_id | age | condition | wm_score | |
---|---|---|---|---|
0 | sub-01 | 25 | A | 41 |
2 | sub-03 | 28 | A | 34 |
3 | sub-04 | 25 | B | 39 |
Show code cell content
""" Implement your ToDo here. """
### BEGIN SOLUTION
conA_pp = df.loc[df.loc[:, 'condition'] == 'A', :]
### END SOLUTION
""" Tests the above ToDo. """
if len(conA_pp.columns) != 4:
raise ValueError("Make sure you only filter the rows, not the columns!")
if conA_pp.index.tolist() != [0, 2]:
raise ValueError("Hmm, the indexing operation didn't go as expected ...")
print("WELL DONE!")
WELL DONE!
Adding and removing rows and columns#
When working with dataframes, at some point you’ll may want to add columns and/or rows. You can actually use the loc
and iloc
indexers for this as well. For example, if you would like to add a new column named “nationality” to the dataframe, you can do the following:
# Note: no list this time
df.loc[:, 'nationality'] = ['Dutch', 'British', 'Dutch', 'German']
df
participant_id | age | condition | wm_score | nationality | |
---|---|---|---|---|---|
0 | sub-01 | 25 | A | 41 | Dutch |
1 | sub-02 | 21 | B | 48 | British |
2 | sub-03 | 28 | A | 34 | Dutch |
3 | sub-04 | 25 | B | 39 | German |
In the above cell we basically used loc
to assign a list of values to a new column named “nationality”. Importantly, this list should have the same number of values as the number of rows of the dataframe.
Show code cell content
""" Implement the ToDo here. """
### BEGIN SOLUTION
df.loc[4, :] = ['sub-05', 23, 'B', 51, 'French']
### END SOLUTION
""" Tests the above ToDo. """
if df.iloc[-1, :].loc['participant_id'] != 'sub-05':
raise ValueError("Hmm, the participant_id of the new row does not seem to be correct ...")
if df.iloc[-1, :].loc['age'] != 23:
raise ValueError("Hmm, the age of the new row does not seem to be correct ...")
if df.iloc[-1, :].loc['condition'] != 'B':
raise ValueError("Hmm, the condition of the new row does not seem to be correct ...")
if df.iloc[-1, :].loc['wm_score'] != 51:
raise ValueError("Hmm, the wm_score of the new row does not seem to be correct ...")
if df.iloc[-1, :].loc['nationality'] != 'French':
raise ValueError("Hmm, the nationality of the new row does not seem to be correct ...")
print("CORRECT!")
CORRECT!
To delete a row or column, you can use the drop
method (see the documentation for more details). You can drop both columns (using the argument columns
) and rows (using the argument rows
) at the same time. For example, if I’d want to drop data fram “sub-05” and the entire “nationality” column, we could do the following:
df.drop(columns='nationality', index=4)
participant_id | age | condition | wm_score | |
---|---|---|---|---|
0 | sub-01 | 25.0 | A | 41.0 |
1 | sub-02 | 21.0 | B | 48.0 |
2 | sub-03 | 28.0 | A | 34.0 |
3 | sub-04 | 25.0 | B | 39.0 |
Note that the dataframe is not automatically updated! If we want to replace the original dataframe, we need to assign the result of the drop
call to the original variable name (df
):
df = df.drop(columns='nationality', index=4)
Note that, instead of saving the output of the drop
method in a new (or the same) variable, you can also modify the dataframe “in place” by setting the inplace
argument to True
(i.e., drop(inplace=True)
). In my opinion, however, this often leads to bugs so I’d recommend against usign this approach.
Dataframe operations#
DataFrame
(and Series
) objects have a lot of methods that implement common computational and statistical operations. Often, these methods return a new DataFrame
or Series
object with the result from the operation. For example, a useful method is describe
, which returns a set of descriptive statistics for all numeric columns (i.e., it will automatically skip non-numeric columns):
df.describe()
age | wm_score | |
---|---|---|
count | 4.000000 | 4.000000 |
mean | 24.750000 | 40.500000 |
std | 2.872281 | 5.802298 |
min | 21.000000 | 34.000000 |
25% | 24.000000 | 37.750000 |
50% | 25.000000 | 40.000000 |
75% | 25.750000 | 42.750000 |
max | 28.000000 | 48.000000 |
Separate operations, like computing the mean, are also available:
# Note that it returns a Series, because it consists
# of a single column (with two values)
df.mean(numeric_only=True)
age 24.75
wm_score 40.50
dtype: float64
Show code cell content
""" Implement the ToDo here. """
### BEGIN SOLUTION
age_sd = df.loc[:, 'age'].std()
### END SOLUTION
""" Tests the above ToDo. """
if round(age_sd, 3) != 2.872:
raise ValueError("Something went wrong ...")
print("Well done!")
Well done!
Saving DataFrames#
Just like Pandas is able to read data from different file formats, it can also save data from DataFrames
to different file format, including CSV files, JSON files, and databases. Again, for now, we’ll only focus on CSV files.
Importantly, saving data to disk is usually done using DataFrame
methods. For example, to save your DataFrame
data to a CSV file, you can use its to_csv
method.
As you can see, the to_csv
method has a lot of arguments you can use to specify exactly how you’d like the data to be saved. Often, you only need to specify the first argument, i.e., the path and name of the file you’d like to save the data to. For example, to save the data to a CSV file with the name “some_data.csv” to the current working directory, you’d run the following:
# We'll use the `df` DataFrame we used earlier
df.to_csv('some_data.csv')
If you open the CSV file, you probably noticed that its first column contains integers and lacks a column name — this is the DataFrame
index! You likely don’t need this index. To avoid saving the index when using to_csv
, you can set the argument index
to False
(i.e., index=False
).
Show code cell content
""" Implement the ToDo here. """
### BEGIN SOLUTION
df.to_csv('other_data.tsv', sep='\t', index=False)
### END SOLUTION
""" Tests the above ToDo. """
df_test = pd.read_csv('other_data.tsv', sep='\t')
assert(df_test.shape[1] == 4)
print("Awesome!")
Awesome!
Let’s remove the two files we created to not clutter our current directory.
import os
for f in ['some_data.csv', 'other_data.tsv']:
if os.path.isfile(f):
os.remove(f)
Aggregation (advanced / optional)#
In the context of data analysis and statistics, you may want to perform operations separately for different subsets of the data. For example, you may want to compute the mean age separately for the different conditions. You may know this aggregation operation as creating a “pivot table”. With pandas DataFrame
objects, this can be easily done using the groupby
method in combination with a particular operation.
In the groupby
method, you define by which factor you want to group your data (e.g., “condition”) and you may subsequently “chain” it with another method (e.g, mean
):
means = df.groupby('condition').mean(numeric_only=True)
means
age | wm_score | |
---|---|---|
condition | ||
A | 26.5 | 37.5 |
B | 23.0 | 43.5 |
Note that the group factor becomes the index in the resulting dataframe! Again, the mean operation is only performed for the numeric columns. To explain all of the functionality of the groupby
method probably needs a completely separate tutorial, so let’s just finish with a relatively more difficult (and optional) ToDo.
Show code cell content
import pandas as pd
import numpy as np
np.random.seed(42)
data = np.random.normal(300, 70, 100)
cond = ['A', 'B', 'C', 'D'] * 25
df_final = pd.DataFrame({'rt': data, 'condition': cond})
df_final
### BEGIN SOLUTION
# Not necessarily the most elegant solution!
df_stats = df_final.groupby("condition").mean()
df_stats.columns = ['mean']
df_stats['std'] = df_final.groupby("condition").std()
df_stats['t_value'] = (df_stats.loc[:, 'mean'] - 300) / (df_stats.loc[:, 'std'] / np.sqrt(25))
### END SOLUTION
""" Tests the above ToDo. """
if df_stats.columns.tolist() != ['mean', 'std', 't_value']:
raise ValueError("The dataframe does not contain the right columns!")
if df_stats.index.tolist() != ['A', 'B', 'C', 'D']:
raise ValueError("The dataframe does not contain the right index!")
if df_stats.loc[:, 't_value'].round(4).tolist() != [-1.0839, -1.0983, -0.6441, 0.2139]:
raise ValueError("T-values are not yet correct!")
print("YOU'RE AMAZING!")
YOU'RE AMAZING!