Introduction to Numpy (tutorial)#
This notebook is about Numpy, Python’s core library for numeric computing. Knowing how to use Numpy is essential in domains such as machine learning, neuroimaging analysis, and image processing.
Contents#
The ndarray
Indexing
Array math
Broadcasting
Let’s start by importing numpy, which is commonly done as follows:
import numpy as np
The ndarray#
The most important feature of Numpy is it’s core data structure: the numpy ndarray (which stands for n-dimensional array, referring to the fact that the array may be of any dimension: 1D, 2D, 3D, 180D … nD). This is, just like basic lists/dictionaries/integers/tuples, a data structure with its own syntax, operations, and methods, as we will explain below.
Python lists vs. numpy arrays#
Basically, numpy arrays are a lot like Python lists. The major difference, however, is that numpy arrays may contain only a single data-type, while Python lists may contain different data-types within the same list.
Let check this out:
# Python lists may contain mixed data-types: an integer, a float, a string, a list
python_list = [1, 2.5, "whatever", [3, 4, 5]]
for value in python_list:
print("%s is a: %s" % (str(value), type(value)))
1 is a: <class 'int'>
2.5 is a: <class 'float'>
whatever is a: <class 'str'>
[3, 4, 5] is a: <class 'list'>
Unlike Python lists, numpy only allows entries of the same data-type. This difference between Python lists and numpy arrays is basically the same as R lists (allow multiple data-types) versus R matrices/arrays (only allow one data type), and is also the same as MATLAB cells (allow multiple data-types) versus MATLAB matrices (only allow one data type).
In fact, if you try to make a numpy array with different data-types, numpy will force the entries into the same data-type (in a smart way), as is shown in the example below:
# Importantly, you often specify your arrays as Python lists first, and then convert them to numpy
to_convert_to_numpy = [1, 2, 3.5] # specify python list ...
numpy_array = np.array(to_convert_to_numpy) # ... and convert ('cast') it to numpy
for entry in numpy_array:
print(entry)
print('this is a: %s \n' % type(entry))
1.0
this is a: <class 'numpy.float64'>
2.0
this is a: <class 'numpy.float64'>
3.5
this is a: <class 'numpy.float64'>
As you can see, Numpy converted our original list (to_convert_to_numpy), which contained both integers and floats, to an array with only floats! You might think that such a data structure that only allows one single data type is not ideal. However, the very fact that it only contains a single data-type makes operations on numpy arrays extremely fast. For example, loops over numpy arrays are often way faster than loops over python lists. This is because, internally, Python has to check the data-type of each loop entry before doing something with that entry. Because numpy arrays one allow a single data-type, it only has to check for the entries’ data type once. If you imagine looping over an array or list of length 100,000, you probably understand that the numpy loop is way faster.
Let’s check out the speed difference between Python list operations and numpy array operations:
# %timeit is a cool 'feature' that you can use in Notebooks (no need to understand how it works)
# it basically performs a computation that you specify a couple of times and prints how long it took on average
results_python = %timeit -o [x * 2 for x in range(0, 100000)]
8.61 ms ± 35.9 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
And now let’s do the same with numpy:
results_numpy = %timeit -o np.arange(0, 10000) * 2
ratio = results_python.average / results_numpy.average
print("Numpy is %i times faster than base Python!" % ratio)
10.1 µs ± 9.62 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)
Numpy is 850 times faster than base Python!
You see that Numpy much faster! This really matters when you start doing more complex operations or large arrays!
Creating numpy arrays#
As shown an earlier example, numpy arrays can be created as follows:
Define a Python list, e.g.
my_list = [0, 1, 2]
Convert the list to a numpy array, e.g.
numpy_array = np.array(my_list)
Importantly, a simple Python list will be converted to a 1D numpy array, but a nested Python list will be converted to a 2D (or even higher-dimensional array). Nesting is simply combining different lists, separated by commans, as is shown here:
my_list = [1, 2, 3]
my_array = np.array(my_list)
print("A 1D (or 'flat') array:")
print(my_array, '\n')
my_nested_list = [[1, 2, 3],
[4, 5, 6],
[7, 8, 9]]
my_2D_array = np.array(my_nested_list)
print("A 2D array:")
print(my_2D_array)
A 1D (or 'flat') array:
[1 2 3]
A 2D array:
[[1 2 3]
[4 5 6]
[7 8 9]]
and store it in a variable named vec.
Show code cell content
""" Implement the ToDo here. """
### BEGIN SOLUTION
vec = np.array([5, 3, 2, 8, 9])
### END SOLUTION
""" Tests the above ToDo. """
if 'vec' not in dir():
raise ValueError("You haven't created the variable 'vec' yet!")
if type(vec) != np.ndarray:
raise ValueError("vec is not a numpy array!")
if vec.ndim != 1:
raise ValueError("vec is not a 1D array!")
if vec.tolist() != [5, 3, 2, 8, 9]:
raise ValueError("vec does not have the right contents ...")
print("Yay!")
Yay!
Create the following matrix (2D array):
and store it in a variable named arr_2d. Hint: start by creating a nested python list (like we did with the my_nested_list variable) and then convert it to numpy.
Show code cell content
""" Implement the ToDo here. """
### BEGIN SOLUTION
arr_2d = np.array([
[5, 2, 8],
[8, 2, 1],
[4, 4, 4],
[1, 2, 3]
])
### END SOLUTION
""" Tests the above ToDo. """
if type(arr_2d) != np.ndarray:
raise ValueError("arr_2d is not a numpy array!")
if arr_2d.shape != (4, 3):
raise ValueError("arr_2d does not have the right shape!")
print("Yes! Well done!")
Yes! Well done!
Show code cell content
""" Implement the ToDo here. """
### BEGIN SOLUTION
arr_3d = np.array([
[
[1, 2],
[3, 4]
],
[
[5, 6],
[7, 8]
]
])
### END SOLUTION
""" Tests the above ToDo. """
if arr_3d.shape != (2, 2, 2):
raise ValueError("The array does not have the right dimensions!")
if arr_3d[0, 0, 0] != 1:
raise ValueError("Hmm, something went wrong ...")
if arr_3d[0, 0, 1] != 2:
raise ValueError("Hmm, something went wrong ...")
if arr_3d[1, 1, 1] != 8:
raise ValueError("Hmm, something went wrong ...")
As you can imagine, creating numpy arrays from nested lists becomes cumbersome if you want to create (large) arrays with more than 2 dimensions. There are, fortunately, a lot of other ways to create (‘initialize’) large, high-dimensional numpy arrays. One often-used method is to create an array with zeros using the numpy function np.zeros
. This function takes one (mandatory) argument, which is a tuple with the dimensions of your desired array:
my_desired_dimensions = (2, 5) # suppose I want to create a matrix with zeros of size 2 by 5
my_array = np.zeros(my_desired_dimensions)
print(my_array)
[[0. 0. 0. 0. 0.]
[0. 0. 0. 0. 0.]]
Show code cell content
""" Implement the ToDo here. """
### BEGIN SOLUTION
array_5d = np.zeros((2, 3, 5, 3, 7))
### END SOLUTION
""" Tests the above ToDo. """
from tests import test_create_array_with_zeros
test_create_array_with_zeros(array_5d)
Well done!
Using arrays with zeros is often used in what is called ‘pre-allocation’, in which you create an ‘empty’ array with only zeros and for example, ‘fill’ that array in a loop.
Below, we show you an example in which we pre-allocate an array with 5 zeros, and fill that in a for-loop with the squares of 1 - 5. Try to understand how it works! (You have to do something similar in the next ToDo!)
my_array = np.zeros(5)
print('Original zeros-array')
print(my_array)
for i in range(5): # notice the range function here! This loop now iterates over [0, 1, 2, 3, 4]
number_to_calculate_the_square_of = i + 1
my_array[i] = number_to_calculate_the_square_of ** 2
print('\nFilled array')
print(my_array)
Original zeros-array
[0. 0. 0. 0. 0.]
Filled array
[ 1. 4. 9. 16. 25.]
Show code cell content
""" Implement the ToDo here. """
my_array2 = np.zeros(8)
### BEGIN SOLUTION
for i in range(8):
my_array2[i] = 1 / (i + 1)
### END SOLUTION
""" Tests the above ToDo. """
from tests import test_fill_array_with_complement
test_fill_array_with_complement(my_array2)
AWESOME!
In addition to np.zeros
, you can create numpy arrays using other functions, like np.ones
and random
from the np.random
module:
ones = np.ones((5, 10)) # create an array with ones
print(ones, '\n')
rndom = np.random.random((5, 10)) # Create an array filled with random values (0 - 1 uniform)
print(rndom)
[[1. 1. 1. 1. 1. 1. 1. 1. 1. 1.]
[1. 1. 1. 1. 1. 1. 1. 1. 1. 1.]
[1. 1. 1. 1. 1. 1. 1. 1. 1. 1.]
[1. 1. 1. 1. 1. 1. 1. 1. 1. 1.]
[1. 1. 1. 1. 1. 1. 1. 1. 1. 1.]]
[[0.26519863 0.61210109 0.07390538 0.23448279 0.91801535 0.09771712
0.59989371 0.95954777 0.77371265 0.96163829]
[0.5918337 0.28657956 0.17395184 0.19858053 0.27070102 0.29039822
0.74757114 0.09582356 0.79228018 0.60746672]
[0.09362956 0.3003594 0.93882937 0.57015628 0.82258401 0.98592789
0.74375117 0.74636537 0.44392873 0.13059863]
[0.69194765 0.1913486 0.16140539 0.20675131 0.67467491 0.49576562
0.49252433 0.37714862 0.05598054 0.30890028]
[0.40887687 0.89940557 0.34818037 0.56836087 0.65853873 0.51483755
0.50701647 0.31754501 0.63856904 0.08532279]]
Numpy indexing#
Indexing (extracting a single value of an array) and slicing (extracting multiple values - a subset - from an array) of numpy arrays is largely the same as with regular Python lists and data structures from other scientific computing languages such as R and MATLAB. Let’s check out a couple of examples of a 1D array:
my_array = np.arange(10, 21) # numpy equivalent of list(range(10, 21))
print('Full array:')
print(my_array, '\n')
Full array:
[10 11 12 13 14 15 16 17 18 19 20]
print("Index the first element:")
print(my_array[0])
Index the first element:
10
print("Index the second-to-last element:")
print(my_array[-2])
Index the second-to-last element:
19
print("Slice from 5 until (not including!) 8")
print(my_array[5:8])
Slice from 5 until (not including!) 8
[15 16 17]
print("Slice from beginning until 4")
print(my_array[:4])
Slice from beginning until 4
[10 11 12 13]
Setting values in numpy arrays works the same way as lists:
my_array = np.arange(10, 21)
my_array[0] = 10000
print(my_array)
my_array[5:7] = 0
print(my_array)
[10000 11 12 13 14 15 16 17 18 19 20]
[10000 11 12 13 14 0 0 17 18 19 20]
Show code cell content
""" Implement the ToDo here. """
my_array = np.arange(3, 25)
### BEGIN SOLUTION
my_array[1::2] = 0.0
### END SOLUTION
''' Tests the above ToDo. '''
from tests import test_set_odd_indices_to_zero
test_set_odd_indices_to_zero(my_array)
Good job!
Multidimensional indexing#
Often, instead of working on and indexing 1D array, we’ll work with multi-dimensional (>1D) arrays. Indexing multi-dimensional arrays is, again, quite similar to other scientific computing languages.
Like indexing Python lists, indexing multidimensional numpy arrays is done with square brackets []
, in which you can put as many comma-delimited numbers as there are dimensions in your array.
For example, suppose you have a 2D array of shape \(3 \times 3\) and you want to index the value in the first row and first column. You would do this as follows:
my_array = np.zeros((3, 3)) # 3 by 3 array with zeros
indexed_value = my_array[0, 0]
print("Value of first row and first column: %.1f" % indexed_value)
Value of first row and first column: 0.0
Show code cell content
""" Implement the ToDo here. """
my_array = np.zeros((3, 3))
print(my_array)
# ToDo: set the value in the lower right corner to 1
### BEGIN SOLUTION
my_array[-1, -1] = 1.0
### END SOLUTION
[[0. 0. 0.]
[0. 0. 0.]
[0. 0. 0.]]
""" Tests the ToDo above. """
from tests import test_set_lower_right_value_to_one
test_set_lower_right_value_to_one(my_array)
Superb!
In addition to setting specific slices to specific values, you can also extract sub-arrays using slicing/indexing. An important construct here is that you can use a single colon :
to select all values from a particular dimension. For example, if you want to select all column-values (second dimension) from only the first row (first dimension), do this:
some_2d_arr[0, :]
We’ll show you some examples below:
my_array = np.array([[1, 2, 3],
[4, 5, 6],
[7, 8, 9]])
print(my_array, '\n')
all_column_values_from_first_row = my_array[0, :]
print('First row')
print(all_column_values_from_first_row, '\n')
all_row_values_from_first_col = my_array[:, 0]
print('First column')
print(all_row_values_from_first_col)
[[1 2 3]
[4 5 6]
[7 8 9]]
First row
[1 2 3]
First column
[1 4 7]
So far, we only talked about 2D arrays, which are relatively easy to ‘understand’. In neuroimaging, however, we usually work with 3D, 4D, or even higher-dimensional arrays. These arrays are great for organizing data, but they are somewhat unintuitive. This takes some time to get used to!
To get you used to thinking in more than 2 (or 3) dimensions, consider the following scenario. Suppose that a researcher wants to test the efficacy of a particular medicine against high blood pressure. To do so, she/he measures the (average systolic) blood pressure of a group of twenty subjects every hour for 30 days when they’re not on the medication. The same subjects are then again measured for another 30 days, but then when they’re on medication.
After this period of data collection, the researcher has \(20\ \mathrm{(subjects)}\ \times\ 24\ \mathrm{(hours)}\ \times\ 30\ \mathrm{(days)}\ \times\ 2\ \mathrm{(conditions: off/on)} = 28800\) measurements! We can then organize the data in a 4D array, in which each factor (subjects/hours/days/conditions) represents a separate dimension (also called “axis”) of our array!*
* Note that this type of data is perhaps best represented in a “table-like” structure, such as a Dataframe in R or a Pandas DataFrame, the Python-equivalent of R’s dataframe). But you can ignore that for the sake of this example.
So, let’s generate some random data (from a normal distribution to generate ‘realistic’ blood pressure data) that could represent this blood pressure dataset:
np.random.seed(42) # this is not important for now
bp_data = np.random.normal(loc=100, scale=5, size=(20, 24, 30, 2))
(Note that we’re not printing the bp_data
variable here, because it’s way too much to visualize/interpret at once anyway!)
Now, suppose I would want to extract the blood pressure of subject 5 at 09.00 (AM) in the morning at day 21 when he/she was not on medication. In that case, I would do:
# Note that I count midnight as the first measurement
this_particular_datapoint = bp_data[4, 8, 20, 0]
print("Blood pressure of participant 5 at 09.00 (AM) on day 21 in "
"the no-medication (off) condition is %.3f" % this_particular_datapoint)
# Also, remember that Python is zero-based, so e.g. participant 5 is indexed by index 4!
Blood pressure of participant 5 at 09.00 (AM) on day 21 in the no-medication (off) condition is 100.567
Try to remember this: in multidimensional arrays, each dimension (axis) represents a different “attribute” of your data (e.g. subjects, time, condition, etc.). This concept is very important to understand in, for example, functional MRI analysis (which revolves around 4D data: 3 spatial dimensions and 1 time dimension)!
Show code cell content
""" Implement the ToDo here. """
### BEGIN SOLUTION
day_18_condition_on = bp_data[:, :, 17, 1]
### END SOLUTION
""" Tests the above ToDo. """
from tests import test_bloodpressure_index
test_bloodpressure_index(day_18_condition_on)
You're incredible!
Multidimensional indexing using slices (especially the single colon :
slice, i.e., selecting everything) is very common in scientific programming such as neuroimaging analyses. There is, however, yet another way of (multidimensional) indexing called “boolean indexing”.
In this type of indexing, you index an array with a boolean array (i.e. array with True and False values) of the same shape. Basically, when you’re indexing the array my_array
with boolean array bool_array
, you’re saying: “give me all values in my_array
that are True
at the same location in bool_array
!”
Let’s look at an example:
my_array = np.array([[1, 2, 3],
[4, 5, 6],
[7, 8, 9]])
print("The original array:\n")
print(my_array, '\n')
bool_array = np.array([[True, False, True],
[False, True, False],
[True, False, True]])
print("The boolean array:\n")
print(bool_array, '\n')
print('Result of indexing my_array with bool_array:\n')
print(my_array[bool_array])
The original array:
[[1 2 3]
[4 5 6]
[7 8 9]]
The boolean array:
[[ True False True]
[False True False]
[ True False True]]
Result of indexing my_array with bool_array:
[1 3 5 7 9]
Usually, you do not write out the boolean array in full (as we did above), but you base it on the data itself to “filter” it according to some criterion formalized as a logical statement (i.e., using the boolean operators >, <, ==, or !=). For example, suppose I want to extract only the values above 6 from the my_array
variable in the above example.
To do so, I could do the following:
my_array = np.array([[1, 2, 3],
[4, 5, 6],
[7, 8, 9]])
print("The original array:\n")
print(my_array, '\n')
bool_array = my_array > 6
print("The boolean array:\n")
print(bool_array, '\n')
print('Result of indexing my_array with bool_array:\n')
print(my_array[bool_array])
The original array:
[[1 2 3]
[4 5 6]
[7 8 9]]
The boolean array:
[[False False False]
[False False False]
[ True True True]]
Result of indexing my_array with bool_array:
[7 8 9]
Easy, right? Now try it yourself in the next ToDo!
Show code cell content
""" Implement the ToDo here. """
my_array = np.array([[0, 1, -1, -2],
[2, -5, 1, 4],
[10, -2, -4, 20]])
# Make a new boolean array below (name it whatever you want) ...
# And use it to index my_array and store it in a variable `square_is_larger_than_4`.
### BEGIN SOLUTION
square_is_larger_than_4 = my_array[my_array ** 2 > 4]
### END SOLUTION
""" Tests the above ToDo. """
from tests import test_boolean_indexing
test_boolean_indexing(square_is_larger_than_4)
EPIC!
Again, it’s very important to understand how to (effectively) index multidimensional numpy arrays using slices and boolean indexing, as we’ll use it a lot in this course!
Data-types#
Every numpy array is a grid of values of the same type. Numpy provides a large set of numeric datatypes that you can use to construct arrays. Numpy guesses the datatype when you create an array, but functions that construct arrays usually also include an optional argument to explicitly specify the datatype.
Here are a couple of examples:
x1 = np.array([1, 2]) # Let numpy choose the datatype (here: int)
x2 = np.array([1.0, 2.0]) # Let numpy choose the datatype (here: float)
x3 = np.array([1, 2], dtype=np.float64) # Force a particular datatype (input: int, but converted to 64bit float)
x4 = np.array([-1, 0, 1, 2], dtype=bool) # Convert ints to booleans! 0 -> False, everthing else -> True
print('%r (%s)' % (x1, type(x1[0])))
print('%r (%s)' % (x2, type(x2[0])))
print('%r (%s)' % (x3, type(x3[0])))
print('%r (%s)' % (x4, type(x4[0])))
array([1, 2]) (<class 'numpy.int64'>)
array([1., 2.]) (<class 'numpy.float64'>)
array([1., 2.]) (<class 'numpy.float64'>)
array([ True, False, True, True]) (<class 'numpy.bool_'>)
Note that, after creating the array, you can still change the datatype by using the numpy method astype()
!
x = np.array([1, 2, 3, 4])
print(type(x[0]))
# Now convert it to floasts!
x = x.astype(float) # or astype(np.float)
print(type(x[0]))
<class 'numpy.int64'>
<class 'numpy.float64'>
Show code cell content
""" Implement the ToDo here. """
bool_array = np.array([True, True, False, False, True])
### BEGIN SOLUTION
bool2int_array = bool_array.astype(int)
print(bool2int_array)
### END SOLUTION
[1 1 0 0 1]
""" Tests the above ToDo (not for points). """
np.testing.assert_array_equal(bool2int_array, np.array([1, 1, 0, 0, 1]))
print("Well done!")
Well done!
Methods vs. functions#
In the previous notebooks, you’ve learned that, in addition to functions, ‘methods’ exist that are like functions of an object. You’ve seen examples of list methods, e.g. my_list.append(1)
, and string methods, e.g. my_string.replace('a', 'b')
.
Like lists and strings, numpy arrays have a lot of convenient methods that you can call (like the astype
method we saw earlier). Again, this is just like a function, but then applied to itself. Often, numpy provides both a function and method for simple operations.
Let’s look at an example:
my_array = np.arange(10) # creates a numpy array from 0 until (excluding!) 10
print(my_array, '\n')
mean_array = np.mean(my_array)
print('The mean of the array is: %f' % mean_array)
mean_array2 = my_array.mean()
print('The mean of the array (computed by its corresponding method) is: %f' % mean_array2)
print('Is the results from the numpy function the same as '
'the corresponding method? Answer: %s' % str(mean_array == mean_array2))
[0 1 2 3 4 5 6 7 8 9]
The mean of the array is: 4.500000
The mean of the array (computed by its corresponding method) is: 4.500000
Is the results from the numpy function the same as the corresponding method? Answer: True
If there is both a function and a method for the operation you want to apply to the array, it really doesn’t matter what you choose! Let’s look at some more (often used) methods of numpy ndarrays:
my_array = np.array([[1, 2, 3],
[4, 5, 6],
[7, 8, 9]])
std_my_array = my_array.std() # same as np.std(array)
print("Standard deviation of my_array: %.3f" % std_my_array, '\n')
transpose_my_array = my_array.T # same as np.transpose(array)
print("Transpose of my_array:\n%r" % transpose_my_array, '\n')
min_my_array = my_array.min() # same as np.min(array)
print("Minimum of my_array: %i" % my_array.min(), '\n')
max_my_array = my_array.max() # same as np.max(array)
print("Maximum of my_array: %i" % max_my_array, '\n')
sum_my_array = my_array.sum() # same as np.sum(array)
print("Sum of my_array: %i" % sum_my_array, '\n')
Standard deviation of my_array: 2.582
Transpose of my_array:
array([[1, 4, 7],
[2, 5, 8],
[3, 6, 9]])
Minimum of my_array: 1
Maximum of my_array: 9
Sum of my_array: 45
Importantly, a method may or may not take arguments (input). If no arguments are given, it just looks like “object.method()”, i.e. two enclosing brackets with nothing in between. However, a method may take one or more arguments (like the my_list.append(1) method)! This argument may be named or unnamed - doesn’t matter. An example:
my_array2 = np.random.random((3, 3))
print('Original array:')
print(my_array2, '\n')
print('Use the round() method with the argument 3:')
print(my_array2.round(3), '\n')
print('Use the round() method with the named argument 5:')
print(my_array2.round(decimals=5), '\n')
Original array:
[[0.32710922 0.30512889 0.84403235]
[0.6239239 0.29343832 0.71361229]
[0.34459281 0.58655561 0.99263403]]
Use the round() method with the argument 3:
[[0.327 0.305 0.844]
[0.624 0.293 0.714]
[0.345 0.587 0.993]]
Use the round() method with the named argument 5:
[[0.32711 0.30513 0.84403]
[0.62392 0.29344 0.71361]
[0.34459 0.58656 0.99263]]
In addition to the methods listed above, you’ll probably see the following methods a lot in the code of others.
Reshaping arrays:
my_array = np.arange(10)
print(my_array.reshape((5, 2))) # reshape to desired shape
[[0 1]
[2 3]
[4 5]
[6 7]
[8 9]]
Ravel (“flatten”) an array:
temporary = my_array.reshape((5, 2))
print("Initial shape: %s" % (temporary.shape,))
print(temporary.ravel()) # unroll multi-dimensional array to single 1D array
print("Shape after ravel(): %s" % (temporary.ravel().shape,))
Initial shape: (5, 2)
[0 1 2 3 4 5 6 7 8 9]
Shape after ravel(): (10,)
# .dot() does matrix multiplication (dot product: https://en.wikipedia.org/wiki/Dot_product)
# This linear algebra operation is used very often in neuroimaging research
# (which depends heavily on the General Linear Model!)
array1 = np.array([0, 1, 2, 3])
array2 = np.array([4, 5, 6, 7])
dot_product = array1.dot(array2)
print(dot_product)
38
You may use the corresponding numpy min/max methods or functions in your custom function, doesn’t matter. Don’t forget to explicitly return the value of the range!
Note: this custom function that implements the mathematical formula for a vector’s range is completely unrelated to the Python function range that is often used in for-loops!
Show code cell content
""" Implement the ToDo here by completing the function. """
def calculate_range(arr):
''' Calculate the range of an array.
Parameters
----------
arr : a 1D numpy array
Returns
-------
The range of the input arr
'''
### BEGIN SOLUTION
return arr.max() - arr.min()
### END SOLUTION
""" Tests the above ToDo. """
outp = calculate_range(np.array([0, 1, 2, 3]))
if outp is None:
raise ValueError("Didn't get any output! Did you explicitly return the range?")
assert(outp == 3)
assert(calculate_range(np.array([-1, 0, 1, 2])) == 3)
assert(calculate_range(np.array([0, 0, 0, 0])) == 0)
print("Great job!")
Great job!
Alright, by now, if you see a variable followed by a word ending with enclosed brackets, e.g. my_array.mean()
, you’ll know that it’s a method! But sometimes you might see something similar, but without the brackets, such as my_array.size
. As you know, this .size
is called an attribute of the variable my_array
. The attribute may be of any data-type, like a string, integer, tuple, an array itself. Let’s look at a numpy-specific example:
my_array = np.array([[1, 2, 3],
[4, 5, 6],
[7, 8, 9]])
print(my_array, '\n')
print('The size (number of element) in the array is:')
print(my_array.size, '\n')
print('The .size attribute is of data-type: %s' % type(my_array.size))
[[1 2 3]
[4 5 6]
[7 8 9]]
The size (number of element) in the array is:
9
The .size attribute is of data-type: <class 'int'>
Again, you might not use attributes a lot during this course, but you’ll definitely see them around in the code of the tutorials. Below, some of the common ndarray attributes are listed:
my_array = np.array([[1, 2, 3],
[4, 5, 6],
[7, 8, 9]])
print('Size (number of elements) of array:')
print(my_array.size, '\n') # returns an integer
print('Shape of array:')
print(my_array.shape, '\n') # this is a tuple!
print('Number of dimensions:')
print(my_array.ndim) # this is an integer
Size (number of elements) of array:
9
Shape of array:
(3, 3)
Number of dimensions:
2
arr : a 1D numpy array
h0 : a scalar value (single number) that represents the value representing the null-hypothesis
The function should compute the one-sample t-test, which tests whether the mean of an array is significantly different from a given value representing the null-hypothesis. Formally, for any given array \(x\) and null-hypothesis \(h_{0}\):
Here: \(\bar{x}\) represents the mean of \(x\), \(s\) represents the standard deviation of \(x\), and \(N\) represents the length (‘size’) of x. So, in slightly less mathematical notation:
Make sure to return the t-value!
Hint 1: to compute \(N\), you can use the .size attribute of an array …
Hint 2: use the function np.sqrt(some_number) to calculate the square root …
Show code cell content
""" Implement the ToDo here. """
def compute_one_sample_ttest(arr, h0):
''' Computes the one-sample t-test for any array and h0.
Parameters
----------
arr : a 1D numpy array
h0 : an int or float
Returns
-------
A single value representing the t-value
'''
### BEGIN SOLUTION
t_val = (arr.mean() - h0) / (arr.std() / np.sqrt(arr.size - 1))
return t_val
### END SOLUTION
""" Tests the ToDo above. """
from tests import test_tvalue_computation
arr = np.random.randn(100)
outp = compute_one_sample_ttest(arr , 0)
if outp is None:
raise ValueError("Your function didn't return anything! Did you forget the return statement?")
print("Going to do 3 tests ...")
test_tvalue_computation(arr, 0, outp)
outp = compute_one_sample_ttest(arr, 5)
test_tvalue_computation(arr, 5, outp)
outp = compute_one_sample_ttest(arr, -3)
test_tvalue_computation(arr, -3, outp)
Going to do 3 tests ...
Correct! You stats wizard!
Correct! You stats wizard!
Correct! You stats wizard!
Numpy: array math#
Now you know all the numpy basics necessary to do neuroimaging analysis! As you’ll see in the last section (Working with nifti-images), we’ll work with 3D (structural MRI images) or 4D (functional MRI images) numpy arrays a lot. Given that you know how the basics about numpy in general and numpy ndarrays in particular, we can utilize some of numpy’s best features: (very fast) array math.
Basic mathematical functions operate elementwise on arrays, which means that the operation (e.g. addition) is applied onto each element in the array. So, let’s initialize a 1D array with ten zeros and let’s add 1 to it:
x = np.zeros(10)
print(x, '\n')
x += 1 # remember: this the same as x = x + 1
print(x)
[0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
[1. 1. 1. 1. 1. 1. 1. 1. 1. 1.]
Additionally, you can also sum two arrays together in an elementwise manner by simply writing: array_1 + array_2
, given that these two (or more) arrays are of the same shape! Let’s look at an example:
x = np.array([[1,2],[3,4]], dtype=np.float64)
y = np.array([[5,6],[7,8]], dtype=np.float64)
print("x: \n%r" % x, '\n')
print("y: \n%r" % y, '\n')
print("x+y: \n%r" % (x + y), '\n')
x:
array([[1., 2.],
[3., 4.]])
y:
array([[5., 6.],
[7., 8.]])
x+y:
array([[ 6., 8.],
[10., 12.]])
Often, there exist function-equivalents of the mathematical operators. For example, x + y
is the same as np.add(x, y)
. However, it is recommended to use the operators wherever possible to improve readability of your code. See below for an example:
print(x + y, '\n')
print(np.add(x, y))
[[ 6. 8.]
[10. 12.]]
[[ 6. 8.]
[10. 12.]]
Next to addition, we can also do elementwise subtraction, multiplication, divison, square root, and exponentiation:
# Elementwise difference; both produce the array
print(x - y, '\n')
print(np.subtract(x, y)) # function-equivalent of above
[[-4. -4.]
[-4. -4.]]
[[-4. -4.]
[-4. -4.]]
# Elementwise product; both produce the array
print(x * y, '\n')
print(np.multiply(x, y))
[[ 5. 12.]
[21. 32.]]
[[ 5. 12.]
[21. 32.]]
# Elementwise division; both produce the array
# [[ 0.2 0.33333333]
# [ 0.42857143 0.5 ]]
print(x / y, '\n')
print(np.divide(x, y))
[[0.2 0.33333333]
[0.42857143 0.5 ]]
[[0.2 0.33333333]
[0.42857143 0.5 ]]
# Elementwise square root; there is no operator-equivalent!
print(np.sqrt(x))
[[1. 1.41421356]
[1.73205081 2. ]]
# Elementwise exponentiation
print(x ** y, '\n')
print(np.power(x, y))
[[1.0000e+00 6.4000e+01]
[2.1870e+03 6.5536e+04]]
[[1.0000e+00 6.4000e+01]
[2.1870e+03 6.5536e+04]]
Show code cell content
""" Implement the ToDo here. """
arr_A = np.arange(10).reshape((5, 2))
arr_B = np.arange(10, 20).reshape((5, 2))
### BEGIN SOLUTION
result_product_and_sum = (arr_A * arr_B) + 5
### END SOLUTION
""" Tests the above ToDo. """
from tests import test_array_product_and_sum
test_array_product_and_sum(result_product_and_sum)
Great!
Note that unlike Matlab, *
is elementwise multiplication, not matrix multiplication. We instead use the dot function (or method!) to compute matrix operations like inner products of vectors, multiplication of a vector by a matrix, and multiplication of two matrices.
These matrix operations are quite important in this course, because they are used a lot in neuroimaging methods (such as the General Linear Model, the topic of next week!). You’re not expected to fully understand these matrix operations, but you’ll see them several times in this course, so make sure you’re familiar with its implementation in Python/Numpy.
An example of the inner product (“dot product”) of two vectors in both the function-format and the method-format:
v = np.array([9,10])
w = np.array([11, 12])
# Inner product of vectors; both produce 219
print(v.dot(w))
print(np.dot(v, w))
219
219
Additionally, in Python 3.6 (and above), you can use the @
character as an operator for matrix multiplication of numpy arrays!
print(v @ w)
219
Probably the most used functions in numpy are the sum() and mean() fuctions (or equivalent methods!). A nice feature is that they can operate on the entire array (this is the default) or they can be applied per dimension (or, in numpy lingo, per “axis”).
Applying functions along axes is very common in scientific computing! Let’s look at an example in which we apply the sum
function/method across rows and columns:
x = np.array([[1, 2],[3, 4], [5, 6]])
print('Original array:')
print(x, '\n')
print('Sum over ALL elements of x:')
print(np.sum(x), '\n')
print('Sum across rows of x:')
print(np.sum(x, axis=0), '\n')
print('Sum across columns of x:')
print(x.sum(axis=1)) # this is the method form! Is exactly the same as np.sum(x, axis=1)
Original array:
[[1 2]
[3 4]
[5 6]]
Sum over ALL elements of x:
21
Sum across rows of x:
[ 9 12]
Sum across columns of x:
[ 3 7 11]
Importantly, application of functions across axes is much quicker (and more concise) than writing for-loops! Let’s look at the speed difference between a for-loop (implemented as a list comprehension) and the numpy-style application of the sum
function across axes:
arr = np.random.random((1000, 100))
loop_style = %timeit -o [arr[i, :].sum() for i in range(arr.shape[0])]
axis_style = %timeit -o arr.sum(axis=1)
print("Using the axis-argument is %.3f times faster than a for-loop!" % (loop_style.average / axis_style.average))
1.74 ms ± 3.55 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
54.6 µs ± 28 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each)
Using the axis-argument is 31.878 times faster than a for-loop!
This application of functions on entire arrays (in an elementwise manner, like arr_1 + arr_2
) and application of functions across a certain axis (like np.sum(arr_1, axis=1)
) is often called vectorization. This is an incredibly useful concept and something that is used in many data analysis tools! Make sure you understand this.
Let’s practice this a bit.
Complete the function, and make sure to return the result, which should be a 1D vector which values represents the ranges for all columns. So, if the input array would be of shape (100, 10), then the shape of your returned array should be (10,).
Show code cell content
""" Implement the ToDo here (by completing the function below). """
def compute_range_vectorized(arr):
''' Compute the range across rows for each column in a 2D array.
Parameters
----------
arr : a 2D numpy array
Returns
-------
A 1D vector with the ranges for each column
'''
### BEGIN SOLUTION
return arr.max(axis=0) - arr.min(axis=0)
### END SOLUTION
""" Tests the above ToDo. """
from tests import test_compute_range_vectorized
np.random.seed(42)
test_arr = np.random.random((100, 10))
outp = compute_range_vectorized(test_arr)
if outp is None:
raise ValueError("Output is None! Did you forget the Return statement in your function?")
test_compute_range_vectorized(test_arr, outp)
Easy peasy!
Broadcasting#
Broadcasting is a powerful mechanism that allows numpy to work with arrays of different shapes when performing arithmetic operations. Frequently we have a smaller array and a larger array, and we want to use the smaller array multiple times to perform some operation on the larger array.
For example, suppose that we want to add a vector to each row of a matrix. We could do it like this:
# We will add the vector v to each row of the matrix x,
# storing the result in the matrix y
x = np.array([[1,2,3], [4,5,6], [7,8,9], [10, 11, 12]])
print('x array is of shape: %r' % (x.shape,))
print(x, '\n')
v = np.array([1, 0, 1])
print('v vector is of shape: %r (different shape than x!)' % (v.shape,))
print(v, '\n')
y = np.zeros(x.shape) # Create an empty (zeros) matrix with the same shape as x
print('Shape of (pre-allocated) y-matrix: %r' % (y.shape,))
# Add the vector v to each row of the matrix x with an explicit loop
for i in range(x.shape[0]): # see how the shape attributes comes in handy in creating loops?
y[i, :] = x[i, :] + v
print('The result of adding v to each row of x, as stored in y:')
print(y)
x array is of shape: (4, 3)
[[ 1 2 3]
[ 4 5 6]
[ 7 8 9]
[10 11 12]]
v vector is of shape: (3,) (different shape than x!)
[1 0 1]
Shape of (pre-allocated) y-matrix: (4, 3)
The result of adding v to each row of x, as stored in y:
[[ 2. 2. 4.]
[ 5. 5. 7.]
[ 8. 8. 10.]
[11. 11. 13.]]
This works; however when the matrix x
is very large, computing an explicit for-loop in Python could be slow. Note that adding the vector v to each row of the matrix x
is equivalent to forming a matrix vv
by stacking multiple copies of v
vertically, like this [[1 0 1], [1 0 1], [1 0 1], [1 0 1]]
, and subsequently elementwise addition of x + vv
:
vv = np.tile(v, (4, 1)) # i.e. expand vector 'v' 4 times along the row dimension (similar to MATLAB's repmat function)
y = x + vv # Add x and vv elementwise
print(y)
[[ 2 2 4]
[ 5 5 7]
[ 8 8 10]
[11 11 13]]
Numpy broadcasting allows us to perform this computation without actually creating multiple copies of v. Consider this version, using broadcasting:
# We will add the vector v to each row of the matrix x,
# storing the result in the matrix y
x = np.array([[1,2,3], [4,5,6], [7,8,9], [10, 11, 12]])
v = np.array([1, 0, 1])
y = x + v # Add v to each row of x using broadcasting
print(y)
[[ 2 2 4]
[ 5 5 7]
[ 8 8 10]
[11 11 13]]
The line y = x + v
works even though x
has shape (4, 3)
and v
has shape (3,)
due to broadcasting; this line works as if v actually had shape (4, 3)
, where each row was a copy of v
, and the sum was performed elementwise.
This broadcasting function is really useful, as it prevents us from writing unnessary and by definition slower explicit for-loops. Additionally, it’s way easier to read and write than explicit for-loops (which need pre-allocation). Functions that support broadcasting are known as universal functions. You can find the list of all universal functions in the documentation.
from IPython.display import YouTubeVideo
YouTubeVideo('K96OoAWbhBE', width=800, height=300)
Here are some applications of broadcasting using different functions:
x = np.array([[1, 2],[3, 4], [5, 6]], dtype=np.float64)
x_sum = x.sum(axis=0)
print(x / x_sum, '\n')
x_mean = x.mean(axis=0)
print(x / x_mean)
[[0.11111111 0.16666667]
[0.33333333 0.33333333]
[0.55555556 0.5 ]]
[[0.33333333 0.5 ]
[1. 1. ]
[1.66666667 1.5 ]]
Standardization, which is also oftend called “z-transformation”, is often done in statistics. So, in the function below, make sure that it is able to standardize each column in any 2D array. Make sure to use vectorized array computations (i.e, the np.mean and np.std functions/methods across rows) and broadcasting, so no for-loops!
Show code cell content
""" Implement the ToDo here (by completing the function below). """
def standardize_columns(arr):
''' Standardize each column of a 2D input-array.
Parameters
----------
arr : a 2D numpy array
Returns
-------
The input-array with standardized columns (should have the same shape as the input-array!)
'''
### BEGIN SOLUTION
return (arr - arr.mean(axis=0)) / arr.std(axis=0)
### END SOLUTION
""" Tests the above ToDo. """
test_arr = np.random.normal(5, 2, size=(100, 10))
outp = standardize_columns(test_arr)
if outp is None:
raise ValueError("The output from your function is None; did you forget the Return statement?")
try:
np.testing.assert_array_almost_equal(outp.mean(axis=0), np.zeros(test_arr.shape[1]))
except AssertionError as e:
print("The mean of the columns of your standardized array are not 0!")
raise(e)
try:
np.testing.assert_array_almost_equal(outp.std(axis=0), np.ones(test_arr.shape[1]))
except AssertionError as e:
print("The std of the columns of your standardized array are not 1!")
raise(e)
print("A-MA-ZING!")
A-MA-ZING!
Now, you know the most important numpy concepts and functionality that is necessary to start using it. Surely, there is a lot more to the numpy package that what we’ve covered here, but that’s for another time 😃