Introduction to Numpy (tutorial)

Introduction to Numpy (tutorial)#

This notebook is about Numpy, Python’s core library for numeric computing. Knowing how to use Numpy is essential in domains such as machine learning, neuroimaging analysis, and image processing.

Contents#

The ndarray
Indexing
Array math
Broadcasting

Let’s start by importing numpy, which is commonly done as follows:

import numpy as np

The ndarray#

The most important feature of Numpy is it’s core data structure: the numpy ndarray (which stands for n-dimensional array, referring to the fact that the array may be of any dimension: 1D, 2D, 3D, 180D … nD). This is, just like basic lists/dictionaries/integers/tuples, a data structure with its own syntax, operations, and methods, as we will explain below.

Python lists vs. numpy arrays#

Basically, numpy arrays are a lot like Python lists. The major difference, however, is that numpy arrays may contain only a single data-type, while Python lists may contain different data-types within the same list.

Let check this out:

# Python lists may contain mixed data-types: an integer, a float, a string, a list
python_list = [1, 2.5, "whatever", [3, 4, 5]] 

for value in python_list:
    
    print("%s is a: %s" % (str(value), type(value)))

1 is a: <class 'int'>
2.5 is a: <class 'float'>
whatever is a: <class 'str'>
[3, 4, 5] is a: <class 'list'>

Unlike Python lists, numpy only allows entries of the same data-type. This difference between Python lists and numpy arrays is basically the same as R lists (allow multiple data-types) versus R matrices/arrays (only allow one data type), and is also the same as MATLAB cells (allow multiple data-types) versus MATLAB matrices (only allow one data type).

In fact, if you try to make a numpy array with different data-types, numpy will force the entries into the same data-type (in a smart way), as is shown in the example below:

# Importantly, you often specify your arrays as Python lists first, and then convert them to numpy
to_convert_to_numpy = [1, 2, 3.5]               # specify python list ...
numpy_array = np.array(to_convert_to_numpy)     # ... and convert ('cast') it to numpy

for entry in numpy_array:
    
    print(entry)
    print('this is a: %s \n' % type(entry))

1.0
this is a: <class 'numpy.float64'> 

2.0
this is a: <class 'numpy.float64'> 

3.5
this is a: <class 'numpy.float64'> 

As you can see, Numpy converted our original list (to_convert_to_numpy), which contained both integers and floats, to an array with only floats! You might think that such a data structure that only allows one single data type is not ideal. However, the very fact that it only contains a single data-type makes operations on numpy arrays extremely fast. For example, loops over numpy arrays are often way faster than loops over python lists. This is because, internally, Python has to check the data-type of each loop entry before doing something with that entry. Because numpy arrays one allow a single data-type, it only has to check for the entries’ data type once. If you imagine looping over an array or list of length 100,000, you probably understand that the numpy loop is way faster.

Let’s check out the speed difference between Python list operations and numpy array operations:

# %timeit is a cool 'feature' that you can use in Notebooks (no need to understand how it works)
# it basically performs a computation that you specify a couple of times and prints how long it took on average
results_python = %timeit -o [x * 2 for x in range(0, 100000)] 

8.61 ms ± 35.9 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

And now let’s do the same with numpy:

results_numpy = %timeit -o np.arange(0, 10000) * 2 

ratio = results_python.average / results_numpy.average
print("Numpy is %i times faster than base Python!" % ratio)

10.1 µs ± 9.62 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)
Numpy is 850 times faster than base Python!

You see that Numpy much faster! This really matters when you start doing more complex operations or large arrays!

Creating numpy arrays#

As shown an earlier example, numpy arrays can be created as follows:

Define a Python list, e.g. my_list = [0, 1, 2]
Convert the list to a numpy array, e.g. numpy_array = np.array(my_list)

Importantly, a simple Python list will be converted to a 1D numpy array, but a nested Python list will be converted to a 2D (or even higher-dimensional array). Nesting is simply combining different lists, separated by commans, as is shown here:

my_list = [1, 2, 3]
my_array = np.array(my_list)

print("A 1D (or 'flat') array:")
print(my_array, '\n')

my_nested_list = [[1, 2, 3],
                  [4, 5, 6],
                  [7, 8, 9]]

my_2D_array = np.array(my_nested_list)
print("A 2D array:")
print(my_2D_array)

A 1D (or 'flat') array:
[1 2 3] 

A 2D array:
[[1 2 3]
 [4 5 6]
 [7 8 9]]

ToDo: Create the following 1D array:

(1)#\[\begin{align} \begin{bmatrix} 5 & 3 & 2 & 8 & 9 \end{bmatrix} \end{align}\]

and store it in a variable named vec.

""" Tests the above ToDo. """
if 'vec' not in dir():
    raise ValueError("You haven't created the variable 'vec' yet!")
    
if type(vec) != np.ndarray:
    raise ValueError("vec is not a numpy array!")
    
if vec.ndim != 1:
    raise ValueError("vec is not a 1D array!")

if vec.tolist() != [5, 3, 2, 8, 9]:
    raise ValueError("vec does not have the right contents ...")
    
print("Yay!")

Yay!

ToDo:

Create the following matrix (2D array):

(2)#\[\begin{align} \begin{bmatrix} 5 & 2 & 8 \\ 8 & 2 & 1 \\ 4 & 4 & 4 \\ 1 & 2 & 3 \end{bmatrix} \end{align}\]

and store it in a variable named arr_2d. Hint: start by creating a nested python list (like we did with the my_nested_list variable) and then convert it to numpy.

""" Tests the above ToDo. """
if type(arr_2d) != np.ndarray:
    raise ValueError("arr_2d is not a numpy array!")
    
if arr_2d.shape != (4, 3):
    raise ValueError("arr_2d does not have the right shape!")
    
print("Yes! Well done!")

Yes! Well done!

ToDo (optional): Let's try something more difficult! Create a 3D array with dimensions $2 \times 2 \times 2$ with consecutive numbers starting at 1 and store this in a variable named arr_3d. So arr_3d[0, 0, 0] should evaluate to 1, arr_3d[0, 0, 1] should evaluate to 2, arr_3d[0, 1, 0] evaluates to 3, etc.

""" Tests the above ToDo. """
if arr_3d.shape != (2, 2, 2):
    raise ValueError("The array does not have the right dimensions!")

if arr_3d[0, 0, 0] != 1:
    raise ValueError("Hmm, something went wrong ...")

if arr_3d[0, 0, 1] != 2:
    raise ValueError("Hmm, something went wrong ...")
    
if arr_3d[1, 1, 1] != 8:
    raise ValueError("Hmm, something went wrong ...")

As you can imagine, creating numpy arrays from nested lists becomes cumbersome if you want to create (large) arrays with more than 2 dimensions. There are, fortunately, a lot of other ways to create (‘initialize’) large, high-dimensional numpy arrays. One often-used method is to create an array with zeros using the numpy function np.zeros. This function takes one (mandatory) argument, which is a tuple with the dimensions of your desired array:

my_desired_dimensions = (2, 5) # suppose I want to create a matrix with zeros of size 2 by 5
my_array = np.zeros(my_desired_dimensions)

print(my_array)

[[0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0.]]

ToDo: Using the np.zeros function, create a five dimensional array of shape $2 \times 3 \times 5 \times 3 \times 7$ and store it in a variable named array_5d.

""" Tests the above ToDo. """
from tests import test_create_array_with_zeros
test_create_array_with_zeros(array_5d)

Well done!

Using arrays with zeros is often used in what is called ‘pre-allocation’, in which you create an ‘empty’ array with only zeros and for example, ‘fill’ that array in a loop.

Below, we show you an example in which we pre-allocate an array with 5 zeros, and fill that in a for-loop with the squares of 1 - 5. Try to understand how it works! (You have to do something similar in the next ToDo!)

my_array = np.zeros(5)

print('Original zeros-array')
print(my_array)

for i in range(5):  # notice the range function here! This loop now iterates over [0, 1, 2, 3, 4]
    number_to_calculate_the_square_of = i + 1
    my_array[i] = number_to_calculate_the_square_of ** 2

print('\nFilled array')
print(my_array)

Original zeros-array
[0. 0. 0. 0. 0.]

Filled array
[ 1.  4.  9. 16. 25.]

ToDo: Write a loop in which you fill each value in the variable my_array2 with the value 1 divided by the current index plus one. So, suppose the index (usually i, like the last example) is 3, then you should fill my_array2 at index i with 1 / (3 + 1).

""" Tests the above ToDo. """
from tests import test_fill_array_with_complement
test_fill_array_with_complement(my_array2)

AWESOME!

In addition to np.zeros, you can create numpy arrays using other functions, like np.ones and random from the np.random module:

ones = np.ones((5, 10)) # create an array with ones
print(ones, '\n')

rndom = np.random.random((5, 10)) # Create an array filled with random values (0 - 1 uniform)
print(rndom)

[[1. 1. 1. 1. 1. 1. 1. 1. 1. 1.]
 [1. 1. 1. 1. 1. 1. 1. 1. 1. 1.]
 [1. 1. 1. 1. 1. 1. 1. 1. 1. 1.]
 [1. 1. 1. 1. 1. 1. 1. 1. 1. 1.]
 [1. 1. 1. 1. 1. 1. 1. 1. 1. 1.]] 

[[0.26519863 0.61210109 0.07390538 0.23448279 0.91801535 0.09771712
  0.59989371 0.95954777 0.77371265 0.96163829]
 [0.5918337  0.28657956 0.17395184 0.19858053 0.27070102 0.29039822
  0.74757114 0.09582356 0.79228018 0.60746672]
 [0.09362956 0.3003594  0.93882937 0.57015628 0.82258401 0.98592789
  0.74375117 0.74636537 0.44392873 0.13059863]
 [0.69194765 0.1913486  0.16140539 0.20675131 0.67467491 0.49576562
  0.49252433 0.37714862 0.05598054 0.30890028]
 [0.40887687 0.89940557 0.34818037 0.56836087 0.65853873 0.51483755
  0.50701647 0.31754501 0.63856904 0.08532279]]

Numpy indexing#

Indexing (extracting a single value of an array) and slicing (extracting multiple values - a subset - from an array) of numpy arrays is largely the same as with regular Python lists and data structures from other scientific computing languages such as R and MATLAB. Let’s check out a couple of examples of a 1D array:

my_array = np.arange(10, 21)  # numpy equivalent of list(range(10, 21))
print('Full array:')
print(my_array, '\n') 

Full array:
[10 11 12 13 14 15 16 17 18 19 20] 

print("Index the first element:")
print(my_array[0])

Index the first element:
10

print("Index the second-to-last element:")
print(my_array[-2])

Index the second-to-last element:
19

print("Slice from 5 until (not including!) 8")
print(my_array[5:8])

Slice from 5 until (not including!) 8
[15 16 17]

print("Slice from beginning until 4")
print(my_array[:4])

Slice from beginning until 4
[10 11 12 13]

Setting values in numpy arrays works the same way as lists:

my_array = np.arange(10, 21)
my_array[0] = 10000
print(my_array)

my_array[5:7] = 0
print(my_array)

[10000    11    12    13    14    15    16    17    18    19    20]
[10000    11    12    13    14     0     0    17    18    19    20]

ToDo (1 point): In the array my_array below, set all odd indices (i.e., the first element, the third element, the fifth element, etc.) to the value 0.0 in a single statement using a slice (i.e., not a for-loop).

''' Tests the above ToDo. '''
from tests import test_set_odd_indices_to_zero
test_set_odd_indices_to_zero(my_array)

Good job!

Multidimensional indexing#

Often, instead of working on and indexing 1D array, we’ll work with multi-dimensional (>1D) arrays. Indexing multi-dimensional arrays is, again, quite similar to other scientific computing languages.

Like indexing Python lists, indexing multidimensional numpy arrays is done with square brackets [], in which you can put as many comma-delimited numbers as there are dimensions in your array.

For example, suppose you have a 2D array of shape $3 \times 3$ and you want to index the value in the first row and first column. You would do this as follows:

my_array = np.zeros((3, 3)) # 3 by 3 array with zeros
indexed_value = my_array[0, 0]
print("Value of first row and first column: %.1f" % indexed_value)

Value of first row and first column: 0.0

ToDo: Using multidimensional indexing, set the value in the last row and last column of the 2D array my_array to 1.0. In other words, set the value in the lower-right "corner" of the 2D array to 1.0.

""" Tests the ToDo above. """
from tests import test_set_lower_right_value_to_one
test_set_lower_right_value_to_one(my_array)

Superb!

In addition to setting specific slices to specific values, you can also extract sub-arrays using slicing/indexing. An important construct here is that you can use a single colon : to select all values from a particular dimension. For example, if you want to select all column-values (second dimension) from only the first row (first dimension), do this:

some_2d_arr[0, :]

We’ll show you some examples below:

my_array = np.array([[1, 2, 3],
                     [4, 5, 6],
                     [7, 8, 9]])

print(my_array, '\n')

all_column_values_from_first_row = my_array[0, :]
print('First row')
print(all_column_values_from_first_row, '\n')

all_row_values_from_first_col = my_array[:, 0]
print('First column')
print(all_row_values_from_first_col)

[[1 2 3]
 [4 5 6]
 [7 8 9]] 

First row
[1 2 3] 

First column
[1 4 7]

So far, we only talked about 2D arrays, which are relatively easy to ‘understand’. In neuroimaging, however, we usually work with 3D, 4D, or even higher-dimensional arrays. These arrays are great for organizing data, but they are somewhat unintuitive. This takes some time to get used to!

To get you used to thinking in more than 2 (or 3) dimensions, consider the following scenario. Suppose that a researcher wants to test the efficacy of a particular medicine against high blood pressure. To do so, she/he measures the (average systolic) blood pressure of a group of twenty subjects every hour for 30 days when they’re not on the medication. The same subjects are then again measured for another 30 days, but then when they’re on medication.

After this period of data collection, the researcher has $20\ \mathrm{(subjects)}\ \times\ 24\ \mathrm{(hours)}\ \times\ 30\ \mathrm{(days)}\ \times\ 2\ \mathrm{(conditions: off/on)} = 28800$ measurements! We can then organize the data in a 4D array, in which each factor (subjects/hours/days/conditions) represents a separate dimension (also called “axis”) of our array!*

* Note that this type of data is perhaps best represented in a “table-like” structure, such as a Dataframe in R or a Pandas DataFrame, the Python-equivalent of R’s dataframe). But you can ignore that for the sake of this example.

So, let’s generate some random data (from a normal distribution to generate ‘realistic’ blood pressure data) that could represent this blood pressure dataset:

np.random.seed(42)  # this is not important for now
bp_data = np.random.normal(loc=100, scale=5, size=(20, 24, 30, 2))

(Note that we’re not printing the bp_data variable here, because it’s way too much to visualize/interpret at once anyway!)

Now, suppose I would want to extract the blood pressure of subject 5 at 09.00 (AM) in the morning at day 21 when he/she was not on medication. In that case, I would do:

# Note that I count midnight as the first measurement
this_particular_datapoint = bp_data[4, 8, 20, 0]
print("Blood pressure of participant 5 at 09.00 (AM) on day 21 in "
      "the no-medication (off) condition is %.3f" % this_particular_datapoint)

# Also, remember that Python is zero-based, so e.g. participant 5 is indexed by index 4!

Blood pressure of participant 5 at 09.00 (AM) on day 21 in the no-medication (off) condition is 100.567

Try to remember this: in multidimensional arrays, each dimension (axis) represents a different “attribute” of your data (e.g. subjects, time, condition, etc.). This concept is very important to understand in, for example, functional MRI analysis (which revolves around 4D data: 3 spatial dimensions and 1 time dimension)!

ToDo: Consider the blood-pressure dataset again. In the code-cell below, extract from all participants all data from day 18 in the medication ("on") condition. You'll have to use slices with the single :! Store the results of your index-operation in the variable day_18_condition_on.

""" Tests the above ToDo. """
from tests import test_bloodpressure_index
test_bloodpressure_index(day_18_condition_on)

You're incredible!

Multidimensional indexing using slices (especially the single colon : slice, i.e., selecting everything) is very common in scientific programming such as neuroimaging analyses. There is, however, yet another way of (multidimensional) indexing called “boolean indexing”.

In this type of indexing, you index an array with a boolean array (i.e. array with True and False values) of the same shape. Basically, when you’re indexing the array my_array with boolean array bool_array, you’re saying: “give me all values in my_array that are True at the same location in bool_array!”

Let’s look at an example:

my_array = np.array([[1, 2, 3],
                     [4, 5, 6],
                     [7, 8, 9]])

print("The original array:\n")
print(my_array, '\n')

bool_array = np.array([[True, False, True],
                       [False, True, False],
                       [True, False, True]])

print("The boolean array:\n")
print(bool_array, '\n')

print('Result of indexing my_array with bool_array:\n')
print(my_array[bool_array])

The original array:

[[1 2 3]
 [4 5 6]
 [7 8 9]] 

The boolean array:

[[ True False  True]
 [False  True False]
 [ True False  True]] 

Result of indexing my_array with bool_array:

[1 3 5 7 9]

Usually, you do not write out the boolean array in full (as we did above), but you base it on the data itself to “filter” it according to some criterion formalized as a logical statement (i.e., using the boolean operators >, <, ==, or !=). For example, suppose I want to extract only the values above 6 from the my_array variable in the above example.

To do so, I could do the following:

my_array = np.array([[1, 2, 3],
                     [4, 5, 6],
                     [7, 8, 9]])

print("The original array:\n")
print(my_array, '\n')

bool_array = my_array > 6

print("The boolean array:\n")
print(bool_array, '\n')

print('Result of indexing my_array with bool_array:\n')
print(my_array[bool_array])

The original array:

[[1 2 3]
 [4 5 6]
 [7 8 9]] 

The boolean array:

[[False False False]
 [False False False]
 [ True  True  True]] 

Result of indexing my_array with bool_array:

[7 8 9]

Easy, right? Now try it yourself in the next ToDo!

ToDo: Use a boolean index to extract all values whose square (i.e. $x^2$) is larger than 4 from the array (my_array) below. Store the results in a variable with the name square_is_larger_than_4.

""" Tests the above ToDo. """
from tests import test_boolean_indexing
test_boolean_indexing(square_is_larger_than_4)

EPIC!

Again, it’s very important to understand how to (effectively) index multidimensional numpy arrays using slices and boolean indexing, as we’ll use it a lot in this course!

Data-types#

Every numpy array is a grid of values of the same type. Numpy provides a large set of numeric datatypes that you can use to construct arrays. Numpy guesses the datatype when you create an array, but functions that construct arrays usually also include an optional argument to explicitly specify the datatype.

Here are a couple of examples:

x1 = np.array([1, 2])  # Let numpy choose the datatype (here: int)
x2 = np.array([1.0, 2.0])  # Let numpy choose the datatype (here: float)
x3 = np.array([1, 2], dtype=np.float64)  # Force a particular datatype (input: int, but converted to 64bit float)
x4 = np.array([-1, 0, 1, 2], dtype=bool)  # Convert ints to booleans! 0 -> False, everthing else -> True 

print('%r (%s)' % (x1, type(x1[0])))
print('%r (%s)' % (x2, type(x2[0])))
print('%r (%s)' % (x3, type(x3[0])))
print('%r (%s)' % (x4, type(x4[0])))

array([1, 2]) (<class 'numpy.int64'>)
array([1., 2.]) (<class 'numpy.float64'>)
array([1., 2.]) (<class 'numpy.float64'>)
array([ True, False,  True,  True]) (<class 'numpy.bool_'>)

Note that, after creating the array, you can still change the datatype by using the numpy method astype()!

x = np.array([1, 2, 3, 4])
print(type(x[0]))

# Now convert it to floasts!
x = x.astype(float)  # or astype(np.float)
print(type(x[0]))

<class 'numpy.int64'>
<class 'numpy.float64'>

ToDo: Below, we've defined a boolean array named bool_array. Convert it to an array with integers and store it in a new array called bool2int_array. Check out the values of the new array!

""" Tests the above ToDo (not for points). """
np.testing.assert_array_equal(bool2int_array, np.array([1, 1, 0, 0, 1]))
print("Well done!")

Well done!

Methods vs. functions#

In the previous notebooks, you’ve learned that, in addition to functions, ‘methods’ exist that are like functions of an object. You’ve seen examples of list methods, e.g. my_list.append(1), and string methods, e.g. my_string.replace('a', 'b').

Like lists and strings, numpy arrays have a lot of convenient methods that you can call (like the astype method we saw earlier). Again, this is just like a function, but then applied to itself. Often, numpy provides both a function and method for simple operations.

Let’s look at an example:

my_array = np.arange(10)  # creates a numpy array from 0 until (excluding!) 10
print(my_array, '\n')

mean_array = np.mean(my_array)
print('The mean of the array is: %f' % mean_array)

mean_array2 = my_array.mean() 
print('The mean of the array (computed by its corresponding method) is: %f' % mean_array2)

print('Is the results from the numpy function the same as '
      'the corresponding method? Answer: %s' % str(mean_array == mean_array2))

[0 1 2 3 4 5 6 7 8 9] 

The mean of the array is: 4.500000
The mean of the array (computed by its corresponding method) is: 4.500000
Is the results from the numpy function the same as the corresponding method? Answer: True

If there is both a function and a method for the operation you want to apply to the array, it really doesn’t matter what you choose! Let’s look at some more (often used) methods of numpy ndarrays:

my_array = np.array([[1, 2, 3],
                     [4, 5, 6],
                     [7, 8, 9]])

std_my_array = my_array.std()  # same as np.std(array)
print("Standard deviation of my_array: %.3f" % std_my_array, '\n')

transpose_my_array = my_array.T  # same as np.transpose(array)
print("Transpose of my_array:\n%r" % transpose_my_array, '\n')

min_my_array = my_array.min()  # same as np.min(array)
print("Minimum of my_array: %i" % my_array.min(), '\n')

max_my_array = my_array.max()  # same as np.max(array)
print("Maximum of my_array: %i" % max_my_array, '\n')

sum_my_array = my_array.sum()  # same as np.sum(array)
print("Sum of my_array: %i" % sum_my_array, '\n')

Standard deviation of my_array: 2.582 

Transpose of my_array:
array([[1, 4, 7],
       [2, 5, 8],
       [3, 6, 9]]) 

Minimum of my_array: 1 

Maximum of my_array: 9 

Sum of my_array: 45 

Importantly, a method may or may not take arguments (input). If no arguments are given, it just looks like “object.method()”, i.e. two enclosing brackets with nothing in between. However, a method may take one or more arguments (like the my_list.append(1) method)! This argument may be named or unnamed - doesn’t matter. An example:

my_array2 = np.random.random((3, 3))
print('Original array:')
print(my_array2, '\n')

print('Use the round() method with the argument 3:')
print(my_array2.round(3), '\n')

print('Use the round() method with the named argument 5:')
print(my_array2.round(decimals=5), '\n')

Original array:
[[0.32710922 0.30512889 0.84403235]
 [0.6239239  0.29343832 0.71361229]
 [0.34459281 0.58655561 0.99263403]] 

Use the round() method with the argument 3:
[[0.327 0.305 0.844]
 [0.624 0.293 0.714]
 [0.345 0.587 0.993]] 

Use the round() method with the named argument 5:
[[0.32711 0.30513 0.84403]
 [0.62392 0.29344 0.71361]
 [0.34459 0.58656 0.99263]] 

In addition to the methods listed above, you’ll probably see the following methods a lot in the code of others.

Reshaping arrays:

my_array = np.arange(10)
print(my_array.reshape((5, 2))) # reshape to desired shape

[[0 1]
 [2 3]
 [4 5]
 [6 7]
 [8 9]]

Ravel (“flatten”) an array:

temporary = my_array.reshape((5, 2))
print("Initial shape: %s" % (temporary.shape,))
print(temporary.ravel()) # unroll multi-dimensional array to single 1D array
print("Shape after ravel(): %s" % (temporary.ravel().shape,))

Initial shape: (5, 2)
[0 1 2 3 4 5 6 7 8 9]
Shape after ravel(): (10,)

# .dot() does matrix multiplication (dot product: https://en.wikipedia.org/wiki/Dot_product)
# This linear algebra operation is used very often in neuroimaging research 
# (which depends heavily on the General Linear Model!)
array1 = np.array([0, 1, 2, 3])
array2 = np.array([4, 5, 6, 7])

dot_product = array1.dot(array2)
print(dot_product)

ToDo (1 point): Let's practice writing functions some more. Complete the function below, named calculate_range. This function takes a single input-argument — a 1D numpy array — and subsequently calculates the range of the array. The range is the difference between the maximum and minimum of any given array (vector) $x$:

(3)#\[\begin{align} \mathrm{range}(x) = \max(x) - \min(x) \end{align}\]

You may use the corresponding numpy min/max methods or functions in your custom function, doesn’t matter. Don’t forget to explicitly return the value of the range!

Note: this custom function that implements the mathematical formula for a vector’s range is completely unrelated to the Python function range that is often used in for-loops!

""" Tests the above ToDo. """

outp = calculate_range(np.array([0, 1, 2, 3]))
if outp is None:
    raise ValueError("Didn't get any output! Did you explicitly return the range?")

assert(outp == 3)
assert(calculate_range(np.array([-1, 0, 1, 2])) == 3)
assert(calculate_range(np.array([0, 0, 0, 0])) == 0)

print("Great job!")

Great job!

Alright, by now, if you see a variable followed by a word ending with enclosed brackets, e.g. my_array.mean(), you’ll know that it’s a method! But sometimes you might see something similar, but without the brackets, such as my_array.size. As you know, this .size is called an attribute of the variable my_array. The attribute may be of any data-type, like a string, integer, tuple, an array itself. Let’s look at a numpy-specific example:

my_array = np.array([[1, 2, 3],
                     [4, 5, 6],
                     [7, 8, 9]])

print(my_array, '\n')
print('The size (number of element) in the array is:')
print(my_array.size, '\n')
print('The .size attribute is of data-type: %s' % type(my_array.size))

[[1 2 3]
 [4 5 6]
 [7 8 9]] 

The size (number of element) in the array is:
9 

The .size attribute is of data-type: <class 'int'>

Again, you might not use attributes a lot during this course, but you’ll definitely see them around in the code of the tutorials. Below, some of the common ndarray attributes are listed:

my_array = np.array([[1, 2, 3],
                     [4, 5, 6],
                     [7, 8, 9]])

print('Size (number of elements) of array:')
print(my_array.size, '\n') # returns an integer

print('Shape of array:')
print(my_array.shape, '\n') # this is a tuple!

print('Number of dimensions:')
print(my_array.ndim) # this is an integer

Size (number of elements) of array:
9 

Shape of array:
(3, 3) 

Number of dimensions:
2

ToDo: Let's try another one. Complete the function below, named compute_one_sample_ttest. This function takes two input arguments:

arr : a 1D numpy array
h0 : a scalar value (single number) that represents the value representing the null-hypothesis

The function should compute the one-sample t-test, which tests whether the mean of an array is significantly different from a given value representing the null-hypothesis. Formally, for any given array $x$ and null-hypothesis $h_{0}$:

(4)#\[\begin{align} t = \frac{\bar{x} - h_{0}}{s\ / \sqrt{N}} \end{align}\]

Here: $\bar{x}$ represents the mean of $x$, $s$ represents the standard deviation of $x$, and $N$ represents the length (‘size’) of x. So, in slightly less mathematical notation:

(5)#\[\begin{align} t = \frac{\mathrm{mean}(x) - h_{0}}{\mathrm{std}(x)\ / \sqrt{\mathrm{length}(x) - 1}} \end{align}\]

Make sure to return the t-value!

Hint 1: to compute $N$, you can use the .size attribute of an array …
Hint 2: use the function np.sqrt(some_number) to calculate the square root …

""" Tests the ToDo above. """
from tests import test_tvalue_computation

arr = np.random.randn(100)
outp = compute_one_sample_ttest(arr , 0)

if outp is None:
    raise ValueError("Your function didn't return anything! Did you forget the return statement?")

print("Going to do 3 tests ...")
test_tvalue_computation(arr, 0, outp)

outp = compute_one_sample_ttest(arr, 5) 
test_tvalue_computation(arr, 5, outp)

outp = compute_one_sample_ttest(arr, -3) 
test_tvalue_computation(arr, -3, outp)

Going to do 3 tests ...
Correct! You stats wizard!
Correct! You stats wizard!
Correct! You stats wizard!

Numpy: array math#

Now you know all the numpy basics necessary to do neuroimaging analysis! As you’ll see in the last section (Working with nifti-images), we’ll work with 3D (structural MRI images) or 4D (functional MRI images) numpy arrays a lot. Given that you know how the basics about numpy in general and numpy ndarrays in particular, we can utilize some of numpy’s best features: (very fast) array math.

Basic mathematical functions operate elementwise on arrays, which means that the operation (e.g. addition) is applied onto each element in the array. So, let’s initialize a 1D array with ten zeros and let’s add 1 to it:

x = np.zeros(10)
print(x, '\n')
x += 1 # remember: this the same as x = x + 1
print(x)

[0. 0. 0. 0. 0. 0. 0. 0. 0. 0.] 

[1. 1. 1. 1. 1. 1. 1. 1. 1. 1.]

Additionally, you can also sum two arrays together in an elementwise manner by simply writing: array_1 + array_2, given that these two (or more) arrays are of the same shape! Let’s look at an example:

x = np.array([[1,2],[3,4]], dtype=np.float64)
y = np.array([[5,6],[7,8]], dtype=np.float64)

print("x: \n%r" % x, '\n')
print("y: \n%r" % y, '\n')
print("x+y: \n%r" % (x + y), '\n')

x: 
array([[1., 2.],
       [3., 4.]]) 

y: 
array([[5., 6.],
       [7., 8.]]) 

x+y: 
array([[ 6.,  8.],
       [10., 12.]]) 

Often, there exist function-equivalents of the mathematical operators. For example, x + y is the same as np.add(x, y). However, it is recommended to use the operators wherever possible to improve readability of your code. See below for an example:

print(x + y, '\n')
print(np.add(x, y))

[[ 6.  8.]
 [10. 12.]] 

[[ 6.  8.]
 [10. 12.]]

Next to addition, we can also do elementwise subtraction, multiplication, divison, square root, and exponentiation:

# Elementwise difference; both produce the array
print(x - y, '\n')
print(np.subtract(x, y))  # function-equivalent of above 

[[-4. -4.]
 [-4. -4.]] 

[[-4. -4.]
 [-4. -4.]]

# Elementwise product; both produce the array
print(x * y, '\n')
print(np.multiply(x, y))

[[ 5. 12.]
 [21. 32.]] 

[[ 5. 12.]
 [21. 32.]]

# Elementwise division; both produce the array
# [[ 0.2         0.33333333]
#  [ 0.42857143  0.5       ]]
print(x / y, '\n')
print(np.divide(x, y))

[[0.2        0.33333333]
 [0.42857143 0.5       ]] 

[[0.2        0.33333333]
 [0.42857143 0.5       ]]

# Elementwise square root; there is no operator-equivalent!
print(np.sqrt(x))

[[1.         1.41421356]
 [1.73205081 2.        ]]

# Elementwise exponentiation
print(x ** y, '\n')
print(np.power(x, y))

[[1.0000e+00 6.4000e+01]
 [2.1870e+03 6.5536e+04]] 

[[1.0000e+00 6.4000e+01]
 [2.1870e+03 6.5536e+04]]

ToDo: Do an elementwise product between the two variables defined below (arr_A and arr_B) and subsequently add 5 to each element; store the result in a new variable called result_product_and_sum.

""" Tests the above ToDo. """
from tests import test_array_product_and_sum
test_array_product_and_sum(result_product_and_sum)    

Great!

Note that unlike Matlab, * is elementwise multiplication, not matrix multiplication. We instead use the dot function (or method!) to compute matrix operations like inner products of vectors, multiplication of a vector by a matrix, and multiplication of two matrices.

These matrix operations are quite important in this course, because they are used a lot in neuroimaging methods (such as the General Linear Model, the topic of next week!). You’re not expected to fully understand these matrix operations, but you’ll see them several times in this course, so make sure you’re familiar with its implementation in Python/Numpy.

An example of the inner product (“dot product”) of two vectors in both the function-format and the method-format:

v = np.array([9,10])
w = np.array([11, 12])

# Inner product of vectors; both produce 219
print(v.dot(w))
print(np.dot(v, w))

219
219

Additionally, in Python 3.6 (and above), you can use the @ character as an operator for matrix multiplication of numpy arrays!

print(v @ w)

Probably the most used functions in numpy are the sum() and mean() fuctions (or equivalent methods!). A nice feature is that they can operate on the entire array (this is the default) or they can be applied per dimension (or, in numpy lingo, per “axis”).

Applying functions along axes is very common in scientific computing! Let’s look at an example in which we apply the sum function/method across rows and columns:

x = np.array([[1, 2],[3, 4], [5, 6]])

print('Original array:')
print(x, '\n')

print('Sum over ALL elements of x:')
print(np.sum(x), '\n')

print('Sum across rows of x:')
print(np.sum(x, axis=0), '\n')

print('Sum across columns of x:')
print(x.sum(axis=1)) # this is the method form! Is exactly the same as np.sum(x, axis=1) 

Original array:
[[1 2]
 [3 4]
 [5 6]] 

Sum over ALL elements of x:
21 

Sum across rows of x:
[ 9 12] 

Sum across columns of x:
[ 3  7 11]

Importantly, application of functions across axes is much quicker (and more concise) than writing for-loops! Let’s look at the speed difference between a for-loop (implemented as a list comprehension) and the numpy-style application of the sum function across axes:

arr = np.random.random((1000, 100))
loop_style = %timeit -o [arr[i, :].sum() for i in range(arr.shape[0])]
axis_style = %timeit -o arr.sum(axis=1)
print("Using the axis-argument is %.3f times faster than a for-loop!" % (loop_style.average / axis_style.average))

1.74 ms ± 3.55 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)

54.6 µs ± 28 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each)
Using the axis-argument is 31.878 times faster than a for-loop!

This application of functions on entire arrays (in an elementwise manner, like arr_1 + arr_2) and application of functions across a certain axis (like np.sum(arr_1, axis=1)) is often called vectorization. This is an incredibly useful concept and something that is used in many data analysis tools! Make sure you understand this.

Let’s practice this a bit.

ToDo: Remember the "range" function from before? Below, we started writing a function, called calculate_range_vectorized, that calculates, for any 2D array, the range of each column (i.e. calculates range across rows for each column) in a vectorized manner (so without any for loops!).

Complete the function, and make sure to return the result, which should be a 1D vector which values represents the ranges for all columns. So, if the input array would be of shape (100, 10), then the shape of your returned array should be (10,).

""" Tests the above ToDo. """
from tests import test_compute_range_vectorized

np.random.seed(42)
test_arr = np.random.random((100, 10))
outp = compute_range_vectorized(test_arr)

if outp is None:
    raise ValueError("Output is None! Did you forget the Return statement in your function?")

test_compute_range_vectorized(test_arr, outp)

Easy peasy!

Broadcasting#

Broadcasting is a powerful mechanism that allows numpy to work with arrays of different shapes when performing arithmetic operations. Frequently we have a smaller array and a larger array, and we want to use the smaller array multiple times to perform some operation on the larger array.

For example, suppose that we want to add a vector to each row of a matrix. We could do it like this:

# We will add the vector v to each row of the matrix x,
# storing the result in the matrix y
x = np.array([[1,2,3], [4,5,6], [7,8,9], [10, 11, 12]])

print('x array is of shape: %r' % (x.shape,))
print(x, '\n')

v = np.array([1, 0, 1])
print('v vector is of shape: %r (different shape than x!)' % (v.shape,))
print(v, '\n')

y = np.zeros(x.shape)   # Create an empty (zeros) matrix with the same shape as x
print('Shape of (pre-allocated) y-matrix: %r' % (y.shape,))

# Add the vector v to each row of the matrix x with an explicit loop
for i in range(x.shape[0]): # see how the shape attributes comes in handy in creating loops?
    y[i, :] = x[i, :] + v

print('The result of adding v to each row of x, as stored in y:')
print(y)

x array is of shape: (4, 3)
[[ 1  2  3]
 [ 4  5  6]
 [ 7  8  9]
 [10 11 12]] 

v vector is of shape: (3,) (different shape than x!)
[1 0 1] 

Shape of (pre-allocated) y-matrix: (4, 3)
The result of adding v to each row of x, as stored in y:
[[ 2.  2.  4.]
 [ 5.  5.  7.]
 [ 8.  8. 10.]
 [11. 11. 13.]]

This works; however when the matrix x is very large, computing an explicit for-loop in Python could be slow. Note that adding the vector v to each row of the matrix x is equivalent to forming a matrix vv by stacking multiple copies of v vertically, like this [[1 0 1], [1 0 1], [1 0 1], [1 0 1]], and subsequently elementwise addition of x + vv:

vv = np.tile(v, (4, 1)) # i.e. expand vector 'v' 4 times along the row dimension (similar to MATLAB's repmat function)
y = x + vv  # Add x and vv elementwise
print(y)

[[ 2  2  4]
 [ 5  5  7]
 [ 8  8 10]
 [11 11 13]]

Numpy broadcasting allows us to perform this computation without actually creating multiple copies of v. Consider this version, using broadcasting:

# We will add the vector v to each row of the matrix x,
# storing the result in the matrix y
x = np.array([[1,2,3], [4,5,6], [7,8,9], [10, 11, 12]])
v = np.array([1, 0, 1])
y = x + v  # Add v to each row of x using broadcasting
print(y)

[[ 2  2  4]
 [ 5  5  7]
 [ 8  8 10]
 [11 11 13]]

The line y = x + v works even though x has shape (4, 3) and v has shape (3,) due to broadcasting; this line works as if v actually had shape (4, 3), where each row was a copy of v, and the sum was performed elementwise.

This broadcasting function is really useful, as it prevents us from writing unnessary and by definition slower explicit for-loops. Additionally, it’s way easier to read and write than explicit for-loops (which need pre-allocation). Functions that support broadcasting are known as universal functions. You can find the list of all universal functions in the documentation.

Tip: The video below (which renders after running the cell) explains broadcasting very nicely — check it out if you want to see a visual representation of how broadcasting works!

from IPython.display import YouTubeVideo
YouTubeVideo('K96OoAWbhBE', width=800, height=300)

Here are some applications of broadcasting using different functions:

x = np.array([[1, 2],[3, 4], [5, 6]], dtype=np.float64)
x_sum = x.sum(axis=0)

print(x / x_sum, '\n')

x_mean = x.mean(axis=0)
print(x / x_mean)

[[0.11111111 0.16666667]
 [0.33333333 0.33333333]
 [0.55555556 0.5       ]] 

[[0.33333333 0.5       ]
 [1.         1.        ]
 [1.66666667 1.5       ]]

ToDo: Below, we started writing a function called standardize_columns, which takes a single input-argument — a 2D numpy array — and should subsequently standardize each column such that each column has a mean of zero and a standard deviation of 1. In other words, standardization subtracts the mean (across rows) of each column from the values in that column, and then divides those value with the standard deviation of that column. Formally, for each column $j$ in any 2D array, standardization entails:

(6)#\[\begin{align} x_{j}\ \mathrm{standardized} = \frac{(x_{j} - \bar{x_{j}})}{\mathrm{std}(x_{j})} \end{align}\]

Standardization, which is also oftend called “z-transformation”, is often done in statistics. So, in the function below, make sure that it is able to standardize each column in any 2D array. Make sure to use vectorized array computations (i.e, the np.mean and np.std functions/methods across rows) and broadcasting, so no for-loops!

""" Tests the above ToDo. """

test_arr = np.random.normal(5, 2, size=(100, 10))
outp = standardize_columns(test_arr)

if outp is None:
    raise ValueError("The output from your function is None; did you forget the Return statement?")

try:
    np.testing.assert_array_almost_equal(outp.mean(axis=0), np.zeros(test_arr.shape[1]))
except AssertionError as e:
    print("The mean of the columns of your standardized array are not 0!")
    raise(e)

try:
    np.testing.assert_array_almost_equal(outp.std(axis=0), np.ones(test_arr.shape[1]))
except AssertionError as e:
    print("The std of the columns of your standardized array are not 1!")
    raise(e)
    
print("A-MA-ZING!")

A-MA-ZING!

Now, you know the most important numpy concepts and functionality that is necessary to start using it. Surely, there is a lot more to the numpy package that what we’ve covered here, but that’s for another time 😃