1. Introduction to NumPy

Data manipulation in Python is nearly synonymous with NumPy array manipulation: even tools like Pandas are built around the NumPy array. Numpy arrays can be thought of as mathematical vectors and behave correspondingly; contrary to Python lists, which are a container that stores arbitrary objects.

Python list recap

We saw previously that arrays allow us to store a series of values in a single variable. For example:

listnumbers = [1, 2, 3]
print(listnumbers)

[1, 2, 3]

Notably, in Python, a list can contain objects that are not of the same data type. For example:

list1 = [1, "string", {'a':1}, [[1, 3], set()]]  # arbitrary objects
print(list1)

[1, 'string', {'a': 1}, [[1, 3], set()]]

Adding two lists concatenates them, there is no mathematical operation: adding two containers that contain arbitrary objects means “combining” them.

listadd = list1 + listnumbers
print(listadd)

[1, 'string', {'a': 1}, [[1, 3], set()], 1, 2, 3]

Being able to store multiple data types in a single list can be convenient. However, Python lists can be cumbersome to work with when doing mathematical operations, and for more complex and multi-dimensional data. They are also relatively slow to process. For example, say we wanted to add two large lists of values together. We would need to do the following:

# create large list a
a = range(1000000)
b = range(1000000)
added_list = [ai+bi for ai, bi in zip(a, b)]

Let’s time how long it took to do that last line. For that, we can use the %timeit magic command in a jupyter notebook:

%timeit added_list = [ai+bi for ai, bi in zip(a, b)]

94.3 ms ± 819 μs per loop (mean ± std. dev. of 7 runs, 10 loops each)

While this may seem fast in human time, it’s quite slow computationally wise. If you had multiple operations like this it would quickly add up. As we will see below, it’s possible to do array operations like this much more quickly using NumPy.

NumPy arrays

For much faster, easier manipulation of numerical arrays, we use NumPy. What is NumPy? A good summary is provided by Claude AI:

NumPy is a fundamental Python library for scientific computing that provides support for large, multi-dimensional arrays and matrices. It offers a comprehensive collection of mathematical functions to operate on these arrays efficiently, with operations implemented in C for high performance. NumPy serves as the foundation for most other scientific Python libraries like pandas, scikit-learn, and matplotlib, making it essential for data science, machine learning, and numerical analysis workflows.

Let’s see how to do some basic array operations with numpy. First, if you have not done so, you’ll need to install numpy into your conda environment. To do so, in a terminal, activate your conda environment, then either run:

pip install numpy

conda install -y -c conda-forge numpy

We can now import numpy:

import numpy as np

NumPy array creation

There are multiple ways to create an array using numpy. Some examples:

# from a Python list:
arrnumbers = np.array(listnumbers)
print("arrnumbers:", arrnumbers)

# manually creating it:
arrnumbers2 = np.array([5, 3, 42])
print("arrnumbers2:", arrnumbers2)

# from a range of values (compare to range above):
arrnumbers3 = np.arange(1000000)
print("arrnumbers3:", arrnumbers3)

# an array of values linearlly spaced between two endpoints:
arrlinspace = np.linspace(0, 10, 5)  # 5 values equally spaced between 0 and 10
print("arrlinspace:", arrlinspace)

# an array of zeros:
arrzeros = np.zeros(4)
print("arrzeros:", arrzeros)

# an array of ones:
arrones = np.ones(4)
print("arrones:", arrones)

# an empty array (values will be whatever is in memory at the time):
arrempty = np.empty(4)
print("arrempty:", arrempty)

arrnumbers: [1 2 3]
arrnumbers2: [ 5  3 42]
arrnumbers3: [     0      1      2 ... 999997 999998 999999]
arrlinspace: [ 0.   2.5  5.   7.5 10. ]
arrzeros: [0. 0. 0. 0.]
arrones: [1. 1. 1. 1.]
arrempty: [1. 1. 1. 1.]

NumPy arrays can have multiple dimensions:

arr2d = np.array([[1, 2, 3], [4, 5, 6]])
print(arr2d)

# we can get the number of dimensions with .ndim:
print("arr2d.ndim:", arr2d.ndim)

# or the shape with .shape:
print("arr2d.shape:", arr2d.shape) # returns the number of rows and columns

[[1 2 3]
 [4 5 6]]
arr2d.ndim: 2
arr2d.shape: (2, 3)

Many array constructor functions take a shape argument to create a mult-dimensional array:

zeros2d = np.zeros((3, 4))  # 3 rows, 4 columns
print("zeros2d:\n", zeros2d)

ones3d = np.ones((2, 3, 4))  # 2 blocks, 3 rows, 4 columns
print("ones3d:\n", ones3d)

zeros2d:
 [[0. 0. 0. 0.]
 [0. 0. 0. 0.]
 [0. 0. 0. 0.]]
ones3d:
 [[[1. 1. 1. 1.]
  [1. 1. 1. 1.]
  [1. 1. 1. 1.]]

 [[1. 1. 1. 1.]
  [1. 1. 1. 1.]
  [1. 1. 1. 1.]]]

Or, we can reshape a current array:

arrnumbers2d = arrnumbers3.reshape(1000, 1000)  # 1000 rows, 1000 columns
print("arrnumbers3.shape:", arrnumbers3.shape)
print("arrnumbers3:\n", arrnumbers3)

print("arrnumbers2d.shape:", arrnumbers2d.shape)
print("arrnumbers2d:\n", arrnumbers2d)

arrnumbers3.shape: (1000000,)
arrnumbers3:
 [     0      1      2 ... 999997 999998 999999]
arrnumbers2d.shape: (1000, 1000)
arrnumbers2d:
 [[     0      1      2 ...    997    998    999]
 [  1000   1001   1002 ...   1997   1998   1999]
 [  2000   2001   2002 ...   2997   2998   2999]
 ...
 [997000 997001 997002 ... 997997 997998 997999]
 [998000 998001 998002 ... 998997 998998 998999]
 [999000 999001 999002 ... 999997 999998 999999]]

Array slicing

Similar to Python lists, we can access elements of a list using braces and indices. The syntax is:

arr[start:end:step]

Some examples:

# print a single element in arrnumbers:
print(arrnumbers[0]) # first element
print(arrnumbers[1]) # second element

1
2

# print a range of elements in arrnumbers:
print('arrnumbers[0:2]:', arrnumbers[0:2])

# equivalently:
print('arrnumbers[:2]:', arrnumbers[:2])  # start is 0 by default

# or to print from an index to the end:
print('arrnumbers[1:]:', arrnumbers[1:])  # goes to the end by default

# or to print all numbers:
print('arrnumbers[:]:', arrnumbers[:])  # start and end are default

# print every second element:
print('arrnumbers[::2]:', arrnumbers[::2])

arrnumbers[0:2]: [1 2]
arrnumbers[:2]: [1 2]
arrnumbers[1:]: [2 3]
arrnumbers[:]: [1 2 3]
arrnumbers[::2]: [1 3]

Negative indices can be used to slice starting from the end, and to reverse order. For example:

print('arrnumbers[-2:]', arrnumbers[-2:])  # print the last two elements
print('arrnumbers[::-1]', arrnumbers[::-1])  # print all elements in reverse order

arrnumbers[-2:] [2 3]
arrnumbers[::-1] [3 2 1]

For multi-dimensional arrays, the same rules apply, you just separate the indexing for each dimension by commas. For example:

print("arr2d:\n", arr2d)
print("arr2d[0, 0]:", arr2d[0, 0])  # first row, first column
print("arr2d[:, 0]:", arr2d[:, 0])  # all rows, first column
print("arr2d[0, :]:", arr2d[0, :])  # first row, all columns
print("arr2d[0:2, 1:3]:\n", arr2d[0:2, 1:3])  # first two rows, columns 1 and 2

arr2d:
 [[1 2 3]
 [4 5 6]]
arr2d[0, 0]: 1
arr2d[:, 0]: [1 4]
arr2d[0, :]: [1 2 3]
arr2d[0:2, 1:3]:
 [[2 3]
 [5 6]]

Subarrays as no-copy views

One important–and extremely useful–thing to know about array slices is that they return views rather than copies of the array data. This is one area in which NumPy array slicing differs from Python list slicing: in lists, slices will be copies. Consider our two-dimensional array from before:

print(arr2d)

[[1 2 3]
 [4 5 6]]

Let’s extract a \(2 \times 2\) subarray from this:

arr2d_sub = arr2d[:2, :2]
print(arr2d_sub)

[[1 2]
 [4 5]]

Now if we modify this subarray, we’ll see that the original array is changed! Observe:

arr2d_sub[0, 0] = 99
print(arr2d_sub)

[[99  2]
 [ 4  5]]

print(arr2d)

[[99  2  3]
 [ 4  5  6]]

This default behavior is actually quite useful: it means that when we work with large datasets, we can access and process pieces of these datasets without the need to copy the underlying data buffer.

Creating copies of arrays

Despite the nice features of array views, it is sometimes useful to instead explicitly copy the data within an array or a subarray. This can be most easily done with the copy() method:

arr2d_sub_copy = arr2d[:2, :2].copy()
print(arr2d_sub_copy)

[[99  2]
 [ 4  5]]

If we now modify this subarray, the original array is not touched:

arr2d_sub_copy[0, 0] = 42
print(arr2d_sub_copy)

[[42  2]
 [ 4  5]]

print(arr2d)

[[99  2  3]
 [ 4  5  6]]

Boolean slicing

You can use boolean expressions to retrieve certain values in an array. For example:

# print all the values in arrnumbers2 greater than 4:
print("arrnumbers:\n", arrnumbers2)
print("arrnumbers > 5:\n", arrnumbers2[arrnumbers2 > 4])  # boolean array

arrnumbers:
 [ 5  3 42]
arrnumbers > 5:
 [ 5 42]

What’s actually happening here is you’re first creating a boolean array. This is an array in which each element is either True or False. In this case, arrnumbers2 > 4 is creating an array indicating which indices in arrnumbers2 are greater than 4. Passing the boolean array as an index then pulls out those values. We can see this if we break it into two steps:

mask = arrnumbers2 > 4
print("mask:\n", mask)
print("arrnumbers2[mask]:\n", arrnumbers2[mask])

mask:
 [ True False  True]
arrnumbers2[mask]:
 [ 5 42]

Data types

A key difference between NumPy arrays and Python arrays is that the data in a NumPy array must all be of the same type. You can get the data type of the values in an array using .dtype. For example:

print('arrnumbers:', arrnumbers)
print('arrnumbers.dtype:', arrnumbers.dtype)

arrnumbers: [1 2 3]
arrnumbers.dtype: int64

print('arrones:', arrones)
print('arrones.dtype:', arrones.dtype)

arrones: [1. 1. 1. 1.]
arrones.dtype: float64

print('mask:', mask)
print('mask.dtype:', mask.dtype)

mask: [ True False  True]
mask.dtype: bool

If you try to create an array with different data types, numpy will automatically cast them to all be the same. For example:

mixed = np.array([1, 2.0, 3, 4.8, True, False])  # ints, floats, bools
print("mixed:", mixed)
print("mixed.dtype:", mixed.dtype)  # everything cast to float (note that True -> 1, False -> 0)

mixed: [1.  2.  3.  4.8 1.  0. ]
mixed.dtype: float64

You can cast an array to a different type using .astype. This will create a copy of the array with values cast to the type you specified. For example:

mixedint = mixed.astype(int)  # cast to int
print("mixedint:", mixedint)
print("mixedint.dtype:", mixedint.dtype)  # everything cast to int

mixedint: [1 2 3 4 1 0]
mixedint.dtype: int64

Array operations

One of the most useful aspects about NumPy arrays is they allow you to perform mathematical operations on the all the elements in the list using the same syntax you would for single variables. For example, we can add all the values in one array to another by doing:

a = np.arange(1000000)
b = np.arange(1000000)
c = a + b  # add element-wise
print("c:", c)

c: [      0       2       4 ... 1999994 1999996 1999998]

Compare that to the way we had to add two Python lists together above. Note that if a and b were Python lists a+b concatenates them together (i.e., appends the values of b on to the end of a) where as if a and b are Numpy arrays, the values are added together element-wise.

Aside from being easier to write, NumPy array operations are also much faster than Python operations. Let’s time how long it took to create c:

%timeit c = a + b

381 μs ± 20.5 μs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)

Compare to what we got when we did the same thing with Python lists above. It’s about 100 times faster!

More advanced math operations

Numpy comes with a large number of math functions built-in, which we can run on NumPy arrays. For example:

# take the sine of every element in a:
print(np.sin(a))

# sum up all the values in a:
print(np.sum(a))

# take the average of all the values in a:
print(np.mean(a))

[ 0.          0.84147098  0.90929743 ...  0.21429647 -0.70613761
 -0.97735203]
499999500000
499999.5

Some operations can also be executed as methods on the array. For example:

# sum up all the values in a:
print(a.sum())  

# take the average of all the values in a:
print(a.mean())

499999500000
499999.5

Note that this doesn’t work with the sine function, however:

a.sin()

---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
Cell In[33], line 1
----> 1 a.sin()

AttributeError: 'numpy.ndarray' object has no attribute 'sin'

In VS Code, you can see all the operations you can call as methods of the array by typing the array name + .; e.g., a.. That will show a drop-down list that you can cycle through.

Code Challenge 1.1

Let’s illustrate the speed and simplicity of NumPy vs native Python lists.

Create a Python array that has 100,000 values equally spaced between 0 and 2*pi (pi = 3.141592653589793).
Calculate the average of the cosine of every value in the Python array. For the cos function, you will need to import the math module.
Repeat steps 1 and 2, but using purely NumPy arrays and functions. You should be able to do step 2 in a single line of code. Note that NumPy has an in-built pi value (np.pi).
Time how long it takes the computer to do Step 2. Compare how long that takes when you use NumPy. When doing the comparison, just time the math operation step, not the array creation. Which is faster? Hint: for timing the Python version, you’ll need to use %%timeit rather than %timeit, as the Python version will require multiple lines of code. Put all the lines in a single cell in your notebook, and put %%timeit at the top to time the entire cell, rather than just a single line.
Which is computationally faster? By what factor?

Solution

To create the Python array:

xsize = 100000
x = [xi*2*np.pi/xsize for xi in range(xsize)]

Evaluation:

import math
y = [math.cos(xi) for xi in x]
sum(y)/len(y)

-3.8771274778262895e-17

Same, with numpy:

import numpy as np
# create the array
x = np.linspace(0, 2*np.pi, xsize)
# take the average of the cosine of the values
np.cos(x).mean()

np.float64(9.99999999994543e-06)

Python timing, in a Jupyter cell:

%%timeit
y = [math.cos(xi) for xi in x]
sum(y)/len(y)

Numpy:

%timeit np.cos(x).mean()

You should get that the numpy version is faster, by a factor of 10-100.