NumPy#
NumPy is an open-source library that facilitates efficient numerical processing on multi-dimensional arrays. NumPy is ubiquitous in scientific computing and is used extensively in other common libraries such as SciPy, pandas and scikit-learn. To demonstrate the efficiency of NumPy, let’s import it and look at an example.
import numpy as np
# Create 100 random numbers
random_numbers = np.random.rand(100)
# Let's add 10 to each random number
%timeit [x+10 for x in random_numbers]
# Now let's add 10 to each random number using NumPy
%timeit random_numbers + 10
We’ll go through the details of the code contained in the example throughout this tutorial, but for now it suffices to notice that the second number is lower than the first, i.e., using NumPy was faster than using native Python.
In order to obtain faster speeds, NumPy uses its own custom data structure; NumPy arrays. On the face of it, arrays look a lot like lists, but there are some key differences. Arrays stored in RAM (memory) sequentially, whereas lists are stored randomly. That might sound a little complicated, so to give an analogy, consider the time saved if you were making a meal and had all the ingredients lined up in order vs having to find the ingredients all around the house. Lists can store any datatype, but arrays can only store one, so, all elements would need to be e.g., integer, or floats.
Creating Arrays#
Arrays can be initialised with a list:
np.array([0, 1, 2, 3, 4])
We can also make multidimensional arrays:
np.array([[0, 1, 2, 3, 4], [5, 7, 9, 11, 13]]) # Creates a 2D array
Or use Python ranges:
np.array(range(5))
Predict what the result will be before running the following code:
np.array([range(5), range(1, 6), range(2, 7)])
Slicing Arrays#
Often, you may want to select a section of an array for processing, or to examine the results. In a one-dimensional array this can be done similar to that of a list:
my_array = np.array(range(10))
my_array[0] # The first element of the array
Can you write some code to access the second to the fifth element of the array?
# Your code here
In a multi-dimensional array, we can do something very similar, but need to specify the value for multiple dimensions:
my_multidimensional_array = np.array([range(10), range(10, 20)])
my_multidimensional_array[0, 5] # gives me the sixth element of the first row
try printing out the array above to manually verify that the slice is doing what you expect.
Slicing more than one dimension can be a little harder to work out, but the format is:
array[row_index, column_index]
.
This can be extended to select only ranges:
array[row_index_start:row_index_end, column_index_start:column_index_end]
Above, you wrote code to access the second to the fifth element of an array, using the new notation, can you rewrite this?
# Your code here
Can you do something similar for my_multidimensional_array
, accessing the second to the fifth of both rows?
# Your code here
Generating Random Numbers#
Juat like the random library, NumPy can generate psuedo random numbers (RNGs). Unlike the random library, NumPy can generate an array of these values with any shape. For instance, to generate pseudo-random numbers uniformly distributed between 0 and 1 in an array with 4 rows and 5 columns:
np.random.rand(4, 5)
NumPy also offers other distributions, e.g., exponential, laplace etc..
np.random.exponential(1, (4, 5))
Broadcasting Arrays#
Element-wise numerical processing can be performed on NumPy arrays
# multiplication
print(np.array([1, 2, 3]) * np.array([1, 2, 3])) # [1*1, 2*2, 3*3]
# addition
print(np.array([1, 2, 3]) + np.array([1, 2, 3])) # [1+1, 2+2, 3+3]
# subtraction
print(np.array([1, 2, 3]) - np.array([1, 2, 3])) # [1-1, 2-2, 3-3]
# division
print(np.array([1, 2, 3]) / np.array([1, 2, 3])) # [1/1, 2/2, 3/3]
This is straightforward when arrays are the same shape, when they’re not, NumPy does something clever known as broadcasting. For example, when adding a number to an array, behind the scenes the number is broadcasted to the same shape as the array, then added together. So, when we run:
my_array + 2
What’s actually happening is:
broadcasted_2s = np.broadcast_to(2, my_array.shape)
print(broadcasted_2s)
my_array + broadcasted_2s
This is incredibly useful, especially when dealing with multiple dimensions, but there are a few caveats, the array shapes must be compatible. A compatible shape is one that has the same dimension shape or is equal 1. For example, these array shapes are compatible:
(5, 4) & (4)
(5, 4) & (5, 1)
(5, 4) & (1)
So, in example 1, 5x4 array could be multiplied by a 4 member array, because each row is multiplied individually by the array with 4 members. In example 2, each column is multiplied by the 5x1 array, and in the final example, the whole array is multiplied by a scalar value.
print(f"The shape of my_multidimensional_array: {my_multidimensional_array.shape}")
# compatible shapes are, (10), (2, 1), and (1)
print(f"Multiplying by an array of shape (10)")
print(my_multidimensional_array * np.array(range(10)))
print(f"Multiplying by an array of shape (2, 1)")
print(my_multidimensional_array * np.array(range(2)).reshape(2, 1))
print(f"Multiplying by an array of shape (1)")
print(my_multidimensional_array * 5)
Create a new array with shape (8, 5) and add, subtract, divide, and multiply the array by relevant sized arrays. You may want to use np.random
with your favourite distribution to do this!
# Your code here
Useful Mathematical Functions#
NumPy contains a number of useful mathematical functions including:
mean
median
min
max
sum
std dev
variance
percentiles
all following the format:
random_array = np.random.rand(10)
print(f"Mean: {np.mean(random_array)}")
Create a function that summarises some data, include any metrics you may find useful, but include the range and the IQR.
# Your code here
Other useful functions#
NumPy contains a number of other useful functions:
np.arange(0, 10, 2.5) # Create an array containing values from 0 to 10 in steps of 2.5 (end point exclusive)
arange
is similar to the range
built-in function we’ve seen before, but allows non-integer steps!
np.linspace(0, 10, 5) # Generate 5 numbers between 0 and 10 (end point inclusive)
linspace
returns evenly spaced numbers between th start and end point, but crucially includes the end point.
np.zeros((5, 2)) # Creates a 5x2 array filled with 0.
array([[0., 0.],
[0., 0.],
[0., 0.],
[0., 0.],
[0., 0.]])
zeros
are useful when we know how big an array will be and want to update it incrementally, a similar function exists for ones
.
Sometimes we have 2 data sources and want to combine them into a single source, we can use np.concatenate
or other functions:
rand_arr_1 = np.random.rand(5, 5)
rand_arr_2 = np.random.rand(5, 1)
np.hstack([rand_arr_1, rand_arr_2]) # stacks the arrays together horizontally, so the final shape is 5x6 (5 rows and 6 columns)
hstack
concatenates arrays horizontally (axis=1, rows), this is equivalent to np.concatenate[rand_arr_1, rand_arr_2], axis=1)
.
Can you concatenate these arrays vertically? i.e., stick the second array underneath the first array
HINT: You may need to look up a transpose
Why could you not append the arrays in their original shape?
# Your code here
I/O#
I/O stands for Input/Output, which essentially is reading and writing to files. We don’t tend to use NumPy for this, generally we’ll use the Pandas library, but you don’t learn that until day 3!
One common format for data to be stored in is a “csv” format
np.savetxt("random_data.csv", random_array, delimiter=",") # Creates a random array and saves it in a file called random_data.csv
array_from_disk = np.genfromtxt("random_data.csv", delimiter=",") # Reads in the csv file "random_data.csv"
Verify that random_array
and array_from_disk
are indeed the same.
# Your code here
Memory Management#
It’s more efficient to process NumPy arrays row by row rather than column by column, this is due to the way the arrays are stored in memory (RAM), if you’re interested, please see advanced numpy.
Task#
Consider the following scenario, patients have been enrolled on a weight-loss programme and their BMI is tracked each week. We have been tasked with providing information on the data.
Each row represents a different patient, each column represents a different week.
patient_data = np.array(
[
[26.2, 26.1, 25.8, 25.9, 25.7],
[30.3, 30.4, 30.1, 30.2, 30.1],
[27.1, 26.2, 25.6, 24.0, 23.4],
[25.1, 24.3, 24.3, 24.0, 24.1],
[29.4, 30.1, 29.5, 29.5, 29.5],
]
)
Use the function you wrote earlier as a starting point, can you generalise it to provide summary statistics for each patient?
# Your code here
What was the
largest change in BMI?
smallest change in BMI?
average change in BMI?
HINT: Think carefully about these questions and whether the answers make sense
# Your code here