Pandas#
Knowledge of Pandas is a must-have for anyone wishing to use Python for data analysis! Pandas is a powerful open-source library that provides new data structures, such as the Series and the DataFrame that allows you to efficiently handle relational data. Over the next few cells, you’ll discover the key features of Pandas. This notebook is designed to provide you with the basics of Pandas, before we look at a case study with real data in the next notebook.
import pandas as pd # We normally import pandas with the alias 'pd'
import numpy as np
Recap Task#
Create a function that will take in a tuple containing a shape and returns a random array. E.g., if I pass (5, 5) I should end up with a 5x5 array of random numbers.
Then use your function to create a 10x10 array and save it as a variable named data
# YOUR CODE HERE
DataFrames#
Now we have some data, we can create a DataFrame. This is ordinarily done using a pd.read_
command, e.g., pd.read_sql
takes in connection information to a database and a sql query and will return a dataframe containing the results. We’ll look more at read_
later today, but for now, we’ll use the random data.
# We can turn the data into a DataFrame
df = pd.DataFrame(data) # df is a standard variable name for a DataFrame
# We can take a look at the dataframe by just typing df
df
# or we can look at the top n rows (default 5)
df.head()
# or the final n rows (default 5)
df.tail()
Notice that the columns are ‘labelled’ as 0 - 9. This is because when we created the DataFrame we didn’t provide any labels for the columns, we could reload it with column names, or rename the columns:
df.rename({0:"ColumnA", 1:"ColumnB"}, axis=1) # will rename the first 2 columns
NOTE:#
This does not save the changes made to the DataFrame
, lots of commands in Pandas will produce a new DataFrame
with the changes in, rather than make them in place (it’s always worth checking the behaviour though).
We could change that by providing the keyword argument inplace=True
for some functions, or we could save over the variable df
, e.g.,
df = df.rename({0:"ColumnA", 1:"ColumnB"}, axis=1)
but for now, we’ll leave it with the default numeric columns.
Summary Statistics#
We can also get some summary statistics from the DataFrame
:
df.describe()
This works well with numeric data, but let’s try with some nominal data.
categories = np.random.choice(["red", "blue", "green"], 10) # create a random column containing red, green and blue.
categories = pd.Series(categories, name="colours") # turn it into a Pandas series with the label 'colours'
print(categories) # take a look at the new column
df = pd.concat([df, categories], axis=1) # Creates a new DataFrame with the new column added on the end
Task#
Take a look at the new DataFrame
to check that the new column is there, then generate the summary statistics, what do you notice?
# your code here
Let’s take a look at the datatypes in the DataFrame
to try and understand this:
df.dtypes
You should be able to see that the first 10 columns are of type float64
and the final type is an object
. Pandas tries to infer the datatype from the data if it’s not explicitly given.
Selecting Data#
Let’s take a closer look at the final 2 columns. There are a few ways to do this:
df.iloc[:, -2:] # iloc allows you to reference [rows, columns] by their index, e.g., here we want all the rows, and from the second to last column onwards.
df.loc[:, [9, "colours"]] # loc allows you to select columns by name [rows, [column_names]]
df[[9, "colours"]] # If you want all rows, you can also just specify the columns [[column_names]]
iloc
is the least preferred way to select data, can you think of a reason why?
We can use loc
or column selection to add columns too, if we reference a column that does not yet exist, and set values to it, it will actually create a column in the dataframe!
We can also convert data types, for categorical data, pandas has a special dtype:
df["colours"] = df.loc[:, "colours"].astype("category")
df.dtypes
Behind the scenes, this is actually using a code instead of the categories for processing, we can access these codes via:
df.loc[:, "colours"].cat.codes
Or if we wanted to see a map between the codes and the categories:
# Create a dictionary by enumerating what every is returned from df["colours"].cat.categories (which will be the categories)
dict(enumerate(df["colours"].cat.categories))
Using the final 2 columns, let’s see if we can calculate the average value per colour, so in pseudocode we want to:
Select the last 2 columns
group by the colour
calculate the mean
# Going through the pseudocode:
# - Select the last 2 columns, we know these are names 9 and "colours" and that we want all rows: df.loc[:, [9, "colours]]
# - group by the colour: pandas has a groupby method for this exact thing!
# - calculate the mean: An object output from groupby needs more information to specify what to do with the groups,
# e.g., we could sum them, in this case, we just want mean().
df.loc[:, [9, "colours"]].groupby("colours").mean()
Task#
Adapt the above approach to calculate the mean values per colour for 2 columns in a single DataFrame
.
Check the DataFrame
looks how you expect it to look, then save it as the variable grouped_df
.
# Your code here
Basic Plotting#
It is possible to generate quick plots using Pandas. The plotting in Pandas is built on top of the common Python plotting library matplotlib
library, so it is possible to customise these plots, but better approaches to plotting will be covered in forthcoming session. For now, the purpose of this is to get a feel for the data quickly.
The basic format to plotting is:
df.plot()
for a line graph, or
df.plot.OTHERGRAPH
where OTHERGRAPH
could be box, bar, kde etc… (see the documentation for an extensive list).
We’ll use a bar plot to compare our average column values for each colour:
grouped_df.plot.bar()
Task#
Using the same 2 columns you selected above, create a boxplot.
# Your code here
Pivot and Crosstab#
We’ve already seen the groupby
keyword, but there are other ways to manipulate a DataFrame
, some of the most common ones are pivot
and crosstab
. pivot
acts exactly like a pivot table in Excel and allows you to choose columns to be the index and columns, and aggregates values together.
Task#
Start by adding a new column to df
with a random choice of "A"
or "B"
and call it "alpha/beta"
.
For bonus marks, can you save the new column with a categorical data type?
# Your code here
# This should display a 12 column DataFrame with 10 random float64 columns, a column labelled "colour" and a column labelled "alpha/beta".
df
We can use our 2 new columns to calculate the mean value in any other column, for instance column 4, for each combination:
pd.pivot_table(df, index = "colours", columns = "alpha/beta", values = 4, aggfunc = "mean")
NOTE#
You may see NaN
in our pivot_table
, this stands for Not A Number, and means there is a missing value, e.g., if green & B is NaN
there are no values associated when colours
is green and alpha/beta
is B.
Task#
Make a pivot table out of a different column, and use a different aggregation function.
HINT: If you’re unsure about the options available, check the documentation.
# Your code here
crosstab
s are a special case of pivot_table
that return the count of 2 (or more) variables. You can achieve something very similar with aggfunc = 'count'
in a pivot_table
.
Notice any 0s in the DataFrame
below, these will be located in the same places as the NaN
values above.
pd.crosstab(index = df["colours"], columns = df["alpha/beta"])
NOTE#
There is a difference between NaN
and 0, the former is like a blank cell in an Excel spreadsheet and does not indicate that there are no values, the latter is stating that we know that there are no values.
Checkpoint#
Run the cell below, don’t worry too much about the details. This loads a breast cancer dataset as cancer_df
with different properties of the cancer and whether it was malignant or benign.
Show code cell content
from sklearn.datasets import load_breast_cancer
lbc = load_breast_cancer(as_frame=True)
lbc_map = dict(enumerate(lbc.target_names))
cancer_df = lbc.data
cancer_df["Type"] = lbc.target.map(lbc_map)
cancer_df["radius group"] = pd.cut(cancer_df["mean radius"], bins=[*range(6, 30, 2)])
There are many columns in this dataset, but you’re going to use the final 2, Type
and radius group
, to create a pivot table with another column of your choice, then plot the results.
You’ll initially want to take a look at the dataset, do a little bit of data exploration and decide on another column to use in pivot table.
# Your code here
Missing Data#
Let’s suppose that some of our data was missing from the original DataFrame
, say ~10% of our numeric data:
df_missing_data = df.copy()
indices = np.random.choice([*range(10)], (10, 2))
for x, y in indices:
df_missing_data.iloc[x, y] = np.nan
df_missing_data
First, let’s calculate how much of our data is missing, this won’t equal exactly 10% because we’re not including the qualitative columns from the missing data.
f"Missing data: {df_missing_data.isna().values.sum() / df_missing_data.size * 100 :.2f}%"
Breaking down the above:
f"
: Specifies that this is an ‘f-string’, which is a string that can contain code encapsulated in curly brackets ({}).df_missing_data.isna()
: This returns aDataFrame
containingTrue
andFalse
, where theTrue
values are in the place of any missing data..values
: Retrieves the underlying NumPy array..sum()
: This is a neat little trick in Python, whereTrue
andFalse
is the same as 1 and 0 respectively, so by adding up all theTrue
(or 1) values, we get the total number of missing values.df_missing_data.size
: Returns the number of elements in theDataFrame
.* 100
: Hopefully by now this is self-explanatory!:.2f
: An f-string formatting trick, we’re stating that we only want to display 2 decimal places for our #float.
Have a look at where the data has been removed from your dataframe:
df_missing_data
This can be hard to see sometimes, so we can highlight missing values to help quickly locate them:
df_missing_data.style.highlight_null()
Machine learning algorithms are often very sensitive to missing data, so we need to try and make a ‘best-guess’ to fill in the data (imputing), you will learn more robust algorithms to do this in the future.
We have a few options here, perhaps we want to fill in the missing data with the mean per column:
means = df_missing_data.select_dtypes(np.float64).mean(axis=0)
df_means_filled = df_missing_data.fillna(means, axis=0)
df_means_filled
Perhaps we want to use our qualitative columns to provide some insight:
df_category_filled = df_missing_data.copy()
colour_alpha_beta_means = df.groupby(
["colours", "alpha/beta"]
).mean().mean(axis=1).to_dict()
for row_index, col_index in zip(*np.where(df_category_filled.isna())):
value = colour_alpha_beta_means[
tuple(
df_category_filled.loc[row_index, ["colours", "alpha/beta"]].values
)
]
df_category_filled.iloc[row_index, col_index] = value
For this, we can easily see which imputation method fit better:
def calculate_rmse(imputed_df, original_df, columns):
"""
Calculate the Root Mean Square Error (RMSE) between imputed and original DataFrame for specified columns.
Parameters:
imputed_df (DataFrame): DataFrame with imputed values.
original_df (DataFrame): Original DataFrame with true values.
columns (List[str]): List of column names to calculate RMSE for.
Returns:
float: Root Mean Square Error (RMSE) value.
"""
diff = imputed_df.loc[:, columns] - original_df.loc[:, columns]
rmse = np.sqrt(np.mean(diff**2))
return rmse
print(f"RMSE when using a category fill imputation: {calculate_rmse(df_category_filled, df, [*range(10)]) :.3f}")
print(f"RMSE when using a means fill imputation: {calculate_rmse(df_means_filled, df, [*range(10)]) :.3f}")
# In this case, it's likely to be very similar! You'll look at better imputation methods in a forthcoming section.
Combining Datasets#
Combining datasets can be done in a variety of ways, we can do a vertical mash together:
# Creating a new dataframe that's similar to the first, feel free to print it out and have a look at it!
new_df = pd.DataFrame(
zip(
*np.random.rand(10, 10),
np.random.choice(["red", "blue", "green"], size=10),
np.random.choice(["alpha", "beta"], size=10)
)
).rename({10:"colours", 11:"alpha/beta"}, axis=1)
# concat sticks the new dataframe on the bottom of the first, notice the indices in the following
pd.concat([df, new_df])
We can do a horizontal mash together:
# Creates a new dataframe containing 2 columns
extra_cols = pd.DataFrame(
zip(
np.random.choice(["yes", "no"], size=10),
np.random.choice(["left", "both", "right"], size=10, p=[0.45, 0.1, 0.45])
), columns = ["yes/no", "dominant_hand"]
)
# Squashes them together horizontally
pd.concat([df, extra_cols], axis=1)
We can also merge
together (this is similar the behaviour in SQL).
# This code does a few things, see if you can work out what's going on, feel free to pull it apart to test each bit!
new_df = df.groupby(
["colours", "alpha/beta"]
).mean().reset_index().dropna()[["colours", "alpha/beta", 9]].rename({9:10}, axis=1)
# Look out for the colour purple in the "colours" column!
new_df.loc[len(df.index)] = ("purple", "A", 1)
# Inner joins merge dataframes together with keys that appear in both datasets
pd.merge(left=df, left_on=["colours", "alpha/beta"], right=new_df, right_on=["colours", "alpha/beta"], how="inner") # so no purple value should appear
# We can also specify a left join, where we use keys that only appear in the left dataframe
pd.merge(left=df, left_on=["colours", "alpha/beta"], right=new_df, right_on=["colours", "alpha/beta"], how="left")
# Similarly, we can also specify a right join, where we use keys that only appear in the right dataframe
pd.merge(left=df, left_on=["colours", "alpha/beta"], right=new_df, right_on=["colours", "alpha/beta"], how="right") # That purple value should appear now!
TASK#
Read through the types of merges in pandas (https://pandas.pydata.org/docs/reference/api/pandas.merge.html#) and try a outer and cross.
# Your code here
Time series data#
Let’s say the data we had was actually taken on certain days
number_of_months = 10
years_months_days = {"year": [2023]*number_of_months, "month": [*range(1, number_of_months+1)], "day": [1]*number_of_months}
date_df = pd.DataFrame(years_months_days)
display(date_df)
# we can convert this into a single column:
dates = pd.to_datetime(date_df)
dates
TASK#
Add the new column to your dataframe with the column label of “dates” and verify the datatype is what you’re expecting.
# Your code here
Pandas has lots of utility functions to deal with time series data, we touch upon it here, but this will largely be covered in a forthecoming section. Let’s create some random data:
dates = pd.date_range(start='1/1/2023', end='31/1/2023')
random_numbers = np.random.rand(dates.shape[0])
# notice that we set the dates to the index, this doesn't have to be done at creation,
# but makes things easier if we want to take advantage of pandas time series utilities.
sample_time_df = pd.DataFrame(index = dates, data=random_numbers)
sample_time_df
We’ll go through 4 main functions that make working with time series data really easy in pandas.
shift
: we can move our values forwards or backwards using shift:
sample_time_df.shift(3)
resample
: The values can be resampled using a different time frequency, e.g. weekly, and give you values as of the start of the week.
sample_time_df.resample('W').sum()
asfreq
: returns the value at the end of the time period.
sample_time_df.asfreq('W')
rolling
: returns a rolling window that can be used with an aggregation function (e.g., mean)
sample_time_df.rolling(3).mean()
NOTE#
Most of these functions will only work if the date is set to the index, in the event that the date column is not automatically set to the index, it can be set by using df.set_index(NAME_OF_COLUMN)
.
Often pandas struggles to parse a date string, in some cases you may need to provide a format string to pd.to_datetime
to ensure correct formatting (see the table here).
TASK#
Can you create a plot that contains all of your numeric columns
Under that plot, can you also plot the rolling total of the numeric values (TIP: start by writing the pseudocode)
# Your code here
Apply and map#
If we want to apply a function to each row in a dataframe we could iterate through the rows:
add_one = lambda row: row + 1
[add_one(row) for row in df[0]]
But pandas also provides the apply
keyword:
df[0].apply(add_one)
We can apply functions to entire dataframes too:
df.select_dtypes(float).apply(add_one)
Similarly, if we want to map a value to something, we can use a dictionary and iterate:
mapping_dict = {"A": "alpha", "B":"beta"}
[mapping_dict[row] for row in df["alpha/beta"]]
Or we can use the map
functionality:
df["alpha/beta"].map(mapping_dict)
apply
and applymap
can also be used for conditional formatting. If a colour depends only on its own value, we can use applymap
:
def color_formatter(val):
"""
Formats the input value as a CSS color style.
Parameters:
val (str): The value representing a color.
Returns:
str: A string representing a CSS color style in the format "color: <color>".
Example:
>>> color_formatter("red")
'color: red'
"""
color = "black"
if val in {"red", "green", "blue"}:
color = val
return f"color: {color}"
df.style.applymap(color_formatter)
If we wanted to highlight the maximum value in a column (or row) we couldn’t use applymap
since it acts element-wise, but we could use apply
:
def color_max(col):
"""
Assigns CSS styles to elements in a list based on the maximum value.
Parameters:
col (list): A list of values.
Returns:
list: A list of CSS style strings corresponding to each element in the input list.
The style for the element with the maximum value is "background: green; color: white", while
other elements are assigned the style "color: black".
Example:
>>> color_max([1, 2, 3, 2, 1])
['color: black', 'color: black', 'background: green; color: green', 'color: black', 'color: black']
"""
colors = ["color: black"] * len(col)
if col.dtypes == np.float64:
colors[np.argmax(col)] = "background: green; color: white"
return colors
df.style.apply(color_max)
We can also apply colormaps
import seaborn as sns
cm = sns.color_palette("Blues", as_cmap=True)
df.style.background_gradient(cmap=cm)
TASK#
Update the colours column of the dataframe, df
, to use the first letter of each colour in lowercase. Then display the dataframe showing the corresponding colours (e.g., ‘r’ should be shown as a red colour etc…)
# Your code here
# Once you're finished, run this code. It should be error free if you have done the task successfully.
assert not (set(df["colours"]) - {"r", "g", "b"}), "Looks like you have values other than 'r', 'g', 'b' in your dataframe!"
Congratulations on getting this far, this concludes the pandas introduction! This afternoon you’ll be putting this all into practice on a real dataset.