Pandas & Preprocessing Checkpoint Task#
In this notebook you will be applying your knowledge to analyse a real dataset. Take a few minutes to read about the dataset here: https://scikit-learn.org/stable/datasets/toy_dataset.html#diabetes-dataset
import numpy as np
import pandas as pd
from sklearn.datasets import load_diabetes
def load_dataset() -> pd.DataFrame:
"""
Load the diabetes dataset and introduce NaN values randomly into the DataFrame.
Returns:
pd.DataFrame: DataFrame containing the diabetes dataset with NaN values.
"""
df = load_diabetes(as_frame=True, scaled=False).frame
num_nans = int(df.size * 0.05) # number of values to make nan (5%)
nan_indices = np.random.choice(range(df.size), size=num_nans, replace=False)
for index in nan_indices:
x, y = divmod(index, df.shape[1])
df.iloc[x, y] = np.nan
return df
Read through the function load_dataset
and try to understand what it’s doing. When you’re happy, call it below to create your dataset with the variable name df
# Your code here
Examine at the first 10 rows of the dataset and answer the following (using code):
How many features and samples are there? What are the names of the features?
# Your code here
From the documentation, the feature names have been abbreviated from:
bmi: body mass index
bp: average blood pressure
s1: tc, total serum cholesterol
s2: ldl, low-density lipoproteins
s3: hdl, high-density lipoproteins
s4: tch, total cholesterol / HDL
s5: ltg, possibly log of serum triglycerides level
s6: glu, blood sugar level
target: qualitative measure of disease progression one year after baseline
Rename the features to something that you’ll find more helpful, e.g. s1 -> total_serum_cholesterol
# Your code here
Identify how much of the data is missing
# Your code here
Using your knowledge of imputation, fill in the missing data. Be prepared to justify why you’ve used your chosen method!
# Your code here
# Run this cell to make sure you have no missing data.
assert df.isna().values.sum() == 0, "Looks like some of your data is still missing!"
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
Cell In[7], line 3
1 # Run this cell to make sure you have no missing data.
----> 3 assert df.isna().values.sum() == 0, "Looks like some of your data is still missing!"
NameError: name 'df' is not defined
Pandas will try to infer the datatype from the data, but sometimes gets it wrong, check the datatypes of the data
# Your code here
If you’re happy with the datatypes of your features, then move on the next question. If not, convert them here.
# Your code here
Create a function to calculate the mean, median, std deviation, min, max and range, pass your data frame to the function and examine the results.
# Your code here
Depending on your dtype conversions, you may see a mean value for sex
, this is not a meaningful value, can you create the value counts for sex
.
# Your code here
What do you notice? Is this dataset a fair representation?
Can you calculate the value counts for age:
# Your code here
This is a little tricky to see the details of, can you generate a histogram for age:
# Your code here
Let’s examine if there’s any correlation with age and our ‘target’. Group by ages, calculate the mean value for the target and plot the resulting graph
# Your code here
From this graph, it’s hard to see any trends present, try to group these ages into categories, 18-28, 28-30 etc… Put this into the dataframe as a new feature with the name “age_group”
HINT: Look at the documentation for pd.cut
# Your code here
Plot a bar chart showing the counts for each of the categories in the age group?
# Your code here
Now let’s compare our new feature with the target. Group by the age group, calculate the mean and plot the resulting graph.
# Your code here
EXTENSION QUESTION
Perform a one-way chi-square test to get an indication for statistical significance of this data.
from scipy.stats import chisquare
# Your code here
Taking an approach like this can help to iron out some outliers in the data.
It is clear from the results above that age does have a correlation with the target. Let’s convert the categorical column age_group
to use the categorical codes:
# Your code here
Try to identify any other correlations that exist in the dataframe by calculating the correlation matrix.
# Your code here
This is easier to visualise in a plot, a convenience function has been created for you (plot_correlation_matrix
), you don’t need to worry about the details of the function at this time, but pass your correlation matrix to the function and observe the plot
Show code cell content
import seaborn as sns
from matplotlib import pyplot as plt
def plot_correlation_matrix(corr):
"""
Plots a correlation matrix as a heatmap.
Parameters:
corr (DataFrame): The correlation matrix to be plotted.
Returns:
None: The function displays the correlation matrix heatmap.
"""
fig = plt.figure(figsize=(7,7))
sns.heatmap(corr, annot=True, fmt='.1f', vmin=-1, vmax=1, cmap='RdBu')
plt.show()
# Your code here
Pick 2 features with a high correlation with the target. Comment on these correlations using graphs to justify your comments.
# Your code here
Machine learning algorithms tend to behave poorly when there are columns with strong correlation between each other, can you identify and remove any columns that display such behaviour?
# Your code here
Let’s examine how spread each feature is by using box plots:
# Your code here
The problem is, each feature is on a different scale, can you standardise the numerical features such that they’re mean centered and have a standard deviation of 1?
NOTE: Categorical variables should not be standardised
# Your code here
Now, re-examine the box plot:
# Your code here
Congratulations on reaching this point! Now the data would be ready to input into a machine learning model (or dimensionality reduction). You’ll cover some of these in an upcoming session.