Pandas & Preprocessing Checkpoint Task

Pandas & Preprocessing Checkpoint Task#

In this notebook you will be applying your knowledge to analyse a real dataset. Take a few minutes to read about the dataset here: https://scikit-learn.org/stable/datasets/toy_dataset.html#diabetes-dataset

import numpy as np
import pandas as pd
from sklearn.datasets import load_diabetes

def load_dataset() -> pd.DataFrame:
    """
    Load the diabetes dataset and introduce NaN values randomly into the DataFrame.

    Returns:
        pd.DataFrame: DataFrame containing the diabetes dataset with NaN values.
    """
    df = load_diabetes(as_frame=True, scaled=False).frame

    num_nans = int(df.size * 0.05) # number of values to make nan (5%)
    
    nan_indices = np.random.choice(range(df.size), size=num_nans, replace=False)
    
    for index in nan_indices:
        x, y = divmod(index, df.shape[1])
        df.iloc[x, y] = np.nan
        
    return df

Read through the function load_dataset and try to understand what it’s doing. When you’re happy, call it below to create your dataset with the variable name df

# Your code here

Examine at the first 10 rows of the dataset and answer the following (using code):

How many features and samples are there? What are the names of the features?

# Your code here

From the documentation, the feature names have been abbreviated from:

bmi: body mass index
bp: average blood pressure
s1: tc, total serum cholesterol
s2: ldl, low-density lipoproteins
s3: hdl, high-density lipoproteins
s4: tch, total cholesterol / HDL
s5: ltg, possibly log of serum triglycerides level
s6: glu, blood sugar level
target: qualitative measure of disease progression one year after baseline

Rename the features to something that you’ll find more helpful, e.g. s1 -> total_serum_cholesterol

# Your code here

Identify how much of the data is missing

# Your code here

Using your knowledge of imputation, fill in the missing data. Be prepared to justify why you’ve used your chosen method!

# Your code here

# Run this cell to make sure you have no missing data.

assert df.isna().values.sum() == 0, "Looks like some of your data is still missing!"

---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Cell In[7], line 3
      1 # Run this cell to make sure you have no missing data.
----> 3 assert df.isna().values.sum() == 0, "Looks like some of your data is still missing!"

NameError: name 'df' is not defined

Pandas will try to infer the datatype from the data, but sometimes gets it wrong, check the datatypes of the data

# Your code here

If you’re happy with the datatypes of your features, then move on the next question. If not, convert them here.

# Your code here

Create a function to calculate the mean, median, std deviation, min, max and range, pass your data frame to the function and examine the results.

# Your code here

Depending on your dtype conversions, you may see a mean value for sex, this is not a meaningful value, can you create the value counts for sex.

# Your code here

What do you notice? Is this dataset a fair representation?

Can you calculate the value counts for age:

# Your code here

This is a little tricky to see the details of, can you generate a histogram for age:

# Your code here

Let’s examine if there’s any correlation with age and our ‘target’. Group by ages, calculate the mean value for the target and plot the resulting graph

# Your code here

From this graph, it’s hard to see any trends present, try to group these ages into categories, 18-28, 28-30 etc… Put this into the dataframe as a new feature with the name “age_group”

HINT: Look at the documentation for pd.cut

# Your code here

Plot a bar chart showing the counts for each of the categories in the age group?

# Your code here

Now let’s compare our new feature with the target. Group by the age group, calculate the mean and plot the resulting graph.

# Your code here

EXTENSION QUESTION

Perform a one-way chi-square test to get an indication for statistical significance of this data.

from scipy.stats import chisquare

# Your code here

Taking an approach like this can help to iron out some outliers in the data. It is clear from the results above that age does have a correlation with the target. Let’s convert the categorical column age_group to use the categorical codes:

# Your code here

Try to identify any other correlations that exist in the dataframe by calculating the correlation matrix.

# Your code here

This is easier to visualise in a plot, a convenience function has been created for you (plot_correlation_matrix), you don’t need to worry about the details of the function at this time, but pass your correlation matrix to the function and observe the plot

# Your code here

Pick 2 features with a high correlation with the target. Comment on these correlations using graphs to justify your comments.

# Your code here

Machine learning algorithms tend to behave poorly when there are columns with strong correlation between each other, can you identify and remove any columns that display such behaviour?

# Your code here

Let’s examine how spread each feature is by using box plots:

# Your code here

The problem is, each feature is on a different scale, can you standardise the numerical features such that they’re mean centered and have a standard deviation of 1?

NOTE: Categorical variables should not be standardised

# Your code here

Now, re-examine the box plot:

# Your code here

Congratulations on reaching this point! Now the data would be ready to input into a machine learning model (or dimensionality reduction). You’ll cover some of these in an upcoming session.