Data Preprocessing#

Data preprocessing is an important step in machine learning and is a prerequisite for modelling your data.

The main goal of data preprocessing is to transform your dataset into a suitable form for modelling, and in doing this it will also improve the performance of your model and provide more robust and reliable results.

Cleaning and Preparing Data#

There are many things to consider for preprocessing and not all of them will be relevant to your data, but still should be checked for. Gathering, cleaning, and preparing your data will take, as is expected to take, a significant proportion of your time and attention when compared with the time spent on model development and model tuning.

Within this chapter the following topics will be covered:

    • Checking and Converting Data Types

    • Handle Missing Values

      • Remove Records with Missing Values

      • Removing Specific Columns Or Rows That Contain Missing Data

      • Impute Missing Values

    • Remove Duplicates

    • Removing Outliers

      • Interquartile Range (IQR) method

      • Z-score (Standard Score)

    • Dealing with Target Imbalance

      • SMOTE

      • Random Overs-sampling

      • Random Under-sampling

Import the following libraries for this chapter:

It is good practice to import all the libraries that are used in your notebook at the very start. You may not know all the libraries you intend on using from the beginning, but they should all be added here as and when you need additional libraries.

# Main libraries
import pandas as pd
import random

# Removing Outliers section
import numpy as np
import matplotlib.pyplot as plt

# Dealing with Target Imbalance section
from imblearn.over_sampling import SMOTE
from imblearn.over_sampling import RandomOverSampler
from imblearn.under_sampling import RandomUnderSampler
import seaborn as sns

Checking and Converting Data Types#

Before any preprocessing can begin, we must understand the data types of our features.
Pandas will automatically assign data types for your data when you import a dataset. Sometimes the types assigned will be incorrect, so it is important to check these and convert them to the correct data type.
The most common data types used in pandas are:

  • object: Contains sting values or contains a mixture of types.

  • int64: Whole numbers, equivalent to native python integer type, where 64 refers to the allocation of memory allotted for storing the value, in this case, the number of bits.

  • float64: Decimal numbers.

  • datetime64: Dates and times - this special data type unlocks a extra functionality for working with time series data, such as datetime indexing.

We have also seen in the previous module data type category specifically used for categorical data when using Pandas.

Data types can be easily checked in a dataset by using the .info() method.

Generate the below dataset from the function generate_demo_dataset1 and let’s look at the data types.

Hide code cell source
# generate_demo_dataset1
# import pandas as pd
# import numpy as np

def generate_demo_dataset1():
    """
    Generate a dataset for cleaning data examples.

    Returns:
    pandas.DataFrame: A DataFrame containing the generated dataset with columns:
    'category', 'a', 'b', 'c', 'd'.
    """
    category = ['category 1', 'category 2', 'category 2', 'category 1', 'category 1', 'category 1']
    a_values = (6.0, 2.0, 4.0, 3.0, 7.0, 5.0)
    b_values = [6.0, 3.0, 5.0, 1.0, 10.0, 8.0]
    c_values = [9.0, 4.0, 3.0, 3.0, 7.0, 1.0]
    d_values = [1.0, 2.0, 7.0, 9.0, 6.0, 2.0]

    # Create DataFrame
    data = {'category': category, 'a': a_values, 'b': b_values, 'c': c_values, 'd': d_values}
    df = pd.DataFrame(data)
    
    df['c'] = df['c'].astype("str")
    
    return df


df = generate_demo_dataset1()

print(df)

To inspect the data types use df.info().

# df = generate_demo_dataset1() # original dataframe
# print(df,"\n\n") # print dataframe with two carriage returns.

# inspect the data types of the dataframe
df.info()

The pandas .astype() method can be used to convert a column’s data type to a specified data type.
In order to do this we need to reassign the column to overwrite the original data when converting it.

Before converting a column be extra careful that all the values it contains can be appropriately converted to the new data type.

Let’s check the values in column "c"

# df = generate_demo_dataset1() # original dataframe
# print(df,"\n\n") # print dataframe with two carriage returns.

# Check the age column for unique values and it's data type
unique_values = df["c"].unique()

print(unique_values)

Check that column "c" is in fact data type 'object'.

# df = generate_demo_dataset1() # original dataframe
# print(df,"\n\n") # print dataframe with two carriage returns.

# Check the data type of the column
column_dtype = df['c'].dtype

print(column_dtype)

To change column "c" to data type 'float' we need to overwrite the original column with the new data type.

# df = generate_demo_dataset1() # original dataframe
# print(df,"\n\n") # print dataframe with two carriage returns.

# Change the age column data type to integer
df["c"] = df["c"].astype("float")

# Check the data type of the column
column_dtype = df['c'].dtype

print(column_dtype)

Practical Task 1.1

Inspect the following dataset. Identify and convert two columns that require their data type to be changed.

# generate_practical_dataset1
# import pandas as pd
# import numpy as np

def generate_practical_dataset1():
    """
    Generate a synthetic dataset consisting of patient demographic information
    Returns:
    pandas.DataFrame: A DataFrame containing the synthetic dataset with columns: 
    'Surname', 'Age', 'Gender', 'Weight', 'Blood Type'.
    """
    # List of patient surnames
    patient_names = ["Smith", "Johnson", "Williams", "Jones", "Brown",
                     "Davis", "Miller", "Wilson", "Moore", "Taylor"]

    # List of blood types
    blood_types = ["A+", "A-", "B+", "B-", "AB+", "AB-", "O+", "O-"]


    # Generate random demographic dataset
    demographic_dataset = []

    for n in patient_names:
        name = n
        age = random.randint(20, 85)
        gender = random.choice(["Male", "Female"])
        weight = round(random.uniform(55.5,95.5),2)
        blood_type = random.choice(blood_types)
        record = {
            "Surname": name,
            "Age": age,
            "Gender": gender,
            "Weight": weight,
            "Blood Type": blood_type,
        }
        demographic_dataset.append(record)

    # Convert dataset to DataFrame
    df = pd.DataFrame(demographic_dataset)
    
    # Set all columns to 'object' data types
    df = df.astype('object')
    
    return(df)

df = generate_practical_dataset1()
print(df.head(10))
# Your Code Here - Converting Data Types (Identify and convert two columns)
# df = generate_practical_dataset1()

Handle Missing Values#

Datasets often have missing values or empty records, often encoded as blanks or NaN (Not a Number). Handling missing values is the most common problem in data science and is the first step of data preprocessing as most machine learning algorithms can’t deal with values that are missing or blank.

Removing ALL records with missing values is a basic strategy that is sometimes used, but it comes with a cost of losing probable valuable data and the associated information or patterns. A better strategy is to impute the missing values.

In this next section we are going to take a look at::

  • Remove all records with missing values.

  • Removing specific columns or rows.

  • Impute (fill-in) the missing values.

Generate the below dataset from the function generate_demo_dataset2 and let’s look at the missing values.

# generate_demo_dataset2
# import pandas as pd
# import numpy as np

def generate_demo_dataset2():
    """
    Generate a dataset for cleaning data.

    Returns:
    pandas.DataFrame: A DataFrame containing the generated dataset with columns: 
    'category', 'a', 'b', 'c', 'd'.
    """

    category = ['category 1','category 2','category 2',None,'category 1','category 1']
    a_values = (6.0,np.nan,4.0,3.0,7.0,5.0)
    b_values = [np.nan,3.0,5.0,np.nan,10.0,8.0]
    c_values = [9.0,4.0,3.0,3.0,7.0,1.0]
    d_values = [1.0,2.0,7.0,np.nan,6.0,2.0]


    # Create DataFrame
    data = {'category': category, 'a': a_values, 'b': b_values, 'c': c_values, 'd': d_values}
    df = pd.DataFrame(data)
    
    return(df)

df = generate_demo_dataset2()

print(df)

Remove all records with missing values#

In Python, particularly in pandas dataframes or numpy arrays, NaN is commonly used to represent missing or undefined numerical data. However, if you’re working with non-numeric data types, such as objects, None is often used as an alternative to represent missing values.

Within SQL NULL is more broadly used to represent missing values across different data types.

We can use the .isna() method to inspect the NaN values.
This will return a boolean result (True or False) on whether the value

  • is NaN - True
    or

  • Not Nan - False

# df = generate_demo_dataset2() # original dataframe
print(df,"\n\n") # print dataframe with two carriage returns.

# Identify missing values - ie: Is the value missing?
df.isna()

In pandas, isna() and isnull() are essentially aliases of each other, meaning they are two different names for the same function. Both functions are used to detect missing values in a dataframe or series. There is no difference in functionality between them; you can use either one based on your preference.

Similarly, notna() and notnull() are also aliases of each other and serve the same purpose—to detect non-missing values in a dataframe or series.

So looking at our dataset we may decide to remove (or drop) all the rows where NaN/None values are present in any of the columns.
For this we can use: .dropna() function.

# df = generate_demo_dataset2() # original dataframe
print(df,"\n\n") # print dataframe with two carriage returns.

# drop all rows in the dataframe that contain NaN/None values in any of the columns
drop_all_nan_rows = df.dropna()

print(drop_all_nan_rows)

This is a very and broad approach to dealing with missing values. There are more specific ways that we might choose to adopt instead.

Removing specific columns or rows.#

We can drop specific rows by passing index labels to the .drop() function.

The .drop() function does not check for NaN or None Values.
By passing the index labels to this function the rows will be just be removed.

# df = generate_demo_dataset2() # original dataframe
print(df,"\n\n") # print dataframe with two carriage returns.

# Drop rows in the dataframe by their specified row index
drop_specified_rows = df.drop([1,2,4])

print(drop_specified_rows)

Alternatively, there may be certain columns that are not required.
These can also be removed using the .drop() function by also passing the axis value = 1, which indicates a column.

# df = generate_demo_dataset2() # original dataframe
print(df,"\n\n") # print dataframe with two carriage returns.

# Drop a specified column from the dataframe
drop_column = df.drop("b",axis=1)

print(drop_column)

What if we want to drop rows only where data is missing in a particular column?
So far we have just looked at dropping rows and columns without checking for missing values using .drop(). Let’s look at what we can do specifically considering missing values.
We can use .dropna() in the same way as .drop(), but it will cater to only rows or columns with NaN or None values.

For this, first take a look at how many missing values we have in each column using .isna() method to identify the number of NaN values and then additionally using the .sum() method to count them in each column.

# df = generate_demo_dataset2() # original dataframe
print(df,"\n\n") # print dataframe with two carriage returns.

# Count the number of NaN values in each column
CountNaN = df.isna().sum()

print(CountNaN)

To remove the rows with missing values in a particular column we can specify a list of labels to the subset argument of .dropna().

# df = generate_demo_dataset2() # original dataframe
print(df,"\n\n") # print dataframe with two carriage returns.

# Drop rows from a dataframe where NaN values are present in a specified column
SpecifiedColumn = df.dropna(subset=["b"])

print(SpecifiedColumn)

For more options on dropping values where NaN values exits, see the pandas documentation on dropna here.

Impute the missing values.#

Rather than removing rows or columns that have missing data we could fill in the missing values using the measures of central tendency, such as mean, median, and mode.

  • The mean can be used to impute a numeric feature.

  • The median can be used to impute the ordinal feature.

  • The mode or highest occurring value can be used to impute the categorical feature.

Note: it is important to understand that in some cases, missing values will not impact the data, such as unique identifiers.
For example, unique values such as MRN, NHS Number will not impact the machine learning models because they are just identifiers shouldn’t be used as features in the model.

Let’s first use the .describe() method to review statistics on the numerical columns within the dataset.

# df = generate_demo_dataset2() # original dataframe
print(df,"\n\n") # print dataframe with two carriage returns.

# stats on numerical values
df.describe()

Let’s use the mean to fill in the missing values for column "a" and the median for column "b".

# df = generate_demo_dataset2() # original dataframe
print(df,"\n\n") # print dataframe with two carriage returns.

# stats on numerical values
print(df.describe(),"\n\n")

# fill NaN values in column a with mean and median
df["a"] = df["a"].fillna(df["a"].mean())
df["b"] = df["b"].fillna(df["b"].median())

print(df)

And let’s now use the most frequently occurring value in category to replace the missing values.

# df = generate_demo_dataset2() # original dataframe
print(df,"\n\n") # print dataframe with two carriage returns.

# fill NaN values in column category (categorical feature) a with mode
df["category"] = df["category"].fillna(df["category"].mode()[0])

print(df)

We may decide that it is better to replace missing values with a specified value instead.

df = generate_demo_dataset2() # original dataframe
print(df,"\n\n") # print dataframe with two carriage returns.

# fill NaN values in column category (categorical feature) a with specified value
df["category"] = df["category"].fillna("Unknown")

print(df)

Practical Task 1.2

From the following dataset look to carry out the following:

  • Drop at least one column that you feel that wouldn’t be needed in your model.

  • Drop rows where there are nulls in a specific column.

  • Fill in missing values using either mean, median or mode within a column.

Note: If you want to ensure the changes are saved to the original DataFrame without creating a new one, setting inplace=True needs to be added as an argument.

# generate_practical_dataset2
# import pandas as pd
# import numpy as np

def generate_practical_dataset2():
    """
    Generate a synthetic demographic dataset.

    Returns:
    pandas.DataFrame: A DataFrame containing the generated dataset with columns: 
    'MRN', 'Surname', 'Age', 'Gender', 'Favourite Colour', 'Weight', 'Blood Type'.
    """
    data = {
    'MRN':    [482754, 194552, 456272, 569149, 152106, 697630, 922086, 801114, 942324, 737040],
    'Surname': ['Smith', 'Johnson', 'Williams', 'Jones', 'Brown', 'Davis', 'Miller', 'Wilson', 'Moore', 'Taylor'],
    'Age': [26, np.NaN, 56, 54, 74, 62, 83, 24, np.NaN, 26],
    'Gender': ['Male', 'Female', 'Male', None, 'Male', 'Female', None, 'Female', None, 'Female'],
    'Favourite Colour': ['Red',  np.NaN, 'Purple', 'Blue', 'Orange', 'Red',  np.NaN, 'Yellow', 'Black', 'Pink'],
    'Weight': [57.87, 66.96, 62.98, 63.83, 87.95, 69.91, np.NaN, 61.49, 93.71, np.NaN],
    'Blood Type': ['A+', 'B-', 'B-', 'AB+', 'A-', 'O+', 'AB+', 'B+', 'B-', 'O+']
    }
    df = pd.DataFrame(data)
    return (df)

df = generate_practical_dataset2()
print(df)
# Your code here
df = generate_practical_dataset2()

There are other methods of imputing missing values such as sklearn.impute.IterativeImputer and sklearn.impute.KNNImputer. Use the links to learn more about these methods.

Remove Duplicates#

Here we are going to look at how to identify duplicate records in your dataset and how to remove these.

Generate the below dataset from the function generate_demo_dataset3 and let’s take a look at the duplicate values.

# generate_demo_dataset3
# import pandas as pd
# import numpy as np

def generate_demo_dataset3():
    """
    Generate a dataset for cleaning data.

    Returns:
    pandas.DataFrame: A DataFrame containing the generated dataset with columns: 
    'category', 'a', 'b', 'c', 'd', 'colour'.
    """

    category = ['category 1','category 2','category 2','category 1','category 1','category 1','category 1','category 1','category 1']
    a_values = (6.0,1.0,1.0,3.0,7.0,5.0,3.0,7.0,5.0)
    b_values = [7.0,3.0,5.0,3.0,10.0,8.0,3.0,10.0,8.0]
    c_values = [9.0,4.0,3.0,3.0,7.0,1.0,3.0,7.0,1.0]
    d_values = [1.0,2.0,7.0,3.0,6.0,2.0,3.0,6.0,2.0]
    colour = ['Red','Blue','Green','Orange','Yellow','Pink','Orange','Yellow','Pink']


    # Create DataFrame
    data = {'category': category, 'a': a_values, 'b': b_values, 'c': c_values, 'd': d_values, 'colour': colour}
    
    df = pd.DataFrame(data)
    
    return(df)

df = generate_demo_dataset3()

print(df)

.duplicated() can be used to identify duplicate records and will return a Boolean value for the record.

# df = generate_demo_dataset3() # original dataframe
print(df,"\n\n") # print dataframe with two carriage returns.

df.duplicated()

If we want to view the duplicate records, we can carry out the following:

# df = generate_demo_dataset3() # original dataframe
print(df,"\n\n") # print dataframe with two carriage returns.

# Identify duplicate records
duplicate_records = df.duplicated()

# Select duplicate records
duplicates = df[duplicate_records]

# Display duplicate records
print(duplicates)

To remove the duplicates completely from the dataset use the .drop_duplicates() method.

# df = generate_demo_dataset3() # original dataframe
print(df,"\n\n") # print dataframe with two carriage returns.

# Remove duplicates
df_no_duplicates = df.drop_duplicates()

print(df_no_duplicates)

We may wish to refine this further by passing the keep argument, where we can specify whether the first or last duplicate record should be kept.
If keep is not specified the default is the first.
An example of when you might want to keep the last record would be if you had a sorted dataframe in chronological order where you wanted to keep the most recent record.

# df = generate_demo_dataset3() # original dataframe
print(df,"\n\n") # print dataframe with two carriage returns.

# Keep the last occurrence of each duplicated row
df_no_duplicates = df.drop_duplicates(keep='last')

print(df_no_duplicates)

By passing False to this argument all duplicates will be dropped.

# df = generate_demo_dataset3() # original dataframe
print(df,"\n\n") # print dataframe with two carriage returns.

# Drop all duplicates
df_no_duplicates = df.drop_duplicates(keep=False)

print(df_no_duplicates)

Instead of identifying duplicates across the whole set of columns, certain specified columns can be used to identify duplicates.
For this will use the subset argument.

# df = generate_demo_dataset3() # original dataframe
print(df,"\n\n") # print dataframe with two carriage returns.

df_no_duplicates = df.drop_duplicates(subset=['category', 'a'])

print(df_no_duplicates)

Practical Task 1.3

From the following dataset:

  • Identify the two duplicates in the data and remove them.

  • Inspect the results and remove any further suspected duplicates based on ‘Surname’ and ‘Age’.

# generate_practical_dataset3
# import pandas as pd
# import numpy as np

def generate_practical_dataset3():
    """
    Generate a synthetic demographic dataset.

    Returns:
    pandas.DataFrame: A DataFrame containing the generated dataset with columns: 
    'MRN', 'Surname', 'Age', 'Gender', 'Weight', 'Blood Type'.
    """

    data = {
        'MRN': [176968, 173798, 851542, 336291, 114317, 737813, 609203, 938757, 661284, 147859,336291,661284,319011],
        'Surname': ['Smith', 'Johnson', 'Williams', 'Jones', 'Brown', 'Davis', 'Miller', 'Wilson', 'Moore', 'Taylor','Jones','Moore', 'Taylor'],
        'Age': [26, 24, 56, 54, 74, 62, 83, 24, 60, 26,54,60, 26],
        'Gender': ['Male', 'Female', 'Male', 'Female', 'Male', 'Female', 'Male', 'Female', 'Female', 'Female','Female','Female', None],
        'Weight': [57.87, 66.96, 62.98, 63.83, 87.95, 69.91, 62.14, 61.49, 93.71, 94.47,63.83, 93.71, 94.47],
        'Blood Type': ['A+', 'B-', 'B-', 'AB+', 'A-', 'O+', 'AB+', 'B+', 'B-', 'O+','AB+','B-', '0+']
    }
    df = pd.DataFrame(data)
    
    return(df)

df = generate_practical_dataset3()
print(df)
# Your code here
# df = generate_practical_dataset3()

Removing Outliers#

This section we will briefly look into removing outliers, but this is just a smaller part of a much wider topic of anomaly detection that will be covered separately in it’s own module.
Essentially anomaly detection encompasses two broad practices of ‘outlier detection’ and ‘novelty detection’.

Where outliers are abnormal or extreme data points that are only seen in your initial training and novelties are new or previously unseen instances compared to your original data.

Getting back to looking at outliers, we are now going to take a look at some simple ways of identifying and removing them.

# generate_demo_dataset4
# import pandas as pd
# import random
# import numpy as np
# import matplotlib.pyplot as plt

def generate_demo_dataset4():
    """
    Generate a dataset with outliers.

    Returns:
    numpy.ndarray: An array containing the generated dataset with possible outliers.
    """
    np.random.seed(0)
    data = np.random.normal(loc=8, scale=1, size=100) #loc=10
    outlier_indices = np.random.choice(100, size=10, replace=False)  # Introduce 10 outliers
    data[outlier_indices] = np.random.normal(loc=12, scale=1, size=10)  # Outliers have mean 12 #20,1,10
   
    return(data)

data = generate_demo_dataset4()

print(data)

Popular methods of outlier detection that are used:

  • Interquartile range (IQR): The IQR is the range between the first quartile (Q1) and the third quartile (Q3) of a distribution.
    When an instance is beyond Q1 or Q3 for some multiplier of IQR, they are considered outliers. The most common multiplier is 1.5, making the outlier range [Q1–1.5 * IQR, Q3 + 1.5 * IQR].

  • Z-score (standard score) : The z-score or standard score measures how many standard deviations a data point is away from the mean.
    Generally, instances with a z-score over 3 are chosen as outliers.

Let’s plot our data, which is in a numpy array, and take a look at the data points.
For this we will use a plotting library called matplotlib to generate a plot.

import matplotlib.pyplot as plt
plt.plot(data)

Now let’s plot the data as a box plot and what do you notice?

To do this swap, .plot with .boxplot.

plt.boxplot(data)

It is clear from the box plot that there are points that appear to be outliers in the data.

IQR method#

We are first going to take a look at the IQR method.
The IQR is the range between the first quartile (Q1) and the third quartile (Q3) of a distribution. When an instance is beyond Q1 or Q3 for some multiplier of IQR, they are considered outliers. The most common multiplier is 1.5, making the outlier range [Q1–1.5 * IQR, Q3 + 1.5 * IQR].

Generate the function and then use it to calculate the upper and lower bounds based on the IQR * 1.5

# Define function to identify outliers using IQR method
def identify_outliers_iqr(data, threshold=1.5):
    """
    Identify outliers in a dataset using the interquartile range (IQR) method.

    Args:
    data (numpy.ndarray or pandas.Series): The data for which outliers are to be identified.
    threshold (float, optional): The threshold value to determine outliers. Defaults to 1.5.

    Returns:
    tuple: A tuple containing:
        - outliers (numpy.ndarray): A boolean array indicating outliers in the data.
        - lower_bound (float): The lower bound for outlier detection.
        - upper_bound (float): The upper bound for outlier detection.
    """
    q1 = np.percentile(data, 25)
    q3 = np.percentile(data, 75)
    iqr = q3 - q1
    lower_bound = q1 - threshold * iqr
    upper_bound = q3 + threshold * iqr
    outliers = np.logical_or(data < lower_bound, data > upper_bound)
    
    return outliers, lower_bound, upper_bound

let’s use the identify_outliers_iqr function:

outliers_iqr, lower_bound, upper_bound = identify_outliers_iqr(data)

print(outliers_iqr)
print("lower_bound:", lower_bound)
print("upper_bound:", upper_bound)

and plot the results using the generated function below plot_outliers_iqr:

# import matplotlib.pyplot as plt

def plot_outliers_iqr(data, outliers_iqr, lower_bound, upper_bound):
    """
    Plot the data with identified outliers using the interquartile range (IQR) method.

    Args:
    data (numpy.ndarray or pandas.Series): The original data to be plotted.
    outliers_iqr (numpy.ndarray): A boolean array indicating outliers in the data.
    lower_bound (float): The lower bound for outlier detection.
    upper_bound (float): The upper bound for outlier detection.
    """
    plt.figure(figsize=(10, 6))  # create a blank figure to plot on

    plt.plot(data, label='Data')  # plot the data
    plt.plot(np.where(outliers_iqr)[0], data[outliers_iqr], 'ro', label='Outliers (IQR)')  # highlight the outliers

    plt.axhline(lower_bound, color='gray', linestyle='--', label='Lower Bound')  # add lower bound line
    plt.axhline(upper_bound, color='gray', linestyle='--', label='Upper Bound')  # add upper bound line

    plt.legend()  # add legend
    plt.xlabel('Day')  # add x label
    plt.ylabel('Index')  # add y label
    plt.title('Outlier Detection Example using IQR Method')  # add title
    plt.grid(True)  # show grid

    plt.show()  # show the completed plot

Using the function plot_outliers_iqr

# Plot data with outliers highlighted and bound lines
plot_outliers_iqr(data, outliers_iqr, lower_bound, upper_bound)

Several points have been identified using this method. To remove them from the data we are working with we can filter the data array using Boolean indexing.
~outliers_iqr negates the Boolean array outliers_iqr, so it selects only the elements of data that are not identified as outliers (ie: Are not True).

# Remove outliers from the dataset
cleaned_data = data[~outliers_iqr]

print("Original data shape:", data.shape)
print("Cleaned data shape:", cleaned_data.shape)

Z-score (standard score) method:#

A z-score represents the number of standard deviations a data point is from the mean of a dataset. Mathematically, the z-score of a data point \(x\) in a dataset with mean \(μ\) and standard deviation \(σ\) is calculated as:

\(Z= \frac{x−μ}{σ}\)

A z-score of 0 means the data point is exactly at the mean, a positive z-score means the data point is above the mean, and a negative z-score means the data point is below the mean.

Generally, instances with a z-score over 3 are chosen as outliers. This concept refers to data points that are located at 3 standard deviations from the mean of the dataset. It’s often used as a threshold for identifying outliers, especially in normally distributed datasets, where approximately 99.7% of the data falls within 3 standard deviations of the mean (assuming a normal distribution).

Let’s generate the dataset we are going to use:

df = generate_demo_dataset4() # original dataframe
print(df,"\n\n") # print dataframe with two carriage returns.

And apply the above formula to calculate the z-scores.

# Calculate z-scores
z_scores = (data - np.mean(data)) / np.std(data)
print(z_scores)

Now we have calculated the z-scores, lets create an array of outliers (True/False) that are greater than the general threshold of 3 standard deviations from the mean.

# Define threshold for outlier detection
threshold = 3

# Identify outliers
outliers = np.abs(z_scores) > threshold
print(outliers)

We can put all this into a function to make the calculation easier:

def identify_outliers_zscore(data, threshold=3):
    """
    Identify outliers in a dataset using z-scores.

    Args:
    data (numpy.ndarray or pandas.Series): The data for which outliers are to be identified.
    threshold (float, optional): The threshold value to determine outliers. Defaults to 3.

    Returns:
    numpy.ndarray: A boolean array indicating outliers in the data.
    """
    # Calculate z-scores
    z_scores = (data - np.mean(data)) / np.std(data)
    
    # Identify outliers
    outliers = np.abs(z_scores) > threshold
    
    return outliers
# Call the function
outliers = identify_outliers_zscore(data)

This function will plot the results, generate this function…

# import matplotlib.pyplot as plt

def plot_outliers_zscore(data, outliers, threshold=3):
    """
    Plot the data with identified outliers using the z-score method.

    Args:
    data (numpy.ndarray or pandas.Series): The original data to be plotted.
    outliers (numpy.ndarray): A boolean array indicating outliers in the data.
    threshold (float, optional): The threshold value to determine outliers. Defaults to 3.
    """
    plt.figure(figsize=(10, 6))  # create a blank figure to plot on

    plt.plot(data, label='Data')  # plot the data
    plt.plot(np.where(outliers)[0], data[outliers], 'ro', label='Outliers')  # highlight the outliers

    plt.axhline(np.mean(data), color='green', linestyle='-', label='Mean')  # mean
    # plt.axhline(np.median(data), color='purple', linestyle='-', label='Median')  # median
    
    # add lower threshold line
    plt.axhline(np.mean(data) - (threshold * np.std(data)), color='gray', linestyle='--', label='Lower Threshold')  
     # add upper threshold line
    plt.axhline(np.mean(data) + (threshold * np.std(data)), color='gray', linestyle='--', label='Upper Threshold') 

    plt.legend()  # add legend
    plt.xlabel('Index')  # add x label
    plt.ylabel('Value')  # add y label
    plt.title('Outlier Detection Example using z-Score (standard score) Method')  # add title
    plt.grid(True)  # show grid
    plt.show()  # show the completed plot

… and now use plot_outliers_zscore to plot the data.

# Plot data with outliers highlighted
plot_outliers_zscore(data, outliers)

Practical Task 1.4

From the following dataset identify and plot the outliers for cholesterol levels using both the methods we have covered:
Use the supplied functions to identify the outliers and plot the results.

  • IQR Method:
    Functions: identify_outliers_iqr and plot_outliers_iqr

  • Z-Score Method:
    Functions: identify_outliers_zscore and plot_outliers_zscore.

# generate_practical_dataset4
# import numpy as np

def generate_practical_dataset4(num_patients=1000):
    """
    Generate random healthcare data for a specified number of patients.

    Parameters:
    - num_patients (int): Number of patients for which healthcare data is generated. Default is 1000.

    Returns:
    - ndarray: A 2D NumPy array containing healthcare data with the following columns:
               - Age of patients
               - Cholesterol levels in mg/dL
               - Blood pressure in mmHg
               - Body Mass Index (BMI)
    """
    # Generate random healthcare data
    age = np.random.randint(18, 90, num_patients)  # Age of patients
    cholesterol = np.random.normal(200, 30, num_patients)  # Cholesterol levels in mg/dL
    blood_pressure = np.random.randint(90, 180, num_patients)  # Blood pressure in mmHg
    bmi = np.random.normal(25, 4, num_patients)  # Body Mass Index (BMI)

    # Stack arrays horizontally to create a single 2D array
    healthcare_data = np.column_stack((age, cholesterol, blood_pressure, bmi))

    return healthcare_data

# Call the function and print first few rows of the healthcare data
healthcare_data = generate_practical_dataset4()
print("Sample healthcare data (first 5 rows):")
print(healthcare_data[:5])
# print the first 50 rows of the cholesterol column
print(healthcare_data[:50, 1])
# Your code here
# healthcare_data = generate_practical_dataset4()

Dealing with Target Imbalance#

Before diving into this section, we first need to understand the terms ‘Target’ and ‘Features’.

Target(s) - used to describe the column(s) you are trying to predict in your machine learning model.
Features - Are all the other columns in the data.

So, target imbalance is when our target column, also known as the target variable, has fewer instances in the data of the thing we are trying to predict.

Why do we need to address target imbalance?#

Addressing target imbalance is crucial in many machine learning tasks, particularly in classification problems, because it ensures that the model doesn’t become biased towards the majority class.
When the classes in your dataset are imbalanced, meaning some classes have significantly more samples than others, the model may learn to simply predict the majority class for most instances.

When assessing target imbalance, there isn’t a fixed threshold that universally determines whether there’s an imbalance or not.
However, a common rule of thumb to determine whether a class imbalance is significant, is if one class represents less than 10% to 20% of the total dataset.

A real world example of this would be detecting credit card fraud transactions, or in healthcare, “did not attend” (DNA) rates in outpatient appointments.

Let’s take a look at some ways to tackle target imbalance.

Generate the below patient data where the target is the Disease variable.

# generate_demo_dataset5
# import pandas as pd
# import numpy as np

def generate_demo_dataset5():
    """
    Generate a synthetic demographic dataset.

    Returns:
    pandas.DataFrame: A DataFrame containing the generated dataset with columns:
        - 'Age': Age of the individuals.
        - 'Gender': Gender of the individuals (1) male or (2) female.
        - 'Blood Pressure': Blood pressure of the individuals.
        - 'Disease': Target variable indicating the presence (1) or absence (0) of a disease.
    """
    np.random.seed(42)

    # Features
    age = np.random.randint(20, 80, size=1000)
    gender = np.random.choice([1, 2], size=1000)
    blood_pressure = np.random.randint(90, 180, size=1000)

    # Target variable
    disease = np.random.choice([0, 1], size=1000, p=[0.9, 0.1])

    # Create DataFrame
    data = pd.DataFrame({
        'Age': age,
        'Gender': gender,
        'Blood Pressure': blood_pressure,
        'Disease': disease
    })
    df = pd.DataFrame(data)
    
    return(df)

df = generate_demo_dataset5()
print(df)

# Display the shape of dataframe
print(df.shape)

Inspect the ‘Disease’ target variable to identify if there is a significant imbalance.
Tip: Note these numbers down as it will help you with checking the following methods we are about to cover!

df['Disease'].value_counts()

This can be quickly plotted to give a quick visual representation.

df['Disease'].value_counts().plot.bar()

To summarise before moving on. Most records, in this dataset, are where the Disease target variable is False - i.e.: No Disease. The remaining records represent the minority class, where the Disease variable is True. As the aim is to predict when Disease = True we need to address the imbalance in the data.

SMOTE (Synthetic Minority Over-sampling Technique):#

One way to dealing with target imbalance is a methodology called SMOTE, which stands for Synthetic Minority Over-sampling TEchnique.

The way that SMOTE works is that for each minority class instance in the data, SMOTE will find it’s k nearest neighbours (where k is a user specified number) in the feature space.
From this it then generates synthetic samples by creating new instances along the line segments connecting the minority class instance to its nearest neighbours.
These synthetic samples are then added to the original dataset, which effectively increases the number of the minority class instances.

To demonstrate this we are going to use part of the imblearn library.
The imbalanced-learn library (abbreviated as imblearn) is a Python library specifically designed to address the problem of class imbalance in machine learning datasets.

Note: We have already imported parts of the imblearn library at the start of the notebook.

For SMOTE we use the following library:

# import SMOTE from imblearn.over_sampling
from imblearn.over_sampling import SMOTE

First, we need to separate the features and target to X and y respectively - this is a common standard notation for features and targets.

# Separate features and target variable
X = df.drop('Disease', axis=1) # drop target variable leaving the remaining features
y = df['Disease'] # just the target variable

Task:

  • Check the number of rows and columns in X (features) and y (target variable)

  • And then also plot the values of the target variable y:

# Your code here

We now have our features and target separated, so we can now start with instantiating smote (create an instance of the class, in this case and instance of SMOTE). Then use it to ‘resample’ the dataset for the X and y.

To do this we use: SMOTE(random_state=42) which creates an instance of the SMOTE algorithm with a specific random state (here 42). The random state is an arbitrary choice, and you could use any integer value. The important aspect is to keep it consistent across runs if reproducibility is desired.

Geek Alert: You will see 42 often used in notebooks from other data scientists. The use of the number 42 as the random_state parameter in machine learning is actually a reference to the science fiction series “The Hitchhiker’s Guide to the Galaxy” by Douglas Adams.

# Instantiate SMOTE
smote = SMOTE(random_state=42, k_neighbors=5)

Now use this to resample your data.

# Resample the dataset
X_resampled_smote, y_resampled_smote = smote.fit_resample(X, y)

Check the number of rows and columns in X_resampled and y_resampled:

# Your code here

So, where Disease = False we had x records, this being the majority class, SMOTE will take the minority class, of Disease = True with y records, and increase this to x records to make ‘the total number of records for Disease = True’ balanced with ‘the total records for Disease = False’.
Therefore, the total number of resulting records will be double the majority class.

Look at the resampled target variable value counts to confirm this:

y_resampled_smote.value_counts()

Let’s quickly plot the before and after using the below function:

# import matplotlib.pyplot as plt

def plot_before_and_after_resampling(y, y_resampled, label):
    """
    Plot bar plots before and after resampling.

    Args:
        y (pandas.Series): Original target variable.
        y_resampled (pandas.Series): Resampled target variable.
        label (str): Label to be used in the plot title for the resampled data.

    Returns:
        None
    """
    # Create a figure and axis object
    fig, axs = plt.subplots(1, 2, figsize=(8, 4))

    # Plot the first bar plot for y
    y.value_counts().plot(kind='bar', ax=axs[0])
    axs[0].set_title('y')
    axs[0].set_xlabel('Disease')
    axs[0].set_ylabel('Number of Records')

    # Plot the first bar plot for y_resampled
    y_resampled.value_counts().plot(kind='bar', ax=axs[1])
    axs[1].set_title(f'y_resampled using {label}')
    axs[1].set_xlabel('Disease')
    axs[1].set_ylabel('Number of Records')

    # Adjust layout
    plt.tight_layout()

    # Show the plot
    plt.show()
    
    return
plot_before_and_after_resampling(y,y_resampled_smote,"SMOTE")

Random oversampling:#

RandomOverSampler simply duplicates some samples from the minority class to balance the dataset. It randomly selects instances from the minority class and replicates them until the dataset is balanced.

Generate the dataset again if required:

df = generate_demo_dataset5()
print(df)

For random over sampling we use the following library:

# import RandomOverSampler from imblearn.over_sampling
from imblearn.over_sampling import RandomOverSampler

We have previously already separated the features and the target variable, so we don’t need to repeat this. As a reminder the code was:

# Separate features and target variable
X = df.drop('Disease', axis=1)
y = df['Disease']

Now we create an instance of the RandomOverSampler algorithm:

# Instantiate RandomOverSampler
ros = RandomOverSampler(random_state=42)

Now use this to resample your data.

# Resample the dataset
X_resampled_ros, y_resampled_ros = ros.fit_resample(X, y)

And plot the results.

plot_before_and_after_resampling(y,y_resampled_ros,"RandomOverSampler")

What do you notice comparing these results to SMOTE?

The results should be the same but the technique that the algorithms use is very different, but both are over sampling algorithms.

Use SMOTE when the minority class is densely packed or when there is overlapping with the majority class. SMOTE synthesises new minority class samples along the lines connecting existing minority class samples, effectively creating synthetic examples within the feature space.

Use RandomOverSampler when the minority class is spread out and there is less risk of creating overlapping or synthetic examples that might not represent the true distribution of the minority class. RandomOverSampler simply duplicates minority class samples, maintaining the original distribution.

In both cases over sampling is best to use on smaller datasets, as potentially a lot of extra records will be created to meet the balance.

To help with identifying whether the minority class is densely packed, overlapping with the majority class or being spread out, use a seaborn pair plot to quickly visualise patterns.

  • Seaborn is a visualisation library which is imported as import seaborn as sns.

  • A pair plot, also known as a scatterplot matrix, is a type of visualisation that allows you to explore relationships between pairs of variables in a dataset. It’s particularly useful for datasets with multiple variables, enabling you to quickly identify patterns, correlations, and potential insights.

Use the functions original_target_variable_pair_plot and resampled_target_variable_pair_plot to compare the results of

  • The original data

  • Smote resampled data

  • Random oversampling resampled data

Hide code cell source
# import matplotlib.pyplot as plt

def original_target_variable_pair_plot(df):
    """
    Generate a pair plot to visualise the distribution of features by disease class for original data.

    Parameters:
    df (DataFrame): The DataFrame containing the original data.
    """
    # Visualising the distribution of features by disease class
    sns.pairplot(df, hue='Disease', height=2)
    # Add a title
    plt.suptitle('Pair Plot of Features by Disease Class - Original Data', y=1.05)
    plt.show()

    # Calculating the average distance between minority class samples
    minority_samples = df[df['Disease'] == 1][['Age', 'Blood Pressure']]
    mean_distance = np.mean(np.linalg.norm(minority_samples - minority_samples.mean(axis=0), axis=1))
    print("Average distance between minority class samples:", mean_distance)
    
    return

def resampled_target_variable_pair_plot(X_resampled, y_resampled, label):
    """
    Generate a pair plot to visualise the distribution of features by disease class after resampling.

    Parameters:
    X_resampled (array-like): The resampled features.
    y_resampled (array-like): The resampled target variable.
    label (str): The label indicating the type of resampling performed.
    """
    # Concatenate the resampled features and target variable
    df_resampled = pd.concat([pd.DataFrame(X_resampled, columns=X.columns), 
                              pd.DataFrame(y_resampled, columns=['Disease'])], axis=1)

    # Visualising the distribution of features by disease class after resample
    sns.pairplot(df_resampled, hue='Disease', height=2)
    plt.suptitle(f'Pair Plot of Features by Disease Class - {label} Resampled Data', y=1.05)
    plt.show()

    # Calculating the average distance between minority class samples
    minority_samples = df_resampled[df_resampled['Disease'] == 1][['Age', 'Blood Pressure']]
    mean_distance = np.mean(np.linalg.norm(minority_samples - minority_samples.mean(axis=0), axis=1))
    print("Average distance between minority class samples:", mean_distance)

    return   

Run the functions and review the plots.

# original data
original_target_variable_pair_plot(df)
# smote resampled data
resampled_target_variable_pair_plot(X_resampled_smote, y_resampled_smote, "smote")
# random over sampled resampled data
resampled_target_variable_pair_plot(X_resampled_ros, y_resampled_ros, "random over sampler")

Random Under-sampling:#

Random under-sampling can be effective when the dataset is very large and the computational resources are limited. However, the trade of is that it comes with the risk of losing potentially valuable information from the majority class.

Generate the dataset again if required:

df = generate_demo_dataset5()
print(df)

For random over-sampling we use the following library:

# import RandomUnderSampler from imblearn.under_sampling
from imblearn.under_sampling import RandomUnderSampler

You should be familiar with the steps to carry out resampling as they are the same as before just with a new algorithm RandomUnderSampler.
We will do this in one code cell.

# Separate features and target variable
X = df.drop('Disease', axis=1)
y = df['Disease']

# Instantiate RandomUnderSampler
rus = RandomUnderSampler(random_state=42)

# Resample the dataset
X_resampled_rus, y_resampled_rus = rus.fit_resample(X, y)

And go straight to plotting the results.

plot_before_and_after_resampling(y,y_resampled_rus,"RandomUnderSampler")

This should show what you expected that the majority class (Disease = False) has been randomly reduced to the same number of records of the minority class (Disease = True).

Practical Task 1.5

Using one of the above methods look to address the target imbalance of this new dataset.

  • The target variable is: Diabetes

# generate_practical_dataset5
# import pandas as pd
# import numpy as np

def generate_practical_dataset5():
    """
    Generate a synthetic healthcare-related dataset with imbalanced classes.

    Returns:
    pandas.DataFrame: A DataFrame containing the generated dataset with columns:
        - 'Age': Age of the patients.
        - 'Gender': Gender of the patients (1 for male, 2 for female).
        - 'Blood Pressure': Blood pressure of the patients.
        - 'Cholesterol': Cholesterol level of the patients.
        - 'Diabetes': Target variable indicating the presence (1) or absence (0) of diabetes.
    """
    np.random.seed(42)

    # Features
    age = np.random.randint(20, 80, size=1000)
    gender = np.random.choice([1, 2], size=1000)
    blood_pressure = np.random.randint(90, 180, size=1000)
    cholesterol = np.random.randint(120, 300, size=1000)

    # Target variable
    # Introduce class imbalance (90% negative class, 10% positive class)
    diabetes = np.random.choice([0, 1], size=1000, p=[0.9, 0.1])

    # Create DataFrame
    data = pd.DataFrame({
        'Age': age,
        'Gender': gender,
        'Blood Pressure': blood_pressure,
        'Cholesterol': cholesterol,
        'Diabetes': diabetes
    })
    
    return data

# Generate the synthetic healthcare dataset
healthcare_df = generate_practical_dataset5()

# Display the first few rows of the dataset
print(healthcare_df.head())

# Display the shape of the dataset
print("Shape of the dataset:", healthcare_df.shape)
  1. Check the target variable and confirm the imbalance. You can also plot this if you wish using .plot.bar().

# Your code here
  1. Separate the features and target variable.

# Your code here
  1. Pick a method for dealing with the imbalance and instantiate the algorithm.

# Your code here
  1. Resample the dataset.

# Your code here
  1. Plot the results of before and after the resampling method using function: plot_before_and_after_resampling.

# Your code here

Chapter Summary#

Well done on reaching the end of this chapter!
Just to recap what we have learnt when is comes to cleaning and preparing our data, you should now feel familiar with:

  • Checking and converting data types.

  • Handling missing values by removing specific rows or columns based on missing values being present and looked at filling in missing values using statistical values from the dataset.

  • Removing duplicate records and duplicates across specific columns.

  • Identifying and handling outliers in the data.

  • Dealing with target imbalance.