Data Preprocessing#
Data preprocessing is an important step in machine learning and is a prerequisite for modelling your data.
The main goal of data preprocessing is to transform your dataset into a suitable form for modelling, and in doing this it will also improve the performance of your model and provide more robust and reliable results.
Cleaning and Preparing Data#
There are many things to consider for preprocessing and not all of them will be relevant to your data, but still should be checked for. Gathering, cleaning, and preparing your data will take, as is expected to take, a significant proportion of your time and attention when compared with the time spent on model development and model tuning.
Within this chapter the following topics will be covered:
Checking and Converting Data Types
Handle Missing Values
Remove Records with Missing Values
Removing Specific Columns Or Rows That Contain Missing Data
Impute Missing Values
Remove Duplicates
Removing Outliers
Interquartile Range (IQR) method
Z-score (Standard Score)
Dealing with Target Imbalance
SMOTE
Random Overs-sampling
Random Under-sampling
Import the following libraries for this chapter:
It is good practice to import all the libraries that are used in your notebook at the very start. You may not know all the libraries you intend on using from the beginning, but they should all be added here as and when you need additional libraries.
# Main libraries
import pandas as pd
import random
# Removing Outliers section
import numpy as np
import matplotlib.pyplot as plt
# Dealing with Target Imbalance section
from imblearn.over_sampling import SMOTE
from imblearn.over_sampling import RandomOverSampler
from imblearn.under_sampling import RandomUnderSampler
import seaborn as sns
Checking and Converting Data Types#
Before any preprocessing can begin, we must understand the data types of our features.
Pandas
will automatically assign data types for your data when you import a dataset. Sometimes the types assigned will be incorrect, so it is important to check these and convert them to the correct data type.
The most common data types used in pandas
are:
object
: Contains sting values or contains a mixture of types.int64
: Whole numbers, equivalent to native python integer type, where 64 refers to the allocation of memory allotted for storing the value, in this case, the number of bits.float64
: Decimal numbers.datetime64
: Dates and times - this special data type unlocks a extra functionality for working with time series data, such as datetime indexing.
We have also seen in the previous module data type category
specifically used for categorical data when using Pandas
.
Data types can be easily checked in a dataset by using the .info()
method.
Generate the below dataset from the function generate_demo_dataset1
and let’s look at the data types.
Show code cell source
# generate_demo_dataset1
# import pandas as pd
# import numpy as np
def generate_demo_dataset1():
"""
Generate a dataset for cleaning data examples.
Returns:
pandas.DataFrame: A DataFrame containing the generated dataset with columns:
'category', 'a', 'b', 'c', 'd'.
"""
category = ['category 1', 'category 2', 'category 2', 'category 1', 'category 1', 'category 1']
a_values = (6.0, 2.0, 4.0, 3.0, 7.0, 5.0)
b_values = [6.0, 3.0, 5.0, 1.0, 10.0, 8.0]
c_values = [9.0, 4.0, 3.0, 3.0, 7.0, 1.0]
d_values = [1.0, 2.0, 7.0, 9.0, 6.0, 2.0]
# Create DataFrame
data = {'category': category, 'a': a_values, 'b': b_values, 'c': c_values, 'd': d_values}
df = pd.DataFrame(data)
df['c'] = df['c'].astype("str")
return df
df = generate_demo_dataset1()
print(df)
To inspect the data types use df.info()
.
# df = generate_demo_dataset1() # original dataframe
# print(df,"\n\n") # print dataframe with two carriage returns.
# inspect the data types of the dataframe
df.info()
The pandas .astype()
method can be used to convert a column’s data type to a specified data type.
In order to do this we need to reassign the column to overwrite the original data when converting it.
Before converting a column be extra careful that all the values it contains can be appropriately converted to the new data type.
Let’s check the values in column "c"
# df = generate_demo_dataset1() # original dataframe
# print(df,"\n\n") # print dataframe with two carriage returns.
# Check the age column for unique values and it's data type
unique_values = df["c"].unique()
print(unique_values)
Check that column "c"
is in fact data type 'object'
.
# df = generate_demo_dataset1() # original dataframe
# print(df,"\n\n") # print dataframe with two carriage returns.
# Check the data type of the column
column_dtype = df['c'].dtype
print(column_dtype)
To change column "c"
to data type 'float'
we need to overwrite the original column with the new data type.
# df = generate_demo_dataset1() # original dataframe
# print(df,"\n\n") # print dataframe with two carriage returns.
# Change the age column data type to integer
df["c"] = df["c"].astype("float")
# Check the data type of the column
column_dtype = df['c'].dtype
print(column_dtype)
Practical Task 1.1
Inspect the following dataset. Identify and convert two columns that require their data type to be changed.
# generate_practical_dataset1
# import pandas as pd
# import numpy as np
def generate_practical_dataset1():
"""
Generate a synthetic dataset consisting of patient demographic information
Returns:
pandas.DataFrame: A DataFrame containing the synthetic dataset with columns:
'Surname', 'Age', 'Gender', 'Weight', 'Blood Type'.
"""
# List of patient surnames
patient_names = ["Smith", "Johnson", "Williams", "Jones", "Brown",
"Davis", "Miller", "Wilson", "Moore", "Taylor"]
# List of blood types
blood_types = ["A+", "A-", "B+", "B-", "AB+", "AB-", "O+", "O-"]
# Generate random demographic dataset
demographic_dataset = []
for n in patient_names:
name = n
age = random.randint(20, 85)
gender = random.choice(["Male", "Female"])
weight = round(random.uniform(55.5,95.5),2)
blood_type = random.choice(blood_types)
record = {
"Surname": name,
"Age": age,
"Gender": gender,
"Weight": weight,
"Blood Type": blood_type,
}
demographic_dataset.append(record)
# Convert dataset to DataFrame
df = pd.DataFrame(demographic_dataset)
# Set all columns to 'object' data types
df = df.astype('object')
return(df)
df = generate_practical_dataset1()
print(df.head(10))
# Your Code Here - Converting Data Types (Identify and convert two columns)
# df = generate_practical_dataset1()
Handle Missing Values#
Datasets often have missing values or empty records, often encoded as blanks or NaN (Not a Number). Handling missing values is the most common problem in data science and is the first step of data preprocessing as most machine learning algorithms can’t deal with values that are missing or blank.
Removing ALL records with missing values is a basic strategy that is sometimes used, but it comes with a cost of losing probable valuable data and the associated information or patterns. A better strategy is to impute the missing values.
In this next section we are going to take a look at::
Remove all records with missing values.
Removing specific columns or rows.
Impute (fill-in) the missing values.
Generate the below dataset from the function generate_demo_dataset2
and let’s look at the missing values.
# generate_demo_dataset2
# import pandas as pd
# import numpy as np
def generate_demo_dataset2():
"""
Generate a dataset for cleaning data.
Returns:
pandas.DataFrame: A DataFrame containing the generated dataset with columns:
'category', 'a', 'b', 'c', 'd'.
"""
category = ['category 1','category 2','category 2',None,'category 1','category 1']
a_values = (6.0,np.nan,4.0,3.0,7.0,5.0)
b_values = [np.nan,3.0,5.0,np.nan,10.0,8.0]
c_values = [9.0,4.0,3.0,3.0,7.0,1.0]
d_values = [1.0,2.0,7.0,np.nan,6.0,2.0]
# Create DataFrame
data = {'category': category, 'a': a_values, 'b': b_values, 'c': c_values, 'd': d_values}
df = pd.DataFrame(data)
return(df)
df = generate_demo_dataset2()
print(df)
Remove all records with missing values#
In Python, particularly in pandas
dataframes or numpy
arrays, NaN
is commonly used to represent missing or undefined numerical data. However, if you’re working with non-numeric data types, such as objects, None
is often used as an alternative to represent missing values.
Within SQL NULL
is more broadly used to represent missing values across different data types.
We can use the .isna()
method to inspect the NaN
values.
This will return a boolean result (True or False) on whether the value
is
NaN
- True
orNot
Nan
- False
# df = generate_demo_dataset2() # original dataframe
print(df,"\n\n") # print dataframe with two carriage returns.
# Identify missing values - ie: Is the value missing?
df.isna()
In pandas, isna()
and isnull()
are essentially aliases of each other, meaning they are two different names for the same function. Both functions are used to detect missing values in a dataframe or series. There is no difference in functionality between them; you can use either one based on your preference.
Similarly, notna()
and notnull()
are also aliases of each other and serve the same purpose—to detect non-missing values in a dataframe or series.
So looking at our dataset we may decide to remove (or drop
) all the rows where NaN
/None
values are present in any of the columns.
For this we can use: .dropna()
function.
# df = generate_demo_dataset2() # original dataframe
print(df,"\n\n") # print dataframe with two carriage returns.
# drop all rows in the dataframe that contain NaN/None values in any of the columns
drop_all_nan_rows = df.dropna()
print(drop_all_nan_rows)
This is a very and broad approach to dealing with missing values. There are more specific ways that we might choose to adopt instead.
Removing specific columns or rows.#
We can drop specific rows by passing index labels to the .drop()
function.
The .drop()
function does not check for NaN
or None
Values.
By passing the index labels to this function the rows will be just be removed.
# df = generate_demo_dataset2() # original dataframe
print(df,"\n\n") # print dataframe with two carriage returns.
# Drop rows in the dataframe by their specified row index
drop_specified_rows = df.drop([1,2,4])
print(drop_specified_rows)
Alternatively, there may be certain columns that are not required.
These can also be removed using the .drop()
function by also passing the axis
value = 1, which indicates a column.
# df = generate_demo_dataset2() # original dataframe
print(df,"\n\n") # print dataframe with two carriage returns.
# Drop a specified column from the dataframe
drop_column = df.drop("b",axis=1)
print(drop_column)
What if we want to drop rows only where data is missing in a particular column?
So far we have just looked at dropping rows and columns without checking for missing values using .drop()
.
Let’s look at what we can do specifically considering missing values.
We can use .dropna()
in the same way as .drop()
, but it will cater to only rows or columns with NaN
or None
values.
For this, first take a look at how many missing values we have in each column using .isna()
method to identify the number of NaN
values and then additionally using the .sum()
method to count them in each column.
# df = generate_demo_dataset2() # original dataframe
print(df,"\n\n") # print dataframe with two carriage returns.
# Count the number of NaN values in each column
CountNaN = df.isna().sum()
print(CountNaN)
To remove the rows with missing values in a particular column we can specify a list of labels to the subset
argument of .dropna()
.
# df = generate_demo_dataset2() # original dataframe
print(df,"\n\n") # print dataframe with two carriage returns.
# Drop rows from a dataframe where NaN values are present in a specified column
SpecifiedColumn = df.dropna(subset=["b"])
print(SpecifiedColumn)
For more options on dropping values where NaN
values exits, see the pandas documentation on dropna here.
Impute the missing values.#
Rather than removing rows or columns that have missing data we could fill in the missing values using the measures of central tendency, such as mean, median, and mode.
The mean can be used to impute a numeric feature.
The median can be used to impute the ordinal feature.
The mode or highest occurring value can be used to impute the categorical feature.
Note: it is important to understand that in some cases, missing values will not impact the data, such as unique identifiers.
For example, unique values such as MRN, NHS Number will not impact the machine learning models because they are just identifiers shouldn’t be used as features in the model.
Let’s first use the .describe()
method to review statistics on the numerical columns within the dataset.
# df = generate_demo_dataset2() # original dataframe
print(df,"\n\n") # print dataframe with two carriage returns.
# stats on numerical values
df.describe()
Let’s use the mean
to fill in the missing values for column "a"
and the median
for column "b"
.
# df = generate_demo_dataset2() # original dataframe
print(df,"\n\n") # print dataframe with two carriage returns.
# stats on numerical values
print(df.describe(),"\n\n")
# fill NaN values in column a with mean and median
df["a"] = df["a"].fillna(df["a"].mean())
df["b"] = df["b"].fillna(df["b"].median())
print(df)
And let’s now use the most frequently occurring value in category
to replace the missing values.
# df = generate_demo_dataset2() # original dataframe
print(df,"\n\n") # print dataframe with two carriage returns.
# fill NaN values in column category (categorical feature) a with mode
df["category"] = df["category"].fillna(df["category"].mode()[0])
print(df)
We may decide that it is better to replace missing values with a specified value instead.
df = generate_demo_dataset2() # original dataframe
print(df,"\n\n") # print dataframe with two carriage returns.
# fill NaN values in column category (categorical feature) a with specified value
df["category"] = df["category"].fillna("Unknown")
print(df)
Practical Task 1.2
From the following dataset look to carry out the following:
Drop at least one column that you feel that wouldn’t be needed in your model.
Drop rows where there are nulls in a specific column.
Fill in missing values using either mean, median or mode within a column.
Note: If you want to ensure the changes are saved to the original DataFrame without creating a new one, setting
inplace=True
needs to be added as an argument.
# generate_practical_dataset2
# import pandas as pd
# import numpy as np
def generate_practical_dataset2():
"""
Generate a synthetic demographic dataset.
Returns:
pandas.DataFrame: A DataFrame containing the generated dataset with columns:
'MRN', 'Surname', 'Age', 'Gender', 'Favourite Colour', 'Weight', 'Blood Type'.
"""
data = {
'MRN': [482754, 194552, 456272, 569149, 152106, 697630, 922086, 801114, 942324, 737040],
'Surname': ['Smith', 'Johnson', 'Williams', 'Jones', 'Brown', 'Davis', 'Miller', 'Wilson', 'Moore', 'Taylor'],
'Age': [26, np.NaN, 56, 54, 74, 62, 83, 24, np.NaN, 26],
'Gender': ['Male', 'Female', 'Male', None, 'Male', 'Female', None, 'Female', None, 'Female'],
'Favourite Colour': ['Red', np.NaN, 'Purple', 'Blue', 'Orange', 'Red', np.NaN, 'Yellow', 'Black', 'Pink'],
'Weight': [57.87, 66.96, 62.98, 63.83, 87.95, 69.91, np.NaN, 61.49, 93.71, np.NaN],
'Blood Type': ['A+', 'B-', 'B-', 'AB+', 'A-', 'O+', 'AB+', 'B+', 'B-', 'O+']
}
df = pd.DataFrame(data)
return (df)
df = generate_practical_dataset2()
print(df)
# Your code here
df = generate_practical_dataset2()
There are other methods of imputing missing values such as sklearn.impute.IterativeImputer and sklearn.impute.KNNImputer. Use the links to learn more about these methods.
Remove Duplicates#
Here we are going to look at how to identify duplicate records in your dataset and how to remove these.
Generate the below dataset from the function generate_demo_dataset3
and let’s take a look at the duplicate values.
# generate_demo_dataset3
# import pandas as pd
# import numpy as np
def generate_demo_dataset3():
"""
Generate a dataset for cleaning data.
Returns:
pandas.DataFrame: A DataFrame containing the generated dataset with columns:
'category', 'a', 'b', 'c', 'd', 'colour'.
"""
category = ['category 1','category 2','category 2','category 1','category 1','category 1','category 1','category 1','category 1']
a_values = (6.0,1.0,1.0,3.0,7.0,5.0,3.0,7.0,5.0)
b_values = [7.0,3.0,5.0,3.0,10.0,8.0,3.0,10.0,8.0]
c_values = [9.0,4.0,3.0,3.0,7.0,1.0,3.0,7.0,1.0]
d_values = [1.0,2.0,7.0,3.0,6.0,2.0,3.0,6.0,2.0]
colour = ['Red','Blue','Green','Orange','Yellow','Pink','Orange','Yellow','Pink']
# Create DataFrame
data = {'category': category, 'a': a_values, 'b': b_values, 'c': c_values, 'd': d_values, 'colour': colour}
df = pd.DataFrame(data)
return(df)
df = generate_demo_dataset3()
print(df)
.duplicated()
can be used to identify duplicate records and will return a Boolean value for the record.
# df = generate_demo_dataset3() # original dataframe
print(df,"\n\n") # print dataframe with two carriage returns.
df.duplicated()
If we want to view the duplicate records, we can carry out the following:
# df = generate_demo_dataset3() # original dataframe
print(df,"\n\n") # print dataframe with two carriage returns.
# Identify duplicate records
duplicate_records = df.duplicated()
# Select duplicate records
duplicates = df[duplicate_records]
# Display duplicate records
print(duplicates)
To remove the duplicates completely from the dataset use the .drop_duplicates()
method.
# df = generate_demo_dataset3() # original dataframe
print(df,"\n\n") # print dataframe with two carriage returns.
# Remove duplicates
df_no_duplicates = df.drop_duplicates()
print(df_no_duplicates)
We may wish to refine this further by passing the keep
argument, where we can specify whether the first or last duplicate record should be kept.
If keep
is not specified the default is the first
.
An example of when you might want to keep the last record would be if you had a sorted dataframe in chronological order where you wanted to keep the most recent record.
# df = generate_demo_dataset3() # original dataframe
print(df,"\n\n") # print dataframe with two carriage returns.
# Keep the last occurrence of each duplicated row
df_no_duplicates = df.drop_duplicates(keep='last')
print(df_no_duplicates)
By passing False
to this argument all duplicates will be dropped.
# df = generate_demo_dataset3() # original dataframe
print(df,"\n\n") # print dataframe with two carriage returns.
# Drop all duplicates
df_no_duplicates = df.drop_duplicates(keep=False)
print(df_no_duplicates)
Instead of identifying duplicates across the whole set of columns, certain specified columns can be used to identify duplicates.
For this will use the subset
argument.
# df = generate_demo_dataset3() # original dataframe
print(df,"\n\n") # print dataframe with two carriage returns.
df_no_duplicates = df.drop_duplicates(subset=['category', 'a'])
print(df_no_duplicates)
Practical Task 1.3
From the following dataset:
Identify the two duplicates in the data and remove them.
Inspect the results and remove any further suspected duplicates based on ‘Surname’ and ‘Age’.
# generate_practical_dataset3
# import pandas as pd
# import numpy as np
def generate_practical_dataset3():
"""
Generate a synthetic demographic dataset.
Returns:
pandas.DataFrame: A DataFrame containing the generated dataset with columns:
'MRN', 'Surname', 'Age', 'Gender', 'Weight', 'Blood Type'.
"""
data = {
'MRN': [176968, 173798, 851542, 336291, 114317, 737813, 609203, 938757, 661284, 147859,336291,661284,319011],
'Surname': ['Smith', 'Johnson', 'Williams', 'Jones', 'Brown', 'Davis', 'Miller', 'Wilson', 'Moore', 'Taylor','Jones','Moore', 'Taylor'],
'Age': [26, 24, 56, 54, 74, 62, 83, 24, 60, 26,54,60, 26],
'Gender': ['Male', 'Female', 'Male', 'Female', 'Male', 'Female', 'Male', 'Female', 'Female', 'Female','Female','Female', None],
'Weight': [57.87, 66.96, 62.98, 63.83, 87.95, 69.91, 62.14, 61.49, 93.71, 94.47,63.83, 93.71, 94.47],
'Blood Type': ['A+', 'B-', 'B-', 'AB+', 'A-', 'O+', 'AB+', 'B+', 'B-', 'O+','AB+','B-', '0+']
}
df = pd.DataFrame(data)
return(df)
df = generate_practical_dataset3()
print(df)
# Your code here
# df = generate_practical_dataset3()
Removing Outliers#
This section we will briefly look into removing outliers, but this is just a smaller part of a much wider topic of anomaly detection that will be covered separately in it’s own module.
Essentially anomaly detection encompasses two broad practices of ‘outlier detection’ and ‘novelty detection’.
Where outliers are abnormal or extreme data points that are only seen in your initial training and novelties are new or previously unseen instances compared to your original data.
Getting back to looking at outliers, we are now going to take a look at some simple ways of identifying and removing them.
# generate_demo_dataset4
# import pandas as pd
# import random
# import numpy as np
# import matplotlib.pyplot as plt
def generate_demo_dataset4():
"""
Generate a dataset with outliers.
Returns:
numpy.ndarray: An array containing the generated dataset with possible outliers.
"""
np.random.seed(0)
data = np.random.normal(loc=8, scale=1, size=100) #loc=10
outlier_indices = np.random.choice(100, size=10, replace=False) # Introduce 10 outliers
data[outlier_indices] = np.random.normal(loc=12, scale=1, size=10) # Outliers have mean 12 #20,1,10
return(data)
data = generate_demo_dataset4()
print(data)
Popular methods of outlier detection that are used:
Interquartile range (IQR): The IQR is the range between the first quartile (Q1) and the third quartile (Q3) of a distribution.
When an instance is beyond Q1 or Q3 for some multiplier of IQR, they are considered outliers. The most common multiplier is 1.5, making the outlier range[Q1–1.5 * IQR, Q3 + 1.5 * IQR]
.Z-score (standard score) : The z-score or standard score measures how many standard deviations a data point is away from the mean.
Generally, instances with a z-score over 3 are chosen as outliers.
Let’s plot our data, which is in a numpy array, and take a look at the data points.
For this we will use a plotting library called matplotlib
to generate a plot.
import matplotlib.pyplot as plt
plt.plot(data)
Now let’s plot the data as a box plot and what do you notice?
To do this swap,
.plot
with.boxplot
.
plt.boxplot(data)
It is clear from the box plot that there are points that appear to be outliers in the data.
IQR method#
We are first going to take a look at the IQR method.
The IQR is the range between the first quartile (Q1) and the third quartile (Q3) of a distribution. When an instance is beyond Q1 or Q3 for some multiplier of IQR, they are considered outliers. The most common multiplier is 1.5, making the outlier range [Q1–1.5 * IQR, Q3 + 1.5 * IQR].
Generate the function and then use it to calculate the upper and lower bounds based on the IQR * 1.5
# Define function to identify outliers using IQR method
def identify_outliers_iqr(data, threshold=1.5):
"""
Identify outliers in a dataset using the interquartile range (IQR) method.
Args:
data (numpy.ndarray or pandas.Series): The data for which outliers are to be identified.
threshold (float, optional): The threshold value to determine outliers. Defaults to 1.5.
Returns:
tuple: A tuple containing:
- outliers (numpy.ndarray): A boolean array indicating outliers in the data.
- lower_bound (float): The lower bound for outlier detection.
- upper_bound (float): The upper bound for outlier detection.
"""
q1 = np.percentile(data, 25)
q3 = np.percentile(data, 75)
iqr = q3 - q1
lower_bound = q1 - threshold * iqr
upper_bound = q3 + threshold * iqr
outliers = np.logical_or(data < lower_bound, data > upper_bound)
return outliers, lower_bound, upper_bound
let’s use the identify_outliers_iqr
function:
outliers_iqr, lower_bound, upper_bound = identify_outliers_iqr(data)
print(outliers_iqr)
print("lower_bound:", lower_bound)
print("upper_bound:", upper_bound)
and plot the results using the generated function below plot_outliers_iqr
:
# import matplotlib.pyplot as plt
def plot_outliers_iqr(data, outliers_iqr, lower_bound, upper_bound):
"""
Plot the data with identified outliers using the interquartile range (IQR) method.
Args:
data (numpy.ndarray or pandas.Series): The original data to be plotted.
outliers_iqr (numpy.ndarray): A boolean array indicating outliers in the data.
lower_bound (float): The lower bound for outlier detection.
upper_bound (float): The upper bound for outlier detection.
"""
plt.figure(figsize=(10, 6)) # create a blank figure to plot on
plt.plot(data, label='Data') # plot the data
plt.plot(np.where(outliers_iqr)[0], data[outliers_iqr], 'ro', label='Outliers (IQR)') # highlight the outliers
plt.axhline(lower_bound, color='gray', linestyle='--', label='Lower Bound') # add lower bound line
plt.axhline(upper_bound, color='gray', linestyle='--', label='Upper Bound') # add upper bound line
plt.legend() # add legend
plt.xlabel('Day') # add x label
plt.ylabel('Index') # add y label
plt.title('Outlier Detection Example using IQR Method') # add title
plt.grid(True) # show grid
plt.show() # show the completed plot
Using the function plot_outliers_iqr
…
# Plot data with outliers highlighted and bound lines
plot_outliers_iqr(data, outliers_iqr, lower_bound, upper_bound)
Several points have been identified using this method.
To remove them from the data we are working with we can filter the data array using Boolean indexing.
~outliers_iqr
negates the Boolean array outliers_iqr
, so it selects only the elements of data that are not identified as outliers (ie: Are not True).
# Remove outliers from the dataset
cleaned_data = data[~outliers_iqr]
print("Original data shape:", data.shape)
print("Cleaned data shape:", cleaned_data.shape)
Z-score (standard score) method:#
A z-score represents the number of standard deviations a data point is from the mean of a dataset.
Mathematically, the z-score of a data point \(x\) in a dataset with mean \(μ\) and standard deviation
\(σ\) is calculated as:
\(Z= \frac{x−μ}{σ}\)
A z-score of 0 means the data point is exactly at the mean, a positive z-score means the data point is above the mean, and a negative z-score means the data point is below the mean.
Generally, instances with a z-score over 3 are chosen as outliers. This concept refers to data points that are located at 3 standard deviations from the mean of the dataset. It’s often used as a threshold for identifying outliers, especially in normally distributed datasets, where approximately 99.7% of the data falls within 3 standard deviations of the mean (assuming a normal distribution).
Let’s generate the dataset we are going to use:
df = generate_demo_dataset4() # original dataframe
print(df,"\n\n") # print dataframe with two carriage returns.
And apply the above formula to calculate the z-scores.
# Calculate z-scores
z_scores = (data - np.mean(data)) / np.std(data)
print(z_scores)
Now we have calculated the z-scores, lets create an array of outliers (True/False) that are greater than the general threshold of 3 standard deviations from the mean.
# Define threshold for outlier detection
threshold = 3
# Identify outliers
outliers = np.abs(z_scores) > threshold
print(outliers)
We can put all this into a function to make the calculation easier:
def identify_outliers_zscore(data, threshold=3):
"""
Identify outliers in a dataset using z-scores.
Args:
data (numpy.ndarray or pandas.Series): The data for which outliers are to be identified.
threshold (float, optional): The threshold value to determine outliers. Defaults to 3.
Returns:
numpy.ndarray: A boolean array indicating outliers in the data.
"""
# Calculate z-scores
z_scores = (data - np.mean(data)) / np.std(data)
# Identify outliers
outliers = np.abs(z_scores) > threshold
return outliers
# Call the function
outliers = identify_outliers_zscore(data)
This function will plot the results, generate this function…
# import matplotlib.pyplot as plt
def plot_outliers_zscore(data, outliers, threshold=3):
"""
Plot the data with identified outliers using the z-score method.
Args:
data (numpy.ndarray or pandas.Series): The original data to be plotted.
outliers (numpy.ndarray): A boolean array indicating outliers in the data.
threshold (float, optional): The threshold value to determine outliers. Defaults to 3.
"""
plt.figure(figsize=(10, 6)) # create a blank figure to plot on
plt.plot(data, label='Data') # plot the data
plt.plot(np.where(outliers)[0], data[outliers], 'ro', label='Outliers') # highlight the outliers
plt.axhline(np.mean(data), color='green', linestyle='-', label='Mean') # mean
# plt.axhline(np.median(data), color='purple', linestyle='-', label='Median') # median
# add lower threshold line
plt.axhline(np.mean(data) - (threshold * np.std(data)), color='gray', linestyle='--', label='Lower Threshold')
# add upper threshold line
plt.axhline(np.mean(data) + (threshold * np.std(data)), color='gray', linestyle='--', label='Upper Threshold')
plt.legend() # add legend
plt.xlabel('Index') # add x label
plt.ylabel('Value') # add y label
plt.title('Outlier Detection Example using z-Score (standard score) Method') # add title
plt.grid(True) # show grid
plt.show() # show the completed plot
… and now use plot_outliers_zscore
to plot the data.
# Plot data with outliers highlighted
plot_outliers_zscore(data, outliers)
Practical Task 1.4
From the following dataset identify and plot the outliers for cholesterol levels using both the methods we have covered:
Use the supplied functions to identify the outliers and plot the results.
IQR Method:
Functions:identify_outliers_iqr
andplot_outliers_iqr
Z-Score Method:
Functions:identify_outliers_zscore
andplot_outliers_zscore
.
# generate_practical_dataset4
# import numpy as np
def generate_practical_dataset4(num_patients=1000):
"""
Generate random healthcare data for a specified number of patients.
Parameters:
- num_patients (int): Number of patients for which healthcare data is generated. Default is 1000.
Returns:
- ndarray: A 2D NumPy array containing healthcare data with the following columns:
- Age of patients
- Cholesterol levels in mg/dL
- Blood pressure in mmHg
- Body Mass Index (BMI)
"""
# Generate random healthcare data
age = np.random.randint(18, 90, num_patients) # Age of patients
cholesterol = np.random.normal(200, 30, num_patients) # Cholesterol levels in mg/dL
blood_pressure = np.random.randint(90, 180, num_patients) # Blood pressure in mmHg
bmi = np.random.normal(25, 4, num_patients) # Body Mass Index (BMI)
# Stack arrays horizontally to create a single 2D array
healthcare_data = np.column_stack((age, cholesterol, blood_pressure, bmi))
return healthcare_data
# Call the function and print first few rows of the healthcare data
healthcare_data = generate_practical_dataset4()
print("Sample healthcare data (first 5 rows):")
print(healthcare_data[:5])
# print the first 50 rows of the cholesterol column
print(healthcare_data[:50, 1])
# Your code here
# healthcare_data = generate_practical_dataset4()
Dealing with Target Imbalance#
Before diving into this section, we first need to understand the terms ‘Target’ and ‘Features’.
Target(s) - used to describe the column(s) you are trying to predict in your machine learning model.
Features - Are all the other columns in the data.
So, target imbalance is when our target column, also known as the target variable, has fewer instances in the data of the thing we are trying to predict.
Why do we need to address target imbalance?#
Addressing target imbalance is crucial in many machine learning tasks, particularly in classification problems, because it ensures that the model doesn’t become biased towards the majority class.
When the classes in your dataset are imbalanced, meaning some classes have significantly more samples than others, the model may learn to simply predict the majority class for most instances.
When assessing target imbalance, there isn’t a fixed threshold that universally determines whether there’s an imbalance or not.
However, a common rule of thumb to determine whether a class imbalance is significant, is if one class represents less than 10% to 20% of the total dataset.
A real world example of this would be detecting credit card fraud transactions, or in healthcare, “did not attend” (DNA) rates in outpatient appointments.
Let’s take a look at some ways to tackle target imbalance.
Generate the below patient data where the target is the Disease
variable.
# generate_demo_dataset5
# import pandas as pd
# import numpy as np
def generate_demo_dataset5():
"""
Generate a synthetic demographic dataset.
Returns:
pandas.DataFrame: A DataFrame containing the generated dataset with columns:
- 'Age': Age of the individuals.
- 'Gender': Gender of the individuals (1) male or (2) female.
- 'Blood Pressure': Blood pressure of the individuals.
- 'Disease': Target variable indicating the presence (1) or absence (0) of a disease.
"""
np.random.seed(42)
# Features
age = np.random.randint(20, 80, size=1000)
gender = np.random.choice([1, 2], size=1000)
blood_pressure = np.random.randint(90, 180, size=1000)
# Target variable
disease = np.random.choice([0, 1], size=1000, p=[0.9, 0.1])
# Create DataFrame
data = pd.DataFrame({
'Age': age,
'Gender': gender,
'Blood Pressure': blood_pressure,
'Disease': disease
})
df = pd.DataFrame(data)
return(df)
df = generate_demo_dataset5()
print(df)
# Display the shape of dataframe
print(df.shape)
Inspect the ‘Disease’ target variable to identify if there is a significant imbalance.
Tip: Note these numbers down as it will help you with checking the following methods we are about to cover!
df['Disease'].value_counts()
This can be quickly plotted to give a quick visual representation.
df['Disease'].value_counts().plot.bar()
To summarise before moving on. Most records, in this dataset, are where the Disease target variable is False - i.e.: No Disease. The remaining records represent the minority class, where the Disease variable is True. As the aim is to predict when Disease = True we need to address the imbalance in the data.
SMOTE (Synthetic Minority Over-sampling Technique):#
One way to dealing with target imbalance is a methodology called SMOTE, which stands for Synthetic Minority Over-sampling TEchnique.
The way that SMOTE works is that for each minority class instance in the data, SMOTE will find it’s k
nearest neighbours (where k
is a user specified number) in the feature space.
From this it then generates synthetic samples by creating new instances along the line segments connecting the minority class instance to its nearest neighbours.
These synthetic samples are then added to the original dataset, which effectively increases the number of the minority class instances.
To demonstrate this we are going to use part of the imblearn
library.
The imbalanced-learn library (abbreviated as imblearn) is a Python library specifically designed to address the problem of class imbalance in machine learning datasets.
Note: We have already imported parts of the
imblearn
library at the start of the notebook.
For SMOTE we use the following library:
# import SMOTE from imblearn.over_sampling
from imblearn.over_sampling import SMOTE
First, we need to separate the features and target to X and y respectively - this is a common standard notation for features and targets.
# Separate features and target variable
X = df.drop('Disease', axis=1) # drop target variable leaving the remaining features
y = df['Disease'] # just the target variable
Task:
Check the number of rows and columns in X (features) and y (target variable)
And then also plot the values of the target variable y:
# Your code here
We now have our features and target separated, so we can now start with instantiating smote (create an instance of the class, in this case and instance of SMOTE). Then use it to ‘resample’ the dataset for the X and y.
To do this we use:
SMOTE(random_state=42)
which creates an instance of the SMOTE algorithm with a specific random state (here 42). The random state is an arbitrary choice, and you could use any integer value. The important aspect is to keep it consistent across runs if reproducibility is desired.
Geek Alert: You will see 42 often used in notebooks from other data scientists. The use of the number 42 as the random_state parameter in machine learning is actually a reference to the science fiction series “The Hitchhiker’s Guide to the Galaxy” by Douglas Adams.
# Instantiate SMOTE
smote = SMOTE(random_state=42, k_neighbors=5)
Now use this to resample your data.
# Resample the dataset
X_resampled_smote, y_resampled_smote = smote.fit_resample(X, y)
Check the number of rows and columns in X_resampled
and y_resampled
:
# Your code here
So, where Disease = False we had x records, this being the majority class, SMOTE will take the minority class, of Disease = True with y records, and increase this to x records to make ‘the total number of records for Disease = True’ balanced with ‘the total records for Disease = False’.
Therefore, the total number of resulting records will be double the majority class.
Look at the resampled target variable value counts to confirm this:
y_resampled_smote.value_counts()
Let’s quickly plot the before and after using the below function:
# import matplotlib.pyplot as plt
def plot_before_and_after_resampling(y, y_resampled, label):
"""
Plot bar plots before and after resampling.
Args:
y (pandas.Series): Original target variable.
y_resampled (pandas.Series): Resampled target variable.
label (str): Label to be used in the plot title for the resampled data.
Returns:
None
"""
# Create a figure and axis object
fig, axs = plt.subplots(1, 2, figsize=(8, 4))
# Plot the first bar plot for y
y.value_counts().plot(kind='bar', ax=axs[0])
axs[0].set_title('y')
axs[0].set_xlabel('Disease')
axs[0].set_ylabel('Number of Records')
# Plot the first bar plot for y_resampled
y_resampled.value_counts().plot(kind='bar', ax=axs[1])
axs[1].set_title(f'y_resampled using {label}')
axs[1].set_xlabel('Disease')
axs[1].set_ylabel('Number of Records')
# Adjust layout
plt.tight_layout()
# Show the plot
plt.show()
return
plot_before_and_after_resampling(y,y_resampled_smote,"SMOTE")
Random oversampling:#
RandomOverSampler simply duplicates some samples from the minority class to balance the dataset. It randomly selects instances from the minority class and replicates them until the dataset is balanced.
Generate the dataset again if required:
df = generate_demo_dataset5()
print(df)
For random over sampling we use the following library:
# import RandomOverSampler from imblearn.over_sampling
from imblearn.over_sampling import RandomOverSampler
We have previously already separated the features and the target variable, so we don’t need to repeat this. As a reminder the code was:
# Separate features and target variable
X = df.drop('Disease', axis=1)
y = df['Disease']
Now we create an instance of the RandomOverSampler algorithm:
# Instantiate RandomOverSampler
ros = RandomOverSampler(random_state=42)
Now use this to resample your data.
# Resample the dataset
X_resampled_ros, y_resampled_ros = ros.fit_resample(X, y)
And plot the results.
plot_before_and_after_resampling(y,y_resampled_ros,"RandomOverSampler")
What do you notice comparing these results to SMOTE?
The results should be the same but the technique that the algorithms use is very different, but both are over sampling algorithms.
Use SMOTE when the minority class is densely packed or when there is overlapping with the majority class. SMOTE synthesises new minority class samples along the lines connecting existing minority class samples, effectively creating synthetic examples within the feature space.
Use RandomOverSampler when the minority class is spread out and there is less risk of creating overlapping or synthetic examples that might not represent the true distribution of the minority class. RandomOverSampler simply duplicates minority class samples, maintaining the original distribution.
In both cases over sampling is best to use on smaller datasets, as potentially a lot of extra records will be created to meet the balance.
To help with identifying whether the minority class is densely packed, overlapping with the majority class or being spread out, use a seaborn
pair plot to quickly visualise patterns.
Seaborn
is a visualisation library which is imported asimport seaborn as sns
.A pair plot, also known as a scatterplot matrix, is a type of visualisation that allows you to explore relationships between pairs of variables in a dataset. It’s particularly useful for datasets with multiple variables, enabling you to quickly identify patterns, correlations, and potential insights.
Use the functions original_target_variable_pair_plot
and resampled_target_variable_pair_plot
to compare the results of
The original data
Smote resampled data
Random oversampling resampled data
Show code cell source
# import matplotlib.pyplot as plt
def original_target_variable_pair_plot(df):
"""
Generate a pair plot to visualise the distribution of features by disease class for original data.
Parameters:
df (DataFrame): The DataFrame containing the original data.
"""
# Visualising the distribution of features by disease class
sns.pairplot(df, hue='Disease', height=2)
# Add a title
plt.suptitle('Pair Plot of Features by Disease Class - Original Data', y=1.05)
plt.show()
# Calculating the average distance between minority class samples
minority_samples = df[df['Disease'] == 1][['Age', 'Blood Pressure']]
mean_distance = np.mean(np.linalg.norm(minority_samples - minority_samples.mean(axis=0), axis=1))
print("Average distance between minority class samples:", mean_distance)
return
def resampled_target_variable_pair_plot(X_resampled, y_resampled, label):
"""
Generate a pair plot to visualise the distribution of features by disease class after resampling.
Parameters:
X_resampled (array-like): The resampled features.
y_resampled (array-like): The resampled target variable.
label (str): The label indicating the type of resampling performed.
"""
# Concatenate the resampled features and target variable
df_resampled = pd.concat([pd.DataFrame(X_resampled, columns=X.columns),
pd.DataFrame(y_resampled, columns=['Disease'])], axis=1)
# Visualising the distribution of features by disease class after resample
sns.pairplot(df_resampled, hue='Disease', height=2)
plt.suptitle(f'Pair Plot of Features by Disease Class - {label} Resampled Data', y=1.05)
plt.show()
# Calculating the average distance between minority class samples
minority_samples = df_resampled[df_resampled['Disease'] == 1][['Age', 'Blood Pressure']]
mean_distance = np.mean(np.linalg.norm(minority_samples - minority_samples.mean(axis=0), axis=1))
print("Average distance between minority class samples:", mean_distance)
return
Run the functions and review the plots.
# original data
original_target_variable_pair_plot(df)
# smote resampled data
resampled_target_variable_pair_plot(X_resampled_smote, y_resampled_smote, "smote")
# random over sampled resampled data
resampled_target_variable_pair_plot(X_resampled_ros, y_resampled_ros, "random over sampler")
Random Under-sampling:#
Random under-sampling can be effective when the dataset is very large and the computational resources are limited. However, the trade of is that it comes with the risk of losing potentially valuable information from the majority class.
Generate the dataset again if required:
df = generate_demo_dataset5()
print(df)
For random over-sampling we use the following library:
# import RandomUnderSampler from imblearn.under_sampling
from imblearn.under_sampling import RandomUnderSampler
You should be familiar with the steps to carry out resampling as they are the same as before just with a new algorithm RandomUnderSampler
.
We will do this in one code cell.
# Separate features and target variable
X = df.drop('Disease', axis=1)
y = df['Disease']
# Instantiate RandomUnderSampler
rus = RandomUnderSampler(random_state=42)
# Resample the dataset
X_resampled_rus, y_resampled_rus = rus.fit_resample(X, y)
And go straight to plotting the results.
plot_before_and_after_resampling(y,y_resampled_rus,"RandomUnderSampler")
This should show what you expected that the majority class (Disease = False) has been randomly reduced to the same number of records of the minority class (Disease = True).
Practical Task 1.5
Using one of the above methods look to address the target imbalance of this new dataset.
The target variable is: Diabetes
# generate_practical_dataset5
# import pandas as pd
# import numpy as np
def generate_practical_dataset5():
"""
Generate a synthetic healthcare-related dataset with imbalanced classes.
Returns:
pandas.DataFrame: A DataFrame containing the generated dataset with columns:
- 'Age': Age of the patients.
- 'Gender': Gender of the patients (1 for male, 2 for female).
- 'Blood Pressure': Blood pressure of the patients.
- 'Cholesterol': Cholesterol level of the patients.
- 'Diabetes': Target variable indicating the presence (1) or absence (0) of diabetes.
"""
np.random.seed(42)
# Features
age = np.random.randint(20, 80, size=1000)
gender = np.random.choice([1, 2], size=1000)
blood_pressure = np.random.randint(90, 180, size=1000)
cholesterol = np.random.randint(120, 300, size=1000)
# Target variable
# Introduce class imbalance (90% negative class, 10% positive class)
diabetes = np.random.choice([0, 1], size=1000, p=[0.9, 0.1])
# Create DataFrame
data = pd.DataFrame({
'Age': age,
'Gender': gender,
'Blood Pressure': blood_pressure,
'Cholesterol': cholesterol,
'Diabetes': diabetes
})
return data
# Generate the synthetic healthcare dataset
healthcare_df = generate_practical_dataset5()
# Display the first few rows of the dataset
print(healthcare_df.head())
# Display the shape of the dataset
print("Shape of the dataset:", healthcare_df.shape)
Check the target variable and confirm the imbalance. You can also plot this if you wish using
.plot.bar()
.
# Your code here
Separate the features and target variable.
# Your code here
Pick a method for dealing with the imbalance and instantiate the algorithm.
# Your code here
Resample the dataset.
# Your code here
Plot the results of before and after the resampling method using function:
plot_before_and_after_resampling
.
# Your code here
Chapter Summary#
Well done on reaching the end of this chapter!
Just to recap what we have learnt when is comes to cleaning and preparing our data, you should now feel familiar with:
Checking and converting data types.
Handling missing values by removing specific rows or columns based on missing values being present and looked at filling in missing values using statistical values from the dataset.
Removing duplicate records and duplicates across specific columns.
Identifying and handling outliers in the data.
Dealing with target imbalance.