Data Preprocessing

Data Preprocessing#

Feature Engineering, Scaling and Transforming#

The next step in data preprocessing involves feature engineering, where categorical variables undergo conversion into numerical values, new features are generated, and various data transformations are applied. Scaling and transforming is carried out on the numerical features and either adjusting the range of values or using mathematical operations or functions to adjust the values.
These steps collectively aim to ensure the data’s robustness and suitability for efficient and reliable model execution.

Within this chapter the following topics will be covered:
Feature Engineering:

Creating New Features.
- Bin Numeric Features.
- Group Features.
Encoding Categorical Variables.
- Label Encoding.
- One-Hot Encoder.
- Ordinal Encoding.
Combine Rare Levels / Cardinal Encoding.
Removing Multicollinearity.

Scaling and Transforming:

Normalising or Scaling the Data.
Transformation.

Import the following libraries for this chapter:

import pandas as pd
import random
import numpy as np

# Plotting
import matplotlib.pyplot as plt
import seaborn as sns

# Encoding categorical variables
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import OrdinalEncoder

# Normalise data
from sklearn.preprocessing import MinMaxScaler
from sklearn.preprocessing import StandardScaler

# Scaling data
from sklearn.preprocessing import QuantileTransformer
from sklearn.preprocessing import PowerTransformer

Creating New Features#

By creating new features, you can provide additional information to the machine learning model.
New features can capture relationships and patterns in the data that the original features might not fully represent and can also help make the data more interpretable.

As we have already seen and covered how missing data can be a challenge in machine learning. By creating new features, based on existing features, we can help mitigate the impact of missing values.

We are going to look at two types of feature creation, binning numeric features and group features in this next section.

Bin Numeric Features#

Binning numeric features can be used to separate continuous numerical variables into a set of predefined bins or intervals.
This process involves partitioning the range of the variable into discrete segments, or bins, and assigning each data point to its corresponding bin.

Generate the below dataset from the function generate_fe_demo_dataset1.

# Demo Dataset 1
# import numpy as np
# import pandas as pd

def generate_fe_demo_dataset1():
    """
    Generate a synthetic healthcare dataset for feature engineering.

    Returns:
        pandas.DataFrame: DataFrame containing synthetic healthcare data with the following columns:
            - PatientID: Unique identifier for each patient.
            - Ward: Ward in which the patient is admitted.
            - Age: Age of the patient.
            - Gender: Gender of the patient (Male/Female).
            - BMI: Body Mass Index of the patient.
            - Systolic_BP: Systolic Blood Pressure of the patient.
            - Diastolic_BP: Diastolic Blood Pressure of the patient.
            - Cholesterol: Cholesterol level of the patient.
            - Diabetes: Whether the patient has diabetes (Y/N).
            - Smoker: Whether the patient is a smoker (Y/N).
            - Cigarettes_per_day: Number of cigarettes smoked per day for smokers (0 for non-smokers).
    """
    data = {
        'patient_id': range(5001, 5101),
        'ward': np.random.choice([
            "Medical Ward 1", "Medical Ward 2", "Medical Ward 3", "Medical Ward 4", "Medical Ward 5", 
            "Medical Ward 6", "Medical Ward 7", "Medical Ward 8", "Medical Ward 9", "Medical Ward 10",
            "Medical Ward 11", "Medical Ward 12", "Medical Ward 13", "Medical Ward 14", "Medical Ward 15", 
            "Surgical Ward 1", "Surgical Ward 2", "Surgical Ward 3", "Surgical Ward 4", "Surgical Ward 5", 
            "Surgical Ward 6", "Surgical Ward 7", "Surgical Ward 8"
        ], size=100),
        'age': np.random.randint(20, 80, size=100),
        'gender': np.random.choice(['Male', 'Female'], size=100),
        'bmi': np.random.uniform(18.5, 35.0, size=100),
        'systolic_bp': [
            172, 135, 86, 147, 123, 99, 142, 129, 158, 123, 115, 100, 120, 107, 124, 101, 111, 107, 94, 129, 
            149, 144, 99, 91, 121, 114, 134, 145, 156, 109, 138, 87, 178, 104, 148, 146, 105, 117, 120, 138, 
            130, 126, 162, 118, 146, 80, 135, 139, 163, 133, 148, 134, 111, 180, 135, 131, 125, 101, 93, 129, 
            149, 117, 155, 214, 124, 114, 132, 139, 122, 118, 162, 128, 128, 129, 99, 131, 130, 102, 108, 114, 
            91, 101, 133, 139, 146, 125, 121, 107, 146, 151, 130, 118, 118, 80, 136, 123, 137, 125, 106, 120
        ],
        'diastolic_bp': [
            81, 102, 41, 79, 68, 58, 61, 65, 140, 58, 79, 64, 63, 82, 83, 64, 71, 81, 66, 71, 
            73, 94, 58, 49, 83, 69, 69, 57, 68, 76, 99, 73, 82, 71, 64, 88, 71, 74, 79, 54, 
            113, 63, 70, 72, 54, 55, 55, 78, 81, 75, 72, 75, 61, 99, 67, 86, 85, 70, 57, 85, 80, 
            68, 111, 84, 60, 67, 89, 68, 68, 70, 95, 85, 76, 56, 58, 70, 66, 73, 60, 66, 66, 65, 
            84, 61, 109, 58, 79, 86, 65, 90, 85, 70, 80, 52, 57, 95, 78, 64, 59, 53
        ],
        'cholesterol': np.random.randint(120, 300, size=100),
        'diabetes': np.random.choice(['Y', 'N'], size=100, p=[0.2, 0.8]),
        'smoker': np.random.choice(['Y', 'N'], size=100, p=[0.4, 0.6]),
    }

    # Conditionally generate cigarettes_per_day based on smoker status
    data['cigarettes_per_day'] = [np.random.randint(1, 30) if smoker == 'Y' else 0 for smoker in data['smoker']]
    
    df = pd.DataFrame(data)
    
    return df

Now use the generated function:

df = generate_fe_demo_dataset1()

df.head()

From this dataset we want to create a new feature, that will be categorical, which ‘bins’ the age of patients into 10-year bandings.

This is how it is done, but let’s break this down into its components:

df['age_band'] = pd.cut(df['age'], bins=range(20, 90, 10), labels=[f"{i}-{i+9}" for i in range(20, 80, 10)])

First off we define our new column name: df['age_band'].
We use pd.cut() on the df['age'] column from the dataframe. This function is used to segment and sort data values into bins.
It takes several arguments:
- The first argument df['age'] is the data to be segmented, in this case, the ‘age’ column.
- bins=range(20, 90, 10): This specifies the bins or intervals into which the data will be divided.
  It creates bins starting from 20 up to (but not including) 90, with a step size of 10.
  So, the bins will be [20-29], [30-39], [40-49], …, [80-89].
- labels=[f"{i}-{i+9}" for i in range(20, 80, 10)]: This specifies the labels to assign to each bin in the format [lower bound - upper bound]. So will generate labels like this [20-29], [30-39], [40-49], …, [70-79].

Let’s apply this to our dataframe and look at the results:

# Create age bands
df['age_band'] = pd.cut(df['age'], bins=range(20, 90, 10), labels=[f"{i}-{i+9}" for i in range(20, 80, 10)])

df.head()

Practical Task 2.1

We now want to categorise the cholesterol column based on the below values. Following what we have just learnt create a new feature named ‘cholesterol_category’.

< 200 - Healthy Level

200 - 239 - At Risk

=> 240 - Dangerous

# Your code here

Group Features#

Grouping features involves aggregating or combining related features into a single feature.
This is where multiple related features are combined to create new, more informative features. It can also involve grouping similar features together to represent a broader aspect of the data.

From this we will create a new group feature, combining the systolic_bp and diastolic_bp, to categorise the blood pressure results.

For this a function has been defined. Look at the function and apply it to a new column bp_category.

def categorise_blood_pressure(systolic, diastolic): """ Categorises blood pressure based on systolic and diastolic readings. Parameters: systolic (int): The systolic blood pressure reading. diastolic (int): The diastolic blood pressure reading. Returns: str: A string indicating the blood pressure category. Possible categories: - "Low" for systolic <= 90 and diastolic <= 60 - "Ideal" for systolic <= 120 and diastolic <= 80 - "Pre-high" for systolic <= 135 and diastolic <= 85 - "High" for all other cases """ if systolic <= 90 and diastolic <= 60: return "Low" elif systolic <= 120 and diastolic <= 80: return "Ideal" elif systolic <= 135 and diastolic <= 85: return "Pre-high" else: return "High" return

We can apply the function to the dataframe like this:

# Calculate 'blood_pressure_category' column df['bp_category'] = df.apply(lambda row: categorise_blood_pressure(row['systolic_bp'], row['diastolic_bp']), axis=1)

df.head()

Encoding Categorical Variables#

Categorical data cannot typically be directly handled by machine learning algorithms, as most algorithms are designed to operate with numerical data only. Therefore, before categorical features can be used as inputs to machine learning algorithms, they must be encoded as numerical values.

We are going to look at the various ways in which categorical variables can be encoded and what you need to consider ensuring you select the correct method of encoding.

We are going to continue using the same dataset from the last section, as we are going to be working with the newly created features.

Label Encoding#

Label encoding is a technique used in data preprocessing where categorical variables are converted into numerical values. This process assigns a unique numerical label to each category within a feature.

To carry out label encoding we need to import the following library.

# import LabelEncoder from sklearn.preprocessing from sklearn.preprocessing import LabelEncoder

First, we need to instantiate an instance of the LabelEncoder

# instantiate LabelEncoder label_encoder = LabelEncoder()

Next, we are going to apply the encoder to the gender column.

It is best to overwrite the existing column for this as:

Overwriting the original field saves memory since you’re not storing redundant information.

There’s less complexity in the dataset since you’re not managing multiple versions of the same information.

With only one version of the column, there’s no confusion about which version to use.

To apply the encoder we use .fit_transform on the gender column.

# Label encode the 'gender' column df['gender'] = label_encoder.fit_transform(df['gender'])

Look at the dataframe and notice what it has changed in the gender column.

df.head()

We only have two different values in this column. At some point we may wish to ‘decode’ back to the original values, especially if there are many values that were encoded. To get back to the original values we use .inverse_transform on the gender column.

# Decode the encoded values back to original values df['gender'] = label_encoder.inverse_transform(df['gender']) df.head()

To complete this, let’s reapply the encoding.

df['gender'] = label_encoder.fit_transform(df['gender']) df.head()

One Hot Encoding#

One hot encoding is a method used to convert categorical variables into a binary format, where each category is represented by a binary vector. In this encoding scheme, each category is assigned a unique index, and then a binary vector is created where only one element is “hot” (set to 1) indicating the presence of that category, while all other elements are “cold” (set to 0).

To carry out one hot encoding we can use a method in pandas called .get_dummies.

You can also use sklearn.preprocessing OneHotEncoder for this which is carried out in the same steps as the LabelEncoder that we have just covered.

From our dataset we are going to use One Hot Encoding on the smoker column.

# One-hot encode the 'smoker' column smoker_dummies = pd.get_dummies(df['smoker'], prefix='smoker')

The coding results are in their own dataframe.

# inspect the new dataframe smoker_dummies

To include this into our dataset we just need to concatenate the two dataframes together.

df = pd.concat([df, smoker_dummies], axis=1)

We will also remove the original smoker column.

df.drop('smoker', axis=1, inplace=True)

df.head()

Ordinal Encoding#

Ordinal encoding is used to convert categorical variables into numerical values based on their order or rank.
In this method, each unique category is assigned a numerical value according to either its position in a predefined order or based on the frequency of occurrence.
Ordinal encoding is commonly used when the categorical variables have a natural order or hierarchy, such as low, medium, and high, or when numerical values can represent meaningful relationships between categories.

To carry out ordinal encoding we need to import the following library.

# import OrdinalEncoder from sklearn.preprocessing from sklearn.preprocessing import OrdinalEncoder

From our dataset we are going to use Ordinal Encoding on the cholesterol_category column.

Firstly, we need to define a list of the possible values, in order from lowest to highest.

cholesterol_category_order = ['Healthy', 'At Risk', 'Dangerous']

Now instantiate the encoder with the orders we have just defined.

# Instantiate OrdinalEncoder with specified categories ordinal_encoder = OrdinalEncoder(categories=[cholesterol_category_order])

And apply the ordinal encoding back to the dataframe.

df['cholesterol_category'] = ordinal_encoder.fit_transform(df[['cholesterol_category']]) df.head()

What do you notice about the encoded values that we have just created?
Use the below cell if you need to inspect the values.

# Your code here

Combining Rare Levels / Cardinal Encoding#

Combining rare levels, also known as cardinal encoding, is a strategy where infrequent categories within a categorical variable are grouped together into a single category.
This approach is beneficial for reducing the dimensionality of the feature space and addressing issues related to overfitting caused by sparse or noisy data.
By combining rare levels, the model can focus on the most common and informative categories while simplifying the representation of the data.

Using our dataset, count the number of values that we have in the ward column.

df['ward'].value_counts()

Unless we were looking for specific patterns in the data across wards, it would be better to group the different types of wards in some way to reduce the dimensionality.

# Create a new column to identify if the ward is medical or surgical df['medical_ward'] = 1 df.loc[df['ward'].str.contains('Surgical'), 'medical_ward'] = 0

df.head()

Practical Task 2.2

Using our dataset,

Create a new category variable for bmi named bmi_category using the classification levels below.

Then use ordinal encoding using the new bmi_category column, but create as a new column bmi_category_encoded so you can compare the results easily.

BMI classifcation levels:

A BMI of 18.4 and below is classed as underweight.

A BMI of 18.5 to 24.9 is classed as a healthy weight.

A BMI of 25 to 29.9 is classed as overweight.

A BMI of 30 or more is classed as obese.

# Your code here

Removing Multicollinearity#

What is Multicollinearity?#

Multicollinearity occurs when two or more predictor variables are highly correlated with each other. In other words, it means that some independent variables are linearly dependent on others. This can cause several issues in regression analysis:

Unstable Estimates: Multicollinearity can lead to unstable estimates of the coefficients in the regression model. Small changes in the data can result in large changes in the estimated coefficients.

Reduced Precision: Multicollinearity inflates the standard errors of the regression coefficients, which reduces the precision of the estimates. This makes it difficult to identify the true effect of each predictor variable on the target variable.

Difficulty in Interpretation: High multicollinearity makes it challenging to interpret the individual effects of predictor variables on the target variable. It becomes unclear which variables are truly driving the variation in the target variable.

So, removing multicollinearity involves identifying and eliminating highly correlated variables from a dataset.

To help identify highly correlated variables we can carry out some quick correlation analysis with a correlation matrix. A correlation matrix will show all the variables and identify pairs of variables with high correlation coefficients. Variables with correlation coefficients above a certain threshold (e.g., 0.7 or 0.8) are considered highly correlated.

To demonstrate this, we are going to look at this new dataset that shows some highly correlated data.

# import numpy as np # import pandas as pd def generate_correlated_dataframe(): """ Generate a sample DataFrame with highly correlated variables and multicollinearity. Returns: pandas.DataFrame: DataFrame containing the following columns: - X1: Independent variable 1. - X2: Independent variable 2, highly correlated with X1. - X3: Independent variable 3. - X4: Independent variable 4. """ # Create sample data data = { 'X1': np.random.rand(100), # Independent variable 1 'X2': np.random.rand(100), # Independent variable 2 'X3': np.random.rand(100), # Independent variable 3 'X4': np.random.rand(100), # Independent variable 4 } # Create multicollinearity by adding a new variable that's highly correlated with X1 data['X2'] = data['X1'] + np.random.normal(0, 0.1, 100) # Convert data into a DataFrame df = pd.DataFrame(data) return df def plot_correlation_matrix(correlation_matrix): """ Plot a heatmap of the correlation matrix. Parameters: correlation_matrix (pandas.DataFrame): The correlation matrix to plot. Returns: None """ plt.figure(figsize=(8, 6)) sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', fmt=".2f", linewidths=0.5) plt.title('Correlation Matrix') plt.show()

# generate correlated dataframe df_correlated_data = generate_correlated_dataframe() # print correlated dataframe print(df_correlated_data.head())

Using this data df_correlated_data, and a plotting function plot_correlation_matrix already made for you to use, let’s look at the correlation matrix.

Compute the correlation matrix from the dataset, then plot the results to the correlation matrix.

# Compute the correlation matrix correlation_matrix = df_correlated_data.corr() # Plot the corrrelation matrix. plot_correlation_matrix(correlation_matrix)

Red - High or Positive correlation.

Blue - Low or Negative correlation.

What two variables are highly correlated here?

Let’s now do the same for our dataframe that we have been working on.

# Compute the correlation matrix our_dataset_correlation_matrix = df.corr() # Plot the corrrelation matrix. plot_correlation_matrix(our_dataset_correlation_matrix)

What highly correlated variable could we remove here without losing data?

Practical Task 2.3

Using our dataset

First remove the highly correlated variable from the dataset.

Then using the documentation here on One Hot Encoding, One-Hot encode the diabetes column, using the drop='first' argument.

Note: You will need to import OneHotEncoder from the sklearn library as we have not used this in our notebook.

Plot a correlation matrix to check the results of both the previous points. i.e.: have you successfully removed the columns with high multicollinearity!!

# Your code here - remove the highly correlated variable df.head()

# Your code here - use one hot encoding from sklearn, to encode the diabetes column using the drop=First argument. ###################################### # Remember the drop='first' argument ######################################

# Your code here - plot a new correlation matrix of dataset using function: plot_correlation_matrix()

Scale & Transform#

So far, we have looked at the categorical variables in our dataset and how to prepare them for machine learning. However, the numerical features also need to be reviewed before creating a model, this is what scaling and transforming is about.

Scaling is the process of adjusting the range of values or doing so through a mathematical operation or function applied to the data to achieve the desired scaling effect. Whilst transformation is the process of changing the distribution of the data.

Normalising or Scaling the Data#

Normalising data involves scaling numerical features to a standard range, typically between 0 and 1 or -1 and 1, to ensure consistency and comparability across variables.
This process is essential in data preprocessing to prevent features with larger scales from dominating those with smaller scales, which can adversely affect the performance of machine learning algorithms, especially those sensitive to the scale of features (e.g., K-means clustering, gradient descent-based algorithms).

To implement Min-Max Scaling on our numerical features we need to import the following library.

Note: This process can also be applied to target variables.

# import MinMaxScaler from sklearn.preprocessing from sklearn.preprocessing import MinMaxScaler

We are going to look at a simple dataset to demonstrate this before applying to our dataset.

# Sample Dataframe data = {'Feature1': [10, 20, 30, 40, 50], 'Feature2': [100, 200, 300, 400, 500], 'Feature3': [150, 250, 350, 450, 550]} df_sample = pd.DataFrame(data) df_sample

Firstly, instantiate the MinMaxScaler.

# Instantiate MinMaxScaler minmax_scaler = MinMaxScaler()

Apply to the simple dataframe.

# Normalise the data df_normalised_minmaxscaler = pd.DataFrame(minmax_scaler.fit_transform(df_sample), columns=df_sample.columns)

And compare the output to the original.

print("Original Dataframe:") print(df_sample) print("\nNormalised Dataframe:") print(df_normalised_minmaxscaler)

Another method is Z-score standardisation. This will scale the data so that it has a mean of 0 and standard deviation of 1.
To implement the Min-Max Scaling on our numerical features we need to import the following library.

# import StandardScaler from sklearn.preprocessing from sklearn.preprocessing import StandardScaler

Using the same small dataset of df_sample, set-up the StandardScaler, apply to the dataframe and print the results with the original to compare.

# Instantiate StandardScaler zscore_scaler = StandardScaler() # Normalise the data df_normalised_standardscaler = pd.DataFrame(zscore_scaler.fit_transform(df_sample), columns=df_sample.columns) print("Original DataFrame:") print(df_sample) print("\nNormalised DataFrame:") print(df_normalised_standardscaler)

The choice between Min-Max Scaling and Z-score Standardisation depends on the characteristics of your data and the requirements of your machine learning model.
Min-Max Scaling is more suitable for non-Gaussian (non-normal) distributed data or when you need data within a specific range, while Z-score Standardisation is preferable for Gaussian (normal) distributed data or when preserving the shape of the distribution is important, especially for algorithms that rely on distance metrics.

Practical Task 2.4

Using our dataset apply either the Min-Max scaler or the Z-score/Standard Scaler to the following numerical features.

Column: cholesterol.

Column: bmi.

Column: cigarettes_per_day.

# Your code here

Transformation#

While normalisation rescales the data within new limits to reduce the impact of magnitude in the variance, transformation of features and/or target variables is a more radical technique.
Transformation changes the shape of the distribution such that the transformed data can be represented by a normal or approximate normal distribution.

Transforming data to approximate a normal distribution is crucial in machine learning for ensuring that the data meets the assumptions of many algorithms, such as linear regression and statistical tests. Normalising the distribution improves model stability, convergence, and performance. By stabilising variance and reducing skewness, transformed data facilitates better understanding of underlying relationships and leads to more accurate predictions.

We will be working on our dataset for this section and will be plotting some distributions to show how we can transform them.
Generate this function, which will plot the distribution of one or two specified columns in the dataframe in two separate plots for comparison.

# plot transformation function # import seaborn as sns # import matplotlib.pyplot as plt def plot_transformation(df, column1, column2=None): """ Plot the distribution of one or two specified columns in the dataframe. Parameters: df (pandas.DataFrame): DataFrame containing the data. column1 (str): Name of the first column to be plotted. column2 (str, optional): Name of the second column to be plotted. Defaults to None. """ # Set the style of seaborn sns.set(style="whitegrid") # Plot the first histogram plt.figure(figsize=(8, 4)) plt.subplot(1, 2, 1) sns.histplot(data=df, x=column1, kde=True) plt.title(f'Distribution of {column1}') plt.xlabel(f'{column1} (mmHg)') plt.ylabel('Frequency') # Plot the second histogram if column2 is provided if column2: plt.subplot(1, 2, 2) sns.histplot(data=df, x=column2, kde=True) plt.title(f'Distribution of {column2}') plt.xlabel(f'{column2} (mmHg)') plt.ylabel('Frequency') # Show the plot plt.tight_layout() plt.show() return

Use the plot_transformation function by passing in the name of our dataframe and the column name: 'systolic_bp'.

plot_transformation(df,'systolic_bp')

Power Transformer#

We are going to look at the PowerTransformer from sklearn.
The library for this has already been imported:

# import PowerTransformer from sklearn.preprocessing from sklearn.preprocessing import PowerTransformer

We are going to instantiate the PowerTransformer and apply it to the systolic_bp column, saving the results to a new column.

# Apply PowerTransformer to systolic_bp power_transformer = PowerTransformer() df['systolic_bp_transformed_power'] = power_transformer.fit_transform(df[['systolic_bp']].values)

Let’s look at some of the results.

# Display the transformed DataFrame print(df[['systolic_bp', 'systolic_bp_transformed_power']].head())

And now plot the before and after using the plot_transformation function.

# Plot before and after transformation plot_transformation(df,'systolic_bp','systolic_bp_transformed_power')

Quantile Transformer#

Another transformer that we will look at now is the Quantile Transformer.
The library for this has also already been imported:

# import QuantileTransformer from sklearn.preprocessing from sklearn.preprocessing import QuantileTransformer

The same as before, we instantiate the transformer - QuantileTransformer and apply it to the systolic_bp column. This time saving the results to another new column.

# Apply QuantileTransformer to systolic_bp quantile_transformer = QuantileTransformer(output_distribution='normal') df['systolic_bp_transformed'] = quantile_transformer.fit_transform(df[['systolic_bp']].values)

You will get a warning here:
UserWarning: n_quantiles (1000) is greater than the total number of samples (100). n_quantiles is set to n_samples.
Use this link to the sklearn documentation on QuantileTransformer to see if you can fix the above code.

Some of the results…

# Display the transformed DataFrame print(df[['systolic_bp', 'systolic_bp_transformed']].head())

Plot the before and after…

# Plot before and after transformation plot_transformation(df,'systolic_bp','systolic_bp_transformed')

Practical Task 2.5

With our dataset, use the diastolic_bp to:

Plot the distribution of ‘diastolic_bp’. For this you can use the function plot_transformation for this.

Apply one of the transform methods to a new column diastolic_bp_transformed.

Plot the transformed distribution.

# Your code here

Further reading on types of distribution transforms can be found here.

Chapter Summary#

Well done on reaching the end of this chapter!
You should now be familiar with the following when it comes to feature engineering, scaling, and transforming a dataset:

Creating new features in the dataset.

Encoding categorical variables of different types and know when to use them.

Remove multicollinearity.

Scaling the features and target variables.

Transforming the distribution of features and target variables.

previous

Data Preprocessing

next

Graphing Visual Vocabulary

Contents

Feature Engineering, Scaling and Transforming

Creating New Features

Bin Numeric Features

Group Features

Encoding Categorical Variables

Label Encoding

One Hot Encoding

Ordinal Encoding

Combining Rare Levels / Cardinal Encoding

Removing Multicollinearity

What is Multicollinearity?

Scale & Transform

Normalising or Scaling the Data

Transformation

Power Transformer

Quantile Transformer

Chapter Summary

By Data Science Team

© Somerset NHS Foundation Trust 2023