Classification

Classification#

# You will need to install fetch_ucirepo using the below commented line of code

# pip install ucimlrepo

# Imports required for notebook

import numpy as np
import pandas as pd

import plotly.express as px
from plotly.subplots import make_subplots
from plotly import graph_objects as go

from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.datasets import make_classification
from sklearn import metrics

from ucimlrepo import fetch_ucirepo

random_seed = 42

Useful resources for classification concepts:

Essential Math for Data Science: Thomas Nield (Chapter 6)
StatsQuest Guide to Machine Learning: Josh Starmer
ISLP: https://www.statlearning.com/ (Chapter 4)

Logistic regression#

Motivation for Logistic regression#

Suppose we are collecting some information from patients as they enter A&E. A health metric is collected as a patient enters A&E and is thought this metric will provide a good indication of whether or not the patient will be admitted.

# Creating a fake dataset
data = make_classification(
    n_samples=1000,
    n_features=2,
    n_informative=2,
    n_redundant=0,
    n_repeated=0,
    n_classes=2,
    random_state=42,
)
X = data[0][:, 1]  # Patient metric
y = data[1]  # Whether or not the patient is admitted

classification_data = pd.DataFrame(X, columns=["patient_metric"])
classification_data["patient_admitted"] = y

X = classification_data[["patient_metric"]].values
y = classification_data["patient_admitted"].values
classification_data.head()

# Plot the relationship in Plotly using a scatter chart

los_fig = px.scatter(classification_data, x="patient_metric", y="patient_admitted")
los_fig.update_layout(
    yaxis_title="Patient admitted (1) or not (0)",
    title="Evaluating the relationship between a metric and patient admissions",
)

We can see the probability of the patient being admitted increases as the metric increases, above around 1 it’s likely they will be admitted and below -1, it’s likely they won’t. What if the metric is 0?

We could try to fit a linear regression model to this data:

lr_fig = px.scatter(classification_data, x="patient_metric", y="patient_admitted", trendline="ols")
lr_fig.update_layout(
    yaxis_title="Patient admitted (1) or not (0)",
    title="Using linear regression for binary outcomes",
)

There are some problems with this approach:

Its not possible to derive meaningful estimates for probabilities.
Linear regression is heavily influenced by outliers.

outliers_df = pd.DataFrame(columns = ['patient_metric', 'patient_admitted'])
outliers_df['patient_metric'] = [50]
outliers_df['patient_admitted'] = [1]

lr_fig = px.scatter(pd.concat([classification_data, outliers_df]), x="patient_metric", y="patient_admitted", trendline="ols")
lr_fig.update_layout(
    yaxis_title="Patient admitted (1) or not (0)",
    title="Using linear regression for binary outcomes",
)

Ideally, we would have a function that takes our metric as input and returns a value between 0 and 1 that represents the probability of being admitted or not (linear regression is not bound between 0 and 1)… this is what Logistic Regression can do!

Fitting a logistic regression model using sklearn#

The Logistic regression (see more info on ISLP page 139)

\(p(X) = \frac{e^{\beta_0 + \beta_1X}}{1 + e^{\beta_0 + \beta_1X}}\), where \(p\) represents probability, for a given \(X\) input and \(\beta_0\), \(\beta_1\) are the coefficients, where \(0\leqslant p(x) \leqslant 1\).

# Split our data into training & testing
X_train, X_test, y_train, y_test = train_test_split(
    X, y, random_state=random_seed, test_size=0.3
)

# Just as with linear regression, we save an instance of LogisticRegression()
logistic_regression = LogisticRegression()

# Fit our logistic function to training data
logistic_regression.fit(X_train, y_train)

# Coefficient term
beta_1 = logistic_regression.coef_[0][0]
print(beta_1)

# Intercept term
beta_0 = logistic_regression.intercept_[0]
print(beta_0)

We have a coefficient and an intercept, just like linear regression.. coincidence? Perhaps not…

Logistic Function background theory#

# Use the linear function from the regression section
def linear_function(b_0, b_1, x_array):
    """
    Linear function.
    Inputs:
        b_0 (float): Coefficient of linear function
        b_1 (float): Intercept (bias) term
        x_array (numpy.array): Input x values
    returns
        numpy.array. Outputs from a linear function.
    """

    return b_0 * x_array + b_1


# Create a python function for logistic regression
def logistic_function(linear_function):
    """
    Sigmoid function.
    Inputs:
        b_0 (float): Coefficient of linear function
        b_1 (float): Intercept (bias) term
        x_array (numpy.array): Input x values
    returns
        numpy.array. Outputs from a linear function.
    """

    return np.exp((linear_function)) / (1 + np.exp((linear_function)))

Lets take the coefficient and intercept returned from the logistic_regressionmodel above and plot the line this gives:

linear_output = linear_function(
    logistic_regression.coef_[0][0],
    logistic_regression.intercept_[0],
    np.linspace(-5, 4, 1000),
)
# Print first values
linear_output[:10]

log_fig = px.line(
    x=np.linspace(-5, 4, 1000),
    y=linear_output,
    title="Plotting line using logistic regression outputs",
)
log_fig.update_layout(yaxis_title="Log odds")
log_fig.show()

We wont go into too much detail on this, but this line describes the relationship between an input (i.e., patient health metric) and the log-odds.

To get back to probability from log-odds, we apply the sigmoid function to the linear output.

transform_line_using_sigmoid = logistic_function(linear_output)
# Print first 10 transformed values
transform_line_using_sigmoid[:10]

Plot the transformed line

log_fig = px.line(
    x=np.linspace(-5, 4, 1000),
    y=linear_output,
    title="Plotting line using logistic regression outputs",
)

log_fig.add_trace(
    go.Scatter(
        x=np.linspace(-5, 5, 1000),
        y=transform_line_using_sigmoid,
        name="Transformed linear model",
    )
)
log_fig.update_yaxes(range=[-0.5, 1.5])
log_fig.update_layout(yaxis_title="Probability")

The sigmoid function has collapsed the linear regression function to fall within the bounds 0 to 1. Lets overlay this sigmoid function to our original data.

logistic_model = px.scatter(
    classification_data, x="patient_metric", y="patient_admitted"
)
logistic_model.add_trace(
    go.Scatter(
        x=np.linspace(-5, 4, 1000),
        y=logistic_function(linear_output),
        name="Fitted Logistic model",
    )
)
logistic_model.update_layout(title="Modelling our data using a Sigmoid function")

Logistic regression is backed by linear function!

If we need to make a prediction on whether or not a patient will be admitted, we input a value into the logistic function and observe if the output is closer to 0 or 1 (typically, a threhsold of 0.5 is used for binary classification). This probability can be extracted using the .predict_proba() method on the trained model.

# Probability of being 0 (first value) or 1 (second value) when a patient has a metric
# values of -1.2
logistic_regression.predict_proba([[-1.2]])

Note, the order to probabilities correspondes to the order in the .classes_ attribute.

logistic_regression.classes_

You can also make a binary prediction using the .predict() method on the trained model.

logistic_regression.predict([[-1.2]])

We can assess model accuracy, as with linear regression, by using the .score() method on the trained model. Lets observe the score for the test data.

logistic_regression.score(X_test, y_test)

Here, accuracy referrs to the proportion of correct predictions.

# Where model prediction and observed values are the same
correct_predictions = np.sum(
    y_test == logistic_regression.predict(X_test)
)
print(correct_predictions)

# Length of test data
test_len = len(y_test)
print(test_len)

accuracy = correct_predictions / test_len
print(accuracy)

Logistic regression key points

The logistic regression model allows us to model and make predictions when our output is categorical (specifically, a binary outcome).
We have a method to represent the probabaility of a binary outcome given a set of input values. This could represent our level of confidence in a prediction. This can then be converted to a binary prediction.
We can extend the use of a single input variable (in a similar manner to multiple linear regression) to inlude multiple input variables.

Logistic regression warnings:

Although not all assumptions are required for logistic regression as with linear regression, we need to be careful of multicollinearity and independence in error terms.

Task 1 (15-20 mins)

In this task, we will use a dataset with the target variable indicating whether or not a patient has heart diseae.There are 13 independent variables, however, lets use just 'max-heart-rate' as our single feature.

Target variable

1 = Absence of heart disease
2 = Prescence of heart disease

# fetch dataset
statlog_heart = fetch_ucirepo(id=145)

# Extract data
features = statlog_heart.data.features
target = statlog_heart.data.targets

# Store in dataframe
task_1_data = pd.DataFrame(features)[["max-heart-rate"]]
task_1_data["heart-disease"] = target

# Seperate into features and target
X = task_1_data[["max-heart-rate"]].values
y = task_1_data["heart-disease"].values

Visualise the relationship between max heart rate and whether or not someone has heart disease using plotly express scatter (or another visualisation library of your choice).

# Answer here

Seperate the data into training and testing (use 30% of the data for testing and make sure to shuffle the data). Ensure the random_state is set to 42 (so everyone gets the same answer).

# Answer here

Fit a LogisticRegression() model to the training data.

# Answer here

Compute the model accuracy on the unseen testing data using the score() method.

# Answer here

Predict the probability of someone having heart disease with 'max-heart-rate' of 178

# Answer here

Evaluating classification outcomes#

Based on the first task, the four possible outcomes from the classification prediction are:

Predicts heart disease, and patient actually has heart disease (true positives).
Predicts no heart disease, and patient does * not * have heart disease (true negatives).
Predicts heart disease, and patient does * not * have heart disease (false positives).
Predicts no heart disease, and patient actually has heart disease (false negatives).

This information can be summarised using a confusion matrix.

Confusion matrix#

Lets make a new fake dataset, looking at whether or not a patient gets admitted based on some calculated metric.

data = make_classification(
    n_samples=1000,
    n_features=2,
    n_informative=2,
    n_redundant=0,
    n_repeated=0,
    n_classes=2,
    random_state=42,
)
X = data[0][:, 1]  # Patient metric
y = data[1]  # Whether or not the patient is admitted

# save fake data
classification_data = pd.DataFrame(X, columns=["Metric"])
classification_data["patient_admitted"] = y

X = classification_data[["Metric"]]
y = classification_data['patient_admitted']

# train logistic regression model
logistic_regression = LogisticRegression()
logistic_regression.fit(X, y)

Visualise data and fitted logistic model.

log_fig = px.scatter(
    classification_data,
    x="Metric",
    y="patient_admitted",
    title="Trained logistic model",
)

# Generate many predictions to plot the logistic function
proba = logistic_regression.predict_proba(np.linspace(-5, 4, 100).reshape(-1, 1))[:, 1]

# Add trace of logistic function using predictions
log_fig.add_trace(
    go.Scatter(
        x=np.linspace(-5, 4, 100), y=proba, mode="lines", name="Logistic function"
    )
)
log_fig.show()

Using metrics.confusion_matrix from sklearn allows us to use built in funcionality to calculate the confusion matrix.

confusion_matrix = metrics.confusion_matrix(
    y_true=y, y_pred=logistic_regression.predict(X)
)
print(confusion_matrix)

Note, this is the order of the outputs

tn, fp, fn, tp = metrics.confusion_matrix(y, logistic_regression.predict(X)).ravel()
print(f'True negatives = {tn}')
print(f'False postives = {fp}')
print(f'False negatives = {fn}')
print(f'True postives = {tp}')

Use Plotly express imshow to display confusion matrix as heat map

fig = px.imshow(
    confusion_matrix,
    text_auto=True,
    labels=dict(x="Predicted outcome", y="Actual outcome"),
    x=["0", "1"],
    y=["0", "1"],
)

fig.update_layout(
    xaxis_title="Predicted outcome",
    yaxis_title="Actual outcome",
    title="Confusion matrix",
)

fig.show()

Depending on the scenario, is this number of missed positives (i.e., false negatives) acceptable?

Would you rather wrongly predict a patient that should be admitted, or wrongly predict a patient that shouldn’t be admitted?

Receiver Operator Characteristics (ROC) curves#

In Logistic Regession, we have values we can use to represent probabilties. By default (using .predict()) Logistic Regression will use a 0.5 threshold i.e., values below 0.5 will go to class 0 (not admitted) and values above will go to class 1 (admitted).

Is there any easy way to compare the numbers of true positives and false positives when we change this 0.5 threhold?

# Get probabilities of assigning to class
y_score = logistic_regression.predict_proba(X)[:, 1]

# Print first 10 prediction probabilities
print(y_score[:10])

# ROC curve points
false_positive_rate, true_positive_rate, threshold = metrics.roc_curve(
    y_true=y, y_score=y_score, pos_label=1
)

roc_fig = go.Figure()
roc_fig.add_trace(
    go.Scatter(
        x=false_positive_rate,
        y=true_positive_rate,
        mode="lines",
        text=["Threshold: " + str(round(x, 3)) for x in threshold],
        hoverinfo="text",
        line=dict(width=3),
        name="ROC curve",
    )
)
roc_fig.add_trace(
    go.Scatter(
        x=[0, 1],
        y=[0, 1],
        name="Random guess",
        line_shape="linear",
        mode="lines",
        line=dict(color="green", width=4, dash="dash"),
    )
)
roc_fig.update_layout(
    xaxis_title="False positive rate",
    yaxis_title="True positive rate",
    title="Reciever Operator Curve (ROC)",
)

roc_fig.add_trace(
    go.Scatter(
        x=[0],
        y=[1],
        mode="markers+text",
        name="Perfect classifier",
        text=["The perfect classifier"],
        textposition="bottom right",
    )
)


roc_fig.show()

Note, False positive rate = \(\frac{FP}{FP + TN}\) & True positive rate = \(\frac{TP}{TP + FN}\)

Class imbalance#

Suppose we have been asked to create a prediction model to predict whether someone has a rare medical condition. We use diagnostic imaging as features to predict whether not the virus is present.

X, y = make_classification(
    # the usual parameters
    n_samples=500,
    n_features=5,
    n_informative=3,
    n_classes=2,
    random_state=42,
    # Set label 0 for  98% and 1 for rest 3% of observations
    weights=[0.99],
)
classification_data = pd.DataFrame(X)
classification_data["patient_admitted"] = y

# Train logistic regression model
logistic_regression = LogisticRegression()
logistic_regression.fit(X, y)

logistic_regression.score(X, y)

Wow! this looks like this is a pretty good model! … or does it…

confusion_matrix = metrics.confusion_matrix(y, logistic_regression.predict(X))
confusion_matrix

fig = px.imshow(
    confusion_matrix,
    text_auto=True,
    labels=dict(x="Predicted outcome", y="Actual outcome"),
    x=["0", "1"],
    y=["0", "1"],
)

fig.update_layout(
    xaxis_title="Predicted outcome",
    yaxis_title="Actual outcome",
    title="Confusion matrix of predictions",
)

fig.show()

What is the problem with using this accuracy metric for this data?

To make a good model, we need to adress the imbalance in the data (typically through resampling).

Task 2 (10-15 mins)

Using the logistic model built in task 1, derive and plot, the confusion matrix for the test data. Make use of metrics.confusion_matrix for this.

# Answer here

How many negative predictions were incorrect (false negatives)? How many missed positives do you think are acceptable?

# Answer here

Using machine learning models in practice#

It’s important to ensure the training data is free from bias (it’s representative of the population you are studying).
The code should be fully transparent (includes documentation and is freely available).
Models arent perfect, and therefore should be used in conjunction with expert opinion (human & AI collaboration).
Some simple models have been introduced, but more advanceded models are capable of improved prediction (ensemble techniques) which can be discussed during full training days.