Classification#

# You will need to install fetch_ucirepo using the below commented line of code

# pip install ucimlrepo
# Imports required for notebook

import numpy as np
import pandas as pd

import plotly.express as px
from plotly.subplots import make_subplots
from plotly import graph_objects as go

from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.datasets import make_classification
from sklearn import metrics

from ucimlrepo import fetch_ucirepo

random_seed = 42

Useful resources for classification concepts:

  • Essential Math for Data Science: Thomas Nield (Chapter 6)

  • StatsQuest Guide to Machine Learning: Josh Starmer

  • ISLP: https://www.statlearning.com/ (Chapter 4)

Logistic regression#

Motivation for Logistic regression#

Suppose we are collecting some information from patients as they enter A&E. A health metric is collected as a patient enters A&E and is thought this metric will provide a good indication of whether or not the patient will be admitted.

# Creating a fake dataset
data = make_classification(
    n_samples=1000,
    n_features=2,
    n_informative=2,
    n_redundant=0,
    n_repeated=0,
    n_classes=2,
    random_state=42,
)
X = data[0][:, 1]  # Patient metric
y = data[1]  # Whether or not the patient is admitted
classification_data = pd.DataFrame(X, columns=["patient_metric"])
classification_data["patient_admitted"] = y

X = classification_data[["patient_metric"]].values
y = classification_data["patient_admitted"].values
classification_data.head()
# Plot the relationship in Plotly using a scatter chart

los_fig = px.scatter(classification_data, x="patient_metric", y="patient_admitted")
los_fig.update_layout(
    yaxis_title="Patient admitted (1) or not (0)",
    title="Evaluating the relationship between a metric and patient admissions",
)

We can see the probability of the patient being admitted increases as the metric increases, above around 1 it’s likely they will be admitted and below -1, it’s likely they won’t. What if the metric is 0?

We could try to fit a linear regression model to this data:

lr_fig = px.scatter(classification_data, x="patient_metric", y="patient_admitted", trendline="ols")
lr_fig.update_layout(
    yaxis_title="Patient admitted (1) or not (0)",
    title="Using linear regression for binary outcomes",
)

There are some problems with this approach:

  1. Its not possible to derive meaningful estimates for probabilities.

  2. Linear regression is heavily influenced by outliers.

outliers_df = pd.DataFrame(columns = ['patient_metric', 'patient_admitted'])
outliers_df['patient_metric'] = [50]
outliers_df['patient_admitted'] = [1]

lr_fig = px.scatter(pd.concat([classification_data, outliers_df]), x="patient_metric", y="patient_admitted", trendline="ols")
lr_fig.update_layout(
    yaxis_title="Patient admitted (1) or not (0)",
    title="Using linear regression for binary outcomes",
)

Ideally, we would have a function that takes our metric as input and returns a value between 0 and 1 that represents the probability of being admitted or not (linear regression is not bound between 0 and 1)… this is what Logistic Regression can do!

Fitting a logistic regression model using sklearn#

The Logistic regression (see more info on ISLP page 139)

\(p(X) = \frac{e^{\beta_0 + \beta_1X}}{1 + e^{\beta_0 + \beta_1X}}\), where \(p\) represents probability, for a given \(X\) input and \(\beta_0\), \(\beta_1\) are the coefficients, where \(0\leqslant p(x) \leqslant 1\).

# Split our data into training & testing
X_train, X_test, y_train, y_test = train_test_split(
    X, y, random_state=random_seed, test_size=0.3
)
# Just as with linear regression, we save an instance of LogisticRegression()
logistic_regression = LogisticRegression()
# Fit our logistic function to training data
logistic_regression.fit(X_train, y_train)
# Coefficient term
beta_1 = logistic_regression.coef_[0][0]
print(beta_1)
# Intercept term
beta_0 = logistic_regression.intercept_[0]
print(beta_0)

We have a coefficient and an intercept, just like linear regression.. coincidence? Perhaps not…

Logistic Function background theory#

# Use the linear function from the regression section
def linear_function(b_0, b_1, x_array):
    """
    Linear function.
    Inputs:
        b_0 (float): Coefficient of linear function
        b_1 (float): Intercept (bias) term
        x_array (numpy.array): Input x values
    returns
        numpy.array. Outputs from a linear function.
    """

    return b_0 * x_array + b_1


# Create a python function for logistic regression
def logistic_function(linear_function):
    """
    Sigmoid function.
    Inputs:
        b_0 (float): Coefficient of linear function
        b_1 (float): Intercept (bias) term
        x_array (numpy.array): Input x values
    returns
        numpy.array. Outputs from a linear function.
    """

    return np.exp((linear_function)) / (1 + np.exp((linear_function)))

Lets take the coefficient and intercept returned from the logistic_regressionmodel above and plot the line this gives:

linear_output = linear_function(
    logistic_regression.coef_[0][0],
    logistic_regression.intercept_[0],
    np.linspace(-5, 4, 1000),
)
# Print first values
linear_output[:10]
log_fig = px.line(
    x=np.linspace(-5, 4, 1000),
    y=linear_output,
    title="Plotting line using logistic regression outputs",
)
log_fig.update_layout(yaxis_title="Log odds")
log_fig.show()

We wont go into too much detail on this, but this line describes the relationship between an input (i.e., patient health metric) and the log-odds.

To get back to probability from log-odds, we apply the sigmoid function to the linear output.

transform_line_using_sigmoid = logistic_function(linear_output)
# Print first 10 transformed values
transform_line_using_sigmoid[:10]

Plot the transformed line

log_fig = px.line(
    x=np.linspace(-5, 4, 1000),
    y=linear_output,
    title="Plotting line using logistic regression outputs",
)

log_fig.add_trace(
    go.Scatter(
        x=np.linspace(-5, 5, 1000),
        y=transform_line_using_sigmoid,
        name="Transformed linear model",
    )
)
log_fig.update_yaxes(range=[-0.5, 1.5])
log_fig.update_layout(yaxis_title="Probability")

The sigmoid function has collapsed the linear regression function to fall within the bounds 0 to 1. Lets overlay this sigmoid function to our original data.

logistic_model = px.scatter(
    classification_data, x="patient_metric", y="patient_admitted"
)
logistic_model.add_trace(
    go.Scatter(
        x=np.linspace(-5, 4, 1000),
        y=logistic_function(linear_output),
        name="Fitted Logistic model",
    )
)
logistic_model.update_layout(title="Modelling our data using a Sigmoid function")
Logistic regression is backed by linear function!

If we need to make a prediction on whether or not a patient will be admitted, we input a value into the logistic function and observe if the output is closer to 0 or 1 (typically, a threhsold of 0.5 is used for binary classification). This probability can be extracted using the .predict_proba() method on the trained model.

# Probability of being 0 (first value) or 1 (second value) when a patient has a metric
# values of -1.2
logistic_regression.predict_proba([[-1.2]])

Note, the order to probabilities correspondes to the order in the .classes_ attribute.

logistic_regression.classes_

You can also make a binary prediction using the .predict() method on the trained model.

logistic_regression.predict([[-1.2]])

We can assess model accuracy, as with linear regression, by using the .score() method on the trained model. Lets observe the score for the test data.

logistic_regression.score(X_test, y_test)

Here, accuracy referrs to the proportion of correct predictions.

# Where model prediction and observed values are the same
correct_predictions = np.sum(
    y_test == logistic_regression.predict(X_test)
)
print(correct_predictions)
# Length of test data
test_len = len(y_test)
print(test_len)
accuracy = correct_predictions / test_len
print(accuracy)
Logistic regression key points
  1. The logistic regression model allows us to model and make predictions when our output is categorical (specifically, a binary outcome).

  2. We have a method to represent the probabaility of a binary outcome given a set of input values. This could represent our level of confidence in a prediction. This can then be converted to a binary prediction.

  3. We can extend the use of a single input variable (in a similar manner to multiple linear regression) to inlude multiple input variables.

Logistic regression warnings:
  1. Although not all assumptions are required for logistic regression as with linear regression, we need to be careful of multicollinearity and independence in error terms.

Task 1 (15-20 mins)

In this task, we will use a dataset with the target variable indicating whether or not a patient has heart diseae.There are 13 independent variables, however, lets use just 'max-heart-rate' as our single feature.

Target variable

  • 1 = Absence of heart disease

  • 2 = Prescence of heart disease

# fetch dataset
statlog_heart = fetch_ucirepo(id=145)

# Extract data
features = statlog_heart.data.features
target = statlog_heart.data.targets
# Store in dataframe
task_1_data = pd.DataFrame(features)[["max-heart-rate"]]
task_1_data["heart-disease"] = target
# Seperate into features and target
X = task_1_data[["max-heart-rate"]].values
y = task_1_data["heart-disease"].values
  1. Visualise the relationship between max heart rate and whether or not someone has heart disease using plotly express scatter (or another visualisation library of your choice).

# Answer here
  1. Seperate the data into training and testing (use 30% of the data for testing and make sure to shuffle the data). Ensure the random_state is set to 42 (so everyone gets the same answer).

# Answer here
  1. Fit a LogisticRegression() model to the training data.

# Answer here
  1. Compute the model accuracy on the unseen testing data using the score() method.

# Answer here
  1. Predict the probability of someone having heart disease with 'max-heart-rate' of 178

# Answer here

Evaluating classification outcomes#

Based on the first task, the four possible outcomes from the classification prediction are:

  • Predicts heart disease, and patient actually has heart disease (true positives).

  • Predicts no heart disease, and patient does * not * have heart disease (true negatives).

  • Predicts heart disease, and patient does * not * have heart disease (false positives).

  • Predicts no heart disease, and patient actually has heart disease (false negatives).

This information can be summarised using a confusion matrix.

Confusion matrix#

Lets make a new fake dataset, looking at whether or not a patient gets admitted based on some calculated metric.

data = make_classification(
    n_samples=1000,
    n_features=2,
    n_informative=2,
    n_redundant=0,
    n_repeated=0,
    n_classes=2,
    random_state=42,
)
X = data[0][:, 1]  # Patient metric
y = data[1]  # Whether or not the patient is admitted

# save fake data
classification_data = pd.DataFrame(X, columns=["Metric"])
classification_data["patient_admitted"] = y

X = classification_data[["Metric"]]
y = classification_data['patient_admitted']

# train logistic regression model
logistic_regression = LogisticRegression()
logistic_regression.fit(X, y)

Visualise data and fitted logistic model.

log_fig = px.scatter(
    classification_data,
    x="Metric",
    y="patient_admitted",
    title="Trained logistic model",
)

# Generate many predictions to plot the logistic function
proba = logistic_regression.predict_proba(np.linspace(-5, 4, 100).reshape(-1, 1))[:, 1]

# Add trace of logistic function using predictions
log_fig.add_trace(
    go.Scatter(
        x=np.linspace(-5, 4, 100), y=proba, mode="lines", name="Logistic function"
    )
)
log_fig.show()

Using metrics.confusion_matrix from sklearn allows us to use built in funcionality to calculate the confusion matrix.

confusion_matrix = metrics.confusion_matrix(
    y_true=y, y_pred=logistic_regression.predict(X)
)
print(confusion_matrix)

Note, this is the order of the outputs

tn, fp, fn, tp = metrics.confusion_matrix(y, logistic_regression.predict(X)).ravel()
print(f'True negatives = {tn}')
print(f'False postives = {fp}')
print(f'False negatives = {fn}')
print(f'True postives = {tp}')

Use Plotly express imshow to display confusion matrix as heat map

fig = px.imshow(
    confusion_matrix,
    text_auto=True,
    labels=dict(x="Predicted outcome", y="Actual outcome"),
    x=["0", "1"],
    y=["0", "1"],
)

fig.update_layout(
    xaxis_title="Predicted outcome",
    yaxis_title="Actual outcome",
    title="Confusion matrix",
)

fig.show()

Depending on the scenario, is this number of missed positives (i.e., false negatives) acceptable?

Would you rather wrongly predict a patient that should be admitted, or wrongly predict a patient that shouldn’t be admitted?

Receiver Operator Characteristics (ROC) curves#

In Logistic Regession, we have values we can use to represent probabilties. By default (using .predict()) Logistic Regression will use a 0.5 threshold i.e., values below 0.5 will go to class 0 (not admitted) and values above will go to class 1 (admitted).

Is there any easy way to compare the numbers of true positives and false positives when we change this 0.5 threhold?

# Get probabilities of assigning to class
y_score = logistic_regression.predict_proba(X)[:, 1]

# Print first 10 prediction probabilities
print(y_score[:10])
# ROC curve points
false_positive_rate, true_positive_rate, threshold = metrics.roc_curve(
    y_true=y, y_score=y_score, pos_label=1
)
roc_fig = go.Figure()
roc_fig.add_trace(
    go.Scatter(
        x=false_positive_rate,
        y=true_positive_rate,
        mode="lines",
        text=["Threshold: " + str(round(x, 3)) for x in threshold],
        hoverinfo="text",
        line=dict(width=3),
        name="ROC curve",
    )
)
roc_fig.add_trace(
    go.Scatter(
        x=[0, 1],
        y=[0, 1],
        name="Random guess",
        line_shape="linear",
        mode="lines",
        line=dict(color="green", width=4, dash="dash"),
    )
)
roc_fig.update_layout(
    xaxis_title="False positive rate",
    yaxis_title="True positive rate",
    title="Reciever Operator Curve (ROC)",
)

roc_fig.add_trace(
    go.Scatter(
        x=[0],
        y=[1],
        mode="markers+text",
        name="Perfect classifier",
        text=["The perfect classifier"],
        textposition="bottom right",
    )
)


roc_fig.show()

Note, False positive rate = \(\frac{FP}{FP + TN}\) & True positive rate = \(\frac{TP}{TP + FN}\)

Class imbalance#

Suppose we have been asked to create a prediction model to predict whether someone has a rare medical condition. We use diagnostic imaging as features to predict whether not the virus is present.

X, y = make_classification(
    # the usual parameters
    n_samples=500,
    n_features=5,
    n_informative=3,
    n_classes=2,
    random_state=42,
    # Set label 0 for  98% and 1 for rest 3% of observations
    weights=[0.99],
)
classification_data = pd.DataFrame(X)
classification_data["patient_admitted"] = y
# Train logistic regression model
logistic_regression = LogisticRegression()
logistic_regression.fit(X, y)
logistic_regression.score(X, y)

Wow! this looks like this is a pretty good model! … or does it…

confusion_matrix = metrics.confusion_matrix(y, logistic_regression.predict(X))
confusion_matrix
fig = px.imshow(
    confusion_matrix,
    text_auto=True,
    labels=dict(x="Predicted outcome", y="Actual outcome"),
    x=["0", "1"],
    y=["0", "1"],
)

fig.update_layout(
    xaxis_title="Predicted outcome",
    yaxis_title="Actual outcome",
    title="Confusion matrix of predictions",
)

fig.show()

What is the problem with using this accuracy metric for this data?

To make a good model, we need to adress the imbalance in the data (typically through resampling).

Task 2 (10-15 mins)

  1. Using the logistic model built in task 1, derive and plot, the confusion matrix for the test data. Make use of metrics.confusion_matrix for this.

# Answer here
  1. How many negative predictions were incorrect (false negatives)? How many missed positives do you think are acceptable?

# Answer here

Using machine learning models in practice#

  • It’s important to ensure the training data is free from bias (it’s representative of the population you are studying).

  • The code should be fully transparent (includes documentation and is freely available).

  • Models arent perfect, and therefore should be used in conjunction with expert opinion (human & AI collaboration).

  • Some simple models have been introduced, but more advanceded models are capable of improved prediction (ensemble techniques) which can be discussed during full training days.