Regression#
# Imports required for notebook
import numpy as np
import pandas as pd
import plotly.express as px
from plotly.subplots import make_subplots
from plotly import graph_objects as go
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.datasets import make_regression
from sklearn import metrics
random_seed = 42
Useful resources for regression concepts:
Essential Math for Data Science: Thomas Nield (Chapter 5)
StatsQuest Guide to Machine Learning: Josh Starmer
ISLP: https://www.statlearning.com/ (Chapter 3)
# Custom functions required for notebook
def linear_function(b_0, b_1, x_array):
"""
Inputs:
b_0 (float): Coefficient of linear function
b_1 (float): Intercept (bias) term
x_array (numpy.array): Input x values
returns
numpy.array. Outputs from a linear function.
"""
return b_0 * x_array + b_1
Simple Linear regression#
Motivation for linear regression#
Suppose you want to model and predict patient length of stay. You believe age has the biggest impact on how long a patient will stay in hospital, so you collect some initial data on 50 patients, recording their age and LoS.
# Ensure reproducible results
np.random.seed(random_seed)
# Fabricated dataset
patient_recorded_age = np.random.normal(70, 10, 50)
patient_recorded_los = linear_function(2, 0, patient_recorded_age) + np.random.normal(
15, 5, 50
)
los_data = pd.DataFrame()
los_data["age"] = patient_recorded_age
los_data["los"] = patient_recorded_los
# Show the first 5 rows
los_data.head()
# Plot the relationship in Plotly using a scatter chart
los_fig = px.scatter(los_data, x="age", y="los")
los_fig.update_layout(
yaxis_title="Length of stay (hours)",
title="Evaluating the relationship between age and length of stay",
)
We can see that as age increases/decreases, LoS increases/decreases in a proportional amount, indicating a linear relationship:
\(\hat{y} = \beta_0 + \beta_1x\), where \(\beta_0\) is the intercept and \(\beta_1\) is the coefficient.
Note, \(\hat{y}\) is the notation for our prediction, i.e., the predicted LoS.
Lets plot a simple linear regression model to the data (aka, the line of best fit).
# Plotting the relationship, along with a regrerssion line using the trendline parameter in px.scatter()
los_fig = px.scatter(los_data, x="age", y="los", trendline="ols")
los_fig.update_layout(
yaxis_title="Length of stay (hours)",
title="Evaluating the relationship between age and length of stay",
)
Note, we fit a line to minimise the error between our prediction and our observations:
\(min\sum_{i=1}^{n}(y_i - \hat{y}_i)^2 = min\sum_{i=1}^{n}(y_i - \beta_0 + \beta_1x_i)^2\)
This line does a good job at modelling the relationship! Let’s repeat this process using sklearn.
Fitting a linear regression model using sklearn#
Firstly, we need to import the LinearRegression module from sklearn (which has been imported in the first cell):
from sklearn.linear_model import LinearRegression
# Saving an instance of LinearRegression() to the variable linear_model
linear_model = LinearRegression()
# Separating independent (i.e., age) & dependant (i.e., LoS) variables
X, y = los_data[["age"]].values, los_data["los"].values
# Fitting model to los_data data (i.e., finding the line of best fit)
linear_model.fit(X, y)
# Coefficient term
beta_1 = linear_model.coef_[0]
print(beta_1)
# Intercept term
beta_0 = linear_model.intercept_
print(beta_0)
# Feel free to ignore this (showing how OLS estimates beta_0 and beta_1)
def ols_estimate(x, y):
'''
Function that estimates parameters of simple linear regression using OLS.
Check out ISLP page 71 for more info.
Inputs:
x (numpy.array): Independent varable observed data.
y (numpy.array): Dependent variable observed data.
Outputs:
tuple: (beta_0, beta_1)
'''
# Calculate mean
x_mean = np.mean(x)
y_mean = np.mean(y)
# beta 0 and beta 1 formulae
beta_1_est = np.sum((x - x_mean) * (y - y_mean) ) / np.sum((x - x_mean) ** 2)
beta_0_est = y_mean - beta_1_est*x_mean
return beta_0_est, beta_1_est
# We see the values returned are identical to sklearn
print(ols_estimate(X, y))
We now have a mathematical model for the relationship between LoS and age!… now what?
# Making predictions for a new patient (i.e., expected LoS for someone aged 75)
linear_model.predict([[75]])
# We could equally make manual predictions using the fitted parameters using our prediction equation.
prediction = beta_0 + beta_1 * 75
print(prediction)
Assessing model accuracy#
We’ve fit the optimal line, but does a line really capture our process? There are plenty of metrics to evaluate the performance of the regression model. Look here for more information, or run the cell below.
# Look at some of the available metrics in sklearn
# dir(metrics._regression)
\(R^{2}\) is a metric that quantifies the proportion of variation of a dependant variable predictable by an independant variable(s), as a percentage from 0-100. This metric can be obtained by using the .score()
method on the trained model.
linear_model.score(X, y)
Note, we can easily calculate \(R^{2}\) from scratch (although we wouldn’t in practice):
mean_sum_squares = np.sum((y - np.mean(y)) ** 2) # How "good" a mean line is
model_sum_squares = np.sum(
(y - linear_model.predict(X)) ** 2
) # How "good" our model is
# R^2 calculation
r_2 = (mean_sum_squares - model_sum_squares) / mean_sum_squares
# Note, if model_sum_squares=0, then r_2 = 1
print(r_2)
\(R^{2}\) returns a value between 0 and 1. A value of 0 would mean our model performs no better than a mean line, and 1 means we perfectly describe all variation in the data.
Now we have fit a line, we can predict new patient’s LoS based on their age. For example, someone who is 75 will have a LoS of approximately 165 hours.
We can quantiy the relationship. For example, every 10 year age increase (decrease) is associated with an increase (decrease) in LoS by 20 hours.
We shouldn’t extrapolate for predictions outside of our available age range (i.e., ages <50 or ages >95).
This is only a sample of 50, the sample could be biased and not reflect the population relationship.
Clearly, all data points don’t fall exactly on the line i.e., a prediction is therefore associated with a degree of uncertainty.
Task 1 (15-20 minutes)
Data has been collected on A&E waiting times (in minutes) and bed occupancy (fake data). The department wishes to develop a model to predict patient waiting time based on bed occupancy.
Use linear regression in sklearn to model the relationship between the target variable (wait_times_mins
) and the independent variable (beds_occ
).
# Task data
np.random.seed(random_seed)
task_1_data = pd.DataFrame()
task_1_data["wait_times_mins"] = np.random.normal(3 * 60, 60, 50)
task_1_data["beds_occ"] = (
linear_function(
np.random.randn() + 1, np.random.randn(), task_1_data["wait_times_mins"]
)
+ np.random.normal(15, 15, 50)
).astype(int)
task_1_data.head()
# Splitting into features and target
X, y = task_1_data[["beds_occ"]].values, task_1_data["wait_times_mins"].values
Use the above A&E data to answer these questions:
Plot the relationship using a scatter plot. Do you agree linear regression is appropriate here? (Remeber, what we are trying to predict usually goes along the y-axis).
# Answer here
Fit a simple linear regression model (using all the data) using
LinearRegression()
.
# Answer here
Assess the accuracy of the model using
.score()
on the trained model (the \(R^2\) value). Use the same data that was used to train the model to assess accuracy.
# Answer here
A medical staff member finds there are currently 115 beds occupied. Using the trained model, can you provide an estimate waiting time?
# Answer here
Assumptions of linear regression#
Assumption 1#
There is a linear relationship between the independent variable(s) and the dependant variable.
Suppose another hospital has looked into variables that impact patient length of stay. They have collected new data on patient age and three metrics they believe impact LoS.
# We'll use make_regression to create a fake dataset.
X, y, coef = make_regression(
n_samples=100,
n_features=4,
n_informative=2,
n_targets=1,
bias=1.5,
effective_rank=None,
tail_strength=0.5,
noise=15,
shuffle=True,
coef=True,
random_state=random_seed,
)
# Store data in pandas.DataFrame
regression_data = pd.DataFrame(X)
regression_data["los"] = y
regression_data.columns = [
"patient_health_metric_a",
"patient_health_metric_b",
"patient_health_metric_c",
"age",
"los",
]
regression_data["age"] += 75
regression_data["los"] += 300
regression_data.head()
# Visualise the relationships between each feature and target, using a scatter plot
scatter_figs = make_subplots(
rows=2,
cols=2,
subplot_titles=("Health metic A", "Health metic B", "Health metic C", "Age"),
)
scatter_figs.add_trace(
go.Scatter(
x=regression_data["patient_health_metric_a"],
y=regression_data.los,
mode="markers",
),
row=1,
col=1,
)
scatter_figs.add_trace(
go.Scatter(
x=regression_data["patient_health_metric_b"],
y=regression_data.los,
mode="markers",
),
row=1,
col=2,
)
scatter_figs.add_trace(
go.Scatter(
x=regression_data["patient_health_metric_c"],
y=regression_data.los,
mode="markers",
),
row=2,
col=1,
)
scatter_figs.add_trace(
go.Scatter(x=regression_data["age"], y=regression_data.los, mode="markers"),
row=2,
col=2,
)
scatter_figs.update_layout(margin=dict(b=5, t=100, l=5, r=5), showlegend=False)
scatter_figs.update_layout(
title="Finding which variables have a linear relationship with LoS"
)
scatter_figs.show()
Is there a way to quantify the strength of the linear relationship between two variables…? Yes!
Pearson’s correlation coefficient#
Pearson’s correlation coefficient is a measure of linear association between two variables:
-1 indicates a perfect negative association between two variables.
0 indicates there is no relationship.
+1 indicates a perfect positive association between two variables.
You can use the .corr()
method on a pandas dataframe to calculate the correlation coefficient.
regression_data.corr()
As a rough rule for whether or not there is a linear association, we can follow these rules for absolute values:
0-0.19: very weak
0.2-0.39: weak
0.4-0.59: moderate
0.6-0.79: strong
0.8-1: very strong
More information on these rough guidlines from the BMJ here
Covarience is another metric used to quantify the strength of linear association between two variables, however, isn’t bound betwen -1 and 1 like the correlation coefficient.
\(cov(x, y) = \frac{1}{n}\sum_{i=1}^{n}(x_i - \bar{x})(y_i - \bar{y})\)
\(corr(x, y) = \frac{cov(x, y)}{\sqrt{var(x)}\sqrt{var(y)}}\)
Assumption 2, 3, and 4#
The residuals (or error terms) will:
Be approximately normally distributed (with mean=0)
Have constant variance (homoscedasticity)
Be independent of one another
What are residuals (error terms)? They are the difference between the fitted line and the actual values. They are used to fit the line (we want to minimise our error) and can be studied to provide insight into uncertainty in our predictions.
Model error (residuals)#
We know age has a strong linear relationship with the target variable. Lets fit a LinearRegression()
model and study the residuals.
# Seperate data into features and target
X, y = regression_data[["age"]].values, regression_data["los"].values
# Instantiate LinearRegression model.
linear_model = LinearRegression()
# Fit model to data
linear_model.fit(X, y)
Visualise the fitted model
los_mode_fig = px.scatter(x=X.flatten(), y=y.flatten(), trendline="ols")
los_mode_fig.update_layout(
xaxis_title="Age",
yaxis_title="Length of stay",
title="Fitted linear regression model for predicting patient los from their age",
)
# Lets make predictions based on our training input
fitted_values = linear_model.predict(X)
# The residuals are simply the error (i.e., fitted values - actual values)
residuals = fitted_values.flatten() - y.flatten()
# Show first 10 residuls
residuals[:10]
Diagnostic plots#
resid_figs = make_subplots(
rows=1,
cols=2,
subplot_titles=("Histogram of residuals", "Residuals vs fitted values"),
)
resid_figs.add_trace(go.Histogram(x=residuals), row=1, col=1)
resid_figs.add_trace(
go.Scatter(x=fitted_values.flatten(), y=residuals, mode="markers"), row=1, col=2
)
resid_figs.update_layout(showlegend=False)
Linearity
Independence
Normality
Equality of variance
Great! We now know how to fit a linear regression model using sklearn, we can assess the performance and verify the model assumptions.
Train-Test split#
Lets use train, test split in order to reduce the chance of overfitting our model. Lets look again at age
predicting los
. Rather than training a model using all available data, lets split the data into training and testing sets.
X, y = regression_data[["age"]].values, regression_data["los"].values
X_train, X_test, y_train, y_test = train_test_split(
X, y, random_state=random_seed, test_size=0.3, shuffle=True
)
# Fitting model to los_data training data
linear_model.fit(X_train, y_train)
# Get coefficient and intercept from fitted line
print(linear_model.coef_)
print(linear_model.intercept_)
Plot the fitted model using the training data and the unseen testing data.
train_test_fig = px.scatter(x=X_train.flatten(), y=y_train.flatten(), trendline="ols")
train_test_fig.add_trace(
go.Scatter(
x=X_test.flatten(),
y=y_test.flatten(),
mode="markers",
marker=dict(color="red"),
name="Unseen test data",
)
)
train_test_fig.update_layout(
xaxis_title="Patient age",
yaxis_title="Length of stay",
title="Fitting linear regression model",
)
train_test_fig.show()
linear_model.score(X_train, y_train)
We can now get a much better indication on how the model will perform on unseen incoming data in the future!
linear_model.score(X_test, y_test)
Note, we can extend this concept to perform multiple train-test splits, further reducing the chance of overfitting. This concept is known as cross validation.
Multiple linear regression#
Previously, we only considered a single independent variable (i.e., age) to predict some outcome (i.e., LoS). What if we wanted to predict LoS by using multiple independent variables (i.e., age and health metric C)?
We could fit seperate simple linear regression models for each of our independent variables to the target..but this isn’t the best option:
How would we obtain a single prediction?
Each of the three equations ignore the other two.
A better approach is to extend linear regression to include n variables:
\(\hat{y} = \beta_1x_1 + \beta_2x_2+...+\beta_nx_n\)
All the concepts described above can be used when using more than one independent variable!
We now interpret the coefficient for each variable as the impact of changing that variable while holding all other independant variables constant. Lets fit a linear regression model using both age
and patient_health_metric_c
to predict los
.
X, y = (
regression_data[["age", "patient_health_metric_c"]].values,
regression_data["los"].values,
)
X_train, X_test, y_train, y_test = train_test_split(
X, y, random_state=random_seed, test_size=0.3
)
# Fitting model to training data
linear_model.fit(X_train, y_train)
beta_0 = linear_model.intercept_
beta_1 = linear_model.coef_[0]
beta_2 = linear_model.coef_[1]
print(f"Intercept = {beta_0}")
print(f"Age coef. = {beta_1}")
print(f"patient_health_metric_c coef. = {beta_2}")
# Make a prediction on los using the .predict() method
linear_model.predict([[75, 0.5]])
# We could again manually make this prediction (using the linear equation)
los_prediction = beta_0 + beta_1 * 75 + beta_2 * 0.5
print(los_prediction)
# Score model on test data
linear_model.score(X_test, y_test)
This model performs much better compared with only use age
as a predictor!
We can generalise simple linear regression to include \(n\) independent variables.
You can interperate each coefficient as the linear relationship between the independent variable and the depedent variable, while holding all other variables constant.
Multicollinearity (where 2 or more independant variables have high correlation) between independent variables may impact the interpretation of coefficients in the model.
Variables with different scales may also impact the coefficient interpretation. This can be avoided by using standardising techniques (covered in the pre-processing session).
Be cautious of using \(R^2\), it does not take model complexity into account, and may lead to overfitting if it’s the only metric observed.
Task 2 (15-20 mins)
Additional data from an A&E department was collected. Use multiple linear regression to model the relationship between A&E waiting times and 5 other independent variables (multicollinearity not present in data).
# We'll use make_regression to create a fake dataset on los.
X, y, coef = make_regression(
n_samples=100,
n_features=5,
n_informative=2,
n_targets=1,
bias=1.5,
effective_rank=None,
tail_strength=0.5,
noise=15,
shuffle=True,
coef=True,
random_state=random_seed,
)
# Store data in pandas.DataFrame
task_2_data = pd.DataFrame(X)
task_2_data.columns = [
"hospital_metric",
"beds_metric",
"arrivals_metric",
"staffing_metric",
"ambulance_metric",
]
task_2_data["wait_times_mins"] = y + 250
Calculate the correlation coefficient of the
task_2_data
dataframe to find the two most correlated features with the target variable (wait_times_mins
).
# Answer here
Plot chosen feature(s) using a scatter plot to verify linear relationship.
# Answer here
Store selected features in a list called
seleted_features
, uncomment and run the cell below
# Answer here
# X, y = task_2_data[seleted_features].values, task_2_data["wait_times_mins"].values
Split the data into training and testing (30% testing size with
random_state
set to 42) and train aLinearRegression()
model using the training data. Note, in sklearn multiple linear regression is the exact same process as with simple linear regression!
# Answer here
Score the model on the testing dataset.
# Answer here
Calculate residuals and plot them in a histogram (if you have time).
# Answer here