Time Series Forecasting#

# Imports required for notebook

import warnings

warnings.simplefilter("ignore", category=FutureWarning)

import numpy as np
import pandas as pd

import matplotlib.pyplot as plt
import plotly.express as px
from plotly.subplots import make_subplots

from plotly import graph_objects as go

import statsmodels
import statsmodels.api as sm
from statsmodels.tsa.seasonal import seasonal_decompose
from statsmodels.graphics.tsaplots import plot_acf
from statsmodels.tsa.stattools import acf

random_seed = 42

Excellent freely available resource on time series forecasting

https://otexts.com/fpp3/

What is time series data?#

Time series is a series of data points indexed in time order. Typically, time series is a sequence of points taken as successive, equally spaced points in time. More information on time series data is available here.

In healthcare, we may record time series data for:

  • Hospital admissions and discharges (i.e., daily admissions).

  • Ward beds (i.e., bed occupancy recorded at midnight).

  • Patient vital signs (i.e., hourly temperature recording).

# Creating some fake example time series data
time_index = pd.date_range(start="2022-01-01", end="2024-05-01", freq="W")

time_series_example = pd.DataFrame(index=time_index)
time_series_example["Referrals"] = [np.random.poisson(160) for x in time_index]
# Show first 5 rows
time_series_example.head()
# Line chart for example time series
px.line(
    time_series_example,
    x=time_series_example.index,
    y="Referrals",
    title="Example monthly time series dataset for referrals",
)

Working with time series in Pandas#

Lets look at monthly A&E admissions across England (data available here).

# Data used in notebook
raw_data = pd.read_excel(
    "https://www.england.nhs.uk/statistics/wp-content/uploads/sites/2/2024/02/Monthly-AE-Time-Series-January-2024.xls",
    skiprows=13,
    sheet_name="Activity",
)

raw_data["Period"] = pd.to_datetime(raw_data["Period"], format="%Y-%m-%d")

raw_data = raw_data[["Period", "Total Attendances"]].set_index("Period")
# Print first few rows of data
raw_data.head()
# Line chart for A&E monthly admissions
px.line(
    raw_data,
    x=raw_data.index,
    y="Total Attendances",
    title="NHS England A&E attendances monthly time series",
)

Post COVID data gets quite messy! For ease, lets only analyse data pre-COVID. Of course, we can’t just ignore the impact of COVID on data, but this is something we will address in the optional time series session.

# Remove data post-COVID
raw_data = raw_data.loc[:"2020-02-01"]
raw_data.tail()
# Visualise pre-COVID attendance data
px.line(
    raw_data,
    x=raw_data.index,
    y="Total Attendances",
    title="NHS England A&E attendances monthly time series",
)

We are seeing a pattern of repeated variation in this data (i.e., monthly seasonality). That is, variations that occur in the data at specific intervals (i.e., daily, monthly or yearly).

One factor that could be influencing seasonal variation when looking at total monthly attendances, is simply the number of days in each month. We should remove calendar variation before starting the analysis. We can instead calculate average daily attendances per month.

Calendar adjustments#

We can extract useful information from pandas.datetime objects. A few examples are shown below, but for a full list of attributes, look here.

# Get day of week
raw_data.index.dayofweek
# Get month
raw_data.index.month
# Get days in month
raw_data.index.daysinmonth
# Create a new column with the number of days for a given month
raw_data["days_per_month"] = raw_data.index.daysinmonth

# Show the new column in table
raw_data.head()
# Make calender adjustment
raw_data["avg_daily_attendances"] = (
    raw_data["Total Attendances"] / raw_data["days_per_month"]
)

# Show adjusted data
raw_data.head()
# Plot adjusted data using a Plotly line chart
px.line(
    raw_data,
    x=raw_data.index,
    y=raw_data["avg_daily_attendances"],
    title="Calendar adjusted data",
)

Adjustments should also be made for population variation. For example, if plotting the number of hospital beds over time, we should account for growth in the population. Even though beds may have increased, beds per 1000 population may have decreased (this depends on the question you are asking from the data).

Time series decomposition#

Time series data can be seperated into three seperate components:

  • Trend - Long term increase/decrease in data (not necessarily linear).

  • Seasonal - The impact of the data from seasonal factors, such as month of year or day of week.

  • Remainder - Everything else (i.e., random variation that can’t be modelled or predicted).

Therefore, each point in our data can be described with the following equation (additive model):
\(y_t = T_t + S_t + R_t\), where:

  • \(y_t\) is a data point in our dataset,

  • \(T_t\) is the trend component,

  • \(S_t\) the seasonality component,

  • and \(R_t\) the remainder component, at time \(t\).

Note, the trend component includes both trend and cyclic behaviour. The cyclic component represents non-periodic fluctuation in the data (i.e., the impact from flu-season).

Note, we could have multiplicative decomposition, where we multiply each component:
\(y_t = T_t \times S_t \times R_t\)

If the magnitude of seasonal fluctuations is independent of the trend, use an additive model, otherwise, use a multiplicative model. Most healthcare applications will assume an additive model, usually multiplicative models are found in finance data.

# Example dataset for multiplicative seasonality
url = (
    "https://raw.githubusercontent.com/jbrownlee/Datasets/master/airline-passengers.csv"
)
airline_df = pd.read_csv(url)
airline_df["Month"] = pd.to_datetime(airline_df["Month"], format="%Y-%m")
# Plot example of multiplicative time series data
px.line(
    airline_df,
    x="Month",
    y="Passengers",
    title="Example airline dataset to show multiplicative model",
)

Trend component#

Rolling averages are a sequence of averages taken from a sliding window within you data. For example, if we were to calculate a rolling average of window size 12, we would take the first 12 measurements, calculate the mean for the first value. We then slide the window along one unit and calculate the mean for the next 12 measurements, and so on..

# This can be plotted
rolling_fig = px.scatter(
    raw_data,
    y="avg_daily_attendances",
    trendline="rolling",
    trendline_options=dict(window=12),
    title="12-month moving average",
)
rolling_fig.show()

How to calculate this in pandas? By using the .rolling() method!

raw_data["avg_daily_attendances"].rolling(window=12).mean().head(24)

We can also center the average, so that it evenly represents the average over the 12 months (i.e., the above method calculates the average for the previous 12 months). This is done by usng the center = True argument.

raw_data["avg_daily_attendances"].rolling(window=12, center=True).mean().head(12)

This may be better for estimating the trend component, since there is less lag between when a change in trend occurs in the data, and when it is detected by the rolling mean.

Lets use a 24- month centered rolling average to get an estimate of the trend component.

seasonal_period = 12
# Rolling mean to smooth out our data
trend_component = (
    raw_data["avg_daily_attendances"].rolling(12 * 2 + 1, center=True).mean()
)

Lets create a dataframe to store our time series components.

decomposed_data = pd.DataFrame()
decomposed_data.index = raw_data.index
decomposed_data["original"] = raw_data["avg_daily_attendances"]
decomposed_data["trend_component"] = trend_component.values

# Remove missing values
decomposed_data = decomposed_data.dropna(how="any")
decomposed_data.head()

Seasonal component#

First (assuming an additive series) we remove trend from our data. Note, if it was a multiplicative series, we would simply divide by the trend component.

# Subtract trend from our data
seasonal_and_remainder = raw_data["avg_daily_attendances"] - trend_component

Then group by each month and take the mean.

month_arr = seasonal_and_remainder.index.month
month_arr
seasonal_component = seasonal_and_remainder.groupby(by=month_arr).mean()
seasonal_component
# PLotting seasonal variation in attendance data
seasonal_fig = px.bar(seasonal_component, title = 'Seasonal component in attendance data')
seasonal_fig.update_layout(yaxis_title = 'seasonal variation', 
                           xaxis_title = 'Month')

Lets add the seasonal component to decomposed_data table.

decomposed_data["seasonal_component"] = decomposed_data.index.month.map(
    seasonal_component
)
decomposed_data.head()

Remainder component#

The remainder is simply the difference between the original series and the trend and seasonal components.

decomposed_data["remainder_component"] = (
    decomposed_data["original"]
    - decomposed_data["trend_component"]
    - decomposed_data["seasonal_component"]
)

Note, the remainder terms should ideally have no correlation and have mean 0.

fig = make_subplots(rows=4, cols=1)

fig.append_trace(
    go.Scatter(
        x=decomposed_data.index, y=decomposed_data["trend_component"], name="Trend"
    ),
    row=1,
    col=1,
)


fig.append_trace(
    go.Scatter(
        x=decomposed_data.index,
        y=decomposed_data["seasonal_component"],
        name="Seasonal",
    ),
    row=2,
    col=1,
)

fig.append_trace(
    go.Scatter(
        x=decomposed_data.index,
        y=decomposed_data["remainder_component"],
        name="Remainder",
    ),
    row=3,
    col=1,
)

fig.append_trace(
    go.Scatter(x=decomposed_data.index, y=decomposed_data["original"], name="Original"),
    row=4,
    col=1,
)


fig.update_layout(height=900, width=950, title_text="Decomposed time series")
fig.show()

A little bit long-winded.. However, we can use statsmodels library to decompose the time series for us in only a few lines of code!

statsmodels seasonal_decompose#

# All our steps above are now condensed into one cell with 3 lines of code!
decomposed_ts_update = seasonal_decompose(
    raw_data["avg_daily_attendances"].values, model="additive", period=12
)
fig = make_subplots(rows=3, cols=1)

fig.append_trace(
    go.Scatter(
        x=raw_data.index, y=decomposed_ts_update.trend, name="statsmodels trend"
    ),
    row=1,
    col=1,
)

fig.append_trace(
    go.Scatter(
        x=decomposed_data.index,
        y=decomposed_data["trend_component"],
        name="manual trend",
    ),
    row=1,
    col=1,
)


fig.append_trace(
    go.Scatter(
        x=raw_data.index, y=decomposed_ts_update.seasonal, name="statsmodels seasonal"
    ),
    row=2,
    col=1,
)
fig.append_trace(
    go.Scatter(
        x=decomposed_data.index,
        y=decomposed_data["seasonal_component"],
        name="manual seasonal",
    ),
    row=2,
    col=1,
)


fig.append_trace(
    go.Scatter(
        x=raw_data.index, y=decomposed_ts_update.resid, name="statsmodels remainder"
    ),
    row=3,
    col=1,
)
fig.append_trace(
    go.Scatter(
        x=decomposed_data.index,
        y=decomposed_data["remainder_component"],
        name="manual remainder",
    ),
    row=3,
    col=1,
)

fig.update_layout(
    height=600,
    width=950,
    title_text="Comparing statsmodels seasonal_decompose to our manual calculations",
)
fig.show()

Very similar outputs! Although the statsmodels approach requires much less work.. There are other methods to calculate the decomposed series, other than this ‘classical approach’, such as LOESS (available in statsmodels).

Time series decomposition key points
  • Time series decomposition allows you to understand underlying patterns and structures within the data. For example, what is the seasonal variation? Which months see the highest demand? What is the underlying trend for demand?

  • Although not typically used for forecasting, it helps identify which components may be more predictable and which model may be appropriate (more on this in the next session).

  • Decomposition can compliment other time series techniques, such as SPC analysis or causal impact.

Task (15-20 mins)

A hospital manager has collected daily data on the number of admissions to the A&E department. They want to use a more data driven approach to calculate the weekly staffing rotas and quantify how demand has changed during the data collection period.

# Creating some fake time series data
np.random.seed(random_seed)

time_index = pd.date_range(start="2022-01-01", end="2024-05-01", freq="D")

seasonal_component = (10 * np.sin((2 * np.pi / 7) * np.arange(len(time_index)))).astype(
    int
)
remainder_component = np.random.normal(loc=0, scale=1.5, size=len(time_index)).astype(
    int
)
trend_component = np.round(np.arange(len(time_index)) / 35).astype(int)

time_series = 50 + seasonal_component + trend_component + remainder_component

# Format data in a pandas dataframe
task_data = pd.DataFrame({"date": time_index, "admissions_daily": time_series})
task_data.head()
  1. Make a time series plot of daily A&E admissions.

# Answer here
  1. Using statsmodels seasonal_decompose, decompose the data into its individual components (use an additive model and think about the seasonal period for daily data). Plot each seasonal component.

# Answer here
  1. As we’ve made fake data, we can actually extract exactly the underlying trend and seasonal component (saved as actual_seasonal_component & actual_trend_component, in the cell below). How good was seasonal_decompose at isolating trend and components from the added Guassian noise? Plot the actual trend and seasonal component against the modelled components.

actual_seasonal_component = (
    10 * np.sin((2 * np.pi / 7) * np.arange(len(time_index)))
).astype(int)
actual_trend_component = np.round(np.arange(len(time_index)) / 35).astype(int) + 50
# Answer here

Hint: You can use the .trend and .seasonal to extract trend and seasonal data from the model.

Autocorrelation#

Within the regression notebook, we discussed the correlation coefficient (a measure of linear association between two variables).

The autocorrelation function (ACF) is defined as strength of the linear relationship between a time series and a lagged version of the same time series.

Lets look at our A&E NHS attendances dataset to understand this better.

First lag:

raw_data["avg_daily_attendances_lag1"] = raw_data["avg_daily_attendances"].shift(1)
raw_data.head()
# Inspect linear relationship using a scatter plot
px.scatter(
    raw_data,
    x="avg_daily_attendances_lag1",
    y="avg_daily_attendances",
    title="Relationship between avg_daily_attendances and the first lag",
)
raw_data[["avg_daily_attendances", "avg_daily_attendances_lag1"]].corr()

Second lag:

raw_data["avg_daily_attendances_lag2"] = raw_data["avg_daily_attendances"].shift(2)
raw_data[["avg_daily_attendances", "avg_daily_attendances_lag2"]].corr()

Rather than manually calculate each lag and then the coefficient, we typically use an ACF plot which summarises this information. We have focused primarily on Plotly, but for here, we’ll use matplotlib.

fig, ax = plt.subplots(figsize=(15, 5))
_ = statsmodels.graphics.tsaplots.plot_acf(
    raw_data["avg_daily_attendances"],
    ax=ax,
    lags=12 * 2,
    title="Autocorrelation for A&E average daily attendances",
)
plt.show()

The blue shaded region is used to infer whether the correlation is significantly different to zero (at the 5% significance level).

ACF key points
  • Trending time series tends to have large positive correlations for small lags, since nearby observations have a similar value in a trending time series.

  • Seasonal data will have correlations that are larger at the seasonal lags (i.e., 12 lags for monthly data)

Task 2 (15-20 mins)

  1. Using the task_data calculate the first 2 lags as columns in the pandas dataframe (i.e., create new columns in the dataframe).

# Answer here
  1. Use the pandas.corr() method to get the correlation coefficient for these two lags to the target column (admissions_daily).

# Answer here
  1. Plot the autocorrelation function, using statsmodels, for the previous 4 weeks of lags. By reading the plot, do you agree with the decomposed graphs derived in the first task?

# Answer here