Mathematics and Statistics#

import numpy as np
import math

The math library provides mathematical function and constants. Read the documentation here or run help(math) in a cell to find out more about the library.

Maths Basics and Calculus review#

Introduction to number sets#

  • Natural Numbers: These are simple counting numbers (not including zero or negative numbers) i.e., \(1, 2, 3...\) i.e., the number of wards in a hospital.

natural_number_example_list = [2, 3, 5, 7, 11, 13]

  • Whole Numbers: If we include the number 0 in the set of natural numbers, we have the set of whole numbers i.e., number of patient cancellations.

whole_number_example_list = [0, 2, 3, 4, 5, 600]

  • Integers: This set of numbers includes all whole numbers, but also also negative numbers i.e., change in arrivals at A&E compared to yesterday.

integer_number_example_list = [-100, -1, 0, 2, 3]

  • Rational numbers: All numbers that can be expressed as fractions i.e., the fraction of patients seen within 4-hours at A&E.

rational_number_example_list = [-1/3, 0/4, 1/1, 500/3]

  • Irrational numbers: All numbers that can not be expressed as fractions i.e., some mathematical formulea include these numbers, such as the normal distribution function (we can access some of these irrational numbers in python using the library math).

irrational_number_example_list = [math.pi, 2**0.5]

  • Real numbers: Real numbers include rational and irrational numbers i.e., a continuous measurable process such as patient temperature or height. All the number systems above are a subset of the real numbers.

real_number_example_list = [math.pi, 1/3, -500, 2.1]

  • Complex numbers: Imaginary numbers are defined by the square root of a negative number. * We will very rarely encounter complex numbers in data science tasks. *

In python, there are three distinct numeric types

  • integers: int (INT, SMALLINT, BIGINT in SQL or Integer in R).

    • Natural numbers

    • Whole numbers

    • Integers

  • floating point numbers: float (SMALLFLOAT, FLOAT, DECIMAL in SQL or Numeric in R).

    • Rational numbers

    • Irrational numbers

    • Real numbers

  • complex numbers (don’t worry too much about this number type): complex (not supported in SQL and Complex in R).

type() is a built-in python function that returns the type of an object. For example, type(1) returns int and type(math.pi) returns float.

Task 1 (5-10 minutes)

Let’s apply the concept of loops covered yesterday to inspect the different number types from a list.

Choose a list from the example number systems above (copy and paste the list into the cell below). Loop through each element in your list and print the data type for each member.

# Your code here

Arithmetic operations in python#

There are 7 important mathematical operations available in python (described in Day 1). As a refresher, these operations are:

1. Addition

# Define two variables x and y
x = 5
y = 2
# using the addition operator
output = x + y
print(output)

2. Subtraction

output = x - y
print(output)

3. Multiplication

output = x * y
print(output)

4. Division

output = x / y
print(output)

5. Modulus

output = x % y
print(output)

6. Exponentiation

output = x**y
print(output)

7. Floor division

output = x // y
print(output)

Compound operations

If you want to update a value of a variable, it’s good practice to make use of compound assignment operations.

# Normal arithmetic operation
x = x * 5

# Equivilant compound assignment operation
x *= 5

See more on compound assignment operators here

Order of operations (BIDMAS)#

The order of operations are the rules that tell us the sequence in which we solve expressions with multiple operations. BIDMAS is an acronym used to remember the order of operations in arithmetic.

BIDMAS rules (the order of operations)#

  1. Brackets: Operations involving brackets are performed first (work from the inner most set outwards).

  2. Indices: Next, perform operations involving exponents (indices).

  3. Division & Multiplication (equal priority): Next, perform operations involving division and multiplication (prioritise from left to right).

  4. Addition & Subtraction (equal priority): Finally, perform operations involving addition and subtraction (prioritise from left to right).

For the most part, you won’t need to memorise or know how to apply these rules as python follows BIDMAS when carrying out calculations with multiple expressions.

Task 2 (5-10 minutes)

Solve \(5 + (3 \times 2) - 4^2 \div 2\) using the prioritisation rules from BIDMAS. Check your answer in python for the solution.

# Your working here

Functions#

The concept of a function in python was discussed in Day 1 of the course. Put simply, a function can be broken down into 3 components:

  • Input: The dependant variable. A set of items/numbers (i.e., the real numbers).

  • Relationship: The input values modified by mathematical operations (i.e., adding/multiplication)

  • Output: The dependant variable (depends on the value of the input).

Note, a mathematical function requires that each input be related to exactly one output.

Function of a straight line#

The straight line function is given by \(f(x) = mx + c\), where m and c are constant numbers (any real number). The linear function is used in various data science tasks, for example, linear regression & time series forecasting (topics covered in days 4 & 5).

def straight_line(input_number, gradient, intercept):
    """
    Equation for a straight line.

    Args:
        input_number (float): The x value for the straight line (input for function).
        gradient (float): The gradient of the straight line.
        intercept (float): The y-intercept for the straight line.

    Returns:
        float: The y-value of the straight line (function output).
    """
    return input_number * gradient + intercept
# Let's feed in an input value and see the result:
output = straight_line(input_number=5.2, gradient=5, intercept=3)
print(output)
Hide code cell content
"""
!!Feel free to ignore this cell, data visualisations will be covered on Day 3!!
"""

# import Plotly graphing library
import plotly.graph_objects as go


def plot_straight_line(gradient, intercept):
    """
    Equation of a straight line figure

    Args:
        gradient (float): The gradient of the straight line.
        intercept (float): The y-intercept for the straight line.

    Returns:
        plotly.graph_objs._figure.Figure: A Plotly figure showing
        equation of a straight line.

    """

    # Set axis limits
    axis_range = [-20, 20]

    # Specify a set of input x values
    x_vals = np.linspace(-50, 50, 200)

    # Use the straight_line function to calculate outputs for each input x value
    # using specified gradient and intercept
    y_vals = [straight_line(i, gradient=gradient, intercept=intercept) for i in x_vals]

    # Initialise figure
    fig = go.Figure()

    # Format x-axis
    fig.update_xaxes(
        range=axis_range,
        title="y",
        tickmode="linear",
        showticklabels=False,
        side="left",
        gridcolor="rgba(224,224,224, 0.9)",
    )

    # Format y-axis
    fig.update_yaxes(
        range=axis_range,
        title="x",
        tickmode="linear",
        showticklabels=False,
        side="right",
        gridcolor="rgba(224,224,224, 0.9)",
    )

    # Use f-strings to update figure title
    if intercept >= 0:
        fig.update_layout(title=f"$f(x) = {gradient}x + {intercept}$")
    else:
        fig.update_layout(title=f"$f(x) = {gradient}x {intercept}$")

    # Add x and y axis lines
    fig.add_vline(x=0, line_width=2)
    fig.add_hline(y=0, line_width=2)

    # Add linear function
    fig.add_trace(
        go.Scatter(x=x_vals, y=y_vals, mode="lines", line=dict(color="#00789c"))
    )

    # Update figure layout
    fig.update_layout(
        plot_bgcolor="rgba(255,255,255, 0.5)",
        height=500,
        width=500,
        margin=dict(l=50, r=0, b=0, t=50, pad=0),
    )

    return fig

Make use of the custom function, plot_straight_line() which outputs a graph of the defined straight line function. Use this function and alter the value of the gradient and intercept (you can use any real numbers) for the task below.

plot_straight_line(gradient=2, intercept=1)

Task 3 (5 minutes)

Modify parameters of the straight line equation in plot_straight_line and consider these questions about the gradient (m):

  1. What happens to the line when the gradient (m) increases/decreases?

  2. What happens when the gradient is zero?

  3. What happens when the gradient goes from positive to negative (or vice-versa)?

# Alter parameters of plot_straight_line function here.

Straight lines are useful as they tell us the rate at which y increases (or decreases) for every unit change in x. For example, a straight line may be used for showing a trend over time i.e., “Arrivals at a hospital’s A&E department are increasing by X% per year”.

How do we quantify rates of change when a process does not have a linear relationship?

Lets examine how drug concentration varies in the bloodstream following an initial dose (using methods from this study).

def C_t(t, A=6.2, b=0.5):
    """
    Surge function for studying medicine concentration in bloodstream over time.

    Args:
        t (float): The time (in hours) from ingesting drug.
        A (float): Constant used to fit function for specific medications.
        b (float): Constant used to fit function for specific medications.

    Returns:
        float: Concentration in bloodstream at time t.
    """
    return A * (t**4) * np.exp(-b * t)

We will look at medication concentration in the bloodstream over a 40 hour period.

# Store time points between 0 and 40 hours.
hours = np.linspace(0, 40, 4000)

# Store medicine concentration in patients bloodstream at each point.
concentration = [C_t(time) for time in hours]
"""
!!Feel free to ignore this cell, data visualisations will be covered on Day 3!!
"""

# Import Plotly express library
import plotly.express as px

# Create line chart
fig = px.line(x=hours, y=concentration)

# Update layout
fig.update_layout(
    xaxis_title="Hour (following drug consumption)",
    yaxis_title="Drug concentration",
    template="seaborn",
    title={
        "text": f"Concentration of medication in the bloodstream over time",
        "x": 0.07,
        "xanchor": "left",
    },
)

# Change line style
fig.update_traces(line_color="#00789c", line_width=3)

# Display plot
fig.show()

Tangent lines#

If we want to know the rate of change of drug concentration at a specific time point, we can calculate the tangent line at that point. The tangent line is defined as a line that just touches a curve at a point, matching the curve’s slope (i.e., gradient) there.

def derivative(t):
    """
    Function for the derivative of the surge function.

    Args:
        t (float): time point to calculate derivate.
    Returns:
        float: Derivative of function at time t
    """
    return -3.1 * t**4 * np.exp(-0.5 * t) + 24.8 * t**3 * np.exp(-0.5 * t)
Hide code cell content
"""
!!Feel free to ignore this cell, data visualisations will be covered on Day 3!!
"""


def plot_c_t(tangent_point):
    """
    Plot tangent line for points along the surge function.

    Args:
        tangent_point (float, >0): time point to calculate derivate.
    Returns:
        plotly.graph_objs._figure.Figure: Plotly figure of surge function with tangent line.
    """

    # Define points along surge function to calculate tangent.
    y1 = C_t(tangent_point)
    x1 = tangent_point

    # Initialise figure
    fig = go.Figure()
    # Add traces
    fig.add_trace(
        go.Scatter(
            x=hours,
            y=concentration,
            name="Drug concentration",
            line=dict(color="#00789c", width=3),
        )
    )
    xrange = np.linspace(x1 - 3, x1 + 3, 10)  # Define range of graph

    # Define points along the tangent line
    line_eq = lambda hours, x1, y1: derivative(x1) * (hours - x1) + y1
    fig.add_trace(
        go.Scatter(
            x=xrange,
            y=line_eq(xrange, x1, y1),
            name="Tangent line",
            mode="lines",
            line=dict(shape="linear", width=4),
        )
    )

    # Calculate gradient using derivative function
    gradient = round(derivative(tangent_point), 2)

    # Update figure layout
    fig.update_layout(
        title={
            "text": f"At hour {tangent_point} the tangent line has gradient {gradient}<br><sup>Graph showing drug concentration in bloodstream changes over a 40 hour period</sup>",
            "x": 0.07,
            "xanchor": "left",
        }
    )
    fig.update_layout(margin=dict(r=5))
    fig.update_layout({"paper_bgcolor": "rgba(0, 0, 0, 0)"}, template="seaborn")
    fig.update_layout(
        xaxis_title="Hour (following drug consumption)",
        yaxis_title="Drug concentration",
    )

    return fig
# Using the custom function plot_c_t to plot function tangent at hour 7.
plot_c_t(tangent_point=7)

Task 4 (5-10 minutes)

Modify the point of where the tangent is calculated by changing the tangent_point argument inside the plot_c_t function. From this, have a think about the following questions:

  1. What time range is drug concentration increasing? Does the gradient of the tangent line change over this period?

  2. At approximatly which point do you think concentration reaches a maximum? What is the behaviour of the gradient of the tangent there?

  3. What happens to the gradient of the tangent after reaching a maximum?

# Modify parameters of the plot_c_t function here

Differential calculus is used to find rates of change (i.e., gradients of the tangent lines) at points along a function.

Area below functions#

Integration can be used to find the area below functions between certain ranges (for example, the area below the surge function between hours 15 and 30).

# Function to calculate integral between two time periods
def integral(t1, t2):
    """
    Integral for the surge function between two time points t1 & t2.

    Args:
        t1 (float): First time point
        t2 (float): Second time point

    Returns:
        float: Area under curve between t1 and t2
    """

    return (
        1.0
        * (-12.4 * t2**4 - 99.2 * t2**3 - 595.2 * t2**2 - 2380.8 * t2 - 4761.6)
        * np.exp(-0.5 * t2)
    ) - (
        1.0
        * (-12.4 * t1**4 - 99.2 * t1**3 - 595.2 * t1**2 - 2380.8 * t1 - 4761.6)
        * np.exp(-0.5 * t1)
    )
Hide code cell content
"""
!!Feel free to ignore this cell, data visualisations will be covered on Day 3!!
"""


def plot_integral(t1, t2):
    """
    Plot shaded region below surge function between two points.

    Args:
        t1 (float, >0): Lower limit of integration.
        t2 (float, >t1): Upper limit of integration.
    Returns:
        plotly.graph_objs._figure.Figure: Plotly figure of surge function with tangent line.
    """

    # Initialise figure
    fig = go.Figure()

    # Define points between t1 and t2 with a step of 0.01
    xx = np.arange(t1, t2, 0.01)
    # Find output from surge function for each xx
    yy = [C_t(x) for x in xx]

    # Add filled region from surge function between t1 and t2.
    fig.add_trace(
        go.Scatter(
            x=xx,
            y=yy,
            fill="tozeroy",
            fillcolor="#edae49",
            line=dict(color="#edae49"),
            name="Area under curve",
        )
    )

    # Add surge function to figure.
    fig.add_trace(
        go.Scatter(
            x=hours,
            y=concentration,
            name="Drug concentration",
            fillcolor="#00789c",
            line=dict(color="#00789c", width=3),
        )
    )

    # Update layout
    fig.update_layout(
        {"paper_bgcolor": "rgba(0, 0, 0, 0)"},
        template="seaborn",
        xaxis_title="Hour (following drug consumption)",
        yaxis_title="Drug concentration in bloodstream",
        title={
            "text": f"Finding area below function between two time points during medication process",
            "x": 0.08,
            "y": 0.86,
            "xanchor": "left",
        },
    )

    return fig
# Show area below surge function between two points
plot_integral(15, 30)

The Area Under Curve (AUC) is an important metric in this example, as it tells you the total exposure to the drug within a certain window of time.

Task 5 (5 minutes)

Estimate area (by eye) between hour 5 and 10 (use the square grid lines in the plot). Then, run integral(5, 10) to see whether you agree with the integral function.

# Your working here

Integral calculus is used to determine the area below a function between two points.

Calculus summary#

The two fundamental operations of calculus are differentiation and integration. In fact, calculating derivatives and integrals are inverse operations.

Why is calculus important in data science?

  • Gradient descent: Optimisation technique used in many machine learning models used to minmise some ‘cost’ function (i.e., finding where the derivative is zero).

  • Probability analysis: Calculating area under probability density functions are essential statistical skills.

  • Model evaluation: Integrals are used to assess model performance, by computing areas under curves, as seen in ROC analysis.

Below are some suggested resources/ reading should you wish to learn more about calculus.

Suggested reading:

  1. Essential math for data science (chapter 1).

  2. Thomas Calculus (the gold standard textbook for calculus).

Further application of calculus in healthcare: SIR modelling


Probability theory#

Probability is how strongly we believe an event will happen (usually expressed as a percentage). The concepts lay the foundation and is the language for statistics.

For example:

  • What is the probability patient X arriving to A&E will require a bed?

  • What are the chances patient Y will cancel their appointment?

  • What is the likelihood patient Z has O negative blood type?

Probability theory can be a difficult and unintuitive area of mathematics!

What is the probabilty of someone having a certain blood type?

# Reference: https://www.blood.co.uk/why-give-blood/blood-types/
blood_type_data = {
    "O positive": 0.35,
    "O negative": 0.13,
    "A positive": 0.3,
    "A negative": 0.08,
    "B positive": 0.08,
    "B negative": 0.02,
    "AB positive": 0.02,
    "AB negative": 0.01,
}

Let’s pick someone at random and see their blood type (a random experiment).

The set of all possible outcomes for this experiment is called the sample space.

sample_space = {
    "O positive",
    "O negative",
    "A positive",
    "A negative",
    "B positive",
    "B negative",
    "AB positive",
    "AB negative",
}

The primary goal of probability theory is to assign likelihood to events of interest.

For example, randomly selecting someone and finding they are O positive. This can be written as \(P(X)\) where \(X\) is the event of interest (person selected being O positive).

\(P(X)\) = 0.35

Basic rules of probability#

  • Probability must be between 0 and 1 (0 is an impossible event and 1 is a certain event).

  • The complement rule: The probability of an event not occuring \(P\)(not event) = 1 - \(P\)(event).

  • The multiplication, or the AND rule: Multiply probabilities to get the overall probability in a sequence of independent events.

    • Indipendence is satisfied when the outcome of one event does not affect the other.

  • The addition, or OR rule: Add probabilities for mutually exclusive events to get total probability.

    • Events are mutually exclusive if they can’t occur at the same time.

Experiment 1: Randomly selecting someone and finding they are not O positive#

# Solution (using the complement rule)
1 - blood_type_data["O positive"]

Experiment 2: Randomly selecting two individuals and finding they are both O positive#

# Solution (using the multiplication rule)
blood_type_data["O positive"] * blood_type_data["O positive"]

Using simulation to demonstrate the multiplication rule

import random


def select_patients(n):
    """
    Simulation for selecting n patients and determining whether or not they are O positive.

    Args:
        n (int): Number of patients to simulate.

    Returns:
        list: List of selected patients blood types.
    """
    # Initialise empty list to store patient blood type.
    outcome = []
    for x in range(n):

        # Randomly pick a number between 1 and 100
        blood_type = random.randint(1, 100)

        # This ensures 'O positive' will be selected 35% of the time
        if blood_type <= 35:
            # Add result to list
            outcome.append("O positive")

        # 65% of the time 'NOT O positive' will be selected
        else:
            # Add result to list
            outcome.append("NOT O positive")
    return outcome
# Running the select_patients() function
select_patients(2)
COUNT = 0
RUNS = 50_000
for x in range(RUNS):
    if select_patients(2) == ["O positive", "O positive"]:
        COUNT += 1
print(COUNT / RUNS)

Task 1 (15 minutes)

Using both the definition (multiplication rule) and the simulation approach, find the probability of selecting 3 people at random and getting the following sequence of blood types: ['NOT O positive', 'O positive', 'O positive']. Are both methods in agreement?

Hint: You can make use of the compliment rule to find \(P\)(NOT O positive).

# Your code here

Experiment 3: Randomly selecting two individuals and finding that at least 1 is O positive#

Using the compliment rule, this can be thought of as:

prob_not_O_pos = 1 - blood_type_data["O positive"]
print(1 - prob_not_O_pos * prob_not_O_pos)

Using simulation to justify rule

COUNT = 0
RUNS = 50_000
for x in range(RUNS):
    outcome = select_patients(2)
    if outcome == ["NOT O positive", "NOT O positive"]:
        COUNT += 1
print(1 - (COUNT / RUNS))

Conditional probability and Bayes’ Theorem#

Conditional probability is the probability of an event occurring given another event has occurred. For example, the probability of a patient being admitted to hospital given a set of symptoms.

Bayes’ theorem is a framework to update probabilities based on new information. For example, you can incorporate medical history, test results etc… to determine whether they need to be admitted.

Suggested reading:

  1. Essential math for data science (chapter 2).

  2. Excellent free online book on probability: https://www.probabilitycourse.com/

  3. Another great freely available book: https://projects.iq.harvard.edu/stat110/home


Descriptive & inferential statistics#

What is statistics? Statistics is the branch of mathematics which deals with collecting, analysing, interperating and presenting data. Broadly, statistics can be divided into two main branches- descriptive and inferential statistics.

  • Descriptive Statistics: Techniques which aims to summarise a given dataset. Broadly, we summarise data in terms of it’s central tendancy (mean, median, mode) and it’s variation (standard deviation, variance, skew, kurtosis).

  • Inferential Statistics: Techniques which aim to infer characteristics of a population based on a sample. The population refers to all members you are interested in. For example, if you wanted to know the height of everyone in Somerset. The population would be every single person within Somerset. A sample may be 100 people from Somerset.

Descriptive Statistics#

Measures of central tendancy#

Mean: The average (or central number) for a set of values. To calcuate the mean, we sum the values and divide by the number of values. In mathematical notation this calculation is: \(\bar{x} = \frac{x1+x2+x3+...}{n}\)

# Let's analyse some fake data. Suppose we have data on weekly admissions at A&E stored in a list
emerg_attendances = [
    477,
    513,
    495,
    518,
    476,
    470,
    476,
    508,
    499,
    528,
    528,
    504,
    486,
    554,
    491,
]
# We could calculate the mean from the above definition:
mean_attendances = sum(emerg_attendances) / len(emerg_attendances)
print(mean_attendances)
# Or we could use numpy functionality
np.mean(mean_attendances)

If the data contains outliers, the mean may not provide the most useful summary.

# List of integers with clear outlier value of 100_000
emerg_attendances_with_outlier = [
    477,
    513,
    495,
    518,
    476,
    470,
    476,
    508,
    499,
    528,
    528,
    504,
    486,
    554,
    491,
    100_000,
]
# The mean is heavily influenced by the outlier
np.mean(emerg_attendances_with_outlier)

Median: the middle value in a set of ordered values.

# Let's find the position of the middle value of a list.
position_of_median = np.floor((len(emerg_attendances) + 1) / 2)
position_of_median
# Find the median using the position_of_median
sorted(emerg_attendances)[int(position_of_median) - 1]
# or using built-in python functionality
np.median(emerg_attendances)

The median handles outliers better.

np.median(emerg_attendances_with_outlier)

Mode: the most common number in a set of data (not used often in practice).

# To get a function to find the mode we will import stats from scipy. Use help(scipy) to learn more.
from scipy import stats
stats.mode(emerg_attendances, keepdims=False)

Variance and standard deviation#

Variance and standard deviation tell us how far each data point is away from the mean

variance = \(\sigma^2 = \frac{1}{n} \sum_{i=1}^n (x_i - \bar{x})^2\)

standard deviation = \(\sigma = \sqrt{variance}\)

# Take the mean of the attences and store in the variable mean_attendances
mean_attendances = np.mean(emerg_attendances)

# Calculate the difference between the mean and each value using list comprehensions
mean_diff = [(x - mean_attendances) for x in emerg_attendances]
mean_diff

We don’t want to cancel out the negatives and positives as this may underestimate variation. Therefore we square the difference.

# Using list comprehension to square each element of mean_diff
diff_squared = [x**2 for x in mean_diff]
diff_squared

Finally, we take the mean of this squared difference to arrive at the variance for A&E attendances.

variance = np.mean(diff_squared)
print(variance)
# To make a value more meaningful, we can "undo" the squaring to arrive at the standard deviation
std = variance**0.5  # Note, to the power of 0.5 is equivilent to square root
print(std)
# Or we could use built-in python functionality to arrive at the same answer in a single line of code.
np.var(emerg_attendances) ** 0.5
# Or calculate the standard deviation directly
np.std(emerg_attendances)

Task 1 (10 minutes)

A clinician has collected data on the ages of patients using their service. They have provided some data and would like some statistical KPIs.
ages_data = [71, 81, 69, 78, 69, 63, 67, 66, 62, 69, 70, 62, 62, 70, 61, 55, 63, 78, 57, 66]
Using this data calculate:

  1. The mean age

  2. The median age

  3. The standard deviation of ages

# Mean calculation here
# Median calculation here
# Standard deviation calculation here

While the mean and standard deviation provide useful summaries, they don’t tell you the whole story for the distribution of the data. There are many different probability distributions (i.e., depending on whether the measured process is continuous or discrete) but one of the most commonly found distributions is the normal distribution.

The normal (Guassian) distribution#

To define a normal distribution, you need to specify:

  • Mean

  • Standard deviation

There are a number of real life processes which have a normal distribution, i.e., people’s blood pressure. Systolic blood pressure in healthy adults has a normal distribution with mean 128.4 mmHg and standard deviation 19.6 mmHg (reference).

Mathematically, this is written as \(X \sim \mathcal{N}(128.4,\, 19.6^{2})\)

  • In plain english, this is saying X (our variable of interest, blood pressure) is distributed normally with mean 128.4 and standard deviation 19.6.

Hide code cell content
"""
Feel free to ignore this cell, data visualisations will be covered in Day 3.
"""

import matplotlib.pyplot as plt
from scipy.stats import norm
import scipy


def plot_normal_pdf(mean, std, std_num=0):
    """
    Return normal probability distribution with shaded region showing area covered by std_num standard deviations.

    Args:
        mean (float): Mean value.
        std (float): Standard deviation.
        std_num (float): Number of standard deviations above/below the mean to visualise.

    Returns:
        plotly.graph_objs._figure.Figure: Plotly figure of normal distribution and area within std_num standard deviations.
    """

    # Define x values for CDF
    x = np.arange(mean - std * 5, mean + std * 5, 0.001)

    # Using CDF to calculate area on lower tail
    t1 = scipy.stats.norm(mean, std).cdf(mean - std_num * std)

    # Use symmetry of normal distribution to get upper tail
    t2 = 1 - t1

    # Get points along PDF for lower/upper limits
    lower = norm.ppf(t1, mean, std)
    upper = norm.ppf(t2, mean, std)

    # Area to return
    area = 100 - round(t1 * 2 * 100, 4)

    # Initialise figure
    fig = go.Figure()

    # Define x values for PDF
    xx = np.arange(lower, upper, 0.01)

    # Get y values for PDF
    yy = [norm.pdf(x, mean, std) for x in xx]

    # Add area between specified number of standard deviations around mean
    fig.add_trace(
        go.Scatter(
            x=xx,
            y=yy,
            fill="tozeroy",
            fillcolor="#edae49",
            line=dict(color="#edae49"),
            name=f"{area}% area under curve",
        )
    )

    # Add PDF function to plot
    fig.add_trace(
        go.Scatter(
            x=x,
            y=norm.pdf(x, mean, std),
            name="Normal distribution",
            line=dict(color="#00789c"),
            fillcolor="#edae49",
        )
    )

    fig.update_layout(
        {"paper_bgcolor": "rgba(0, 0, 0, 0)"},
        xaxis_title="Blood pressure",
        yaxis_title="Probability",
        title={
            "text": f"Normal distribution for blood pressure",
            "x": 0.07,
            "xanchor": "left",
        },
    )

    # Add vertical line to show mean
    fig.add_vline(
        mean,
        line_width=3,
        line_dash="dash",
        line_color="#526D82",
        annotation_text="Mean",
    )

    return fig
# using the plot_normal function
plot_normal_pdf(mean=128.4, std=19.6)

Probability distribution function (PDF)#

Typically, to observe whether a process is normally distributed, we plot a histogram (covered in Day 3). In order to extract probabilities for specific outcomes from the distribution, we need to use something called a probability distribution function. The total area below the PDF is 1.

  • Notice how this links integral calculus and probability theory!

This shape of this distribution makes it very easy to spot whether a patient’s blood pressure is within expected ranges. This is where the power of the normal distribution comes into play. We know that:

  • 68% of all patients blood pressure will be within 1 standard deviation of the mean.

  • 95% will be within 2 standard deviations of the mean

  • 99.7% will be within 3 standard deviations of the mean.

This is often coined as the 68-95-99.7 rule (or the Empirical rule)

# Finding the bounds that contain 99.7% of patients blood pressure.
mean = 128.4
std = 19.6
lower_bound = 128.4 - (3 * 19.6)
upper_bound = 128.4 + (3 * 19.6)
print(f"Lower bound: {lower_bound}")
print(f"Upper bound: {upper_bound}")
# Visualise area between bounds
plot_normal_pdf(mean=128.4, std=19.6, std_num=3)

This is often how control limits for SPC charts are calculated! i.e., we expected (assuming a normal distribution) that 99.7% of data points are within the control limits!

Task 2 (10-15 minutes)

Calculate the mean and standard deviation of data.

data = [
    55.65641072,
    48.01818315,
    49.82176254,
    50.45425647,
    55.54134189,
    62.63371532,
    54.15207348,
    45.71053599,
    43.00529417,
    50.30120328,
    55.58464243,
    44.6356539,
    48.59925079,
    50.88434543,
    44.38601424,
    47.56891955,
    62.2533763,
    47.6099892,
    48.78565952,
    51.6846901,
    53.36930845,
    41.01163271,
    52.45308797,
    58.66317657,
    46.98819526,
    54.05535604,
    54.26000618,
    55.329168,
    47.02687401,
    52.36229964,
    47.11341099,
    45.98130137,
    46.18547165,
    44.72855291,
    56.92354436,
    48.02234971,
    55.51503258,
    47.74337786,
    49.52773084,
    44.21196948,
    50.50957096,
    44.08070911,
    48.49120016,
    44.72622151,
    52.62607227,
    50.30850148,
    52.12608983,
    52.57118175,
    51.41407421,
    50.49117488,
]
# Mean and standard deviation calculations here

Use plot_normal_pdf function and modify the mean, std arguments to plot the PDF function.

Note, the argument std_num only relates to the shaded region in the plot and is purely for visualising the area within a certain number of standard deviations from the mean (you can ignore this parameter if you like).

# Plot distribution here

Inferential Statistics#

Inferential statistics involves infering some population metric from a sample. For example, suppose we are asking patients for feedback from a new change made recently to a hospital.

Sample - We randomly select 50 patients at random and ask for their feedback.
Population - We ask every single patient for their feedback (unrealistic!).

Inferential statistics deals with infering the feedback from all patients at the hospital, based on a sample of 50.

Central Limit Theorem#

Now, for the purpose of this example, we will assume that feedback can be recorded as a score from 1-10.

Additionally, we will assume a uniform distribution in responses, that is, each value from 1-10 is equally likely (this is of course not realistic). In fact, it typically doesn’t make a difference what distribution the responses are under certain circumstances (more on this later).

# Lets generate some survey responses:
survey_outcomes = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]

# Randomly select (with replacement) 50 values from survey_outcomes
generated_responses = random.choices(survey_outcomes, k=50)
# Lets print each of the 50 responses
print(generated_responses)

Lets take the mean of these responses to see the average score.

mean_score = np.mean(generated_responses)
print(mean_score)
# Let's repeat this process 1000 times and record the mean for each sample
mean_scores = []
for _ in range(1000):
    simulated_responses = random.choices(survey_outcomes, k=50)
    mean_scores.append(np.mean(simulated_responses))
# list of 1000 sample means (all of sample size = 50)
mean_scores

It’s not very easy to see how these values are distributed from observing a list with 1000 values!

To easily see how the scores are distributed, we can use a histogram (more information on histograms will be covered in Day 3 of the course).

The histogram will plot the frequency of obtaining mean_scores within specific ranges.

"""
!!Feel free to ignore this cell, data visualisations will be covered on Day 3!!
"""

# Create histogram plot
fig = px.histogram(
    x=mean_scores,
    color_discrete_sequence=["#00789c"],  # color of histogram bars
    nbins=30,
)

# Update layout
fig.update_layout(
    xaxis_title="Mean scores",
    yaxis_title="Frequency",
    template="seaborn",
    title={
        "text": f"Distribution of average sample scores with sample size = 50",
        "x": 0.07,
        "xanchor": "left",
    },
)

# Display plot
fig.show()

The above histogram shares similarities to the shape of the normal distribution. Let’s repeat this process, but now, we will repeat the sample collection 10,000 times and plot the distribution of 4 different sample sizes: 1, 10, 50 (same as before) and 100.

Hide code cell content
"""
!!Feel free to ignore this cell, data visualisations will be covered on Day 3!!
"""

from plotly.subplots import make_subplots

# Initialise subplot with 2 rows and 2 columns
fig = make_subplots(rows=2, cols=2)

# Number of experiments to perform
num_experiments = 10_000

# List of sample numbers to consider
num_samples = [1, 10, 50, 100]

# Initialise empty dictionary
samples = {}

for i, n in enumerate(num_samples):
    # List to store the averages of the results of each experiment
    means_experiments = []

    # Simulation of experiments
    for _ in range(num_experiments):

        # Our sample
        results = random.choices(survey_outcomes, k=n)
        #         results = np.random.poisson(5, n)
        #         results = np.random.normal(5, 2, n)
        #         survey_outcomes_extreme_skew = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 9, 10]
        mean_experiments = np.mean(results)  # Calculation of the mean
        means_experiments.append(mean_experiments)

    samples[n] = means_experiments

# Add traces to figure.
fig.append_trace(
    go.Histogram(x=samples[1], nbinsx=10, name="Sample size = 1"), row=1, col=1
)
fig.append_trace(
    go.Histogram(x=samples[10], nbinsx=30, name="Sample size = 10"), row=2, col=1
)
fig.append_trace(
    go.Histogram(x=samples[50], nbinsx=30, name="Sample size = 50"), row=1, col=2
)
fig.append_trace(
    go.Histogram(x=samples[100], nbinsx=30, name="Sample size = 100"), row=2, col=2
)

# Updating figure layout
fig.update_layout(
    {"paper_bgcolor": "rgba(0, 0, 0, 0)"},
    template="seaborn",
    title={
        "text": f"Distribution of average sample scores (with varying sample sizes)<br><sup>The population mean is 5.5 (this is based on the expected value formula for a unform distribution).</sup>",
        "x": 0.07,
        "xanchor": "left",
    },
)

fig.show()

Some important take aways from the central limit theorem (CLT).

  1. The mean of the sample means is equal to the population mean.

  2. If the population distribution is normal, the sample means will be normally distributed.

  3. If the sample is not normal, but sample size is greater than 30, the means will (usually) approximate a normal distribution.

  4. The standard deviation of the sample means will equal the population standard deviation divided by square root of sample size (this is known as standard error). Notice the relationship between standard error and sample size in the plots

For the sampling distribution of the mean, the standard error = \(\frac{{\sigma}}{\sqrt{n}}\)

Why is the CLT important in data science?

The CLT allows us to infer useful things about populations based on samples, even for non-normal distributions (or even more importantly) if we don’t know the underying distribution. For example, we can apply the Empirical rules to our sample to quantify confidence in our estimate.

Confidence intervals#

In reality, we are not going to collect thousands of samples and plot their distribution. Instead, we will collect 1 sample and infer a parameter (such as the mean) from the population. A confidence interval provides a measure of uncertainty associated with our sample estimate.

For example, a 95% confidence interval means that, if we were to collect 100 samples, 95% of them would have confidence intervals containing the true population parameter.

If our sample size is greater than 30, we can usually assume the sampling distribution approximates a normal distribution (by the CLT).

We can therefore apply exactly the same Empirical rules discussed in the normal distribution section to calculate confidence intervals, exept now, we approximate the mean as the sample mean and the standard deviation as the standard error (standard error = \(\frac{{\sigma}}{\sqrt{n}}\)).

Task 3 (15 minutes)

We’re now going to see the power of the CLT in action!

A random sample from 50 patients was collected and based on their feedback, we will infer feedback score for all patients using the hospital.

# Again, we gather a sample on 50 patients at a hospital.
sample_data = [
    7,
    6,
    6,
    8,
    7,
    5,
    6,
    5,
    5,
    6,
    6,
    7,
    6,
    6,
    6,
    6,
    7,
    5,
    6,
    5,
    3,
    6,
    6,
    5,
    8,
    4,
    6,
    5,
    7,
    7,
    6,
    6,
    5,
    4,
    5,
    6,
    7,
    7,
    5,
    5,
    4,
    4,
    4,
    7,
    5,
    5,
    4,
    6,
    4,
    5,
]

We know the standard deviation (\(\sigma\)) is 1. Calculate the mean of this sample and use the equation of the standard error to compute 95% confidence intervals for the mean response score.

# Your code here

Suggested reading:

  1. Essential math for data science (chapter 3).

  2. https://www.statlearning.com/