In the previous lab, we saw that if we have the standard deviation for the population, we can use use
If, on the other hand, the standard deviation of the population is not known (which is usually the case), you have to use the standard deviation of your sample as a stand-in when creating confidence intervals. Since the sample standard deviation is often different than that of the population, further potential errors are introduced to our confidence intervals. To account for this error, we use what's known as a t-critical value instead of the
The t-critical value is drawn from what's known as a t-distribution.
A t-distribution closely resembles the normal distribution but gets wider and wider as the sample size falls.
The t-distribution is available in scipy.stats
with the nickname "t" so we can get t-critical values with stats.t.ppf()
.
You will be able to:
- Calculate confidence intervals
- Interpret confidence intervals in relation to true population parameters
# Import the necessary libraries
import numpy as np
import pandas as pd
import scipy.stats as stats
import matplotlib.pyplot as plt
import random
import math
Let's investigate point estimates by generating a population of random age data collected at two different locations and then drawing a sample from it to estimate the mean:
np.random.seed(20)
population_ages1 = np.random.normal(20, 4, 10000)
population_ages2 = np.random.normal(22, 3, 10000)
population_ages = np.concatenate((population_ages1, population_ages2))
pop_ages = pd.DataFrame(population_ages)
pop_ages.hist(bins=100,range=(5,33),figsize=(9,9))
pop_ages.describe()
Let's take a new, smaller sample (of size smaller than 30) and calculate how much the sample mean differs from the population mean.
np.random.seed(23)
sample_size = 25
sample = None # Take a random sample of size 25 from above population
sample_mean = None # Calculate sample mean
# Print sample mean and difference of sample and population mean
# Sample Mean: 19.870788629471857
# Mean Difference: 1.1377888781920937
We can see that the sample mean differs from the population mean by 1.13 years. We can calculate a confidence interval without the population standard deviation, using the t-distribution using stats.t.ppf(q, df)
function. This function takes in a value for the confidence level required (q) with "degrees of freedom" (df).
In this case, the number of degrees of freedom,
df
, is equal to the sample size minus 1, ordf = sample_size - 1
.
# Calculate the t-critical value for 95% confidence level for sample taken above.
t_critical = None # Get the t-critical value by using 95% confidence level and degree of freedom
print("t-critical value:") # Check the t-critical value
#print(t_critical)
# t-critical value:
# 2.0638985616280205
Calculate the confidence interval of the sample by sigma and calculating the margin of error as:
sigma = sample_std/โn
Margin of Error = t-critical-value * sigma
and finally the confidence interval can be calculated as :
Confidence interval = (sample_mean - margin of error, sample_mean + margin of error)
# Calculate the sample standard deviation
sample_stdev = None # Get the sample standard deviation
# Calculate sigma using the formula described above to get population standard deviation estimate
sigma =None
# Calculate margin of error using t_critical and sigma
margin_of_error = None
# Calculate the confidence intervals using calculated margin of error
confidence_interval = None
# print("Confidence interval:")
#print(confidence_interval)
# Confidence interval:
# (18.4609156900928, 21.280661568850913)
We can verify our calculations by using the Python function stats.t.interval()
:
stats.t.interval(alpha = 0.95, # Confidence level
df= 24, # Degrees of freedom
loc = sample_mean, # Sample mean
scale = sigma) # Standard deviation estimate
# (18.4609156900928, 21.280661568850913)
We can see that the calculated confidence interval includes the population mean calculated above.
Let's run the code multiple times to see how often our estimated confidence interval covers the population mean value:
Write a function using the code above that takes in sample data and returns confidence intervals
# Function to take in sample data and calculate the confidence interval
def conf_interval(sample):
'''
Input: sample
Output: Confidence interval
'''
n = len(sample)
x_hat = sample.mean()
# Calculate the z-critical value using stats.norm.ppf()
# Note that we use stats.t.ppf with q = 0.975 to get the desired t-critical value
# instead of q = 0.95 because the distribution has two tails.
t = None # t-critical value for 95% confidence
sigma = None # Sample standard deviation
# Calculate the margin of error using formula given above
moe = None
# Calculate the confidence interval by applying margin of error to sample mean
# (mean - margin of error, mean+ margin of error)
conf = None
return conf
Call the function 25 times taking different samples at each iteration and calculating the sample mean and confidence intervals
set random seed for reproducability
np.random.seed(12)
# Select the sample size
sample_size = 25
# Initialize lists to store interval and mean values
intervals = []
sample_means = []
# Run a for loop for sampling 25 times and calculate + store confidence interval and sample mean values in lists initialised above
for sample in range(25):
# Take a random sample of chosen size
# Calculate sample mean and confidence_interval
# Calculate and append sample means and conf intervals for each iteration
Plot the confidence intervals along with the sample means and population mean
# Plot the confidence intervals with sample and population means
# Draw the mean and confidence interval for each sample
# Draw the population mean
# Draw the mean and confidence interval for each sample
Just like the last lab, all but one of the 95% confidence intervals overlap the red line marking the true mean. This is to be expected: since a 95% confidence interval captures the true mean 95% of the time, we'd expect our interval to miss the true mean 5% of the time.
In this lab, we learned how to use confidence intervals when the population standard deviation is not known, and the sample size is small (<30). We also saw how to construct them from random samples. We also learned the differences between the use cases for the