Another section down! Let's pull together the statistical measures learned so far to analyze a dataset and produce some business recommendations.
You will be able to:
- Recall the concepts and applications of measures of central tendency and dispersion
- Practice applying and interpreting measures of central tendency and dispersion
- Recall the concepts and applications of covariance and correlation
- Practice applying and interpreting covariance and correlation
Photo by Scott Graham on Unsplash
Imagine you work for a company that sells widgets1 and your boss has asked you to look into the sales data across your media markets for this year. She wants to know:
- What sales volume do we have in a typical market?
- How variable are sales across markets?
- If we have 25k more dollars to spend in advertising per market, should we spend it on TV, radio, or newspaper ads?
1Here we are using the second definition of widget: "an unnamed article considered for purposes of hypothetical example"
For this lab we will be using a popular dataset known as the "Advertising Dataset". It comes from An Introduction to Statistical Learning with Applications in R by G. James, D. Witten, T. Hastie and R. Tibshirani. We have downloaded this dataset for you and stored it in this repository.
This dataset contains four lists. Each number in each list represents the value for that list in a given market. The four lists are:
sales
: the number of widgets sold (in thousands)tv
: the amount of money (in thousands of dollars) spent on TV adsradio
: the amount of money (in thousands of dollars) spent on radio adsnewspaper
: the amount of money (in thousands of dollars) spent on newspaper ads
So, for example:
- the third number from each list represents the value of
sales
,tv
,radio
, andnewspaper
in one market, - the fourth number from each list represents the value of
sales
,tv
,radio
, andnewspaper
in another market,
and so on.
Write code that describes the number of markets a given list has records for, as well as the sales numbers for the markets with the minimum and maximum sales.
Use a measure of central tendency to describe a "typical" market's sales.
Use a measure of dispersion to describe how variable sales are across markets.
Calculate the correlation between TV, radio, and newspaper ad spending and widget sales.
Use the findings from step 4 to make a recommendation.
In the cell below, we've opened up the dataset and loaded it into lists named sales
, tv
, radio
, and newspaper
.
# Run this cell without changes
import pandas as pd
data = pd.read_csv("advertising.csv", index_col=0)
sales = list(data["sales"])
tv = list(data["TV"])
radio = list(data["radio"])
newspaper = list(data["newspaper"])
# display the first 10 sales amounts
sales[:10]
Replace None
with appropriate code so that this cell prints out the correct information. For this part, you only need to use the sales
variable.
Reminder: Replace None
with code that calculates the answer. Don't calculate the answer by hand and then replace None
with the number of your answer!
# Replace None with appropriate code
num_markets = None
min_sales = None
max_sales = None
print(f"""
This dataset contains records for {num_markets} markets
The fewest sales for any market was {min_sales} thousand widgets
The most sales for any market was {max_sales} thousand widgets
""")
Run this code to create a histogram of all sales data:
# Run this cell without changes
import matplotlib.pyplot as plt
%matplotlib inline
fig, ax = plt.subplots(figsize=(10, 5))
ax.hist(sales, bins=15)
ax.set_xlabel("Sales (thousands of widgets)")
ax.set_ylabel("Count")
ax.set_title("Distribution of Sales across Markets");
Now we should be able to address the first business question: What sales volume do we have in a typical market?
That sounds like a question to be answered by a measure of central tendency.
Reminder: the three measures of central tendency we've introduced are:
- Mean
- Median
- Mode
Choose the measure that seems most reasonable to you (it's a judgment call โ there isn't always a single correct answer!) and complete the cell below, using NumPy or SciPy to compute the measure.
# Replace None or <None> with appropriate code
import numpy as np
from scipy import stats
# Assign measure_central_tendency to the mean, median, or mode of the sales data
measure_central_tendency = None
print(f"""
Typical sales volume is {measure_central_tendency} thousand widgets
I chose <None> as the relevant measure because <None>
""")
Now that we have a number to represent the typical sales volume, let's answer: How variable are sales across markets?
That sounds like a question to be answered by a measure of dispersion (also known as a measure of spread).
Reminder: the measures of dispersion we've introduced are:
- (Average) absolute deviation
- Variance
- Standard deviation
- Interquartile range
Choose the measure that seems the most reasonable to you, and write up your answer in the cell below, following the format from the previous question (first calculating the measure, then explaining your answer).
# Your answer here
Now that we have a general understanding of the distribution of the sales data, we can start to answer: If we have 25k more dollars to spend in advertising per market, should we spend it on TV, radio, or newspaper ads?
(Eventually we will learn more sophisticated multivariate modeling techniques that will allow us to simulate the impacts of different choices here such as given TV spending of $x_1$ and radio spending of $x_2$, how would increasing newspaper spending by 25k impact $y$?, but for now we will just use the tools we have learned so far.)
In order to make this recommendation, let's find the correlation between each advertising medium and the associated sales.
(Recall that covariance is the numerator of the correlation formula, and that we typically use correlation rather than just covariance because its magnitude is more interpretable.)
In the following cell, compute the correlation between sales
and tv
, radio
, and newspaper
using NumPy.
# Replace None with appropriate code
tv_corr = None
radio_corr = None
newspaper_corr = None
print("Correlation of Sales and TV Ad Spending:", tv_corr)
print("Correlation of Sales and Radio Ad Spending:", radio_corr)
print("Correlation of Sales and Newspaper Ad Spending:", newspaper_corr)
Which type of ad spending has the highest correlation?
# Replace <None> with TV, radio, or newspaper
print("<None> has the highest correlation with sales")
Let's also plot out each of the ad types vs. sales:
# Run this cell without changes
fig, (ax1, ax2, ax3) = plt.subplots(ncols=3, figsize=(10,3), sharey=True)
ax1.scatter(tv, sales)
ax2.scatter(radio, sales)
ax3.scatter(newspaper, sales)
ax1.set_xlabel("TV")
ax2.set_xlabel("Radio")
ax3.set_xlabel("Newspaper")
ax1.set_ylabel("Sales")
fig.suptitle("Advertising Expenditure vs. Sales Across Media");
Based on the correlation numbers and a visual inspection of those plots, make a recommendation to your boss about where to spend 25k extra dollars per market and why.
# Your answer here
In this cumulative lab, you practiced analyzing sales and advertising data in order to make a business recommendation. Unlike some other labs, there was more ambiguity and we asked you to make some judgment calls in order to use data science concepts for a business audience. In the rest of the course, you will continue building your technical skills and technical communication skills!