For this project, you'll be working with the Northwind database--a free, open-source dataset created by Microsoft containing data from a fictional company. You probably remember the Northwind database from our section on Advanced SQL.
The goal of this project is to test your ability to gather information from a real-world database and use your knowledge of statistical analysis and hypothesis testing to generate analytical insights that can be of value to the company.
The goal of your project is to query the database to get the data needed to perform a statistical analysis. In this statistical analysis, we'll need to perform a hypothesis test (or perhaps several) to answer the following questions:
- Question 1: Does discount amount have a statistically significant effect on the quantity of a product in an order? If so, at what level(s) of discount?
- Question 2: Is there a statistically significant difference in performance between employees from US and UK?
- Question 3: Is there statistically significant difference in discounts given by USA and UK employees?
- Question 4: Is there a statistically significant difference in demand of produce each month?
- Question 5: Is there a statistically significant difference in discount between categories?
- Question 6: Is there a statistically significant difference in performance of shipping companies?
import sqlite3 # for database
import pandas as pd # for dataframe
import matplotlib.pyplot as plt # plotting
import seaborn as sns # plotting
import numpy as np # analysis
from scipy import stats # significance levels, normality
import itertools # for combinations
import statsmodels.api as sm # anova
from statsmodels.formula.api import ols
import warnings
warnings.filterwarnings('ignore') # hide matplotlib warnings
# Connecting to database
conn = sqlite3.connect('Northwind_small.sqlite')
c = conn.cursor()
# List of all tables
tables = c.execute("SELECT name FROM sqlite_master WHERE type='table';").fetchall()
tables = [i[0] for i in tables]
# Loop to put all tables into pandas dataframes
dfs = []
for i in tables:
table = c.execute('select * from "'+i+'"').fetchall()
columns = c.execute('PRAGMA table_info("'+i+'")').fetchall()
df = pd.DataFrame(table, columns=[i[1] for i in columns])
# Cute little function to make a string into variable name
foo = i+"_df"
exec(foo + " = df") # => TableName_df
# Keep all dataframe names in the list to remember what we have
dfs.append(foo)
Table Product
has 77 entries, each entry is unique product
First we can check visually if discount really made a difference in order quantity
discount = OrderDetail_df[OrderDetail_df['Discount']!=0].groupby('ProductId')['Quantity'].mean()
no_discount = OrderDetail_df[OrderDetail_df['Discount']==0].groupby('ProductId')['Quantity'].mean()
plt.figure(figsize=(16,5))
plt.bar(discount.index, discount.values, alpha=1, label='Discount', color='#a0b0f0')
plt.bar(no_discount.index, no_discount.values, alpha=0.8, label='No Discount', color='#c9f9a0')
plt.legend()
plt.title('Order quantities with/without discount')
plt.xlabel('Product ID')
plt.ylabel('Average quantities')
plt.show()
print('Conclusion')
print("On average {}% of discounted products were sold in larger quantities".format(round(sum(discount.values > no_discount.values)/len(discount.values)*100),2))
print("Average order quantity with discount - {} items, without - {} items".format(round(discount.values.mean(),2), round(no_discount.values.mean(),2)))
Conclusion
On average 70.0% of discounted products were sold in larger quantities
Average order quantity with discount - 26.43 items, without - 21.81 items
There is evidence that customers tend to buy more product if it was discounted
To prove that our hypothesis correct we will run an experiment
First lets see how many discounts we have and how many orders on average each discont level provides
# Let's get all discount levels
discounts = OrderDetail_df['Discount'].unique()
discounts.sort()
print('Discount levels')
print(discounts)
Discount levels
[0. 0.01 0.02 0.03 0.04 0.05 0.06 0.1 0.15 0.2 0.25]
# Group orders by discount amounts
# Each group is a DataFrame containing orders with certain discount level
groups = {}
for i in discounts:
groups[i] = OrderDetail_df[OrderDetail_df['Discount']==i]
# Create new DataFrame with Discounts and Order quantities
discounts_df = pd.DataFrame(columns=['Discount %','Orders','Avg. Order Quantity'])
for i in groups.keys():
discounts_df = discounts_df.append({'Discount %':i*100,'Orders':len(groups[i]),'Avg. Order Quantity':groups[i]['Quantity'].mean()}, ignore_index=True)
discounts_df
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
Discount % | Orders | Avg. Order Quantity | |
---|---|---|---|
0 | 0.0 | 1317.0 | 21.715262 |
1 | 1.0 | 1.0 | 2.000000 |
2 | 2.0 | 2.0 | 2.000000 |
3 | 3.0 | 3.0 | 1.666667 |
4 | 4.0 | 1.0 | 1.000000 |
5 | 5.0 | 185.0 | 28.010811 |
6 | 6.0 | 1.0 | 2.000000 |
7 | 10.0 | 173.0 | 25.236994 |
8 | 15.0 | 157.0 | 28.382166 |
9 | 20.0 | 161.0 | 27.024845 |
10 | 25.0 | 154.0 | 28.240260 |
Table above shows that 1%, 2%, 3%, 4% and 6% discounts are significantly small and to draw a conclusion would be problematic. Besides that customers ordered such a small quantities it looks like it made a negative impact. Probably because it was a promotion of new product or some other reason.
Lets drop these discount levels from our experiment
Bootstrapping is a type of resampling where large numbers of smaller samples of the same size are repeatedly drawn, with replacement, from a single original sample.
def bootstrap(sample, n):
bootstrap_sampling_dist = []
for i in range(n):
bootstrap_sampling_dist.append(np.random.choice(sample, size=len(sample), replace=True).mean())
return np.array(bootstrap_sampling_dist)
Cohen's d is an effect size used to indicate the standardised difference between two means. It can be used, for example, to accompany reporting of t-test and ANOVA results. It is also widely used in meta-analysis. Cohen's d is an appropriate effect size for the comparison between two means.
def Cohen_d(group1, group2):
diff = group1.mean() - group2.mean()
n1, n2 = len(group1), len(group2)
var1 = group1.var()
var2 = group2.var()
# Calculate the pooled threshold as shown earlier
pooled_var = (n1 * var1 + n2 * var2) / (n1 + n2)
# Calculate Cohen's d statistic
d = diff / np.sqrt(pooled_var)
return abs(d)
def visualization(control, experimental):
plt.figure(figsize=(10,6))
sns.distplot(experimental, bins=50, label='Experimental')
sns.distplot(control, bins=50, label='Control')
plt.axvline(x=control.mean(), color='k', linestyle='--')
plt.axvline(x=experimental.mean(), color='k', linestyle='--')
plt.title('Control and Experimental Sampling Distributions', fontsize=14)
plt.xlabel('Distributions')
plt.ylabel('Frequency')
plt.legend()
plt.show()
Does discount amount have a statistically significant effect on the quantity of a product in an order? If so, at what level(s) of discount?
-
$H_0$ : there is no difference in order quantity due to discount -
$H_\alpha$ : there is an increase in order quantity due to discount
Usually discount increases order quantity, so it would be reasonable to perform one-tailed test with
In statistics, Welch's t-test, or unequal variances t-test, is a two-sample location test which is used to test the hypothesis that two populations have equal means.
At first I created two distributions (control and experimental). Control distribution uncludes only order quantities without discount only, and experimental distribution includes order quantities with discount (at any level)
This experiment would answer a question if there is any difference in purchase quantity
control = OrderDetail_df[OrderDetail_df['Discount']==0]['Quantity']
experimental = OrderDetail_df[OrderDetail_df['Discount']!=0]['Quantity']
t_stat, p = stats.ttest_ind(control, experimental)
d = Cohen_d(experimental, control)
print('Reject Null Hypothesis') if p < 0.025 else print('Failed to reject Null Hypothesis')
print("Cohen's d:", d)
visualization(control, experimental)
Reject Null Hypothesis
Cohen's d: 0.2862724481729283
Result of the experiment shows that there is a statistically significant difference in orders quantities, hence we reject null hypothesis
The question is posed in the way that it is asking if order quantity is different at the different discount level.
The following step in the research would be to answer the question about at what discount level we see statisticaly significant difference in orders quantities
We will follow the same process as previous experiment, but this time we'll break our experimental group into discount amounts
discounts_significance_df = pd.DataFrame(columns=['Discount %','Null Hypothesis','Cohens d'], index=None)
discounts = [0.05, 0.1, 0.15, 0.2, 0.25]
control = OrderDetail_df[OrderDetail_df['Discount']==0]['Quantity']
for i in discounts:
experimental = OrderDetail_df[OrderDetail_df['Discount']==i]['Quantity']
st, p = stats.ttest_ind(control, experimental)
d = Cohen_d(experimental, control)
discounts_significance_df = discounts_significance_df.append( { 'Discount %' : str(i*100)+'%' , 'Null Hypothesis' : 'Reject' if p < 0.025 else 'Failed', 'Cohens d' : d } , ignore_index=True)
discounts_significance_df
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
Discount % | Null Hypothesis | Cohens d | |
---|---|---|---|
0 | 5.0% | Reject | 0.346877 |
1 | 10.0% | Reject | 0.195942 |
2 | 15.0% | Reject | 0.372404 |
3 | 20.0% | Reject | 0.300712 |
4 | 25.0% | Reject | 0.366593 |
Result of the test shows that there is statistically significant difference in quantities between orders with no discount and applied discounts of 5%, 10%, 15%, 20%, 25%. Hence we reject null hypothesis
-
$H_0$ : there is no difference in order quantity between discounts -
$H_\alpha$ : there is a difference in order quantity between discounts
discounts = np.array([0.05, 0.1, 0.15, 0.2, 0.25])
comb = itertools.combinations(discounts, 2)
discount_levels_df = pd.DataFrame(columns=['Discount %','Null Hypothesis','Cohens d'], index=None)
for i in comb:
control = OrderDetail_df[OrderDetail_df['Discount']==i[0]]['Quantity']
experimental = OrderDetail_df[OrderDetail_df['Discount']==i[1]]['Quantity']
st, p = stats.ttest_ind(experimental, control)
d = Cohen_d(experimental, control)
discount_levels_df = discount_levels_df.append( { 'Discount %' : str(i[0]*100)+'% - '+str(i[1]*100)+'%', 'Null Hypothesis' : 'Reject' if p < 0.05 else 'Failed', 'Cohens d' : d } , ignore_index=True)
discount_levels_df.sort_values('Cohens d', ascending=False)
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
Discount % | Null Hypothesis | Cohens d | |
---|---|---|---|
4 | 10.0% - 15.0% | Failed | 0.149332 |
6 | 10.0% - 25.0% | Failed | 0.145146 |
0 | 5.0% - 10.0% | Failed | 0.127769 |
5 | 10.0% - 20.0% | Failed | 0.089008 |
7 | 15.0% - 20.0% | Failed | 0.068234 |
9 | 20.0% - 25.0% | Failed | 0.062415 |
2 | 5.0% - 20.0% | Failed | 0.047644 |
1 | 5.0% - 15.0% | Failed | 0.017179 |
3 | 5.0% - 25.0% | Failed | 0.010786 |
8 | 15.0% - 25.0% | Failed | 0.006912 |
Result of the test shows that there is no statistically significant difference in order quantity between discounts of 5%, 10%, 15%, 20% and 25%.
-
$H_0$ : There is no difference in performance between US and UK employees -
$H_\alpha$ : there is a difference in performance between US and UK employees
How to measure performance of employee? It could be done in different ways, such a:
- survey of the customers
- amount of orders they were able to process
- time it took them to procees the orders
- etc...
To find out the difference in performance we will perform two tests
employees_orders = pd.read_sql_query( '''
SELECT O.EmployeeId, E.Country, COUNT(O.Id) AS Total_Orders
FROM [Order] AS O
JOIN Employee as E
ON O.EmployeeId = E.Id
GROUP BY O.EmployeeId
''' ,conn)
employees_orders
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
EmployeeId | Country | Total_Orders | |
---|---|---|---|
0 | 1 | USA | 123 |
1 | 2 | USA | 96 |
2 | 3 | USA | 127 |
3 | 4 | USA | 156 |
4 | 5 | UK | 42 |
5 | 6 | UK | 67 |
6 | 7 | UK | 72 |
7 | 8 | USA | 104 |
8 | 9 | UK | 43 |
Even without significance test we can tell there is a big difference in the amount of total orders two groups were able to process in two years.
# ANOVA Test
formula = 'Total_Orders ~ C(Country)'
lm = ols(formula, employees_orders).fit()
table = sm.stats.anova_lm(lm, typ=2)
print(table)
sum_sq df F PR(>F)
C(Country) 9446.755556 1.0 22.640129 0.002064
Residual 2920.800000 7.0 NaN NaN
Result of ANOVA Test shows that there is statistically significant difference in orders quantity between two groups of employees from USA and UK.
My suspicion was that group from USA covers larger territory, but it's turn out not to be the case.
I want to investigate further into performance of employees and compare their order processing time, maybe thats the reason in such a big difference in amount of orders
In this test I want to figure out if number of orders affected by how fast employees can process them
usa_uk = pd.read_sql_query('''
SELECT O.Id, O.OrderDate, O.ShippedDate, E.Country FROM [Order] AS O
JOIN Employee AS E
ON O.EmployeeId = E.Id
''',conn)
usa_uk.OrderDate = pd.to_datetime(usa_uk.OrderDate)
usa_uk.ShippedDate = pd.to_datetime(usa_uk.ShippedDate)
usa_uk['ProcessingTime'] = usa_uk.ShippedDate - usa_uk.OrderDate
usa_uk.ProcessingTime = usa_uk.ProcessingTime.dt.days
usa_uk.dropna(inplace=True)
usa = usa_uk[usa_uk.Country == 'USA']['ProcessingTime']
uk = usa_uk[usa_uk.Country == 'UK']['ProcessingTime']
print(usa.mean(), uk.mean())
stats.ttest_ind(usa, uk)
print(Cohen_d(usa, uk))
8.375634517766498 8.807339449541285
0.06310985453186797
Result of the test shows that there is no statistically significant difference in processing time, hence we falied to reject null hypothesis
-
$H_0$ : There is no difference in discounts given by from USA and UK employees -
$H_\alpha$ : There is a difference in discounts given by from USA and UK employees
usa_uk_discount = pd.read_sql_query('''
SELECT OD.Discount, E.Country FROM [Order] AS O
JOIN OrderDetail AS OD ON O.Id = OD.OrderId
JOIN Employee AS E ON O.EmployeeId = E.Id
''', conn)
formula = 'Discount ~ C(Country)'
lm = ols(formula, usa_uk_discount).fit()
table = sm.stats.anova_lm(lm, typ=2)
print(table)
sum_sq df F PR(>F)
C(Country) 0.067081 1.0 9.671415 0.001896
Residual 14.933259 2153.0 NaN NaN
Result of the test shows that there is statistically significant difference in discount amount between employees from USA and UK, hence we reject null hypothesis
Employees from USA tend to give smaller discount to their clients
-
$H_0$ : There is no difference in demand of produce each month -
$H_\alpha$ : There is a difference in demand of produce each month
produce = pd.read_sql_query('''
SELECT O.OrderDate, OD.Quantity, OD.Discount, CategoryId FROM [Order] AS O
JOIN OrderDetail AS OD
ON O.Id = OD.OrderId
JOIN Product
ON Product.Id = OD.ProductId
WHERE Product.CategoryId = 7
''',conn)
produce.OrderDate = pd.to_datetime(produce.OrderDate)
produce['Month'] = produce.OrderDate.dt.month
produce.groupby('Month').mean()
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
Quantity | Discount | CategoryId | |
---|---|---|---|
Month | |||
1 | 16.545455 | 0.050000 | 7.0 |
2 | 15.555556 | 0.011111 | 7.0 |
3 | 21.500000 | 0.004545 | 7.0 |
4 | 29.105263 | 0.028947 | 7.0 |
5 | 12.888889 | 0.075556 | 7.0 |
6 | 21.285714 | 0.085714 | 7.0 |
7 | 26.375000 | 0.050000 | 7.0 |
8 | 15.666667 | 0.038889 | 7.0 |
9 | 17.500000 | 0.025000 | 7.0 |
10 | 33.250000 | 0.037500 | 7.0 |
11 | 16.000000 | 0.055556 | 7.0 |
12 | 26.842105 | 0.100000 | 7.0 |
formula = 'Quantity ~ C(Month)'
lm = ols(formula, produce).fit()
table = sm.stats.anova_lm(lm, typ=2)
print(table)
sum_sq df F PR(>F)
C(Month) 4834.012843 11.0 1.318794 0.221691
Residual 41319.957745 124.0 NaN NaN
There is no statistically significant difference in order quantity between months, hence we failed to reject null hypothesis
-
$H_0$ : There is no difference in discount level between categories -
$H_\alpha$ : There is a difference in discount level between categories
category_discount = pd.read_sql_query('''
SELECT OrderDetail.UnitPrice, Discount, CategoryId FROM OrderDetail
JOIN Product
ON OrderDetail.ProductId = Product.Id
''',conn)
formula = 'Discount ~ C(CategoryId)'
lm = ols(formula, category_discount).fit()
table = sm.stats.anova_lm(lm, typ=2)
print(table)
sum_sq df F PR(>F)
C(CategoryId) 0.074918 7.0 1.539545 0.149326
Residual 14.925422 2147.0 NaN NaN
Result of the test shows that there is no statistically significant difference in discount level between categories, hence we failed to reject null hypothesis
-
$H_0$ : There is no difference in discount level between categories -
$H_\alpha$ : There is a difference in discount level between categories
Order_df.OrderDate = pd.to_datetime(Order_df.OrderDate)
Order_df.ShippedDate = pd.to_datetime(Order_df.ShippedDate)
Order_df.RequiredDate = pd.to_datetime(Order_df.RequiredDate)
Order_df['ProcessingTime'] = Order_df.ShippedDate - Order_df.OrderDate
Order_df['ShippingTime'] = Order_df.RequiredDate - Order_df.ShippedDate
Order_df.ShippingTime = Order_df.ShippingTime.dt.days
Order_df.ProcessingTime = Order_df.ProcessingTime.dt.days
Order_df.groupby('ShipVia').mean()
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
Id | EmployeeId | Freight | ProcessingTime | ShippingTime | |
---|---|---|---|---|---|
ShipVia | |||||
1 | 10667.594378 | 4.232932 | 65.001325 | 8.571429 | 19.485714 |
2 | 10674.963190 | 4.536810 | 86.640644 | 9.234921 | 18.765079 |
3 | 10641.592157 | 4.400000 | 80.441216 | 7.473896 | 19.963855 |
formula = 'ProcessingTime ~ C(ShipVia)'
lm = ols(formula, Order_df).fit()
table = sm.stats.anova_lm(lm, typ=2)
print(table)
sum_sq df F PR(>F)
C(ShipVia) 433.501581 2.0 4.676819 0.009563
Residual 37354.696194 806.0 NaN NaN
Shipper_df
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
Id | CompanyName | Phone | |
---|---|---|---|
0 | 1 | Speedy Express | (503) 555-9831 |
1 | 2 | United Package | (503) 555-3199 |
2 | 3 | Federal Shipping | (503) 555-9931 |
Result of the test shows that there is a statistically significant difference in performance of shipping companies, hence we reject null hypothesis
- Discounts of 5%, 15%, 20% and 25% have approximately the same effect on order quantity
- Employees from US sold more product with lower discount, though order quantity same as employees from UK and processing time (from order being requested to shipping) approximately the same.
- There difference in demand of produce, but not significantly enough to reject null hypothesis
- Discounts were given across categories at the relatively same level
- Find out why employees from US had much more orders than from UK
- Research further what clients responded better to discount
- Find out optimal level of discount for products according to their price and possible seasonal demand
- Find a way to improve logistics