GithubHelp home page GithubHelp logo

arseniyturin / northwind-database-statistical-testing Goto Github PK

View Code? Open in Web Editor NEW

This project forked from learn-co-students/dsc-2-final-project-online-ds-sp-000

0.0 1.0 9.0 7.22 MB

License: Other

Jupyter Notebook 100.00%

northwind-database-statistical-testing's Introduction

Student Project #2

Northwind Database

For this project, you'll be working with the Northwind database--a free, open-source dataset created by Microsoft containing data from a fictional company. You probably remember the Northwind database from our section on Advanced SQL.

The goal of this project is to test your ability to gather information from a real-world database and use your knowledge of statistical analysis and hypothesis testing to generate analytical insights that can be of value to the company.



Database Schema

The Goal

The goal of your project is to query the database to get the data needed to perform a statistical analysis. In this statistical analysis, we'll need to perform a hypothesis test (or perhaps several) to answer the following questions:

  • Question 1: Does discount amount have a statistically significant effect on the quantity of a product in an order? If so, at what level(s) of discount?
  • Question 2: Is there a statistically significant difference in performance between employees from US and UK?
  • Question 3: Is there statistically significant difference in discounts given by USA and UK employees?
  • Question 4: Is there a statistically significant difference in demand of produce each month?
  • Question 5: Is there a statistically significant difference in discount between categories?
  • Question 6: Is there a statistically significant difference in performance of shipping companies?

Importing libraries

import sqlite3 # for database
import pandas as pd # for dataframe
import matplotlib.pyplot as plt # plotting
import seaborn as sns # plotting
import numpy as np # analysis
from scipy import stats # significance levels, normality
import itertools # for combinations
import statsmodels.api as sm # anova
from statsmodels.formula.api import ols

import warnings
warnings.filterwarnings('ignore') # hide matplotlib warnings

Connecting to database

# Connecting to database
conn = sqlite3.connect('Northwind_small.sqlite')
c = conn.cursor()
# List of all tables
tables = c.execute("SELECT name FROM sqlite_master WHERE type='table';").fetchall()
tables = [i[0] for i in tables]

Converting all tables into dataframes

# Loop to put all tables into pandas dataframes
dfs = []
for i in tables:
    table = c.execute('select * from "'+i+'"').fetchall()
    columns = c.execute('PRAGMA table_info("'+i+'")').fetchall()
    df = pd.DataFrame(table, columns=[i[1] for i in columns])
    # Cute little function to make a string into variable name
    foo = i+"_df"
    exec(foo + " = df") # => TableName_df
    # Keep all dataframe names in the list to remember what we have
    dfs.append(foo)

Exploratory Data Analysis

Order quantities of disconted and not discounted products

Table Product has 77 entries, each entry is unique product

First we can check visually if discount really made a difference in order quantity

discount = OrderDetail_df[OrderDetail_df['Discount']!=0].groupby('ProductId')['Quantity'].mean()
no_discount = OrderDetail_df[OrderDetail_df['Discount']==0].groupby('ProductId')['Quantity'].mean()
plt.figure(figsize=(16,5))
plt.bar(discount.index, discount.values, alpha=1, label='Discount', color='#a0b0f0')
plt.bar(no_discount.index, no_discount.values, alpha=0.8, label='No Discount', color='#c9f9a0')
plt.legend()
plt.title('Order quantities with/without discount')
plt.xlabel('Product ID')
plt.ylabel('Average quantities')
plt.show()

print('Conclusion')
print("On average {}% of discounted products were sold in larger quantities".format(round(sum(discount.values > no_discount.values)/len(discount.values)*100),2))
print("Average order quantity with discount - {} items, without - {} items".format(round(discount.values.mean(),2), round(no_discount.values.mean(),2)))

png

Conclusion
On average 70.0% of discounted products were sold in larger quantities
Average order quantity with discount - 26.43 items, without - 21.81 items

There is evidence that customers tend to buy more product if it was discounted

To prove that our hypothesis correct we will run an experiment

Orders grouped by discount level

First lets see how many discounts we have and how many orders on average each discont level provides

# Let's get all discount levels
discounts = OrderDetail_df['Discount'].unique()
discounts.sort()
print('Discount levels')
print(discounts)
Discount levels
[0.   0.01 0.02 0.03 0.04 0.05 0.06 0.1  0.15 0.2  0.25]
# Group orders by discount amounts
# Each group is a DataFrame containing orders with certain discount level
groups = {}
for i in discounts:
    groups[i] = OrderDetail_df[OrderDetail_df['Discount']==i]
# Create new DataFrame with Discounts and Order quantities
discounts_df = pd.DataFrame(columns=['Discount %','Orders','Avg. Order Quantity'])
for i in groups.keys():
    discounts_df = discounts_df.append({'Discount %':i*100,'Orders':len(groups[i]),'Avg. Order Quantity':groups[i]['Quantity'].mean()}, ignore_index=True)

discounts_df
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
Discount % Orders Avg. Order Quantity
0 0.0 1317.0 21.715262
1 1.0 1.0 2.000000
2 2.0 2.0 2.000000
3 3.0 3.0 1.666667
4 4.0 1.0 1.000000
5 5.0 185.0 28.010811
6 6.0 1.0 2.000000
7 10.0 173.0 25.236994
8 15.0 157.0 28.382166
9 20.0 161.0 27.024845
10 25.0 154.0 28.240260

Table above shows that 1%, 2%, 3%, 4% and 6% discounts are significantly small and to draw a conclusion would be problematic. Besides that customers ordered such a small quantities it looks like it made a negative impact. Probably because it was a promotion of new product or some other reason.

Lets drop these discount levels from our experiment

Bootstrap

Bootstrapping is a type of resampling where large numbers of smaller samples of the same size are repeatedly drawn, with replacement, from a single original sample.

def bootstrap(sample, n):
    bootstrap_sampling_dist = []
    for i in range(n):
        bootstrap_sampling_dist.append(np.random.choice(sample, size=len(sample), replace=True).mean())
    return np.array(bootstrap_sampling_dist)

Cohen's d

Cohen's d is an effect size used to indicate the standardised difference between two means. It can be used, for example, to accompany reporting of t-test and ANOVA results. It is also widely used in meta-analysis. Cohen's d is an appropriate effect size for the comparison between two means.

def Cohen_d(group1, group2):

    diff = group1.mean() - group2.mean()
    n1, n2 = len(group1), len(group2)
    var1 = group1.var()
    var2 = group2.var()
    # Calculate the pooled threshold as shown earlier
    pooled_var = (n1 * var1 + n2 * var2) / (n1 + n2)
    # Calculate Cohen's d statistic
    d = diff / np.sqrt(pooled_var)
    return abs(d)

Visualization

def visualization(control, experimental):
    plt.figure(figsize=(10,6))
    sns.distplot(experimental, bins=50,  label='Experimental')
    sns.distplot(control, bins=50,  label='Control')

    plt.axvline(x=control.mean(), color='k', linestyle='--')
    plt.axvline(x=experimental.mean(), color='k', linestyle='--')

    plt.title('Control and Experimental Sampling Distributions', fontsize=14)
    plt.xlabel('Distributions')
    plt.ylabel('Frequency')
    plt.legend()
    plt.show()

Question #1

Does discount amount have a statistically significant effect on the quantity of a product in an order? If so, at what level(s) of discount?

  • $H_0$: there is no difference in order quantity due to discount
  • $H_\alpha$: there is an increase in order quantity due to discount

Usually discount increases order quantity, so it would be reasonable to perform one-tailed test with $\alpha$ set to 0.025. If $p$ < $\alpha$, we reject null hypothesis.

Welch's T-test

In statistics, Welch's t-test, or unequal variances t-test, is a two-sample location test which is used to test the hypothesis that two populations have equal means.

At first I created two distributions (control and experimental). Control distribution uncludes only order quantities without discount only, and experimental distribution includes order quantities with discount (at any level)

This experiment would answer a question if there is any difference in purchase quantity

control = OrderDetail_df[OrderDetail_df['Discount']==0]['Quantity']
experimental = OrderDetail_df[OrderDetail_df['Discount']!=0]['Quantity']

t_stat, p = stats.ttest_ind(control, experimental)
d = Cohen_d(experimental, control)

print('Reject Null Hypothesis') if p < 0.025 else print('Failed to reject Null Hypothesis')
print("Cohen's d:", d)
visualization(control, experimental)
Reject Null Hypothesis
Cohen's d: 0.2862724481729283

png

Result of the experiment shows that there is a statistically significant difference in orders quantities, hence we reject null hypothesis

The question is posed in the way that it is asking if order quantity is different at the different discount level.

The following step in the research would be to answer the question about at what discount level we see statisticaly significant difference in orders quantities

We will follow the same process as previous experiment, but this time we'll break our experimental group into discount amounts

discounts_significance_df = pd.DataFrame(columns=['Discount %','Null Hypothesis','Cohens d'], index=None)

discounts = [0.05, 0.1, 0.15, 0.2, 0.25]
control = OrderDetail_df[OrderDetail_df['Discount']==0]['Quantity']
for i in discounts:
    experimental = OrderDetail_df[OrderDetail_df['Discount']==i]['Quantity']
    st, p = stats.ttest_ind(control, experimental)
    d = Cohen_d(experimental, control)
    discounts_significance_df = discounts_significance_df.append( { 'Discount %' : str(i*100)+'%' , 'Null Hypothesis' : 'Reject' if p < 0.025 else 'Failed', 'Cohens d' : d } , ignore_index=True)    

discounts_significance_df
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
Discount % Null Hypothesis Cohens d
0 5.0% Reject 0.346877
1 10.0% Reject 0.195942
2 15.0% Reject 0.372404
3 20.0% Reject 0.300712
4 25.0% Reject 0.366593

Result of the test shows that there is statistically significant difference in quantities between orders with no discount and applied discounts of 5%, 10%, 15%, 20%, 25%. Hence we reject null hypothesis

Statistically significant difference between discount levels

  • $H_0$: there is no difference in order quantity between discounts
  • $H_\alpha$: there is a difference in order quantity between discounts
discounts = np.array([0.05, 0.1, 0.15, 0.2, 0.25])
comb = itertools.combinations(discounts, 2)
discount_levels_df = pd.DataFrame(columns=['Discount %','Null Hypothesis','Cohens d'], index=None)

for i in comb:
    
    control =      OrderDetail_df[OrderDetail_df['Discount']==i[0]]['Quantity']
    experimental = OrderDetail_df[OrderDetail_df['Discount']==i[1]]['Quantity']
    
    st, p = stats.ttest_ind(experimental, control)
    d = Cohen_d(experimental, control)
    
    discount_levels_df = discount_levels_df.append( { 'Discount %' : str(i[0]*100)+'% - '+str(i[1]*100)+'%', 'Null Hypothesis' : 'Reject' if p < 0.05 else 'Failed', 'Cohens d' : d } , ignore_index=True)    

discount_levels_df.sort_values('Cohens d', ascending=False)
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
Discount % Null Hypothesis Cohens d
4 10.0% - 15.0% Failed 0.149332
6 10.0% - 25.0% Failed 0.145146
0 5.0% - 10.0% Failed 0.127769
5 10.0% - 20.0% Failed 0.089008
7 15.0% - 20.0% Failed 0.068234
9 20.0% - 25.0% Failed 0.062415
2 5.0% - 20.0% Failed 0.047644
1 5.0% - 15.0% Failed 0.017179
3 5.0% - 25.0% Failed 0.010786
8 15.0% - 25.0% Failed 0.006912

Result of the test shows that there is no statistically significant difference in order quantity between discounts of 5%, 10%, 15%, 20% and 25%.

Question 2

Is there a statistically significant difference in performance between employees from US and UK?

  • $H_0$: There is no difference in performance between US and UK employees
  • $H_\alpha$: there is a difference in performance between US and UK employees

How to measure performance of employee? It could be done in different ways, such a:

  • survey of the customers
  • amount of orders they were able to process
  • time it took them to procees the orders
  • etc...

To find out the difference in performance we will perform two tests

employees_orders = pd.read_sql_query( '''
                                    
                                SELECT O.EmployeeId, E.Country, COUNT(O.Id) AS Total_Orders  
                                FROM [Order] AS O
                                JOIN Employee as E
                                ON O.EmployeeId = E.Id
                                GROUP BY O.EmployeeId
                                
                                ''' ,conn)
employees_orders
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
EmployeeId Country Total_Orders
0 1 USA 123
1 2 USA 96
2 3 USA 127
3 4 USA 156
4 5 UK 42
5 6 UK 67
6 7 UK 72
7 8 USA 104
8 9 UK 43

Even without significance test we can tell there is a big difference in the amount of total orders two groups were able to process in two years.

2.1 Amount of orders processed by US and UK employees

# ANOVA Test
formula = 'Total_Orders ~ C(Country)'
lm = ols(formula, employees_orders).fit()
table = sm.stats.anova_lm(lm, typ=2)
print(table)
                 sum_sq   df          F    PR(>F)
C(Country)  9446.755556  1.0  22.640129  0.002064
Residual    2920.800000  7.0        NaN       NaN

Result of ANOVA Test shows that there is statistically significant difference in orders quantity between two groups of employees from USA and UK.

My suspicion was that group from USA covers larger territory, but it's turn out not to be the case.

I want to investigate further into performance of employees and compare their order processing time, maybe thats the reason in such a big difference in amount of orders

2.2 Order Processing Time by US vs UK Employees

In this test I want to figure out if number of orders affected by how fast employees can process them

usa_uk = pd.read_sql_query('''
                    
                    SELECT O.Id, O.OrderDate, O.ShippedDate, E.Country FROM [Order] AS O
                    JOIN Employee AS E
                    ON O.EmployeeId = E.Id

''',conn)
usa_uk.OrderDate = pd.to_datetime(usa_uk.OrderDate)
usa_uk.ShippedDate = pd.to_datetime(usa_uk.ShippedDate)
usa_uk['ProcessingTime'] = usa_uk.ShippedDate - usa_uk.OrderDate
usa_uk.ProcessingTime = usa_uk.ProcessingTime.dt.days
usa_uk.dropna(inplace=True)
usa = usa_uk[usa_uk.Country == 'USA']['ProcessingTime']
uk  = usa_uk[usa_uk.Country == 'UK']['ProcessingTime']

print(usa.mean(), uk.mean())
stats.ttest_ind(usa, uk)
print(Cohen_d(usa, uk))
8.375634517766498 8.807339449541285
0.06310985453186797

Result of the test shows that there is no statistically significant difference in processing time, hence we falied to reject null hypothesis

Question 3

Is there statistically significant difference in discounts given by USA and UK employees?

  • $H_0$: There is no difference in discounts given by from USA and UK employees
  • $H_\alpha$: There is a difference in discounts given by from USA and UK employees

Read Database

usa_uk_discount = pd.read_sql_query('''

                    SELECT OD.Discount, E.Country FROM [Order] AS O
                    JOIN OrderDetail AS OD ON O.Id = OD.OrderId
                    JOIN Employee AS E ON O.EmployeeId = E.Id

''', conn)
formula = 'Discount ~ C(Country)'
lm = ols(formula, usa_uk_discount).fit()
table = sm.stats.anova_lm(lm, typ=2)
print(table)
               sum_sq      df         F    PR(>F)
C(Country)   0.067081     1.0  9.671415  0.001896
Residual    14.933259  2153.0       NaN       NaN

Result

Result of the test shows that there is statistically significant difference in discount amount between employees from USA and UK, hence we reject null hypothesis

Employees from USA tend to give smaller discount to their clients

Question 4

Is there a statistically significant difference in demand of produce each month?

  • $H_0$: There is no difference in demand of produce each month
  • $H_\alpha$: There is a difference in demand of produce each month

Read Database

produce = pd.read_sql_query('''

                                SELECT O.OrderDate, OD.Quantity, OD.Discount, CategoryId FROM [Order] AS O
                                JOIN OrderDetail AS OD
                                ON O.Id = OD.OrderId
                                JOIN Product
                                ON Product.Id = OD.ProductId
                                WHERE Product.CategoryId = 7

''',conn)   

Group data by month

produce.OrderDate = pd.to_datetime(produce.OrderDate)
produce['Month'] = produce.OrderDate.dt.month
produce.groupby('Month').mean()
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
Quantity Discount CategoryId
Month
1 16.545455 0.050000 7.0
2 15.555556 0.011111 7.0
3 21.500000 0.004545 7.0
4 29.105263 0.028947 7.0
5 12.888889 0.075556 7.0
6 21.285714 0.085714 7.0
7 26.375000 0.050000 7.0
8 15.666667 0.038889 7.0
9 17.500000 0.025000 7.0
10 33.250000 0.037500 7.0
11 16.000000 0.055556 7.0
12 26.842105 0.100000 7.0

ANOVA Test

formula = 'Quantity ~ C(Month)'
lm = ols(formula, produce).fit()
table = sm.stats.anova_lm(lm, typ=2)
print(table)
                sum_sq     df         F    PR(>F)
C(Month)   4834.012843   11.0  1.318794  0.221691
Residual  41319.957745  124.0       NaN       NaN

Result

There is no statistically significant difference in order quantity between months, hence we failed to reject null hypothesis

Question 5

Is there a statistically significant difference in discount between categories?

  • $H_0$: There is no difference in discount level between categories
  • $H_\alpha$: There is a difference in discount level between categories

Read Database

category_discount = pd.read_sql_query('''

                        SELECT OrderDetail.UnitPrice, Discount, CategoryId FROM OrderDetail
                        JOIN Product
                        ON OrderDetail.ProductId = Product.Id

''',conn)

ANOVA Test

formula = 'Discount ~ C(CategoryId)'
lm = ols(formula, category_discount).fit()
table = sm.stats.anova_lm(lm, typ=2)
print(table)
                  sum_sq      df         F    PR(>F)
C(CategoryId)   0.074918     7.0  1.539545  0.149326
Residual       14.925422  2147.0       NaN       NaN

Result

Result of the test shows that there is no statistically significant difference in discount level between categories, hence we failed to reject null hypothesis

Question 6

Is there a statistically significant difference in performance of shipping companies?

  • $H_0$: There is no difference in discount level between categories
  • $H_\alpha$: There is a difference in discount level between categories
Order_df.OrderDate = pd.to_datetime(Order_df.OrderDate)
Order_df.ShippedDate = pd.to_datetime(Order_df.ShippedDate)
Order_df.RequiredDate = pd.to_datetime(Order_df.RequiredDate)

Order_df['ProcessingTime'] = Order_df.ShippedDate - Order_df.OrderDate
Order_df['ShippingTime'] = Order_df.RequiredDate - Order_df.ShippedDate

Order_df.ShippingTime = Order_df.ShippingTime.dt.days
Order_df.ProcessingTime = Order_df.ProcessingTime.dt.days
Order_df.groupby('ShipVia').mean()
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
Id EmployeeId Freight ProcessingTime ShippingTime
ShipVia
1 10667.594378 4.232932 65.001325 8.571429 19.485714
2 10674.963190 4.536810 86.640644 9.234921 18.765079
3 10641.592157 4.400000 80.441216 7.473896 19.963855
formula = 'ProcessingTime ~ C(ShipVia)'
lm = ols(formula, Order_df).fit()
table = sm.stats.anova_lm(lm, typ=2)
print(table)
                  sum_sq     df         F    PR(>F)
C(ShipVia)    433.501581    2.0  4.676819  0.009563
Residual    37354.696194  806.0       NaN       NaN
Shipper_df
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
Id CompanyName Phone
0 1 Speedy Express (503) 555-9831
1 2 United Package (503) 555-3199
2 3 Federal Shipping (503) 555-9931

Result

Result of the test shows that there is a statistically significant difference in performance of shipping companies, hence we reject null hypothesis

Conclusion

  • Discounts of 5%, 15%, 20% and 25% have approximately the same effect on order quantity
  • Employees from US sold more product with lower discount, though order quantity same as employees from UK and processing time (from order being requested to shipping) approximately the same.
  • There difference in demand of produce, but not significantly enough to reject null hypothesis
  • Discounts were given across categories at the relatively same level

Further Steps

  • Find out why employees from US had much more orders than from UK
  • Research further what clients responded better to discount
  • Find out optimal level of discount for products according to their price and possible seasonal demand
  • Find a way to improve logistics

northwind-database-statistical-testing's People

Contributors

arseniyturin avatar mike-kane avatar peterbell avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.