The sales_prediction from lbvictor

ROSSMANN: A SALES FORECAST PROJECT

About the Project

For the development of this project, public data available by the company Dirk Rossmann GmbH in a Kaggle competition in 2015 was used. From this data, a business context was created in which the company's CFO needs a forecast of the next six weeks of sales of each store. Based on this, this project was developed in order to get as close as possible to a solution to a real demand of the company, executing all the necessary steps, generating insights and building a really usable data product that generates value for the stakeholder and the company.

The methodology used to carry out this project was CRISP-DM which works through cyclical development with the objective of delivering value quickly. This project refers to the first cycle of CRISP-DM.

The CRISP-DM Cycle

For the execution of this project the following tools were used:

Python

pandas

NumPy

Matplotlib

seaborn

scikit-learn

XGBoost

Flask

Heroku

Access the project development notebook here: https://github.com/lbVictor/Sales_Prediction/blob/main/notebooks/store_sales_prediction_cycle01.ipynb

About the Company

Dirk Rossmann GmbH is one of the largest drug store chains in Europe with around 56.300 employees and a total of 4.244 stores, including 2.233 in Germany. In 2020, the Rossmann Group achieved sales of €10,35 billion turnover in Germany, Poland, Hungary, the Czech Republic, Turkey, Albania, Kosovo and Spain.

The company was founded by Dirk Rossmann with its headquarters in Burgwedel near Hanover in Germany. The Rossmann family owns 60%, and the Hong Kong-based A.S. Watson Group 40% of the company.

The product range includes up to 21,700 items and can vary depending on the size of the shop and the location.

Project Structure

Business Understanding

Business Problem: Rossmann's CFO recently asked store managers for a projection of the next six weeks of sales, as based on this information, he intends to define the investment that will be made in the renovation of each store. Given the requested information and the lack of accurate response from store managers, the data science team suggested the development of a solution to this problem, prioritizing the quality of information and ease of access by the stakeholder.

For this project, a dataset with sales information for the years 2013 until mid-2015 was made available, with data referring to 1,115 Rossmann stores.

Features available in the dataset:

Feature	Description
Store	A unique Id for each store
DayOfWeek	Day of the week (1 = Monday, 7 = Sunday)
Date	Date of each sale
Sales	The turnover for any given day (this is what you are predicting)
Customers	The number of customers on a given day
Open	An indicator for whether the store was open: 0 = closed, 1 = open
StateHoliday	Indicates a state holiday. Normally all stores, with few exceptions, are closed on state holidays. Note that all schools are closed on public holidays and weekends. a = public holiday, b = Easter holiday, c = Christmas, 0 = None
SchoolHoliday	Indicates if the (Store, Date) was affected by the closure of public schools
StoreType	Differentiates between 4 different store models: a, b, c, d
Assortment	Describes an assortment level: a = basic, b = extra, c = extended
CompetitionDistance	Distance in meters to the nearest competitor store
CompetitionOpenSinceMonth	Gives the approximate month of the time the nearest competitor was opened
CompetitionOpenSinceYear	Gives the approximate year the nearest competitor was opened
Promo	Indicates whether a store is running a promo on that day
Promo2	Promo2 is a continuing and consecutive promotion for some stores: 0 = store is not participating, 1 = store is participating
Promo2Since[Year/Week]	Describes the year and calendar week when the store started participating in Promo2
PromoInterval	Describes the consecutive intervals Promo2 is started, naming the months the promotion is started anew. E.g. "Feb,May,Aug,Nov" means each round starts in February, May, August, November of any given year for that store

Solution Plan: After understanding the reason for the CFO to request the accumulated revenue per store for the next 6 weeks, an end-to-end sales forecast project was agreed to be carried out in order to understand the behavior of the features provided, obtain future sales of the stores through a regression machine learning model and display the results in a simple and fast way through a messaging application, where the stakeholder provides the store id and the application returns the estimated sales prediction of the store for the next 6 weeks.

Data Understanding & Data Preparation

Data Cleaning & Data Description: At this stage, the dimensions of the data were verified; data cleaning was performed: renaming columns, changing types to the correct format, analyzing and replacing missing data. A descriptive statistical analysis is also carried out to get an initial idea of the data and identify possible errors.
Feature Engineering: At this stage, a mind-map was created to model the phenomenon(sales) and 12 business hypotheses were generated to be validated in the future. Based on the generated hypotheses, features from the original dataset were derived to support the analysis and learning of the model, which will be carried out in the next phases.
Variable Filtering: At this stage, the data referring to the days that the stores were closed and/or had no sales were removed; and also removed the columns that are not relevant because we have derived other features from them and the column 'customers' which contains information that we won't have in the production environment.
Exploratory Data Analysis (EDA): At this stage, three types of analysis were performed in order to better understand the available data.
- Univariate analysis: carried out in order to understand the individual behavior of each variable.
- Bivariate analysis: carried out in order to understand the relationship between the independent variables with the dependent variable which is 'sales' and to validate the business hypotheses raised in the previous step.
- Multivariate Analysis: carried out in order to understand the correlation between all dataset variables.
Data Preprocessing: At this stage, the preparation of data for future application in machine learning algorithms was performed. The objective is to adjust the data without losing the information content in order to facilitate its understanding by machine learning algorithms.
- Numerical variables:
  - Transformation: applied the logarithmic transformation on the response variable(sales) to bring its distribution closer to a Gaussian distribution.
  - Rescaling: MinMax Scaler and Robust Scaler was applied to time-related variables that do not follow a cyclical nature and a variable that contains a distance in meters.
  - Nature Transformation: the sine and cosine of the variables with information related to the passage of time were extracted, so that their cyclical nature can be better interpreted by the model.
- Categorical variables:
  - Encoding: Ordinal Encoding is applied to variables that follow an order; Label Encoding on variables that don't follow an order; and One Hot Encoding on variables that represent a state and affect all other variables in the dataset.
Feature Selection: In this step, a feature selection algorithm called Boruta was applied, along with a random forest, and which uses the selection by subset (Wrapper method) as a criterion, which adds feature by feature, training the model and applying it to check if the accuracy increases or decreases, selecting only features that have an increase in model accuracy.

Modeling

Machine Learning Algorithms: In this step, we train five Machine Learning models to compare the results. The first was an averaging model to be used as a baseline; the second and third were a linear regression and a regularized linear regression(lasso) with the objective of measuring the complexity of the phenomenon we are modeling (the worse the result, the more complex the phenomenon); the fourth and fifth model were a Random Forest Regressor and a XGBoost Regressor, which are more sophisticated models and obtain better results in more complex phenomena.

ML Results:
- Cross-Validation: technique applied to the training dataset to verify if the results initially obtained are in fact the real ones, or if the validation data were positively or negatively biased. The technique consists of dividing several parts of the dataset between training and validation (following the chronological order of the data) and performing the prediction several times in order to find the average of the predictions that would be the real result of the model. Cross-Validation Results:
  
  Although random forest initially presented a better result, the model chosen to be used was xgboost because it presented a very similar performance after the HP fine tuning step, in addition to training the model faster and having a much smaller final size compared to random forest.
Hyperparameter Fine Tuning: In this step, we look for the best parameters to obtain an even better performance from the model that was chosen in the previous step. To obtain these parameters, an optimization method called Random Search was applied, which randomly tested a set of parameters that were passed to it and returned the set of parameters that presented the best result.

Final Results:

Evaluation

Performance:
- Financial Results: At this stage, the transformation of the model result to a business result was carried out. For a better understanding of the model's financial results, the sum of the total predictions of all stores was performed and the best and worst scenarios were created according to the mean absolute percentage error(MAPE) of 10% displayed in the model results of the Hyperparameter Fine Tuning step.
- Machine Learning Performance: As can be seen from the image below, the model predictions are very close to the real data, which shows us a very accurate result of the algorithm. Further analysis of the model's performance is available on the notebook in the Translation and Error Interpretation section.

Deploy

Deploy Model to Production: To publish the model in production following the initial objective of being an application with easy and quick access, a bot was developed in the telegram that the user informs the ID of the store that wants to view the sales forecast and a cloud in Heroku is loaded containing the APIs that will receive this information and load the necessary data to carry out the application of the machine learning model and return the result to the user. Below is a flowchart of the operation with the steps of this process:

Rossmann Bot

To access a store's sales forecast, start a Telegram conversation with RossmanBot @rossmann_sales_prediction_bot and enter the store id.

Next Steps: For the next CRISP cycle, we can make several improvements to the project:
- Develop a secondary project to forecast the number of customers in the store in the next six weeks and use the model output as input in this sales forecast project.
- Map more features that impact the business through a more accurate mind map and raise and validate more business hypothesis to generate actionable insights for the business team.
- Test different machine learning algorithms to get more accurate results.
- Use a more sophisticated hyperparameter fine tuning technique.
- Optimize the bot on the telegram to provide more information to the user.
Learns: This was the first data science project I did and I can say that the lessons learned were immense. The development of all stages of the project in an organized and structured way, from the initial understanding of the problem to training the model, interpreting the results and publishing it in production, added a lot to my knowledge in the area and in my ability to develop projects in such a way structured. All the analyzes carried out contributed a lot to my understanding of the sales phenomenon in a chain of stores, which can be very useful in future analyzes in several areas.

lbvictor / sales_prediction Goto Github PK

sales_prediction's Introduction

ROSSMANN: A SALES FORECAST PROJECT

About the Project

About the Company

Project Structure

Business Understanding

Data Understanding & Data Preparation

Modeling

Evaluation

Deploy

sales_prediction's People

Contributors

Watchers

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent

Jobs