GithubHelp home page GithubHelp logo

floridene / bigdataproject Goto Github PK

View Code? Open in Web Editor NEW
6.0 2.0 3.0 5.27 MB

BigData-Project: Supermarket Basket Analysis with Markovchain, Aprioi, XGBoost and RNN; M. Sc. Business Intelligence and Process Management, BSEL Berlin, Germany

Home Page: http://www.master-bipm.de/

R 100.00%
xgboost rnn apriori markov-chain

bigdataproject's Introduction

BigData-Project

Supermarket Basket Analysis with Markovchain, Aprioi, XGBoost and RNN

by Max Philipp, Ceyda Ugur and Vera Weidmann

M. Sc. Business Intelligence and Process Management, BSEL Berlin, Germany

Project Description

The project which the repository is about is a competition posted on Kaggle.com. The kick off of the project was in May, 2017 and the time given is 3 months, meaning the deadline is the end of July, 2017. The datatables are provided by Instacart.

Instacart, a grocery ordering and delivery app, aims to make it easy to fill your refrigerator and pantry with customers' personal favorites and staples when they need them. After selecting products through the Instacart app, personal shoppers review their order and do the in-store shopping and delivery for customers.

The purpose of this project is to predict/estimate the users' next orders based on customer orders over time.

Predictive Analytics

R-code can be finded in the belonging folder of this repository. These scipts also include some explanations about our approach and used commands.

Data Analysis

A comprehensive data analysis was done via the databricks community/spark. Databricks provides a Unified Analytics Platform that accelerates innovation by unifying data science, engineering and business. It is based on Hadoop Spark and is open for SQL data analysis as well as python or R. It is very easy to access the data tables and very fast to execute code. Also, because the query results are automatically visualized with only a button, it also makes the understanding of the results easier and more meaningful.

Comprehensive Data Analysis and Visualizations

Some specific visualization results are presented in the following:

The dataset is anonymized and contains a sample of over 3 million grocery orders from more than 200,000 users. For each user, Kaggle provides between 4 and 100 of their orders, with the sequence of products purchased in each order. Moreover, the week and hour of day the order was placed are also provided, and a relative measure of time between orders.

In order to to facilitate, simplify and better understand the content and relationships between all the csv files provided by Kaggle, a schema including connections to multiple data tables have been visualized via the SQL architecture. Besides, all data types are specified. An opportunity to see the origin of each column on this schema is obtained.

As it can be seen from the schema, there are 6 csv files which describes a relational set of files for customers' orders over time. Each entity (customer, product, order, aisle, etc.) has an associated unique id. It can confidently be said that the most important tables for the project task are order_products_ prior and order_products_train. These two tables are also linked to the order and the products. In other words, products and orders table feed the order_products_ prior and order_products_train tables. The reason of why these two tables are highly important for the task is because they contain the “reordered” columns, which are a basis for predicting the next orders of customers.

alt text

We have 134 aisles at total. These aisles contain different products and are grouped by the type of these products. In our visualization which is generated by Tableau, we can obviously see that the huge amount of products are included in the aisle fresh fruits. As an assumption, we can see that people have a lot of option to choose and buy from this aisle. Accordingly, this might increase the reorder rates as well.

alt text

In the bar chart below, we have observed products which are contained in more than 64,000 baskets. These products are highly ordered by customers and therefore they are considered as “frequent”.The frequency bar chart demonstrates lots of fruits and vegetables. So we were right about our assumption by saying that “fresh fruit” aisle will probably be the aisle which customers buy the products the most. This is also about the huge amount of products that fresh fruit aisle contain. Specifically, Banana can be observed as the product which is really highly demanded.

alt text

By taking our exploratory analysis into consideration , we saw that 262464 users have reordered products which contain the word of “Organic”. We have also seen that there are 5035 organic products at total. This result does not surprise us as people's interest in bio nutrition has increased in the last few years.

alt text

There are 21 departments at total. These departments contain different amount of aisles depending on the type of the product.The “produce” department has the most amount of products. We can do the same assumption as we did in the aisle visualization. People will likely buy more products from the produce department compared to other departments because when the amount of samples increase, the likelihood also increases, meaning that we will have more findings in the dataset about the produce department. Also, if we think from the business side, the reason of why there are more products in this department is because these kind of products are highly demanded by customers.

alt text

References

Steffen Rendle, Christoph Freudenthaler, Lars Schmidt-Thieme: Factorizing Personalized Markov Chainsfor Next-Basket Recommendation

Shengxian Wan, Yanyan Lan, Pengfei Wang, Jiafeng Guo, Jun Xu, Xueqi Cheng (2015): Next Basket Recommendation with Neural Networks

Jakob Aungiers (2016): LSTM Neural Network for Time Series Prediction. URL: http://www.jakob-aungiers.com/articles/a/LSTM-Neural-Network-for-Time-Series-Prediction

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.