GithubHelp home page GithubHelp logo

ds-sparkify's Introduction

DS-Sparkify

Capstone Project for the Data Scientist Nanodegree @ Udacity. There is a blog post associated to it, which you can visit here.

Table of Contents

  1. Project Motivation

  2. File Descriptions

  3. Results and Conclusion

  4. Acknowledgements

Project Motivation

Customer churn has become one of the biggest concerns for digital companies out there, since any customer that drops out from the service is a direct loss in the stream of revenue. For that reason, there has been a growing interest in applying statistical methodologies to try to predict which are the customers at risk of churning, and enable these companies to take preemptive actions.

The project presented here resulted from my enrollment in the Data Scientist Nanodegree at Udacity. In this project, we have been provided with customer behavior data for a fictional company named Sparkify, which pretends to be an emulation of actual companies like Spotify or Pandora. The data is comprised of observations on user interactions with the music streaming application; songs listened to, likes, downgrades, thumbs up, thumbs down, metadata like gender, location, etc, and with all this, we have been asked to create a model to predict churn.

The following has been my attempt at this. Since the data is relatively big we've used Spark to analyze it, making use of a cloud service such as IBM Watson, which already provided us with en environment in which run Jupyter notebooks and a installation of both Python and Spark. From a statistical perspective we will train several models on these data, but finally use Gradient Booting Machines since it's the one we found to perform the best on this specific dataset.

Requirements and File Descriptions

Numpy, pandas and pyspark are required in order to run this project. Also you should have a installation of SPARK or use a cloud service that provide with such. We have made use of IBM watson free tier for this purpose.

The file with the data is not included in the repository given its realtively big size. So the only files that constitute this repository are two notebooks, one containing most of the code used during the exploratory analysis and early attempts at modelling, and finally Sparkify.ipynb containing the final version with the results here reported.

Results and Conclusion

The main findings of the code, and some technical deep-dive, can be found at the post available here.

As said, Gradient Boosting Machines was the best performing algorithm on this dataset and for the feature space we looked through. The following are the best results on the train cross-validated test, where we have measured F1-scores. the rationale behind this choice of metric is that we have an unbalanced dataset, and we care more about precision and recall than plain accuracy.

  • Logistic Regression: 0.69
  • SVM: 0.71
  • GBT: 0.76

After confirming that GBt is the best performing model, we did a last evaluation using test set.

  • Test Area Under ROC 0.7995642701525055
  • Accuracy: 0.8591549295774648
  • F-1 Score:0.8423431837710613

which surprisingly are even better than on the train set during cross-validation. Taking into account that we could have engineered many more features, the results are amazingly good.

As a conclusion, we know that Churn problems are present across different industries and business. And they can take on different aspects, as for example employee attrition. Therefore knowing how to tackle them seems like a good skill to have these days.

On the other hand, we know that working with data that's more than a few hundred megabytes, even if can be stored in memory, can still benefit of being processed with distributed computing frameworks, as spark. We have also learnt that Spark dataframes are a highly intuitive API, closely similar to Pandas, and that also counts with many implementation of machine learning algorithms. Among them, typical high-scoring algorithm, such as gradient boosting machines (hello to the Kaggle community out there), are available and ready to be used, which certainly is a good fit to this problem at hand.

When working with big datasets, using the lazy evaluation in spark is paramount to having as efficient a code as possible. In this case, most of the feature engineering part only happens once the final aggregated dataframe (the one that has 1 row per user, with the calculated features) is evaluated.

Acknowledgements

I would like to thanks the udacity nanodegree in Data Science's team; they have provided me with an excellent set of tools to deal with this project and actually the idea to tackle this is theirs, which has proven to be an excelent way to learn to use Spark. Also a huge clap to those parties collaborating with them which have provided the data we've used.

ds-sparkify's People

Contributors

alvaroof avatar

Watchers

James Cloos avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.