GithubHelp home page GithubHelp logo

hurwitzzz / beyond-jupyter-spotify-popularity Goto Github PK

View Code? Open in Web Editor NEW

This project forked from aai-institute/beyond-jupyter

0.0 0.0 0.0 2.88 MB

License: Creative Commons Attribution Share Alike 4.0 International

Python 0.51% Jupyter Notebook 99.48% Dockerfile 0.01%

beyond-jupyter-spotify-popularity's Introduction

CC BY-SA 4.0

Predicting Spotify Song Popularity: A Refactoring Journey

This lesson is part of the Beyond Jupyter series, in which we address the specifics of software design in machine learning contexts. We have put forth a set of guiding principles that can critically inform the decision-making process during development.

In the case study at hand, we will show how a machine learning use case that is implemented as a Jupyter notebook (which was taken from Kaggle) can be successively refactored in order to

  • improve the software design in general, achieving a high degree clarity and maintainability,
  • gain flexibility for experimentation,
  • appropriately track results,
  • arrive at a solution that can straightforwardly be deployed for production.

The use case considers a dataset from kaggle containing meta-data on approximately one million songs (see download instructions below). The goal is to use the data in order to learn a model for the prediction of song popularity given other song attributes such as the tempo, the release year, the key, the musical mode, etc.

Preliminaries

In order for the code of each step to be runnable, set up a Python virtual environment and download the Spotify song data.

Python Environment

Use conda to create an environment based on environment.yml in the root folder of this repository:

conda env create -f environment.yml

This will create a conda environment named pop.

Configure Your IDE's Runtime Environment

Configure your IDE to use the pop environment created in the previous step.

Downloading the Data

You can download the data in two ways:

  • Manually download it from the Kaggle website. Place the CSV file spotify_data.csv in the data folder (in the root of this repository).

    data_folder

  • Alternatively, use the script load_data.py to automatically download the raw data CSV file to the subfolder data on the top level of the repository. Note that a Kaggle API key, which must be configured in kaggle.json, is required for this (see instructions).

How to use this package?

This package is organised as follows:

  • There is one folder per step in the refactoring process with a dedicated README file explaining the key aspects of the respective step.
  • There is an independent Python implementation of the use case in each folder, which you should inspect alongside the README file.

The intended way of exploring this package is to clone the repository and open it in your IDE of choice, such that you can browse it with familiar tools and navigate the code efficiently.

Diffing

To more clearly see the concrete changes from one step to another, you can make use of a diff tool. To support this, in folder refactoring-journey/, you may run the Python script generate_repository.py in order to create a git repository in folder refactoring-repo that references the state of each step in a separate tag, i.e. in said folder, you could run, for example,

    git difftool step04-refactoring step05-sensai

Steps in the Journey

These are the steps of the journey:

  1. Monolithic Notebook

    This is the starting point, a Jupyter notebook which is largely unstructured.

  2. Python Script

    This step extracts the code that is strictly concerned with the training and evaluation of models.

  3. Dataset Representation

    This step introduces an explicit representation for the dataset, making transformations explicit as well as optional.

  4. Model-Specific Pipelines

    This step refactors the pipeline to move all transforming operations into the models, enabling different models to use entirely different pipelines.

  5. Refactoring

    This step improves the code structure by adding function-specific Python modules.

  6. sensAI

    This step introduces the high-level library sensAI, which will enable more flexible, declarative model specifications down the line. It furthermore facilitates logging, model evaluation and helps with other minor details.

  7. Feature Representation

    This step separates representations of features and their properties from the models that use them, allowing model input pipelines to be flexibly composed.

  8. Feature Engineering

    This step adds an engineered feature to the mix.

  9. Tracking Experiments

    This step adds tracking functionality via sensAI's mlflow integration and by logging directly to the file system.

  10. Regression

    This step considers the perhaps more natural formulation of the prediction problem as a regression problem.

  11. Hyperparameter Optimisation

    This step adds hyperparameter optimisation for the XGBoost regression model.

  12. Cross-Validation

    This step adds the option to use cross-validation.

  13. Deployment

    This step adds a web service for inference, which is packaged in a docker container.

beyond-jupyter-spotify-popularity's People

Contributors

opcode81 avatar mdbenito avatar schroedk avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.