GithubHelp home page GithubHelp logo

aam2017 / mlops_dvc_in_jupyter Goto Github PK

View Code? Open in Web Editor NEW

This project forked from scott-wenzel/jupyter-notebook-dvc

0.0 0.0 0.0 18 KB

A minimalistic integration of DVC with a simple Jupyter Notebook. Using this guideline, you can keep working in a notebook while enjoying most of the benefits of data and model versioning. For more information, see the README in the example project.

License: Apache License 2.0

Jupyter Notebook 100.00%

mlops_dvc_in_jupyter's Introduction

Jupyter-Notebook-DVC

Source: https://dagshub.com/DAGsHub-Official/Jupyter-Notebook-DVC

A minimalistic integration of DVC with a simple Jupyter Notebook. Using this guideline, you can keep working in a notebook while enjoying most of the benefits of data and model versioning.

Instructions

  1. Clone the repo.
  2. (Recommended) Create and activate a virtualenv under the env/ directory. Git is already configured to ignore it.
  3. Install the very minimal requirements using pip install -r requirements.txt
  4. Run Jupyter in whatever way works for you. The simplest would be to run pip install jupyter && jupyter notebook.
  5. All relevant code and instructions are in Example.ipynb.

Explanation

This project structure is as an example of how to work with DVC from inside a Jupyter Notebook.

This workflow should enable you to enjoy the full benefits of working with Jupyter Notebooks, while getting most of the benefit out of DVC - namely, reproducible and versioned data science.

The project takes a toy problem as an example - the California housing dataset, which comes packaged with scikit-learn. You can just replace the relevant parts in the notebook with your own data and code. Significantly different project structures might require deeper intervention.

The idea is to leverage DVC in order to create immutable snapshots of your data and models as part of your git commits. To enable this, we created the following DVC stages:

  1. Raw data - kept in data/raw/, versioned in data/raw.dvc
  2. Processed data - kept in data/processed/, versioned in process_data.dvc
  3. Trained models - kept in models/, versioned in models.dvc
  4. Metrics - kept in metrics/metrics.json, versioned as part of the git commit and referenced in models.dvc

Unlike a typical DVC project, which requires you to refactor your code into modules which are runnable from the command line, In this project the aim is to enable you to stay in your comfortable notebook home territory.

So, instead of using dvc repro or dvc run commands, just run your code as you normally would in Example.ipynb. We prepared special cells (marked with green headers) inside this notebook that let you run dvc commit commands on the relevant DVC stages defined above, immediately after you create the relevant data files from your notebook code.

dvc commit computes the hash of the versioned data and saves that hash as text inside the relevant .dvc file. The data itself is ignored and not versioned by git, instead being versioned with DVC. However, the .dvc files, being plain text files, ARE checked into git.

So, to summarize, this workflow should enable you to create a git commit which contains all relevant code, together with references to the relevant data and the resulting models and metrics. Painless reproducible data science!

It's intended as a guideline - definitely feel free to play around with its structure to suit your own needs.


To create a project like this, just go to https://dagshub.com/repo/create and select the Jupyter Notebook + DVC project template.

Made with ๐Ÿถ by DAGsHub.

mlops_dvc_in_jupyter's People

Contributors

scott-wenzel avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.