GithubHelp home page GithubHelp logo

avianaglobal / docker-cookiecutter-data-science Goto Github PK

View Code? Open in Web Editor NEW

This project forked from manifoldai/docker-cookiecutter-data-science

0.0 1.0 0.0 582 KB

A fork of the cookiecutter-data-science leveraging Docker for local development.

Home Page: http://drivendata.github.io/cookiecutter-data-science/

License: MIT License

Python 44.88% Dockerfile 0.16% Makefile 34.90% Shell 0.41% Batchfile 19.65%

docker-cookiecutter-data-science's Introduction

Docker Cookiecutter Data Science

Helping Data Science teams easily move to a Docker-first development workflow to iterate and deliver projects faster and more reliably.

New to Docker? Check out this writeup on containers vs virtual machines and how Docker fits in:

https://medium.freecodecamp.org/a-beginner-friendly-introduction-to-containers-vms-and-docker-79a9e3e119b

Cookiecutter is a command-line utility that automatically scaffolds new projects for you based on a template (referred to as cookiecutters):

http://cookiecutter.readthedocs.io/en/latest/readme.html

This cookiecutter is used in conjunction with a base development image available in Docker Hub to provide an out-of-the-box ready environment for many Data Science and Machine Learning project use cases. After running this cookiecutter and the provided start script a developer will have a local development setup that looks like this:

docker local dev

By scaffolding your data science projects using this cookiecutter you will get:

  • Project Docker image built with your own Dockerfile for project specific requirements
  • Docker Compose configuration that dynamically binds to a free host port and forwards to the jupyter server listening port inside the container
  • Shared volume configuration for accessing and executing all your project code inside of the controlled container environment
  • Ability to edit code using your favorite IDE on your host machine and seeing real-time changes to the runtime environment
  • Jupyter notebook fully configured with nb-extensions ready for development and feature engineering
  • Common data science and plotting libraries pre-installed in the container environment to start working immediately

There are several downstream benefits for moving to a container-first workflow in terms of model and inference engine deployment/delivery. By using containers early in the development cycle you can remove a lot of the configuration management issues that waste developer time and ultimately lower quality of deliverables.

Getting Started

  1. Install Docker:
  2. Install Python Cookiecutter package: http://cookiecutter.readthedocs.org/en/latest/installation.html >= 1.4.0
    $ pip install cookiecutter
    It is recommended to set up a central virtualenv or condaenv for cookiecutter and any other "system" wide Python packages you may need.
  3. Run the cookiecutter docker data science template to scaffold your new project:
    $ cookiecutter https://github.com/manifoldai/docker-cookiecutter-data-science.git
  4. Answer all of the cookiecutter prompts for project name, description, license, etc.
  5. Run the start script from the level of your new project directory:
    $ ./start.sh
  6. After the project image builds check which host port is being forwarded to the Jupyter notebook server inside the running container:
    $ docker ps 
  7. Using any browser access your notebook at localhost:{port}
  8. Start working!

For more details on what packages are available pre-installed in the base image see the manifoldai/orbyter-ml-dev repository page on Docker Hub.

Project Structure

The directory structure of your new project looks like this:

├── LICENSE
├── Dockerfile            <- New project Dockerfile that sources from base ML dev image
├── docker-compose.yml    <- Docker Compose configuration file
├── docker_clean_all.sh   <- Helper script to remove all containers and images from your system
├── start.sh              <- Script to run docker compose and any other project specific initialization steps 
├── Makefile              <- Makefile with commands like `make data` or `make train`
├── README.md             <- The top-level README for developers using this project.
├── data
│   ├── external          <- Data from third party sources.
│   ├── interim           <- Intermediate data that has been transformed.
│   ├── processed         <- The final, canonical data sets for modeling.
│   └── raw               <- The original, immutable data dump.
│
├── docs                  <- A default Sphinx project; see sphinx-doc.org for details
│
├── models                <- Trained and serialized models, model predictions, or model summaries
│
├── notebooks             <- Jupyter notebooks. Naming convention is a number (for ordering),
│                            the creator's initials, and a short `-` delimited description, e.g.
│                            `1.0-jqp-initial-data-exploration`.
│
├── references            <- Data dictionaries, manuals, and all other explanatory materials.
│
├── reports               <- Generated analysis as HTML, PDF, LaTeX, etc.
│   └── figures           <- Generated graphics and figures to be used in reporting
│
├── requirements.txt      <- The requirements file for reproducing the analysis environment, e.g.
│                            generated with `pip freeze > requirements.txt`
│
├── src                   <- Source code for use in this project.
│   ├── __init__.py       <- Makes src a Python module
│   │
│   ├── data              <- Scripts to download or generate data
│   │   └── make_dataset.py
│   │
│   ├── features          <- Scripts to turn raw data into features for modeling
│   │   └── build_features.py
│   │
│   ├── models            <- Scripts to train models and then use trained models to make
│   │   │                    predictions
│   │   ├── predict_model.py
│   │   └── train_model.py
│   │
│   └── visualization  <- Scripts to create exploratory and results oriented visualizations
│       └── visualize.py
│
└── tox.ini            <- tox file with settings for running tox; see tox.testrun.org

Video Demo

Torus Demo Youtube

Helpful Resources

Why Did We Build This?

We are trying to bridge the gap that exists between data science and dev/operations teams today. We wrote about it here: https://medium.com/manifold-ai/torus-a-toolkit-for-docker-first-data-science-bddcb4c97b52

Contributing

PRs and feature requests very welcome!

docker-cookiecutter-data-science's People

Contributors

adamkgoldfarb avatar apollonin avatar codyrioux avatar drivendata avatar firasrb avatar hwartig avatar isms avatar jbrambledc avatar keldlundgaard avatar kplauritzen avatar liudonghs avatar lorey avatar midnighter avatar mkcor avatar mstefferson avatar niloch avatar ohenrik avatar pjbull avatar proinsias avatar randallrs avatar rkoppula avatar verginer avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.