GithubHelp home page GithubHelp logo

course-ds-base's Introduction

course-ds-base

Preparation

1. Fork / Clone this repository

git clone https://github.com/iterative/course-ds-base.git
cd course-ds-base

2. Create and activate virtual environment

Create virtual environment named dvc-venv (you may use other name)

python3 -m venv dvc-venv
echo "export PYTHONPATH=$PWD" >> dvc-venv/bin/activate
source dvc-venv/bin/activate

Install python libraries

pip install --upgrade pip setuptools wheel
pip install -r requirements.txt

Add Virtual Environment to Jupyter Notebook

python -m ipykernel install --user --name=dvc-venv

Configure ToC for jupyter notebook (optional)

jupyter contrib nbextension install --user
jupyter nbextension enable toc2/main

3. Run Jupyter Notebook

jupyter notebook

4. synopsis

  • don't use WOTs in jupyter NBs
  • start from a reusable / concise / installable project layout
  • modularize the code into a reusable, package(-able) structure and format
  • use config files rather than hard-coded config & config management tools
  • structurize the POC / model training code into manageable pipeline stages
  • use dvc to track and version trainig data/assets/pipelines
  • use dvc to run and reproduce the pipeline

5. Thoughts going through my head

  • no linter / import cleanup tools?
  • config management by diverse, folder-dispersed yaml-files? -> hydra?!
  • sensitive information in jp-notebooks? -> better use nbstripout instead
  • usage of namespace packages: https://docs.python.org/3/reference/import.html#namespace-packages
  • relative imports on Windows systems requires careful handling of env vars / thoughts on structure -> better use packaging (as done here)
  • why special repo structure and not usage of a widely accepted cookiecutter approach?
  • basically only refers to PEP420, but is used here in context simply via grouping code under ./src
  • arg parse in stages instead of hydra / other config management packages and/or typer
  • DS blueprint is also mentioned: https://data-science-blueprint.readthedocs.io/en/latest/presentation/schema.html

5.1 topic of the envvars + environment setup

  • setup of the environment is centered on linux systems only
  • tips and hints are linux-specific
  • the PYTHONPATH bake-in at environment generation helps to include code seamlessly
  • however, for other platforms, this does not work as intended or is not possible
  • a more platform-independent solution for managing environment variables is not included
  • a better solution, using a real packaging approach for code-submodules is not included
  • open: what is the connection between PYTHONPATH and python / jupyter on linux?
  • setting windows system - wide environment variable: https://www3.ntu.edu.sg/home/ehchua/programming/howto/Environment_Variables.html
  • setting windows system - wide environment variable w. powershell: https://www.sharepointdiary.com/2021/05/powershell-set-environment-variable.html
  • using PYTHONPATH requires manual starts of jupyter w.o. using the VS Code integrated NB viewer / runner

5.2. DVC as a versioning and ml workflow management tool

  • pip install dvc
  • pip install "dvc[s3]" (or [all] to support all remote stages)
  • init: dvc init (-> creates .dvc)
  • then after the init git add . / git commit -m "init dvc"
  • adding changes to dvc works just like with git
  • dvc run -n "name" -d "dependency" -o "outputs" -p "params"
  • programmtically: dvc.yaml
    • specifies cmds, params, names, stages and the like
  • dvc run is outdated -> use dvc stage add instead
  • dvc repro to reproduce the stages
  • dvc repro -f to force complete rerun, -s for specific changes
  • automated versioning of dvc pipeline outputs

5.3. DVC data management under the hood

  • DVC creates md5 hashes of files and checks if they already are tracked and their metadata already exists
  • the tracked files are saved into the DVC (project) cache directory
  • removes the original file from the working directory
  • creates a reflink / symlink to the working directory
  • saves metadata information to file.csv.dvc
  • works similar with directories
  • dvc status, dvc checkout available
  • manual checkout also the associated git commits
  • dvc checkout:
    • helps in checking out local remote files
    • but no version / branch naming?! -> bc. it should be associated with specific git branches and associated .lock - files
    • adds additional source of errors if people miss out on running checkout after git checkout
  • dvc get to download files from a remote only
  • dvc import to add remote storages into the data versioning & enable tracking of new data

5.4. Comparing experiments, plotting and metrics tracking with dvc

  • need specific lines in the dvc.yaml, not as straightforward as mlflow, for example
  • allows versioning differences in metrics / tracking
  • plotting functions can handle csv files and also be used to generate the diff between different runs
  • handy, but need a hand-specified underlying template for specific data sources
  • everything beyond base templates comes with extra work
  • dvc then takes the template and the source & generates html pages from that
  • these can be stored as png/jpg files & compared
  • dvc can also help tracking model training checkpoints, integrated code and config changes
  • using the --live flag enables dvc-live tracking of the model training progress (especially helpful in deep learning scenarios)
  • this also needs new lines importing dvclive, for example in the training scripts, as well as epoch start/stop tasks
  • details for model training progress tracking (esp. training time, etc.) -> not on the branche of the og training repo?!
  • DVC Studio -> setup howto?

5.5. Experiments management and collaboration

  • dvc also allows cml-style -S --set-params to edit the params.yaml file with dvc exp ("experiment")
  • users can trigger different runs with that, instead of using dv repro to only run specific sections under changes
  • hence, dvc also supports hydra-style overrides, including jobs, queuing, etc.
  • also allows parallel, sequential job working
  • dvc exp show --include-params=featurize
  • dvc exp branch + ID + name -> creates a new branch with git
  • dvc exp apply + ID -> gets a particular experiment into the current workspace

5.6. Data registry and additional topics with DVC: CML

  • middleware on top of storage systems -> i.e. git repo + dvc tracking is the registry
  • persist data sets, feature stores, models
  • call / load from cli or via python dvc package
  • CML ~= continuous machine learning, with the main use being, after iterative:

"We can use GitHub Actions or GitLab CI/CD, and Bitbucket Pipelines to provision >cloud instances from our vendor of choice. CML will then run our pipeline on the provisioned instance, create a pull request for the experiment, and generate a report that shows the experiment's results. We can define multiple triggers to run our pipeline in CML. Typically we will run our pipeline whenever we create a new pull request for an experiment, but we can also imagine automating periodic retraining. Whatever our use case, we can tailor CML to our needs."

OT:

course-ds-base's People

Contributors

mxli417 avatar jendefig avatar mnrozhkov avatar dberenbaum avatar pared avatar moynihan18 avatar omesser avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.