GithubHelp home page GithubHelp logo

helmholtzai-consultants-munich / quicksetup-ai Goto Github PK

View Code? Open in Web Editor NEW
43.0 1.0 7.0 25.96 MB

A flexible template, as a quick setup for deep learning projects in pytorch-lightning

Home Page: https://quicksetup-ai.readthedocs.io/

License: MIT License

Python 96.52% Shell 2.05% Makefile 1.43%
deep-learning pytorch-lightning

quicksetup-ai's Introduction

Quicksetup-ai: A flexible template as a quick setup for deep learning projects in research

stability-stable Lightning Config: Hydra Template Template DOI

Docs | Quickstart | Tutorials |

Description

This template is a combination of pyscaffold datascience and lightning-hydra. It provides a general baseline for Deep Learning projects including:

  • A predefined structure which simplifies the development of the project.
  • A set of tools for experiment tracking, hyper parameter search and rapid experimentation using configuration files. More details in lightning-hydra.
  • Pre-commit hooks and automatic documentation generation.

⚠️ Package compatibility: This template relies on Pytorch Lightning (whose API might change) we use a fixed version of the package to ensure the template doesn't break

Installation

Using Cookiecutter

  1. Create and activate your environment:

    conda create -y -n venv_cookie python=3.9 && conda activate venv_cookie
  2. Install cookiecutter in your environment:

    pip install cookiecutter dvc
  3. Create your own project using this template via cookiecutter:

    cookiecutter https://github.com/HelmholtzAI-Consultants-Munich/Quicksetup-ai.git

Quickstart

Create the pipeline environment and install the ml-pipeline-template package

Before using the template, one needs to install the project as a package.

  • First, create a virtual environment.

You can either do it with conda (preferred) or venv.

  • Then, activate the environment
  • Finally, install the project as a package. Run:
pip install -e .

Run the MNIST example

This pipeline comes with a toy example (MNIST dataset with a simple feedforward neural network). To run the training (resp. testing) pipeline, simply run:

python scripts/train.py
# or python scripts/test.py

Or, if you want to submit the training job to a submit (resp. interactive) cluster node via slurm, run:

sbatch job_submission.sbatch
# or sbatch job_submission_interactive.sbatch
  • The experiments, evaluations, etc., are stored under the logs directory.
  • The default experiments tracking system is mlflow. The mlruns directory is contained in logs. To view a user friendly view of the experiments, run:
# make sure you are inside logs (where mlruns is located)
mlflow ui --host 0000
  • When evaluating (running test.py), make sure you give the correct checkpoint path in configs/test.yaml

Project Organization

├── configs                              <- Hydra configuration files
│   ├── callbacks                               <- Callbacks configs
│   ├── datamodule                              <- Datamodule configs
│   ├── debug                                   <- Debugging configs
│   ├── experiment                              <- Experiment configs
│   ├── hparams_search                          <- Hyperparameter search configs
│   ├── local                                   <- Local configs
│   ├── log_dir                                 <- Logging directory configs
│   ├── logger                                  <- Logger configs
│   ├── model                                   <- Model configs
│   ├── trainer                                 <- Trainer configs
│   │
│   ├── test.yaml                               <- Main config for testing
│   └── train.yaml                              <- Main config for training
│
├── data                                 <- Project data
│   ├── processed                               <- Processed data
│   └── raw                                     <- Raw data
│
├── docs                                 <- Directory for Sphinx documentation in rst or md.
├── models                               <- Trained and serialized models, model predictions
├── notebooks                            <- Jupyter notebooks.
├── reports                              <- Generated analysis as HTML, PDF, LaTeX, etc.
│   └── figures                                 <- Generated plots and figures for reports.
├── scripts                              <- Scripts used in project
│   ├── job_submission.sbatch               <- Submit training job to slurm
│   ├── job_submission_interactive.sbatch   <- Submit training job to slurm (interactive node)
│   ├── test.py                             <- Run testing
│   └── train.py                            <- Run training
│
├── src/<your_project_name>              <- Source code
│   ├── datamodules                             <- Lightning datamodules
│   ├── models                                  <- Lightning models
│   ├── utils                                   <- Utility scripts
│   │
│   ├── testing_pipeline.py
│   └── training_pipeline.py
│
├── tests                                <- Tests of any kind
│   ├── helpers                                 <- A couple of testing utilities
│   ├── shell                                   <- Shell/command based tests
│   └── unit                                    <- Unit tests
│
├── .coveragerc                          <- Configuration for coverage reports of unit tests.
├── .gitignore                           <- List of files/folders ignored by git
├── .pre-commit-config.yaml              <- Configuration of pre-commit hooks for code formatting
├── setup.cfg                            <- Configuration of linters and pytest
├── LICENSE.txt                          <- License as chosen on the command-line.
├── pyproject.toml                       <- Build configuration. Don't change! Use `pip install -e .`
│                                           to install for development or to build `tox -e build`.
├── setup.cfg                            <- Declarative configuration of your project.
├── setup.py                             <- [DEPRECATED] Use `python setup.py develop` to install for
│                                           development or `python setup.py bdist_wheel` to build.
└── README.md

How to cite

@misc{author_year,
  author       = {Isra Mekki, Gerome Vivar, Harshavardhan Subramanian, Erinc Merdivan},
  title        = {Quicksetup-ai},
  year         = {2022},
  doi          = {10.5281/zenodo.10044608},
  url          = {https://github.com/HelmholtzAI-Consultants-Munich/Quicksetup-ai},
}

quicksetup-ai's People

Contributors

crlna16 avatar gerome-v avatar isramekki0 avatar janebert avatar merdivane avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar

quicksetup-ai's Issues

Error in post_gen_project script

I get this error while trying to execute cookiecutter.

Traceback (most recent call last):
  File "C:\Users\aaa\AppData\Local\Temp\tmpg8lvl1vc.py", line 28, in <module>
    delete_license_dir()
  File "C:\Users\aaa\AppData\Local\Temp\tmpg8lvl1vc.py", line 24, in delete_license_dir
    subprocess.run(["rm", "-r", "licenses/"])
  File "C:\Users\aaa\AppData\Local\mambaforge\envs\venv_cookie\lib\subprocess.py", line 505, in run
    with Popen(*popenargs, **kwargs) as process:
  File "C:\Users\aaa\AppData\Local\mambaforge\envs\venv_cookie\lib\subprocess.py", line 951, in __init__
    self._execute_child(args, executable, preexec_fn, close_fds,
  File "C:\Users\aaa\AppData\Local\mambaforge\envs\venv_cookie\lib\subprocess.py", line 1420, in _execute_child
    hp, ht, pid, tid = _winapi.CreateProcess(executable, args,
FileNotFoundError: [WinError 2] Le fichier spécifié est introuvable
ERROR: Stopping generation because post_gen_project hook script didn't exit successfully
Hook script failed (exit status: 1)

issue while running `pip install -e .`

I try to follow the guidelines, but I always failed at pip install -e. step.

I am doing on VS code, with ssh connected to Levante.

Obtaining file:///work/mh0033/m300883/Training_while_Runing_box/Training_while_Running
Installing build dependencies ... done
Checking if build backend supports build_editable ... done
Getting requirements to build editable ... error
error: subprocess-exited-with-error

× Getting requirements to build editable did not run successfully.
│ exit code: 1
╰─> [14 lines of output]
/tmp/pip-build-env-mvw2ixwf/overlay/lib/python3.9/site-packages/setuptools_scm/_integration/setuptools.py:82: UserWarning: version of None already set
warnings.warn(f"version of {dist_name} already set")
running egg_info
creating src/training_while_running.egg-info
writing src/training_while_running.egg-info/PKG-INFO
writing dependency_links to src/training_while_running.egg-info/dependency_links.txt

  An error occurred while building the project, please ensure you have the most updated version of setuptools, setuptools_scm and wheel with:
     pip install -U setuptools setuptools_scm wheel
  error: Problems to parse EntryPoint(name='save_initial_data', value='training while running.utils.dvc_utils:save_initial_data', group='console_scripts').
  Please ensure entry-point follows the spec: https://packaging.python.org/en/latest/specifications/entry-points/
  [end of output]

note: This error originates from a subprocess, and is likely not a problem with pip.
error: subprocess-exited-with-error

× Getting requirements to build editable did not run successfully.
│ exit code: 1
╰─> See above for output.

note: This error originates from a subprocess, and is likely not a problem with pip.

Reproducability issue

Hi,

I am not able to get reproducible results when using the pipeline, with a ResNet on a small dataset.

Fixed by adding the following in training_pipeline under if config.get("seed"):

torch.backends.cudnn.deterministic = True
torch.backends.cudnn.benchmark = False
torch.backends.cudnn.enabled = False

save_dir error in logger/wand.yaml

Hi,

I would suggest to change
save_dir: ${original_work_dir}/logs/wandb to
save_dir: ${original_work_dir}/logs
Otherwise wandb is trying to create ${original_work_dir}/logs/wandb/wandb and the first time complains if this folder does not exist already.

Python 3.11 on MacOS not working due to missing torch==1.11.0 distribution

The setup fails on MacOS together with Python 3.11 due to a missing torch==1.11.0 distribution for Mac.

This is unfortunate as Python 3.11 is the default Python version on MacOS. Supporting torch>=2.0.0 would resolve the issue, but this is incompatible with the used pytorch-lightning==1.5.10.

Is there any plan to support a more recent PyTorch in the future?

PS: Python 3.10 works without any problems.

Slurm scripts expect Conda environment

The Slurm sbatch scripts in scripts expect a global Conda environment called ml_template_env, which is not mentioned in the README.
Creating the Conda environment should be mentioned. I assume local Conda environments make sense here so you avoid name clashes, i.e. with conda create -p ./ml_template_env and conda activate ./ml_template_env.

Memory allocation grows at every training step

Hello colleagues,

when setting up quicksetup in a recent project, we encountered a CUDA out of memory error. We were able to trace back the error and noticed GPU memory usage was growing linearly after each training step. We noticed in Pytorch Lightning training step, there is an output dictionary returned that contains predictions and targets as well as loss:

        # we can return here dict with any tensors
        # and then read it in some callback or in `training_epoch_end()`` below
        # remember to always return loss from `training_step()` or else backpropagation will fail!
        return {"loss": loss, "preds": preds, "targets": targets}

I think the predictions and targets continue to live on the GPU here? I am not sure if that is intended, but if not, then I would propose to move them off the GPU, to avoid the memory clogging up.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.