GithubHelp home page GithubHelp logo

template's Introduction

Description

This is a template repository, which can be used to initialize new repositories for deep learning fast prototyping. The template allows training for single-node single-GPU models, single-node multi-GPU models, or multi-node multi-GPU models with SLURM. Checkpointing and auto-requeuing are supported on slurm as well.

Usage

  • Create a new repository on GitHub with name: [new-repo-name]
  • Follow these steps:
$ git clone https://github.com/ramyamounir/Template.git
$ mv Template [new-repo-name]
$ cd [new-repo-name]
$ git remote set-url origin https://github.com/ramyamounir/[new-repo-name].git
$ git add .
$ git commit -m "Template copied"
$ git push

Layout

This template follows a modular approach where the main components of the code (architecture, loss, scheduler, trainer, etc.) are organized into subdirectories.

  • The train.py script contains all the arguments (parsed by argparse) and nodes/GPUS initializer (slurm or local). It also contains code for importing the dataset, model, loss function and passing them to the trainer function.
  • The lib/trainer/trainer.py script defines the details of the training procedure.
  • The lib/dataset/[args.dataset].py imports data and defines the dataset function. Creating a data directory with a soft link to the dataset is recommended, especially for testing on multiple datasets.
  • The lib/core/ directory contains definitions for loss, optimizer, scheduler functions.
  • The lib/util/ directory contains helper functions organized by file name. (i.e., helper functions for distributed training are placed in the lib/util/distributed.py file).

Run

For single node, single GPU training:

Try the following example

python train.py -gpus 0

For single node, multi GPU training:

Try the following example

python train.py -gpus 0,1,2

For single node, multi GPU training on SLURM:

Try the following example

python train.py -slurm -slurm_nnodes 1 -slurm_ngpus 4 -slurm_partition general

For multi node, multi GPU training on SLURM:

Try the following example

python train.py -slurm -slurm_nnodes 2 -slurm_ngpus 8  -slurm_partition general

Tips

  • To get more information about available arguments, run: python train.py -h
  • To automatically start Tensorboard server as a different thread, add the argument: -tb
  • To overwrite model log files and start from scratch, add the argument: -reset; otherwise, it will use the last weights as a checkpoint and continue writing to the same Tensorboard log files - if the same model name is used.
  • To choose specific node names on SLURM, use the argument: -slurm_nodelist GPU17,GPU18 as an example.
  • If running on a GPU with Tensor cores, using mixed precision models can speed up your training. Add the argument -fp16 to try it out. If it makes training unstable due to the loss of precision, don't use it :)
  • The template allows you to switch architectures, datasets and trainers easily by passing different arguments. For example, different architectures can be added to the lib/arch/[arch-name].py directory and passing the arguments as -arch [arch-name] or -trainer [trainer-name] or -dataset [dataset-name]
  • The stdout and stderr will be printed in the shared directory. We only print the first GPU output. Make sure to change the shared directory in lib/utils/distributed.py depending on the cluster you are using.
  • if you find a bug in this template, open a new issue or a pull request. Any collaboration is more than welcome!

License

This repository is released under the Apache 2.0 license as found in the LICENSE file.

template's People

Contributors

ramyamounir avatar dah512 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.