GithubHelp home page GithubHelp logo

megayeye / slurm-for-ml Goto Github PK

View Code? Open in Web Editor NEW

This project forked from y0ast/slurm-for-ml

0.0 1.0 0.0 53 KB

A Machine Learning workflow for Slurm.

License: MIT License

Shell 82.73% Python 17.27%

slurm-for-ml's Introduction

Slurm for Machine Learning

Many labs have converged on using Slurm for managing their shared compute resources. It is fairly easy to get going with Slurm, but it quickly gets unintuitive when wanting to run a hyper-parameter search. In this repo, I provide some scripts to make starting many jobs painless and easy to control.

Starting a single job

This is easy, but it's here for completeness:

sbatch generic.sh train_script.py --output_folder CIFAR10/baseline --dataset CIFAR10 --learning_rate 1e-4 --method baseline

You simply pass the arguments you normally pass to python to generic.sh and it'll start the job for you on Slurm with a gpu and a conda environment set up.

Note: the --output_folder argument is required here. It is necessary down the line to be able to skip jobs that were done already!

Starting many jobs

For this use case, Slurm has introduced Job Arrays. Slurm assigns separate jobs a simple job array id, which is an integer that starts counting from 1. This does not map well onto the usual machine learning jobs that requires running over a grid of hyperparameters.

For this use case, I present an easy flow:

  1. Define a grid and go through all the resulting jobs (and skip jobs if you later extend the grid and rerun)
  2. Robust against failures (e.g. server crashing, kill jobs mid run etc.)
  3. Easily limit parallelism - simply set max number of GPUs to use

The solution involves creating a file with all jobs you want to run (could be created by a Python/Bash script itself). We then iterate through this file by using the Slurm job array id to index a line! When iterating, we check if a job finished (results.json found in the output folder) and skip it if so.

Step 1:

./create_jobs.sh

This command creates a job list, with all the jobs we want to run. There are two requirements:

  1. your command has an --output_folder flag which is respected in the training code
  2. a successful job creates a results.json in that folder.

If we later want to add jobs, we can simply update this script, generate new jobs, and slurm will skip jobs that were successfully run before!

Step 2:

./run_file.sh MNIST_jobs.txt

This will start 8 jobs in parallel using Slurm job arrays. You can easily change the number of jobs run in parallel by editing the top of run_file.sh. It'll check (in generic.sh) if a job succeeded before, and skip if that's the case.

Setup requirements summarised

  1. conda - by default in the folder miniconda3 along side these scripts. Change the paths in generic.sh to match your setup.
  2. Within Python, save your final results to a file called results.json in the folder specified by --output_folder so the script can check if the jobs was successful. You can also edit this check for your particular setup (e.g. check for a final model saved).

If you update your conda environment as part of the script, then run_locked.sh is necessary because conda is not thread safe by itself, and calling update multiple times in different processes leads to incorrect behaviour. If you don't update your conda environment as part of the script, then you can skip this line.

I have attempted to comment generic.sh as much as possible, so it's easy to see what to change for your Slurm setup!

Happy Slurming!

Let me know if you have any issues with the scripts, or if you see room for improvement. I am happy to accept PRs.

Useful Commands

Count all GPUs available in partition normal:

sinfo -N --partition normal -o %G | awk -F ':' '{sum += $3} END {print sum}'

Count all GPUs that are part of running jobs in all partitions:

squeue -t R --format='%b' --all  | awk -F ':' '{sum += $NF} END {print sum}'

Depending on your Slurm setup you will want to tweak the partition (perhaps add a reservation) and maybe not use --all in squeue.

Other resources

Check out my other help scripts:

  1. Train a ResNet to 94% accuracy on CIFAR-10 with only 150 lines of PyTorch
  2. FastMNIST - a drop in replacement for PyTorch' MNIST that avoids unnecessary processing - leading to 2-3x speed up on a GPU
  3. Obtain a dataset with a subset of ImageNet classes in PyTorch with minimal changes

slurm-for-ml's People

Watchers

James Cloos avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.