Comments (8)
Glad you found it useful! I'll put a SLURM tutorial together tomorrow.
In the meantime, check out the multi-node example:
from pytorch-lightning.
Thanks. Looking forward to see SLURM tutorial.
from pytorch-lightning.
Put the tutorial together here.
Try it out and lmk what you think.
https://link.medium.com/Bs6NxJyjRY
from pytorch-lightning.
Closing this because the tutorial has been created :)
But, feel free to add a comment if the tutorial is missing something or needs clarification.
from pytorch-lightning.
Closing this because the tutorial has been created :)
But, feel free to add a comment if the tutorial is missing something or needs clarification.
In the blog, there is a comment:
Disclaimer: This tutorial assumes your cluster is managed by SLURM.
In fact, I have no idea about how to manage cluster using slurm...
When cluster is managed by slurm, I can try PTL in my cluster.
from pytorch-lightning.
SLURM is installed by your cluster administrator. I guess here I’m using cluster to refer to an academic or corporate cluster.
If it’s literally your own cluster with your own machines then PTL may not support that case because it uses NCCL to communicate and a few flags SLURM sets.
But if it’s an academic or corporate cluster, it most likely has SLURM installed.
If you use srun or sbatch to run jobs, then you’re using SLURM.
from pytorch-lightning.
Yes, now I want to try slurm by myself...
from pytorch-lightning.
I did not feel SLURM convenient than Horovod.
from pytorch-lightning.
Related Issues (20)
- The packages such as libraries and models are not loading from files
- Please make it simple!
- LOG issue HOT 1
- Multi-gpu training is much lower than single gpu (due to additional processes?)
- Missing documentation for the `log_weight_decay` argument in `lightning.pytorch.callbacks.LearningRateMonitor`
- parsing issue with `save_last` parameter of `ModelCheckpoint`
- Construct objects from yaml by classmethod
- FSDP Strategy checkpoint loading
- Current FSDPPrecision does not support custom scaler for 16-mixed precision
- Differentiate testing multiple sets/models when logging
- Issue in Manual optimisation, during self.manual_backward call HOT 1
- Existing metric keys not moved to device after LearningRateFinder
- Checkpoint every_n_steps reruns epoch on restore HOT 3
- Metrics logged by self.log and metric.compute() are different HOT 1
- Multi-node Training with DDP stuck at "Initialize distributed..." on SLURM cluster HOT 3
- Full validation after first microbatch when training after LearningRateFinder
- Add a warning when some of the modules are in eval mode before the training stage
- why pytorch-lightning doc say "Model-parallel training (FSDP and DeepSpeed)". I think there is something wrong. HOT 2
- AWS Trainium fails number of device validation when using more than 1 accelerator on the instances
- OnExceptionCheckpoint: training resumes if ckpt found, even if no ckpt_path provided
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from pytorch-lightning.