Comments (4)
@rustamzh This is a very good idea, and should be implemented.
However, I want to understand exactly your setup, as I'm trying to implement this myself. You specify a ModelCheckpoint with dirpath
equal to some artifact_directory
, I assume? And then you also set the Trainer
's default_root_dir
to that same artifact_directory
?
My issue here is that if I am training two models, using the same artifact_directory
, won't the auto-resume get them mixed up when it uses ckpt_path="last"
?
from pytorch-lightning.
Hi. My dirpath
is empty in the ModelCheckpoint
callback, this way checkpoints are created in their own version-X
(where X is SLURM jobID) subfolder and do not mix, since the autorequeued job retains its jobID.
from pytorch-lightning.
Hi. My
dirpath
is empty in theModelCheckpoint
callback, this way checkpoints are created in their ownversion-X
(where X is SLURM jobID) subfolder and do not mix, since the autorequeued job retains its jobID.
This makes a lot of sense, although I'm missing the way that your default root dir is configured. Would it be possible to share the code snippet that is most relevant to this configuration? It would be much appreciated!
from pytorch-lightning.
default root dir
is just non empty.
from pytorch-lightning.
Related Issues (20)
- The packages such as libraries and models are not loading from files
- Please make it simple!
- LOG issue HOT 1
- Multi-gpu training is much lower than single gpu (due to additional processes?)
- Missing documentation for the `log_weight_decay` argument in `lightning.pytorch.callbacks.LearningRateMonitor`
- parsing issue with `save_last` parameter of `ModelCheckpoint`
- Construct objects from yaml by classmethod
- FSDP Strategy checkpoint loading
- Current FSDPPrecision does not support custom scaler for 16-mixed precision
- Differentiate testing multiple sets/models when logging
- Issue in Manual optimisation, during self.manual_backward call HOT 1
- Existing metric keys not moved to device after LearningRateFinder
- Checkpoint every_n_steps reruns epoch on restore HOT 3
- Metrics logged by self.log and metric.compute() are different HOT 1
- Multi-node Training with DDP stuck at "Initialize distributed..." on SLURM cluster HOT 3
- Full validation after first microbatch when training after LearningRateFinder
- Add a warning when some of the modules are in eval mode before the training stage
- why pytorch-lightning doc say "Model-parallel training (FSDP and DeepSpeed)". I think there is something wrong. HOT 2
- AWS Trainium fails number of device validation when using more than 1 accelerator on the instances
- OnExceptionCheckpoint: training resumes if ckpt found, even if no ckpt_path provided
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from pytorch-lightning.