Comments (5)
Good! Thanks for getting back to us @kokamido. We solved some DDP / checkpointing issues in the develop branch. We are planning to merge it in main branch very soon. Since this issue is solved, I will proceed by closing it. Feel free to reopen it if you require more in-depth help.
Thanks again for opening the issue! :)
from speechbrain.
Hey @kokamido, thanks for letting us know! Could you please show us your save
directory ?
Ping @pplantinga I think this issue is for you ;)
from speechbrain.
I ran the repro with clean save directory. After the crash it looks like this:
root@sbx-60283d040ccf4433b126ad86e96ba6ac-5ff484847d-kcvm5:~/speechbraindebugexample# ls experiments/ddp_crash_repro/save/
CKPT+2024-02-08+12-30-07+01 CKPT+2024-02-08+12-30-08+01 CKPT+2024-02-08+12-30-09+01 CKPT+2024-02-08+12-30-10+01 CKPT+2024-02-08+12-30-11+00
CKPT+2024-02-08+12-30-07+02 CKPT+2024-02-08+12-30-08+02 CKPT+2024-02-08+12-30-09+02 CKPT+2024-02-08+12-30-10+02
And error message for this run is
Root Cause (first observed failure):
[0]:
time : 2024-02-08_12:30:11
host : sbx-60283d040ccf4433b126ad86e96ba6ac-5ff484847d-kcvm5
rank : 1 (local_rank: 1)
exitcode : 1 (pid: 143902)
error_file: /tmp/torchelastic_9faem1ym/none_hfidn2p7/attempt_0/1/error.json
traceback : Traceback (most recent call last):
File "/usr/local/lib/python3.10/dist-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper
return f(*args, **kwargs)
File "/root/speechbraindebugexample/repro.py", line 48, in fit
super(TestBrain, self).fit(epoch_counter, train_set, valid_set, progressbar, train_loader_kwargs, valid_loader_kwargs)
File "/usr/local/lib/python3.10/dist-packages/speechbrain/core.py", line 1366, in fit
self._fit_train(train_set=train_set, epoch=epoch, enable=enable)
File "/usr/local/lib/python3.10/dist-packages/speechbrain/core.py", line 1212, in _fit_train
self._save_intra_epoch_ckpt()
File "/usr/local/lib/python3.10/dist-packages/speechbrain/core.py", line 1386, in _save_intra_epoch_ckpt
self.checkpointer.save_and_keep_only(
File "/usr/local/lib/python3.10/dist-packages/speechbrain/utils/checkpoints.py", line 685, in save_and_keep_only
self.delete_checkpoints(
File "/usr/local/lib/python3.10/dist-packages/speechbrain/utils/checkpoints.py", line 988, in delete_checkpoints
self.find_checkpoints(
File "/usr/local/lib/python3.10/dist-packages/speechbrain/utils/checkpoints.py", line 825, in find_checkpoints
ckpts = self.list_checkpoints()
File "/usr/local/lib/python3.10/dist-packages/speechbrain/utils/checkpoints.py", line 914, in list_checkpoints
return self._construct_checkpoint_objects(self._list_checkpoint_dirs())
File "/usr/local/lib/python3.10/dist-packages/speechbrain/utils/checkpoints.py", line 1061, in _construct_checkpoint_objects
with open(ckpt_dir / METAFNAME) as fi:
FileNotFoundError: [Errno 2] No such file or directory: 'experiments/ddp_crash_repro/save/CKPT+2024-02-08+12-30-10+00/CKPT.yaml'
from speechbrain.
Hey,
could you please fetch the latest speechbrain version available through git clone and let us know if the issue is still there ? Thanks.
Best,
Adel
from speechbrain.
It seems to be fixed for b8a3ee3
from speechbrain.
Related Issues (20)
- Circular Import Error HOT 8
- Circular import in ESC-50 classification recipe HOT 2
- Tacotron2.decoder.infer behaves incorrectly HOT 2
- Can't reproduce pretraining results for Wav2vec2 using LibriSpeech recipe HOT 9
- RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cpu! HOT 2
- not able to import 'HuggingFaceWhisper' from speechbrain.lobes.models.huggingface_whisper HOT 7
- Adapters + LLama -- re-design. HOT 6
- Torch 2.3 breaks DDP? HOT 7
- Training twice as long with Torch > 1.11 HOT 10
- Training regression for Conformer-Transducer models HOT 2
- Math Domain Error in Pretraining tutorial. HOT 1
- Typing syntax not supported in 3.7/3.8 HOT 8
- Potential `SpectrogramDrop` bugs HOT 1
- dtype mismatch in AttentiveStatisticsPooling with FP16 training mode HOT 1
- Task ASR Reported: Caught ZeroDivisionError in DataLoader worker process 0. HOT 4
- Huggingface-Aishell get wrong prediction HOT 2
- AMD ROCm: Conformer-transducer diverges HOT 2
- AMD ROCm: `torch.backends.cudnn.benchmark` should be set to `False` by default on ROCm
- Wav2Vec2WordpieceTokenizer' object has no attribute '_create_trie' HOT 2
- Same result for different samples (with same name) using speech separation HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from speechbrain.