Describe the bug Hi! I found out something that looks like a race

Good! Thanks for getting back to us <a class="user-mention notranslate" data-hovercard

Hey <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url=

It seems to be fixed for <a class="commit-link" data-hovercard-type="commit" data-hove

Train loop may crash during checkpointing about speechbrain HOT 5 CLOSED

kokamido commented on June 19, 2024

Train loop may crash during checkpointing

from speechbrain.

Comments (5)

Adel-Moumen commented on June 19, 2024 1

Good! Thanks for getting back to us @kokamido. We solved some DDP / checkpointing issues in the develop branch. We are planning to merge it in main branch very soon. Since this issue is solved, I will proceed by closing it. Feel free to reopen it if you require more in-depth help.

Thanks again for opening the issue! :)

from speechbrain.

Adel-Moumen commented on June 19, 2024

Hey @kokamido, thanks for letting us know! Could you please show us your save directory ?

Ping @pplantinga I think this issue is for you ;)

from speechbrain.

kokamido commented on June 19, 2024

I ran the repro with clean save directory. After the crash it looks like this:

root@sbx-60283d040ccf4433b126ad86e96ba6ac-5ff484847d-kcvm5:~/speechbraindebugexample# ls experiments/ddp_crash_repro/save/
CKPT+2024-02-08+12-30-07+01  CKPT+2024-02-08+12-30-08+01  CKPT+2024-02-08+12-30-09+01  CKPT+2024-02-08+12-30-10+01  CKPT+2024-02-08+12-30-11+00
CKPT+2024-02-08+12-30-07+02  CKPT+2024-02-08+12-30-08+02  CKPT+2024-02-08+12-30-09+02  CKPT+2024-02-08+12-30-10+02

And error message for this run is

Root Cause (first observed failure):
[0]:
  time      : 2024-02-08_12:30:11
  host      : sbx-60283d040ccf4433b126ad86e96ba6ac-5ff484847d-kcvm5
  rank      : 1 (local_rank: 1)
  exitcode  : 1 (pid: 143902)
  error_file: /tmp/torchelastic_9faem1ym/none_hfidn2p7/attempt_0/1/error.json
  traceback : Traceback (most recent call last):
    File "/usr/local/lib/python3.10/dist-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper
      return f(*args, **kwargs)
    File "/root/speechbraindebugexample/repro.py", line 48, in fit
      super(TestBrain, self).fit(epoch_counter, train_set, valid_set, progressbar, train_loader_kwargs, valid_loader_kwargs)
    File "/usr/local/lib/python3.10/dist-packages/speechbrain/core.py", line 1366, in fit
      self._fit_train(train_set=train_set, epoch=epoch, enable=enable)
    File "/usr/local/lib/python3.10/dist-packages/speechbrain/core.py", line 1212, in _fit_train
      self._save_intra_epoch_ckpt()
    File "/usr/local/lib/python3.10/dist-packages/speechbrain/core.py", line 1386, in _save_intra_epoch_ckpt
      self.checkpointer.save_and_keep_only(
    File "/usr/local/lib/python3.10/dist-packages/speechbrain/utils/checkpoints.py", line 685, in save_and_keep_only
      self.delete_checkpoints(
    File "/usr/local/lib/python3.10/dist-packages/speechbrain/utils/checkpoints.py", line 988, in delete_checkpoints
      self.find_checkpoints(
    File "/usr/local/lib/python3.10/dist-packages/speechbrain/utils/checkpoints.py", line 825, in find_checkpoints
      ckpts = self.list_checkpoints()
    File "/usr/local/lib/python3.10/dist-packages/speechbrain/utils/checkpoints.py", line 914, in list_checkpoints
      return self._construct_checkpoint_objects(self._list_checkpoint_dirs())
    File "/usr/local/lib/python3.10/dist-packages/speechbrain/utils/checkpoints.py", line 1061, in _construct_checkpoint_objects
      with open(ckpt_dir / METAFNAME) as fi:
  FileNotFoundError: [Errno 2] No such file or directory: 'experiments/ddp_crash_repro/save/CKPT+2024-02-08+12-30-10+00/CKPT.yaml'

from speechbrain.

Adel-Moumen commented on June 19, 2024

Hey,

could you please fetch the latest speechbrain version available through git clone and let us know if the issue is still there ? Thanks.

Best,
Adel

from speechbrain.

kokamido commented on June 19, 2024

It seems to be fixed for b8a3ee3

from speechbrain.

Train loop may crash during checkpointing about speechbrain HOT 5 CLOSED

Comments (5)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent

Jobs