GithubHelp home page GithubHelp logo

Comments (5)

Adel-Moumen avatar Adel-Moumen commented on June 11, 2024 1

Good! Thanks for getting back to us @kokamido. We solved some DDP / checkpointing issues in the develop branch. We are planning to merge it in main branch very soon. Since this issue is solved, I will proceed by closing it. Feel free to reopen it if you require more in-depth help.

Thanks again for opening the issue! :)

from speechbrain.

Adel-Moumen avatar Adel-Moumen commented on June 11, 2024

Hey @kokamido, thanks for letting us know! Could you please show us your save directory ?

Ping @pplantinga I think this issue is for you ;)

from speechbrain.

kokamido avatar kokamido commented on June 11, 2024

I ran the repro with clean save directory. After the crash it looks like this:

root@sbx-60283d040ccf4433b126ad86e96ba6ac-5ff484847d-kcvm5:~/speechbraindebugexample# ls experiments/ddp_crash_repro/save/
CKPT+2024-02-08+12-30-07+01  CKPT+2024-02-08+12-30-08+01  CKPT+2024-02-08+12-30-09+01  CKPT+2024-02-08+12-30-10+01  CKPT+2024-02-08+12-30-11+00
CKPT+2024-02-08+12-30-07+02  CKPT+2024-02-08+12-30-08+02  CKPT+2024-02-08+12-30-09+02  CKPT+2024-02-08+12-30-10+02

And error message for this run is

Root Cause (first observed failure):
[0]:
  time      : 2024-02-08_12:30:11
  host      : sbx-60283d040ccf4433b126ad86e96ba6ac-5ff484847d-kcvm5
  rank      : 1 (local_rank: 1)
  exitcode  : 1 (pid: 143902)
  error_file: /tmp/torchelastic_9faem1ym/none_hfidn2p7/attempt_0/1/error.json
  traceback : Traceback (most recent call last):
    File "/usr/local/lib/python3.10/dist-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper
      return f(*args, **kwargs)
    File "/root/speechbraindebugexample/repro.py", line 48, in fit
      super(TestBrain, self).fit(epoch_counter, train_set, valid_set, progressbar, train_loader_kwargs, valid_loader_kwargs)
    File "/usr/local/lib/python3.10/dist-packages/speechbrain/core.py", line 1366, in fit
      self._fit_train(train_set=train_set, epoch=epoch, enable=enable)
    File "/usr/local/lib/python3.10/dist-packages/speechbrain/core.py", line 1212, in _fit_train
      self._save_intra_epoch_ckpt()
    File "/usr/local/lib/python3.10/dist-packages/speechbrain/core.py", line 1386, in _save_intra_epoch_ckpt
      self.checkpointer.save_and_keep_only(
    File "/usr/local/lib/python3.10/dist-packages/speechbrain/utils/checkpoints.py", line 685, in save_and_keep_only
      self.delete_checkpoints(
    File "/usr/local/lib/python3.10/dist-packages/speechbrain/utils/checkpoints.py", line 988, in delete_checkpoints
      self.find_checkpoints(
    File "/usr/local/lib/python3.10/dist-packages/speechbrain/utils/checkpoints.py", line 825, in find_checkpoints
      ckpts = self.list_checkpoints()
    File "/usr/local/lib/python3.10/dist-packages/speechbrain/utils/checkpoints.py", line 914, in list_checkpoints
      return self._construct_checkpoint_objects(self._list_checkpoint_dirs())
    File "/usr/local/lib/python3.10/dist-packages/speechbrain/utils/checkpoints.py", line 1061, in _construct_checkpoint_objects
      with open(ckpt_dir / METAFNAME) as fi:
  FileNotFoundError: [Errno 2] No such file or directory: 'experiments/ddp_crash_repro/save/CKPT+2024-02-08+12-30-10+00/CKPT.yaml'

from speechbrain.

Adel-Moumen avatar Adel-Moumen commented on June 11, 2024

Hey,

could you please fetch the latest speechbrain version available through git clone and let us know if the issue is still there ? Thanks.

Best,
Adel

from speechbrain.

kokamido avatar kokamido commented on June 11, 2024

It seems to be fixed for b8a3ee3

from speechbrain.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.