Describe the bug when trying to finetune WavLM and using DDP. ther

Hi <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="

hi <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="

Hi <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

unused parameters when using WavLM. cased crash when using DDP about speechbrain HOT 9 CLOSED

avishaiElmakies commented on June 5, 2024

unused parameters when using WavLM. cased crash when using DDP

from speechbrain.

Comments (9)

BenoitWang commented on June 5, 2024

Hi @avishaiElmakies, I didn't use ddp for training and I didn't see this before. I did some research and I found: an explanation for this error, and some possible causes and solutions here. I guess possibly there's an incorrect torch.no_grad() in your code. However I suggest first launch ddp with the original recipe, if it works then compare with your own modifications to debug.

from speechbrain.

avishaiElmakies commented on June 5, 2024

hi @BenoitWang i just ran your original recipe and i am getting the same thing. (I took the new develop branch as well)

Traceback:

[2024-01-18 13:11:13,288] torch.distributed.run: [WARNING] 
[2024-01-18 13:11:13,288] torch.distributed.run: [WARNING] *****************************************
[2024-01-18 13:11:13,288] torch.distributed.run: [WARNING] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
[2024-01-18 13:11:13,288] torch.distributed.run: [WARNING] *****************************************
[W socket.cpp:436] [c10d] The server socket cannot be initialized on [::]:29500 (errno: 97 - Address family not supported by protocol).
[W socket.cpp:663] [c10d] The client socket cannot be initialized to connect to [::ffff:127.0.0.1]:29500 (errno: 97 - Address family not supported by protocol).
[W socket.cpp:663] [c10d] The client socket cannot be initialized to connect to [::ffff:127.0.0.1]:29500 (errno: 97 - Address family not supported by protocol).
[W socket.cpp:663] [c10d] The client socket cannot be initialized to connect to [::ffff:127.0.0.1]:29500 (errno: 97 - Address family not supported by protocol).
Some weights of the model checkpoint at microsoft/wavlm-large were not used when initializing WavLMModel: ['encoder.pos_conv_embed.conv.weight_v', 'encoder.pos_conv_embed.conv.weight_g']
- This IS expected if you are initializing WavLMModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing WavLMModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of WavLMModel were not initialized from the model checkpoint at microsoft/wavlm-large and are newly initialized: ['encoder.pos_conv_embed.conv.parametrizations.weight.original1', 'encoder.pos_conv_embed.conv.parametrizations.weight.original0']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Some weights of the model checkpoint at microsoft/wavlm-large were not used when initializing WavLMModel: ['encoder.pos_conv_embed.conv.weight_v', 'encoder.pos_conv_embed.conv.weight_g']
- This IS expected if you are initializing WavLMModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing WavLMModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of WavLMModel were not initialized from the model checkpoint at microsoft/wavlm-large and are newly initialized: ['encoder.pos_conv_embed.conv.parametrizations.weight.original1', 'encoder.pos_conv_embed.conv.parametrizations.weight.original0']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
speechbrain.lobes.models.huggingface_transformers.wav2vec2 - wav2vec 2.0 feature extractor is frozen.
speechbrain.lobes.models.huggingface_transformers.wav2vec2 - wav2vec 2.0 feature extractor is frozen.
speechbrain.core - Beginning experiment!
speechbrain.core - Experiment folder: results/zed_wavlm_large/78
speechbrain.dataio.encoder - Load called, but CategoricalEncoder is not empty. Loaded data will overwrite everything. This is normal if there is e.g. an unk label defined at init.
speechbrain.core - Gradscaler enabled: False. Using precision: fp32.
speechbrain.core - 311.3M trainable parameters in EmoDiaBrain
speechbrain.utils.checkpoints - Would load a checkpoint here, but none found yet.
speechbrain.utils.epoch_loop - Going into epoch 1
  0%|                                                                                                                                                                                                                                                     | 0/203 [00:00<?, ?it/s]Traceback (most recent call last):
  File "/cs/labs/oabend/avishai.elma/speechbrain/recipes/ZaionEmotionDataset/emotion_diarization/train.py", line 342, in <module>
    emo_id_brain.fit(
  File "/cs/labs/oabend/avishai.elma/speechbrain/speechbrain/core.py", line 1488, in fit
    self._fit_train(train_set=train_set, epoch=epoch, enable=enable)
  File "/cs/labs/oabend/avishai.elma/speechbrain/speechbrain/core.py", line 1316, in _fit_train
    loss = self.fit_batch(batch)
  File "/cs/labs/oabend/avishai.elma/speechbrain/speechbrain/core.py", line 1130, in fit_batch
    outputs = self.compute_forward(batch, sb.Stage.TRAIN)
  File "/cs/labs/oabend/avishai.elma/speechbrain/recipes/ZaionEmotionDataset/emotion_diarization/train.py", line 27, in compute_forward
    outputs = self.modules.wav2vec2(wavs)
  File "/cs/labs/oabend/avishai.elma/lab_env_v2/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/cs/labs/oabend/avishai.elma/lab_env_v2/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "/cs/labs/oabend/avishai.elma/lab_env_v2/lib/python3.9/site-packages/torch/nn/parallel/distributed.py", line 1515, in forward
    inputs, kwargs = self._pre_forward(*inputs, **kwargs)
  File "/cs/labs/oabend/avishai.elma/lab_env_v2/lib/python3.9/site-packages/torch/nn/parallel/distributed.py", line 1409, in _pre_forward
    if torch.is_grad_enabled() and self.reducer._rebuild_buckets():
RuntimeError: Expected to have finished reduction in the prior iteration before starting a new one. This error indicates that your module has parameters that were not used in producing loss. You can enable unused parameter detection by passing the keyword argument `find_unused_parameters=True` to `torch.nn.parallel.DistributedDataParallel`, and by 
making sure all `forward` function outputs participate in calculating loss. 
If you already have done the above, then the distributed data parallel module wasn't able to locate the output tensors in the return value of your module's `forward` function. Please include the loss function and the structure of the return value of `forward` of your module when reporting this issue (e.g. list, dict, iterable).
Parameters which did not receive grad for rank 1: model.encoder.layers.23.final_layer_norm.bias, model.encoder.layers.23.final_layer_norm.weight, model.encoder.layers.23.feed_forward.output_dense.bias, model.encoder.layers.23.feed_forward.output_dense.weight, model.encoder.layers.23.feed_forward.intermediate_dense.bias, model.encoder.layers.23.feed_forward.intermediate_dense.weight, model.encoder.layers.23.layer_norm.bias, model.encoder.layers.23.layer_norm.weight, model.encoder.layers.23.attention.gru_rel_pos_linear.bias, model.encoder.layers.23.attention.gru_rel_pos_linear.weight, model.encoder.layers.23.attention.out_proj.bias, model.encoder.layers.23.attention.out_proj.weight, model.encoder.layers.23.attention.q_proj.bias, model.encoder.layers.23.attention.q_proj.weight, model.encoder.layers.23.attention.v_proj.bias, model.encoder.layers.23.attention.v_proj.weight, model.encoder.layers.23.attention.k_proj.bias, model.encoder.layers.23.attention.k_proj.weight, model.encoder.layers.23.attention.gru_rel_pos_const, model.encoder.layers.5.final_layer_norm.bias, model.encoder.layers.5.final_layer_norm.weight, model.encoder.layers.5.feed_forward.output_dense.bias, model.encoder.layers.5.feed_forward.output_dense.weight, model.encoder.layers.5.feed_forward.intermediate_dense.bias, model.encoder.layers.5.feed_forward.intermediate_dense.weight, model.encoder.layers.5.layer_norm.bias, model.encoder.layers.5.layer_norm.weight, model.encoder.layers.5.attention.gru_rel_pos_linear.bias, model.encoder.layers.5.attention.gru_rel_pos_linear.weight, model.encoder.layers.5.attention.out_proj.bias, model.encoder.layers.5.attention.out_proj.weight, model.encoder.layers.5.attention.q_proj.bias, model.encoder.layers.5.attention.q_proj.weight, model.encoder.layers.5.attention.v_proj.bias, model.encoder.layers.5.attention.v_proj.weight, model.encoder.layers.5.attention.k_proj.bias, model.encoder.layers.5.attention.k_proj.weight, model.encoder.layers.5.attention.gru_rel_pos_const, model.masked_spec_embed
Parameter indices which did not receive grad for rank 1: 0 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 464 465 466
  0%|█                                                                                                                                                                                                                           | 1/203 [00:02<07:19,  2.18s/it, train_loss=1.48]
speechbrain.core - Exception:
Traceback (most recent call last):
  File "/cs/labs/oabend/avishai.elma/speechbrain/recipes/ZaionEmotionDataset/emotion_diarization/train.py", line 342, in <module>
    emo_id_brain.fit(
  File "/cs/labs/oabend/avishai.elma/speechbrain/speechbrain/core.py", line 1488, in fit
    self._fit_train(train_set=train_set, epoch=epoch, enable=enable)
  File "/cs/labs/oabend/avishai.elma/speechbrain/speechbrain/core.py", line 1316, in _fit_train
    loss = self.fit_batch(batch)
  File "/cs/labs/oabend/avishai.elma/speechbrain/speechbrain/core.py", line 1130, in fit_batch
    outputs = self.compute_forward(batch, sb.Stage.TRAIN)
  File "/cs/labs/oabend/avishai.elma/speechbrain/recipes/ZaionEmotionDataset/emotion_diarization/train.py", line 27, in compute_forward
    outputs = self.modules.wav2vec2(wavs)
  File "/cs/labs/oabend/avishai.elma/lab_env_v2/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/cs/labs/oabend/avishai.elma/lab_env_v2/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "/cs/labs/oabend/avishai.elma/lab_env_v2/lib/python3.9/site-packages/torch/nn/parallel/distributed.py", line 1515, in forward
    inputs, kwargs = self._pre_forward(*inputs, **kwargs)
  File "/cs/labs/oabend/avishai.elma/lab_env_v2/lib/python3.9/site-packages/torch/nn/parallel/distributed.py", line 1409, in _pre_forward
    if torch.is_grad_enabled() and self.reducer._rebuild_buckets():
RuntimeError: Expected to have finished reduction in the prior iteration before starting a new one. This error indicates that your module has parameters that were not used in producing loss. You can enable unused parameter detection by passing the keyword argument `find_unused_parameters=True` to `torch.nn.parallel.DistributedDataParallel`, and by 
making sure all `forward` function outputs participate in calculating loss. 
If you already have done the above, then the distributed data parallel module wasn't able to locate the output tensors in the return value of your module's `forward` function. Please include the loss function and the structure of the return value of `forward` of your module when reporting this issue (e.g. list, dict, iterable).
Parameters which did not receive grad for rank 0: model.encoder.layers.23.final_layer_norm.bias, model.encoder.layers.23.final_layer_norm.weight, model.encoder.layers.23.feed_forward.output_dense.bias, model.encoder.layers.23.feed_forward.output_dense.weight, model.encoder.layers.23.feed_forward.intermediate_dense.bias, model.encoder.layers.23.feed_forward.intermediate_dense.weight, model.encoder.layers.23.layer_norm.bias, model.encoder.layers.23.layer_norm.weight, model.encoder.layers.23.attention.gru_rel_pos_linear.bias, model.encoder.layers.23.attention.gru_rel_pos_linear.weight, model.encoder.layers.23.attention.out_proj.bias, model.encoder.layers.23.attention.out_proj.weight, model.encoder.layers.23.attention.q_proj.bias, model.encoder.layers.23.attention.q_proj.weight, model.encoder.layers.23.attention.v_proj.bias, model.encoder.layers.23.attention.v_proj.weight, model.encoder.layers.23.attention.k_proj.bias, model.encoder.layers.23.attention.k_proj.weight, model.encoder.layers.23.attention.gru_rel_pos_const, model.encoder.layers.5.final_layer_norm.bias, model.encoder.layers.5.final_layer_norm.weight, model.encoder.layers.5.feed_forward.output_dense.bias, model.encoder.layers.5.feed_forward.output_dense.weight, model.encoder.layers.5.feed_forward.intermediate_dense.bias, model.encoder.layers.5.feed_forward.intermediate_dense.weight, model.encoder.layers.5.layer_norm.bias, model.encoder.layers.5.layer_norm.weight, model.encoder.layers.5.attention.gru_rel_pos_linear.bias, model.encoder.layers.5.attention.gru_rel_pos_linear.weight, model.encoder.layers.5.attention.out_proj.bias, model.encoder.layers.5.attention.out_proj.weight, model.encoder.layers.5.attention.q_proj.bias, model.encoder.layers.5.attention.q_proj.weight, model.encoder.layers.5.attention.v_proj.bias, model.encoder.layers.5.attention.v_proj.weight, model.encoder.layers.5.attention.k_proj.bias, model.encoder.layers.5.attention.k_proj.weight, model.encoder.layers.5.attention.gru_rel_pos_const, model.masked_spec_embed
Parameter indices which did not receive grad for rank 0: 0 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 464 465 466
[2024-01-18 13:11:28,338] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: 1) local_rank: 0 (pid: 2492183) of binary: /cs/labs/oabend/avishai.elma/lab_env_v2/bin/python
Traceback (most recent call last):
  File "/cs/labs/oabend/avishai.elma/lab_env_v2/bin/torchrun", line 8, in <module>
    sys.exit(main())
  File "/cs/labs/oabend/avishai.elma/lab_env_v2/lib/python3.9/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper
    return f(*args, **kwargs)
  File "/cs/labs/oabend/avishai.elma/lab_env_v2/lib/python3.9/site-packages/torch/distributed/run.py", line 806, in main
    run(args)
  File "/cs/labs/oabend/avishai.elma/lab_env_v2/lib/python3.9/site-packages/torch/distributed/run.py", line 797, in run
    elastic_launch(
  File "/cs/labs/oabend/avishai.elma/lab_env_v2/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 134, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/cs/labs/oabend/avishai.elma/lab_env_v2/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 264, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
============================================================
train.py FAILED
------------------------------------------------------------
Failures:
[1]:
  time      : 2024-01-18_13:11:28
  host      : arion-02.cs.huji.ac.il
  rank      : 1 (local_rank: 1)
  exitcode  : 1 (pid: 2492184)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2024-01-18_13:11:28
  host      : arion-02.cs.huji.ac.il
  rank      : 0 (local_rank: 0)
  exitcode  : 1 (pid: 2492183)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================

from speechbrain.

BenoitWang commented on June 5, 2024

Hi @avishaiElmakies I was able to reproduce the issue. I did some research and this seems a HF issue for quite a while, not sure why we still have it... See here and here.

So the problem is that wav2vec2-series models skip layers through its LayerDrop... By adding the following code in speechbrain.core after scaled_loss.backward(), you'll see that for each iteration some random layers are dropped, and when other gpus try to continue they get out of sync.

for name, param in self.modules.wav2vec2.named_parameters():
    if param.grad is None:
        print(name)

The solution is just adding emo_id_brain.find_unused_parameters=True before training. This may not be the neatest but the easiest for now.

from speechbrain.

avishaiElmakies commented on June 5, 2024

@BenoitWang thanks for looking into it.
Is there maybe a better way to handle this? As far as I can see a simple config change or a boolean in a constructor might fix this (at least for the time being).
The best solution will be to sync the processes, but this does sound hard and probably needs to be Done on HF side.
I might be able to do the small fix myself, but I would love if someone could look at it.

from speechbrain.

Adel-Moumen commented on June 5, 2024

This is weird since using --find_unused_parameters=True does the job when fine-tuning an SSL models (at least on ASR).

Could you please retry with --find_unused_parameters=True enabled and let me know what your results are ?

from speechbrain.

avishaiElmakies commented on June 5, 2024

@Adel-Moumen
this does work with that flag. I would llove a better solution since that flag increases the run time of the model.

P.S. sorry for the delay

from speechbrain.

Adel-Moumen commented on June 5, 2024

Hmm I don't think there's a better solution than that. As explained by @BenoitWang, w2v2 from HF is using a layer dropout strategy which causes issues with DDP (and that is why you really need to specify --find_unused_parameters=True). If you want to get some speedup, you can use --precision=fp16 and/or increase the --grad_accumulation_factor.

Do you know @TParcollet if there's ways to remove this layer drop ?

from speechbrain.

TParcollet commented on June 5, 2024

I've always used --find-unused-parameters. Tons of people want to use HuggingFace models, so they need to play with HF rules and the way they implement their code. This thing is out of control of SpeechBrain as it lies in how they implement their transformers library. @avishaiElmakies The two only solutions are to deactivate layer_drop in the config.json of the HuggingFace model ;) You can also pass it to the HF class in the yaml I believe. But this may result (or may not) in degraded downstream performance.. Or use --find_unused_parameters

from speechbrain.

avishaiElmakies commented on June 5, 2024

@TParcollet OK thanks!

from speechbrain.

unused parameters when using WavLM. cased crash when using DDP about speechbrain HOT 9 CLOSED

Comments (9)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent

Jobs