Comments (9)
Hi @avishaiElmakies, I didn't use ddp for training and I didn't see this before. I did some research and I found: an explanation for this error, and some possible causes and solutions here. I guess possibly there's an incorrect torch.no_grad()
in your code. However I suggest first launch ddp with the original recipe, if it works then compare with your own modifications to debug.
from speechbrain.
hi @BenoitWang i just ran your original recipe and i am getting the same thing. (I took the new develop branch as well)
Traceback:
[2024-01-18 13:11:13,288] torch.distributed.run: [WARNING]
[2024-01-18 13:11:13,288] torch.distributed.run: [WARNING] *****************************************
[2024-01-18 13:11:13,288] torch.distributed.run: [WARNING] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
[2024-01-18 13:11:13,288] torch.distributed.run: [WARNING] *****************************************
[W socket.cpp:436] [c10d] The server socket cannot be initialized on [::]:29500 (errno: 97 - Address family not supported by protocol).
[W socket.cpp:663] [c10d] The client socket cannot be initialized to connect to [::ffff:127.0.0.1]:29500 (errno: 97 - Address family not supported by protocol).
[W socket.cpp:663] [c10d] The client socket cannot be initialized to connect to [::ffff:127.0.0.1]:29500 (errno: 97 - Address family not supported by protocol).
[W socket.cpp:663] [c10d] The client socket cannot be initialized to connect to [::ffff:127.0.0.1]:29500 (errno: 97 - Address family not supported by protocol).
Some weights of the model checkpoint at microsoft/wavlm-large were not used when initializing WavLMModel: ['encoder.pos_conv_embed.conv.weight_v', 'encoder.pos_conv_embed.conv.weight_g']
- This IS expected if you are initializing WavLMModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing WavLMModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of WavLMModel were not initialized from the model checkpoint at microsoft/wavlm-large and are newly initialized: ['encoder.pos_conv_embed.conv.parametrizations.weight.original1', 'encoder.pos_conv_embed.conv.parametrizations.weight.original0']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Some weights of the model checkpoint at microsoft/wavlm-large were not used when initializing WavLMModel: ['encoder.pos_conv_embed.conv.weight_v', 'encoder.pos_conv_embed.conv.weight_g']
- This IS expected if you are initializing WavLMModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing WavLMModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of WavLMModel were not initialized from the model checkpoint at microsoft/wavlm-large and are newly initialized: ['encoder.pos_conv_embed.conv.parametrizations.weight.original1', 'encoder.pos_conv_embed.conv.parametrizations.weight.original0']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
speechbrain.lobes.models.huggingface_transformers.wav2vec2 - wav2vec 2.0 feature extractor is frozen.
speechbrain.lobes.models.huggingface_transformers.wav2vec2 - wav2vec 2.0 feature extractor is frozen.
speechbrain.core - Beginning experiment!
speechbrain.core - Experiment folder: results/zed_wavlm_large/78
speechbrain.dataio.encoder - Load called, but CategoricalEncoder is not empty. Loaded data will overwrite everything. This is normal if there is e.g. an unk label defined at init.
speechbrain.core - Gradscaler enabled: False. Using precision: fp32.
speechbrain.core - 311.3M trainable parameters in EmoDiaBrain
speechbrain.utils.checkpoints - Would load a checkpoint here, but none found yet.
speechbrain.utils.epoch_loop - Going into epoch 1
0%| | 0/203 [00:00<?, ?it/s]Traceback (most recent call last):
File "/cs/labs/oabend/avishai.elma/speechbrain/recipes/ZaionEmotionDataset/emotion_diarization/train.py", line 342, in <module>
emo_id_brain.fit(
File "/cs/labs/oabend/avishai.elma/speechbrain/speechbrain/core.py", line 1488, in fit
self._fit_train(train_set=train_set, epoch=epoch, enable=enable)
File "/cs/labs/oabend/avishai.elma/speechbrain/speechbrain/core.py", line 1316, in _fit_train
loss = self.fit_batch(batch)
File "/cs/labs/oabend/avishai.elma/speechbrain/speechbrain/core.py", line 1130, in fit_batch
outputs = self.compute_forward(batch, sb.Stage.TRAIN)
File "/cs/labs/oabend/avishai.elma/speechbrain/recipes/ZaionEmotionDataset/emotion_diarization/train.py", line 27, in compute_forward
outputs = self.modules.wav2vec2(wavs)
File "/cs/labs/oabend/avishai.elma/lab_env_v2/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/cs/labs/oabend/avishai.elma/lab_env_v2/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
return forward_call(*args, **kwargs)
File "/cs/labs/oabend/avishai.elma/lab_env_v2/lib/python3.9/site-packages/torch/nn/parallel/distributed.py", line 1515, in forward
inputs, kwargs = self._pre_forward(*inputs, **kwargs)
File "/cs/labs/oabend/avishai.elma/lab_env_v2/lib/python3.9/site-packages/torch/nn/parallel/distributed.py", line 1409, in _pre_forward
if torch.is_grad_enabled() and self.reducer._rebuild_buckets():
RuntimeError: Expected to have finished reduction in the prior iteration before starting a new one. This error indicates that your module has parameters that were not used in producing loss. You can enable unused parameter detection by passing the keyword argument `find_unused_parameters=True` to `torch.nn.parallel.DistributedDataParallel`, and by
making sure all `forward` function outputs participate in calculating loss.
If you already have done the above, then the distributed data parallel module wasn't able to locate the output tensors in the return value of your module's `forward` function. Please include the loss function and the structure of the return value of `forward` of your module when reporting this issue (e.g. list, dict, iterable).
Parameters which did not receive grad for rank 1: model.encoder.layers.23.final_layer_norm.bias, model.encoder.layers.23.final_layer_norm.weight, model.encoder.layers.23.feed_forward.output_dense.bias, model.encoder.layers.23.feed_forward.output_dense.weight, model.encoder.layers.23.feed_forward.intermediate_dense.bias, model.encoder.layers.23.feed_forward.intermediate_dense.weight, model.encoder.layers.23.layer_norm.bias, model.encoder.layers.23.layer_norm.weight, model.encoder.layers.23.attention.gru_rel_pos_linear.bias, model.encoder.layers.23.attention.gru_rel_pos_linear.weight, model.encoder.layers.23.attention.out_proj.bias, model.encoder.layers.23.attention.out_proj.weight, model.encoder.layers.23.attention.q_proj.bias, model.encoder.layers.23.attention.q_proj.weight, model.encoder.layers.23.attention.v_proj.bias, model.encoder.layers.23.attention.v_proj.weight, model.encoder.layers.23.attention.k_proj.bias, model.encoder.layers.23.attention.k_proj.weight, model.encoder.layers.23.attention.gru_rel_pos_const, model.encoder.layers.5.final_layer_norm.bias, model.encoder.layers.5.final_layer_norm.weight, model.encoder.layers.5.feed_forward.output_dense.bias, model.encoder.layers.5.feed_forward.output_dense.weight, model.encoder.layers.5.feed_forward.intermediate_dense.bias, model.encoder.layers.5.feed_forward.intermediate_dense.weight, model.encoder.layers.5.layer_norm.bias, model.encoder.layers.5.layer_norm.weight, model.encoder.layers.5.attention.gru_rel_pos_linear.bias, model.encoder.layers.5.attention.gru_rel_pos_linear.weight, model.encoder.layers.5.attention.out_proj.bias, model.encoder.layers.5.attention.out_proj.weight, model.encoder.layers.5.attention.q_proj.bias, model.encoder.layers.5.attention.q_proj.weight, model.encoder.layers.5.attention.v_proj.bias, model.encoder.layers.5.attention.v_proj.weight, model.encoder.layers.5.attention.k_proj.bias, model.encoder.layers.5.attention.k_proj.weight, model.encoder.layers.5.attention.gru_rel_pos_const, model.masked_spec_embed
Parameter indices which did not receive grad for rank 1: 0 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 464 465 466
0%|█ | 1/203 [00:02<07:19, 2.18s/it, train_loss=1.48]
speechbrain.core - Exception:
Traceback (most recent call last):
File "/cs/labs/oabend/avishai.elma/speechbrain/recipes/ZaionEmotionDataset/emotion_diarization/train.py", line 342, in <module>
emo_id_brain.fit(
File "/cs/labs/oabend/avishai.elma/speechbrain/speechbrain/core.py", line 1488, in fit
self._fit_train(train_set=train_set, epoch=epoch, enable=enable)
File "/cs/labs/oabend/avishai.elma/speechbrain/speechbrain/core.py", line 1316, in _fit_train
loss = self.fit_batch(batch)
File "/cs/labs/oabend/avishai.elma/speechbrain/speechbrain/core.py", line 1130, in fit_batch
outputs = self.compute_forward(batch, sb.Stage.TRAIN)
File "/cs/labs/oabend/avishai.elma/speechbrain/recipes/ZaionEmotionDataset/emotion_diarization/train.py", line 27, in compute_forward
outputs = self.modules.wav2vec2(wavs)
File "/cs/labs/oabend/avishai.elma/lab_env_v2/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/cs/labs/oabend/avishai.elma/lab_env_v2/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
return forward_call(*args, **kwargs)
File "/cs/labs/oabend/avishai.elma/lab_env_v2/lib/python3.9/site-packages/torch/nn/parallel/distributed.py", line 1515, in forward
inputs, kwargs = self._pre_forward(*inputs, **kwargs)
File "/cs/labs/oabend/avishai.elma/lab_env_v2/lib/python3.9/site-packages/torch/nn/parallel/distributed.py", line 1409, in _pre_forward
if torch.is_grad_enabled() and self.reducer._rebuild_buckets():
RuntimeError: Expected to have finished reduction in the prior iteration before starting a new one. This error indicates that your module has parameters that were not used in producing loss. You can enable unused parameter detection by passing the keyword argument `find_unused_parameters=True` to `torch.nn.parallel.DistributedDataParallel`, and by
making sure all `forward` function outputs participate in calculating loss.
If you already have done the above, then the distributed data parallel module wasn't able to locate the output tensors in the return value of your module's `forward` function. Please include the loss function and the structure of the return value of `forward` of your module when reporting this issue (e.g. list, dict, iterable).
Parameters which did not receive grad for rank 0: model.encoder.layers.23.final_layer_norm.bias, model.encoder.layers.23.final_layer_norm.weight, model.encoder.layers.23.feed_forward.output_dense.bias, model.encoder.layers.23.feed_forward.output_dense.weight, model.encoder.layers.23.feed_forward.intermediate_dense.bias, model.encoder.layers.23.feed_forward.intermediate_dense.weight, model.encoder.layers.23.layer_norm.bias, model.encoder.layers.23.layer_norm.weight, model.encoder.layers.23.attention.gru_rel_pos_linear.bias, model.encoder.layers.23.attention.gru_rel_pos_linear.weight, model.encoder.layers.23.attention.out_proj.bias, model.encoder.layers.23.attention.out_proj.weight, model.encoder.layers.23.attention.q_proj.bias, model.encoder.layers.23.attention.q_proj.weight, model.encoder.layers.23.attention.v_proj.bias, model.encoder.layers.23.attention.v_proj.weight, model.encoder.layers.23.attention.k_proj.bias, model.encoder.layers.23.attention.k_proj.weight, model.encoder.layers.23.attention.gru_rel_pos_const, model.encoder.layers.5.final_layer_norm.bias, model.encoder.layers.5.final_layer_norm.weight, model.encoder.layers.5.feed_forward.output_dense.bias, model.encoder.layers.5.feed_forward.output_dense.weight, model.encoder.layers.5.feed_forward.intermediate_dense.bias, model.encoder.layers.5.feed_forward.intermediate_dense.weight, model.encoder.layers.5.layer_norm.bias, model.encoder.layers.5.layer_norm.weight, model.encoder.layers.5.attention.gru_rel_pos_linear.bias, model.encoder.layers.5.attention.gru_rel_pos_linear.weight, model.encoder.layers.5.attention.out_proj.bias, model.encoder.layers.5.attention.out_proj.weight, model.encoder.layers.5.attention.q_proj.bias, model.encoder.layers.5.attention.q_proj.weight, model.encoder.layers.5.attention.v_proj.bias, model.encoder.layers.5.attention.v_proj.weight, model.encoder.layers.5.attention.k_proj.bias, model.encoder.layers.5.attention.k_proj.weight, model.encoder.layers.5.attention.gru_rel_pos_const, model.masked_spec_embed
Parameter indices which did not receive grad for rank 0: 0 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 464 465 466
[2024-01-18 13:11:28,338] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: 1) local_rank: 0 (pid: 2492183) of binary: /cs/labs/oabend/avishai.elma/lab_env_v2/bin/python
Traceback (most recent call last):
File "/cs/labs/oabend/avishai.elma/lab_env_v2/bin/torchrun", line 8, in <module>
sys.exit(main())
File "/cs/labs/oabend/avishai.elma/lab_env_v2/lib/python3.9/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper
return f(*args, **kwargs)
File "/cs/labs/oabend/avishai.elma/lab_env_v2/lib/python3.9/site-packages/torch/distributed/run.py", line 806, in main
run(args)
File "/cs/labs/oabend/avishai.elma/lab_env_v2/lib/python3.9/site-packages/torch/distributed/run.py", line 797, in run
elastic_launch(
File "/cs/labs/oabend/avishai.elma/lab_env_v2/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 134, in __call__
return launch_agent(self._config, self._entrypoint, list(args))
File "/cs/labs/oabend/avishai.elma/lab_env_v2/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 264, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
============================================================
train.py FAILED
------------------------------------------------------------
Failures:
[1]:
time : 2024-01-18_13:11:28
host : arion-02.cs.huji.ac.il
rank : 1 (local_rank: 1)
exitcode : 1 (pid: 2492184)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
time : 2024-01-18_13:11:28
host : arion-02.cs.huji.ac.il
rank : 0 (local_rank: 0)
exitcode : 1 (pid: 2492183)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================
from speechbrain.
Hi @avishaiElmakies I was able to reproduce the issue. I did some research and this seems a HF issue for quite a while, not sure why we still have it... See here and here.
So the problem is that wav2vec2-series models skip layers through its LayerDrop... By adding the following code in speechbrain.core
after scaled_loss.backward()
, you'll see that for each iteration some random layers are dropped, and when other gpus try to continue they get out of sync.
for name, param in self.modules.wav2vec2.named_parameters():
if param.grad is None:
print(name)
The solution is just adding emo_id_brain.find_unused_parameters=True
before training. This may not be the neatest but the easiest for now.
from speechbrain.
@BenoitWang thanks for looking into it.
Is there maybe a better way to handle this? As far as I can see a simple config change or a boolean in a constructor might fix this (at least for the time being).
The best solution will be to sync the processes, but this does sound hard and probably needs to be Done on HF side.
I might be able to do the small fix myself, but I would love if someone could look at it.
from speechbrain.
This is weird since using --find_unused_parameters=True does the job when fine-tuning an SSL models (at least on ASR).
Could you please retry with --find_unused_parameters=True enabled and let me know what your results are ?
from speechbrain.
@Adel-Moumen
this does work with that flag. I would llove a better solution since that flag increases the run time of the model.
P.S. sorry for the delay
from speechbrain.
Hmm I don't think there's a better solution than that. As explained by @BenoitWang, w2v2 from HF is using a layer dropout strategy which causes issues with DDP (and that is why you really need to specify --find_unused_parameters=True
). If you want to get some speedup, you can use --precision=fp16
and/or increase the --grad_accumulation_factor
.
Do you know @TParcollet if there's ways to remove this layer drop ?
from speechbrain.
I've always used --find-unused-parameters. Tons of people want to use HuggingFace models, so they need to play with HF rules and the way they implement their code. This thing is out of control of SpeechBrain as it lies in how they implement their transformers library. @avishaiElmakies The two only solutions are to deactivate layer_drop in the config.json of the HuggingFace model ;) You can also pass it to the HF class in the yaml I believe. But this may result (or may not) in degraded downstream performance.. Or use --find_unused_parameters
from speechbrain.
@TParcollet OK thanks!
from speechbrain.
Related Issues (20)
- RuntimeError when processing VAD on short audio HOT 7
- [Feature Request]: Stree info in G2P output HOT 1
- [Feature Request]: Improve DAC interface
- [Feature Request]: Load a speechbrain-fine-tuned huggingface model checkpoint with the huggingface interface HOT 1
- Incorrect transformer mask size HOT 8
- loading fully trained brain (a brain who finished training) and then evaluating it with brain.evaluate. will cause a crash when using hpopt context but not reporting in test stage
- Hi! LibriSpeech char training!! HOT 1
- Wav2Vec2Pretrain (HFTransformersInterface implementation) samples padded values for mask_time_indices and negative_sample_indices HOT 2
- spkrec-ecapa-voxceleb-mel-spec model modifies mel spectrum in place when used with CPU HOT 2
- A few unoptimised piece of code (augmentation and masking) HOT 2
- Language codes not following ISO standards in lang-id-voxlingua107-ecapa HOT 1
- Train loop may crash during checkpointing HOT 5
- Possible NCCL-level deadlock during checkpointing HOT 7
- Error during encoding
- AMP at inference time HOT 3
- Unable to use model trained from enhancement template HOT 1
- 🐞 | Import error in speechbrain.pretrained HOT 3
- Not Able to use 'speechbrain/spkrec-ecapa-voxceleb' embeddings in Google Colab. HOT 3
- 'S2STransformerBeamSearcher' object has no attribute 'ctc_forward_step' HOT 2
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from speechbrain.