Comments (16)
[pip3] flake8==3.7.9
[pip3] numpy==1.26.3
[pip3] torch==2.1.2
[pip3] torchaudio==2.1.2
[pip3] torchvision==0.16.2
[pip3] triton==2.1.0
[conda] numpy 1.26.3 pypi_0 pypi
[conda] torch 2.1.2 pypi_0 pypi
[conda] torchaudio 2.1.2 pypi_0 pypi
[conda] torchvision 0.16.2 pypi_0 pypi
[conda] triton 2.1.0 pypi_0 pypi
this is my software version.
from speechbrain.
Hi the traceback of the error is not full, could you please post the full error?
from speechbrain.
I am having a similar issue with DDP.
I think that the DDP isn't able to save all of the recoverables in time before it reaches a place he needs to load them (e.g. if you have 1 epoch and then evaluate the brain after that).
#this is my yaml:
# Seed needs to be set at top of yaml, before objects with parameters are made
seed: 12
__set_seed: !apply:torch.manual_seed [12]
output_folder: finetune_e_wavlm_large
# eder_file: finetune_emo/eder.txt
save_folder: &id008 !ref <output_folder>/save
train_log: &id009 !ref <output_folder>/train_log.txt
local-rank: 0
distributed_launch: false
## some important costants
sample_rate: 16000
download_base_path: "../../models"
dataset_json: ../../data_jsons/emotion_finetune_dataset.json
split_ratio: [0.8, 0.1, 0.1]
window_length: 1 # win_len = 0.02 * 1 = 0.02s
stride: 1 # stride = 0.02 * 1 = 0.02s
encoder_dim: 1024
# Outputs
out_n_neurons: 4 # BPE size, index(blank/eos/bos) = 0
# Dataloader options
# With data_parallel batch_size is split into N jobs
# With DDP batch_size is multiplied by N jobs
dataloader_options:
batch_size: 2
shuffle: true
num_workers: 2 # 2 on linux but 0 works on windows
drop_last: false
pin_memory: true
collate_fn: !name:speechbrain.dataio.batch.PaddedBatch
test_dataloader_opts:
batch_size: 2
collate_fn: !name:speechbrain.dataio.batch.PaddedBatch
epoch_counter: &id007 !new:speechbrain.utils.epoch_loop.EpochCounter
limit: 1
add_noise_aug: &id020 !new:speechbrain.processing.speech_augmentation.AddNoise
snr_low: 10
snr_high: 20
mix_prob: 0.5
drop_freq_aug: &id021 !new:speechbrain.processing.speech_augmentation.DropFreq
drop_freq_high: 0.5
drop_count_low: 2
drop_count_high: 4
drop_prob : 0.5
drop_chunk_aug: &id022 !new:speechbrain.processing.speech_augmentation.DropChunk
drop_length_low : 1600
drop_length_high : 16000
drop_count_low: 1
drop_count_high: 2
augmentation: !new:speechbrain.processing.augmentation.Augmenter
parallel_augment: false
concat_original: false
min_augmentations: 0
max_augmentations: 3
repeat_augment: 1
noise_aug: *id020
freq_aug: *id021
chunk_aug: *id022
input_norm: &id001 !new:speechbrain.processing.features.InputNormalization
norm_type: sentence
std_norm: false
wav2vec2: &id002 !new:speechbrain.lobes.models.huggingface_wav2vec.HuggingFaceWav2Vec2
source: microsoft/wavlm-large
output_norm: true
freeze: false
freeze_feature_extractor: true
save_path: !ref <download_base_path>/wavlm-large
avg_pool: !new:speechbrain.nnet.pooling.Pooling1d
pool_type: avg
kernel_size: 1
stride: 1
ceil_mode: true
output_mlp: &id003 !new:speechbrain.nnet.linear.Linear
input_size: !ref <encoder_dim>
n_neurons: !ref <out_n_neurons>
bias: false
log_softmax: !new:speechbrain.nnet.activations.Softmax
apply_log: true
compute_cost: !name:emotion.finetune.utils.weighted_nll_loss
# can be used with compute loss, probably better for ddp to work like this
# should also know the labels in advance. can work with diffrent labels and label encoeders
weights :
"s": 8.0
"h": 20.0
"n": 1.0
"a": 20.0
modules:
input_norm: *id001
wav2vec2: *id002
output_mlp: *id003
opt_class: !name:torch.optim.Adam
lr: 0.0001
wav2vec2_opt_class: !name:torch.optim.Adam
lr: 0.00001
lr_annealing: &id005 !new:speechbrain.nnet.schedulers.NewBobScheduler
initial_value: 0.0001
improvement_threshold: 0.0025
annealing_factor: 0.8
patient: 0
lr_annealing_wav2vec2: &id006 !new:speechbrain.nnet.schedulers.NewBobScheduler
initial_value: 0.00001
improvement_threshold: 0.0025
annealing_factor: 0.9
patient: 0
checkpointer: !new:speechbrain.utils.checkpoints.Checkpointer
checkpoints_dir: *id008
recoverables:
scheduler_model: *id005
scheduler_wav2vec: *id006
counter: *id007
train_logger: !new:speechbrain.utils.train_logger.FileTrainLogger
save_file: *id009
error_stats: !name:speechbrain.utils.metric_stats.ClassificationStats
this is what i add as a recoverable in my code(during init_optimizers)
if self.checkpointer is not None:
self.checkpointer.add_recoverable("wav2vec2_opt", self.wav2vec2_optimizer)
self.checkpointer.add_recoverable("optimizer", self.optimizer)
self.checkpointer.add_recoverable("input_norm", self.modules.input_norm)
self.checkpointer.add_recoverable("output_mlp", self.modules.output_mlp)
self.checkpointer.add_recoverable("wav2vec2", self.modules.wav2vec2)
Traceback:
Traceback (most recent call last):
File "/cs/labs/oabend/avishai.elma/src/emotion_src/emotion_finetune.py", line 63, in <module>
main()
File "/cs/labs/oabend/avishai.elma/src/emotion_src/emotion_finetune.py", line 52, in main
brain.evaluate(
File "/cs/labs/oabend/avishai.elma/lab_env_v2/lib/python3.9/site-packages/speechbrain/core.py", line 1531, in evaluate
self.on_evaluate_start(max_key=max_key, min_key=min_key)
File "/cs/labs/oabend/avishai.elma/lab_env_v2/lib/python3.9/site-packages/speechbrain/core.py", line 994, in on_evaluate_start
self.checkpointer.recover_if_possible(
File "/cs/labs/oabend/avishai.elma/lab_env_v2/lib/python3.9/site-packages/speechbrain/utils/checkpoints.py", line 891, in recover_if_possible
self.load_checkpoint(chosen_ckpt, device)
File "/cs/labs/oabend/avishai.elma/lab_env_v2/lib/python3.9/site-packages/speechbrain/utils/checkpoints.py", line 904, in load_checkpoint
self._call_load_hooks(checkpoint, device)
File "/cs/labs/oabend/avishai.elma/lab_env_v2/lib/python3.9/site-packages/speechbrain/utils/checkpoints.py", line 1039, in _call_load_hooks
default_hook(obj, loadpath, end_of_epoch, device)
File "/cs/labs/oabend/avishai.elma/lab_env_v2/lib/python3.9/site-packages/speechbrain/utils/checkpoints.py", line 96, in torch_recovery
obj.load_state_dict(torch.load(path, map_location=device), strict=True)
File "/cs/labs/oabend/avishai.elma/lab_env_v2/lib/python3.9/site-packages/torch/serialization.py", line 1028, in load
return _legacy_load(opened_file, map_location, pickle_module, **pickle_load_args)
File "/cs/labs/oabend/avishai.elma/lab_env_v2/lib/python3.9/site-packages/torch/serialization.py", line 1246, in _legacy_load
magic_number = pickle_module.load(f, **pickle_load_args)
EOFError: Ran out of input
[2024-01-16 18:02:11,451] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 2696972 closing signal SIGTERM
[2024-01-16 18:02:29,572] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: 1) local_rank: 1 (pid: 2696973) of binary: /cs/labs/oabend/avishai.elma/lab_env_v2/bin/python
Traceback (most recent call last):
File "/usr/lib/python3.9/runpy.py", line 197, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/usr/lib/python3.9/runpy.py", line 87, in _run_code
exec(code, run_globals)
File "/cs/labs/oabend/avishai.elma/lab_env_v2/lib/python3.9/site-packages/torch/distributed/launch.py", line 196, in <module>
main()
File "/cs/labs/oabend/avishai.elma/lab_env_v2/lib/python3.9/site-packages/torch/distributed/launch.py", line 192, in main
launch(args)
File "/cs/labs/oabend/avishai.elma/lab_env_v2/lib/python3.9/site-packages/torch/distributed/launch.py", line 177, in launch
run(args)
File "/cs/labs/oabend/avishai.elma/lab_env_v2/lib/python3.9/site-packages/torch/distributed/run.py", line 797, in run
elastic_launch(
File "/cs/labs/oabend/avishai.elma/lab_env_v2/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 134, in __call__
return launch_agent(self._config, self._entrypoint, list(args))
File "/cs/labs/oabend/avishai.elma/lab_env_v2/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 264, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
============================================================
emotion_finetune.py FAILED
------------------------------------------------------------
Failures:
<NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
time : 2024-01-16_18:02:11
host : drape-02.cs.huji.ac.il
rank : 1 (local_rank: 1)
exitcode : 1 (pid: 2696973)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================
as we can see it isn't able to save all of the files.
sometimes the error is "EOFError: Ran out of input"
sometimes the error is a keyerror.
when not using DDP or even when using DDP with just one process it works fine.
env:
SpeechBrain system description
==============================
Python version:
3.9.2 (default, Feb 28 2021, 17:03:44)
[GCC 10.2.1 20210110]
==============================
Installed Python packages:
aiohttp==3.9.1
aiosignal==1.3.1
alembic==1.13.0
AMFM-decompy==1.0.11
annotated-types==0.6.0
antlr4-python3-runtime==4.8
asteroid-filterbanks==0.4.0
asttokens==2.4.1
async-timeout==4.0.3
attrs==23.1.0
audioread==3.0.1
bitarray==2.9.0
blis==0.7.11
catalogue==2.0.10
certifi==2022.12.7
cffi==1.16.0
charset-normalizer==2.1.1
click==8.1.7
cloudpathlib==0.16.0
colorama==0.4.6
coloredlogs==15.0.1
colorlog==6.8.0
comm==0.2.0
confection==0.1.4
contourpy==1.2.0
cycler==0.12.1
cymem==2.0.8
Cython==3.0.6
datasets==2.15.0
debugpy==1.8.0
decorator==5.1.1
dill==0.3.7
docopt==0.6.2
einops==0.7.0
en-core-web-sm @ https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.7.1/en_core_web_sm-3.7.1-py3-none-any.whl#sha256=86cc141f63942d4b2c5fcee06630fd6f904788d2f0ab005cce45aadb8fb73889
exceptiongroup==1.2.0
executing==2.0.1
fairseq @ git+https://github.com/pytorch/fairseq@da8fb630880d529ab47e53381c30ddc8ad235216
filelock==3.9.0
flatbuffers==23.5.26
fonttools==4.46.0
frozenlist==1.4.0
fsspec==2023.10.0
greenlet==3.0.2
huggingface-hub==0.19.4
humanfriendly==10.0
hydra-core==1.0.7
HyperPyYAML==1.2.2
idna==3.4
importlib-metadata==7.0.0
importlib-resources==6.1.1
inflect==7.0.0
iniconfig==2.0.0
ipykernel==6.27.1
ipython==8.18.1
jedi==0.19.1
Jinja2==3.1.2
joblib==1.3.2
julius==0.2.7
jupyter_client==8.6.0
jupyter_core==5.5.1
kiwisolver==1.4.5
langcodes==3.3.0
lazy_loader==0.3
librosa==0.10.0
lightning==2.1.2
lightning-utilities==0.10.0
llvmlite==0.36.0
lxml==4.9.3
Mako==1.3.0
markdown-it-py==3.0.0
MarkupSafe==2.1.3
matplotlib==3.8.2
matplotlib-inline==0.1.6
mdurl==0.1.2
mpmath==1.3.0
msgpack==1.0.7
multidict==6.0.4
multiprocess==0.70.15
murmurhash==1.0.10
nest-asyncio==1.5.8
networkx==3.0
npy-append-array==0.9.16
numba==0.53.0
numpy==1.22.0
omegaconf==2.0.6
onnxruntime==1.16.3
optuna==3.5.0
packaging==23.2
pandas==2.1.4
parso==0.8.3
pexpect==4.9.0
Pillow==9.3.0
platformdirs==4.0.0
pluggy==1.3.0
pooch==1.8.0
portalocker==2.8.2
preshed==3.0.9
primePy==1.3
prompt-toolkit==3.0.43
protobuf==4.25.1
psutil==5.9.7
ptyprocess==0.7.0
pure-eval==0.2.2
pyannote.audio==3.1.1
pyannote.core==5.0.0
pyannote.database==5.0.1
pyannote.metrics==3.2.1
pyannote.pipeline==3.0.1
pyarrow==14.0.1
pyarrow-hotfix==0.6
pycparser==2.21
pydantic==2.5.2
pydantic_core==2.14.5
Pygments==2.17.2
pyparsing==3.1.1
pytest==7.4.3
python-dateutil==2.8.2
pytorch-lightning==2.1.2
pytorch-metric-learning==2.3.0
pytz==2023.3.post1
PyYAML==6.0.1
pyzmq==25.1.2
regex==2023.10.3
requests==2.28.1
resampy==0.4.2
rich==13.7.0
ruamel.yaml==0.18.5
ruamel.yaml.clib==0.2.8
sacrebleu==2.4.0
safetensors==0.4.1
scikit-learn==1.3.2
scipy==1.11.4
semver==3.0.2
sentencepiece==0.1.99
shellingham==1.5.4
simplejson==3.19.2
six==1.16.0
smart-open==6.4.0
sortedcontainers==2.4.0
soundfile==0.12.1
soxr==0.3.7
spacy==3.7.2
spacy-legacy==3.0.12
spacy-loggers==1.0.5
speechbrain==0.5.16
SQLAlchemy==2.0.23
srsly==2.4.8
stack-data==0.6.3
sympy==1.12
tabulate==0.9.0
tensorboardX==2.6.2.2
-e git+https://github.com/facebookresearch/textlesslib.git@ba33d669d8284b4f7bfe81e7384e83ab799fe384#egg=textless
tgt==1.5
thinc==8.2.1
threadpoolctl==3.0.0
tokenizers==0.15.0
tomli==2.0.1
torch==2.1.1+cu118
torch-audiomentations==0.11.0
torch-pitch-shift==1.2.4
torchaudio==2.1.1+cu118
torchcrepe==0.0.22
torchmetrics==1.2.1
torchvision==0.16.1+cu118
tornado==6.4
tqdm==4.66.1
traitlets==5.14.0
transformers==4.36.1
triton==2.1.0
typer==0.9.0
typing_extensions==4.9.0
tzdata==2023.3
Unidecode==1.3.7
urllib3==1.26.13
wasabi==1.1.2
wcwidth==0.2.12
weasel==0.3.4
wget==3.2
xxhash==3.4.1
yarl==1.9.4
zipp==3.17.0
==============================
Could not get git revision==============================
CUDA version:
11.8
from speechbrain.
CC @TParcollet and @pplantinga
from speechbrain.
Hi @avishaiElmakies, this way of adding so many recoverables and especially in that spot of the code (the init opt function) is a bit unconventional. It's better to use the checkpointer in Yaml for that. Can we see the train.py associated with this error? Many thanks.
from speechbrain.
@TParcollet Thanks for the fast reply!
I will admit I used someone's code who added the optimizers there so I thought it was fine. Is there a better location to add it?
My code doesn't always have the modules in the yaml (I want my code to be able to take a huggingface.co model and use it, specifically the model used for emotion-Diarization, so I would like to have the flexibility)
I will try to add the files shortly
from speechbrain.
Those are the files I use. Emotion_finetune.py is the main file. Emotion is a package I created that has most of the logic for the finetuning
from speechbrain.
SIde note while waiting for the .py: the new version of SpeechBrain (v1) on the develop Branch can handle any model originating from HF nicely from the lobes.
from speechbrain.
@TParcollet I added the code. In code.zip file(In my previous message)
I will look at the 1.0, I used pip to download the library, but if v1 is more appropriate for. Me I will try and use it.
from speechbrain.
Alright, I won't be able to have a detailed review on that code right now because it is highly customised. I will however give an advice - there is no reason to use such a compositional approach, you could build a simple standard speechbrain recipe doing what this emotion_finetune.py script is doing. It would be much, much, much cleaner and easier to debug. But it is most likely not a DDP problem.
from speechbrain.
@TParcollet
Do I need to use speechbrain v1 for that?
Do you have an example?
from speechbrain.
You don't need the v1 for that, but updating for the latest version always is the best solution, especially since this one is a big change. I am not sure to understand fully what you are trying to do here, but if it's fine-tuning of a model throughout one of our interface, it's not the best possible way (you can do it, as you are trying, but not doing something right might lead to errors). Interfaces are not built for fine-tuning, they are built for inference (in theory). If you want to fine-tune, it's better to get the checkpoint and build an actual training recipe with a .yaml and a .py using the Pretrainer class:
Now for your use case, you could try force the checkpointer to save the checkpoint somewhere at some point in your training recipe. Do not forget to call it within the run_on_main environment or somewhere that is only executed in the main process if you want to use DDP. Unfortunately, your code is a bit too complex here in its structure for me to help. Maybe others may want to have a try if they have time @Adel-Moumenfrom speechbrain.
@TParcollet
I will try to explain what I want to do.
I need to finetune a wavlm model on an emotion data. My main problem is that I want the ability to finetune two models. I want to finetune wavlm from scratch on the data, and I want to finetune a finetuned model. And see who is better. I hoped my code will let me do it by only changing the yaml.
The finetuned model is this one:
https://huggingface.co/speechbrain/emotion-diarization-wavlm-large
And used the code there to use as a inspiration for my code(on how to finetune a model)
https://www.dropbox.com/sh/woudm1v31a7vyp5/AADAMxpQOXaxf8E_1hX202GJa?dl=0
I also looked at the instructions on how to fine tune a model.
I will look at what you said. If you have some advice I will appreciate it very much.
EDIT: I should probably add that when trying to finetune the finetuned model. I would like to change the output mlp.
from speechbrain.
Maybe @BenoitWang can provide some tips here ? :-)
from speechbrain.
I tried making the save on run on main thread but it still not working.
if stage == sb.Stage.VALID:
old_lr, new_lr = self.hparams.lr_annealing(1 - stats["accuracy"])
sb.nnet.schedulers.update_learning_rate(self.optimizer, new_lr)
old_lr_wav2vec2,new_lr_wav2vec2 = self.hparams.lr_annealing_wav2vec2(1 - stats["accuracy"])
sb.nnet.schedulers.update_learning_rate(self.wav2vec2_optimizer, new_lr_wav2vec2)
meta["lr"] = old_lr
meta["lr_wav2vec2"] = old_lr_wav2vec2
self.hparams.train_logger.log_stats(stats_meta=meta, valid_stats=stats)
if epoch % 1 == 0: #TODO change this to 3
checkpointer_meta = stats.copy()
checkpointer_meta.pop('confusion_matrix',None)
sb.utils.distributed.run_on_main(
self.checkpointer.save_and_keep_only,
kwargs={"meta":checkpointer_meta,
"max_keys":["accuracy"],
"num_to_keep":1}
)
print("Saved checkpoint")
I am getting an error:
Traceback (most recent call last):
File "/cs/labs/oabend/avishai.elma/src/emotion_src/emotion_finetune.py", line 63, in <module>
main()
File "/cs/labs/oabend/avishai.elma/src/emotion_src/emotion_finetune.py", line 46, in main
brain = finetune(modules, hparams, datasets,class_weights=class_weights, run_opts=run_opts,
File "/cs/labs/oabend/avishai.elma/src/emotion_src/emotion/finetune/utils.py", line 136, in finetune
emo_brain.fit(
File "/cs/labs/oabend/avishai.elma/lab_env_v2/lib/python3.9/site-packages/speechbrain/core.py", line 1367, in fit
self._fit_valid(valid_set=valid_set, epoch=epoch, enable=enable)
File "/cs/labs/oabend/avishai.elma/lab_env_v2/lib/python3.9/site-packages/speechbrain/core.py", line 1281, in _fit_valid
self.on_stage_end(Stage.VALID, avg_valid_loss, epoch)
File "/cs/labs/oabend/avishai.elma/src/emotion_src/emotion/finetune/finetune_brain.py", line 119, in on_stage_end
sb.utils.distributed.run_on_main(
File "/cs/labs/oabend/avishai.elma/lab_env_v2/lib/python3.9/site-packages/speechbrain/utils/distributed.py", line 65, in run_on_main
ddp_barrier()
File "/cs/labs/oabend/avishai.elma/lab_env_v2/lib/python3.9/site-packages/speechbrain/utils/distributed.py", line 118, in ddp_barrier
torch.distributed.barrier()
File "/cs/labs/oabend/avishai.elma/lab_env_v2/lib/python3.9/site-packages/torch/distributed/c10d_logger.py", line 47, in wrapper
return func(*args, **kwargs)
File "/cs/labs/oabend/avishai.elma/lab_env_v2/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py", line 3696, in barrier
work = default_pg.barrier(opts=opts)
RuntimeError: Detected mismatch between collectives on ranks. Rank 1 is running collective: CollectiveFingerPrint(SequenceNumber=6858OpType=BARRIER), but Rank 0 is running collective: CollectiveFingerPrint(SequenceNumber=0OpType=GATHER).Collectives differ in the following aspects: Sequence number: 6858vs 0 Op type: BARRIERvs GATHER
[E ProcessGroupGloo.cpp:2810] [Rank 0]: Rank 1 failed to pass monitoredBarrier in 7200000 ms
[E ProcessGroupGloo.cpp:138] [Rank 0]: Ranks 1 failed to pass monitoredBarrier in 7200000 ms
speechbrain.core - Exception:
Traceback (most recent call last):
File "/cs/labs/oabend/avishai.elma/lab_env_v2/lib/python3.9/site-packages/speechbrain/utils/distributed.py", line 60, in run_on_main
func(*args, **kwargs)
File "/cs/labs/oabend/avishai.elma/lab_env_v2/lib/python3.9/site-packages/speechbrain/utils/checkpoints.py", line 679, in save_and_keep_only
self.save_checkpoint(
File "/cs/labs/oabend/avishai.elma/lab_env_v2/lib/python3.9/site-packages/speechbrain/utils/checkpoints.py", line 586, in save_checkpoint
torch.distributed.broadcast_object_list(communication_list, src=0)
File "/cs/labs/oabend/avishai.elma/lab_env_v2/lib/python3.9/site-packages/torch/distributed/c10d_logger.py", line 47, in wrapper
return func(*args, **kwargs)
File "/cs/labs/oabend/avishai.elma/lab_env_v2/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py", line 2603, in broadcast_object_list
broadcast(object_sizes_tensor, src=src, group=group)
File "/cs/labs/oabend/avishai.elma/lab_env_v2/lib/python3.9/site-packages/torch/distributed/c10d_logger.py", line 47, in wrapper
return func(*args, **kwargs)
File "/cs/labs/oabend/avishai.elma/lab_env_v2/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py", line 1906, in broadcast
work = default_pg.broadcast([tensor], opts)
RuntimeError: Detected mismatch between collectives on ranks. Rank 0 is running collective: CollectiveFingerPrint(SequenceNumber=6858, OpType=BROADCAST, TensorShape=[1], TensorDtypes=Long, TensorDeviceTypes=TensorOptions(dtype=float (default), device=cuda, layout=Strided (default), requires_grad=false (default), pinned_memory=false (default), memory_format=(nullopt))), but Rank 1 is running collective: CollectiveFingerPrint(SequenceNumber=0OpType=REDUCE).Collectives differ in the following aspects: Sequence number: 6858vs 0 Op type: BROADCASTvs REDUCE Tensor Tensor shapes: 1vs Tensor Tensor dtypes: Longvs Tensor Tensor devices: TensorOptions(dtype=float (default), device=cuda, layout=Strided (default), requires_grad=false (default), pinned_memory=false (default), memory_format=(nullopt))vs
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/cs/labs/oabend/avishai.elma/src/emotion_src/emotion_finetune.py", line 63, in <module>
main()
File "/cs/labs/oabend/avishai.elma/src/emotion_src/emotion_finetune.py", line 46, in main
brain = finetune(modules, hparams, datasets,class_weights=class_weights, run_opts=run_opts,
File "/cs/labs/oabend/avishai.elma/src/emotion_src/emotion/finetune/utils.py", line 136, in finetune
emo_brain.fit(
File "/cs/labs/oabend/avishai.elma/lab_env_v2/lib/python3.9/site-packages/speechbrain/core.py", line 1367, in fit
self._fit_valid(valid_set=valid_set, epoch=epoch, enable=enable)
File "/cs/labs/oabend/avishai.elma/lab_env_v2/lib/python3.9/site-packages/speechbrain/core.py", line 1281, in _fit_valid
self.on_stage_end(Stage.VALID, avg_valid_loss, epoch)
File "/cs/labs/oabend/avishai.elma/src/emotion_src/emotion/finetune/finetune_brain.py", line 119, in on_stage_end
sb.utils.distributed.run_on_main(
File "/cs/labs/oabend/avishai.elma/lab_env_v2/lib/python3.9/site-packages/speechbrain/utils/distributed.py", line 62, in run_on_main
ddp_barrier()
File "/cs/labs/oabend/avishai.elma/lab_env_v2/lib/python3.9/site-packages/speechbrain/utils/distributed.py", line 118, in ddp_barrier
torch.distributed.barrier()
File "/cs/labs/oabend/avishai.elma/lab_env_v2/lib/python3.9/site-packages/torch/distributed/c10d_logger.py", line 47, in wrapper
return func(*args, **kwargs)
File "/cs/labs/oabend/avishai.elma/lab_env_v2/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py", line 3696, in barrier
work = default_pg.barrier(opts=opts)
RuntimeError: [Rank 0]: Ranks 1 failed to pass monitoredBarrier in 7200000 ms
[2024-01-17 21:02:13,945] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: 1) local_rank: 0 (pid: 2012414) of binary: /cs/labs/oabend/avishai.elma/lab_env_v2/bin/python
Traceback (most recent call last):
File "/usr/lib/python3.9/runpy.py", line 197, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/usr/lib/python3.9/runpy.py", line 87, in _run_code
exec(code, run_globals)
File "/cs/labs/oabend/avishai.elma/lab_env_v2/lib/python3.9/site-packages/torch/distributed/launch.py", line 196, in <module>
main()
File "/cs/labs/oabend/avishai.elma/lab_env_v2/lib/python3.9/site-packages/torch/distributed/launch.py", line 192, in main
launch(args)
File "/cs/labs/oabend/avishai.elma/lab_env_v2/lib/python3.9/site-packages/torch/distributed/launch.py", line 177, in launch
run(args)
File "/cs/labs/oabend/avishai.elma/lab_env_v2/lib/python3.9/site-packages/torch/distributed/run.py", line 797, in run
elastic_launch(
File "/cs/labs/oabend/avishai.elma/lab_env_v2/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 134, in __call__
return launch_agent(self._config, self._entrypoint, list(args))
File "/cs/labs/oabend/avishai.elma/lab_env_v2/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 264, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
============================================================
emotion_finetune.py FAILED
------------------------------------------------------------
Failures:
[1]:
time : 2024-01-17_21:02:13
host : firth-02.cs.huji.ac.il
rank : 1 (local_rank: 1)
exitcode : 1 (pid: 2012415)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
time : 2024-01-17_21:02:13
host : firth-02.cs.huji.ac.il
rank : 0 (local_rank: 0)
exitcode : 1 (pid: 2012414)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
I will probably try version 1.0 from develop tommrow. But from what i saw in the code I am not sure this will fix it.
I am also having a problem with some unused parameters. which i will open a different bug for.
from speechbrain.
Hi
I updated my code to v1 and after the refactoring this seems to work fine again!
I would love if someone can look at issue #2340.
from speechbrain.
Related Issues (20)
- unused parameters when using WavLM. cased crash when using DDP HOT 9
- [Feature Request]: Load a speechbrain-fine-tuned huggingface model checkpoint with the huggingface interface HOT 1
- Incorrect transformer mask size HOT 8
- loading fully trained brain (a brain who finished training) and then evaluating it with brain.evaluate. will cause a crash when using hpopt context but not reporting in test stage
- Hi! LibriSpeech char training!! HOT 1
- Wav2Vec2Pretrain (HFTransformersInterface implementation) samples padded values for mask_time_indices and negative_sample_indices HOT 2
- spkrec-ecapa-voxceleb-mel-spec model modifies mel spectrum in place when used with CPU HOT 2
- A few unoptimised piece of code (augmentation and masking) HOT 2
- Language codes not following ISO standards in lang-id-voxlingua107-ecapa HOT 1
- Train loop may crash during checkpointing HOT 5
- Possible NCCL-level deadlock during checkpointing HOT 7
- Error during encoding
- AMP at inference time HOT 3
- Unable to use model trained from enhancement template HOT 1
- 🐞 | Import error in speechbrain.pretrained HOT 3
- Not Able to use 'speechbrain/spkrec-ecapa-voxceleb' embeddings in Google Colab. HOT 3
- 'S2STransformerBeamSearcher' object has no attribute 'ctc_forward_step' HOT 2
- Inference interface uncompatible with Python < 3.9 HOT 1
- Issues regarding discrete WavLM and discrete HuBERT HOT 4
- speechbrain inference classifier error HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from speechbrain.