GithubHelp home page GithubHelp logo

microsoft / unispeech Goto Github PK

View Code? Open in Web Editor NEW
389.0 21.0 70.0 74.12 MB

UniSpeech - Large Scale Self-Supervised Learning for Speech

License: Other

Python 96.14% Cuda 1.13% Shell 0.90% Cython 0.78% C++ 1.05%
pytorch speech-recognition speech-processing speech diarization speech-separation speech-diarization speaker-verification

unispeech's Introduction

UniSpeech

The family of UniSpeech:

WavLM (arXiv): WavLM: Large-Scale Self-Supervised Pre-training for Full Stack Speech Processing

UniSpeech (ICML 2021): Unified Pre-training for Self-Supervised Learning and Supervised Learning for ASR

UniSpeech-SAT (ICASSP 2022 Submission): Universal Speech Representation Learning with Speaker Aware Pre-Training

ILS-SSL (ICASSP 2022 Submission): Self-Supervised Learning for Speech Recognition with Intermediate Layer Supervision

Model introductions, evaluation results, and model inference instructions are located in their corresponding folders. The source code is here [https://github.com/microsoft/UniSpeech/tree/main/src].

Update

Pre-trained models

We strongly suggest using our UniSpeech-SAT model for speaker related tasks, since it shows very powerful performance on various speaker related benchmarks.

Model Pretraining Dataset Finetuning Dataset Model
UniSpeech Large EN Labeled: 1350 hrs en - download
UniSpeech Large Multilingual Labeled: 1350 hrs en + 353 hrs fr + 168 hrs es + 90 hrs it - download
Unispeech Large+ Labeled: 1350 hrs en, Unlabeled: 353 hrs fr - download
UniSpeech Large+ Labeld: 1350 hrs en, Unlabeled: 168 hrs es - download
UniSpeech Large+ Labeled: 1350 hrs en, Unlabeld: 90 hrs it - download
UniSpeech Large Multilingual Labeled: 1350 hrs en + 353 hrs fr + 168 hrs es + 90 hrs it, Unlabeled: 17 hrs ky - download
UniSpeech Large+ Labeled: 1350 hrs en, Unlabeled: 353 hrs fr 1 hr fr download
UniSpeech Large+ Labeld: 1350 hrs en, Unlabeled: 168 hrs es 1 hr es download
UniSpeech Large+ Labeled: 1350 hrs en, Unlabeld: 90 hrs it 1 hr it download
UniSpeech Large Multilingual Labeled: 1350 hrs en + 353 hrs fr + 168 hrs es + 90 hrs it, Unlabeled: 17 hrs ky 1 hr ky download
UniSpeech-SAT Base 960 hrs LibriSpeech - download
UniSpeech-SAT Base+ 60k hrs Libri-Light + 10k hrs GigaSpeech + 24k hrs VoxPopuli - download
UniSpeech-SAT Large 60k hrs Libri-Light + 10k hrs GigaSpeech + 24k hrs VoxPopuli - download
WavLM Base 960 hrs LibriSpeech - download
WavLM Base+ 60k hrs Libri-Light + 10k hrs GigaSpeech + 24k hrs VoxPopuli - download
WavLM Large 60k hrs Libri-Light + 10k hrs GigaSpeech + 24k hrs VoxPopuli - download

Universal Representation Evaluation on SUPERB

alt text

Downstream Task Performance

We also evaluate our models on typical speaker related benchmarks.

Speaker Verification

Finetune the model with VoxCeleb2 dev data, and evaluate it on the VoxCeleb1

Model Fix pre-train Vox1-O Vox1-E Vox1-H
ECAPA-TDNN - 0.87 1.12 2.12
HuBERT large Yes 0.888 0.912 1.853
Wav2Vec2.0 (XLSR) Yes 0.915 0.945 1.895
UniSpeech-SAT large Yes 0.771 0.781 1.669
WavLM large Yes 0.59 0.65 1.328
WavLM large No 0.505 0.579 1.176
+Large Margin Finetune and Score Calibration
HuBERT large No 0.585 0.654 1.342
Wav2Vec2.0 (XLSR) No 0.564 0.605 1.23
UniSpeech-SAT large No 0.564 0.561 1.23
WavLM large (New) No 0.33 0.477 0.984

Large-scale Self-Supervised Speech Representation Learning for Automatic Speaker Verification

Speech Separation

Evaluation on LibriCSS

Model 0S 0L OV10 OV20 OV30 OV40
Conformer (SOTA) 4.5 4.4 6.2 8.5 11 12.6
UniSpeech-SAT base 4.4 4.4 5.4 7.2 9.2 10.5
UniSpeech-SAT large 4.3 4.2 5.0 6.3 8.2 8.8
WavLM base+ 4.5 4.4 5.6 7.5 9.4 10.9
WavLM large 4.2 4.1 4.8 5.8 7.4 8.5

Speaker Diarization

Evaluation on CALLHOME

Model spk_2 spk_3 spk_4 spk_5 spk_6 spk_all
EEND-vector clustering 7.96 11.93 16.38 21.21 23.1 12.49
EEND-EDA clustering (SOTA) 7.11 11.88 14.37 25.95 21.95 11.84
UniSpeech-SAT large 5.93 10.66 12.9 16.48 23.25 10.92
WavLM Base 6.99 11.12 15.20 16.48 21.61 11.75
WavLm large 6.46 10.69 11.84 12.89 20.70 10.35

License

This project is licensed under the license found in the LICENSE file in the root directory of this source tree. Portions of the source code are based on the FAIRSEQ project.

Microsoft Open Source Code of Conduct

Reference

If you find our work is useful in your research, please cite the following paper:

@inproceedings{Wang2021UniSpeech,
  author    = {Chengyi Wang and Yu Wu and Yao Qian and Kenichi Kumatani and Shujie Liu and Furu Wei and Michael Zeng and Xuedong Huang},
  editor    = {Marina Meila and Tong Zhang},
  title     = {UniSpeech: Unified Speech Representation Learning with Labeled and
               Unlabeled Data},
  booktitle = {Proceedings of the 38th International Conference on Machine Learning,
               {ICML} 2021, 18-24 July 2021, Virtual Event},
  series    = {Proceedings of Machine Learning Research},
  volume    = {139},
  pages     = {10937--10947},
  publisher = {{PMLR}},
  year      = {2021},
  url       = {http://proceedings.mlr.press/v139/wang21y.html},
  timestamp = {Thu, 21 Oct 2021 16:06:12 +0200},
  biburl    = {https://dblp.org/rec/conf/icml/0002WQK0WZ021.bib},
  bibsource = {dblp computer science bibliography, https://dblp.org}
}
@article{Chen2021WavLM,
  title   = {WavLM: Large-Scale Self-Supervised  Pre-training   for Full Stack Speech Processing},
  author  = {Sanyuan Chen and Chengyi Wang and Zhengyang Chen and Yu Wu and Shujie Liu and Zhuo Chen and Jinyu Li and Naoyuki Kanda and Takuya Yoshioka and Xiong Xiao and Jian Wu and Long Zhou and Shuo Ren and Yanmin Qian and Yao Qian and Jian Wu and Michael Zeng and Furu Wei},
  eprint={2110.13900},
  archivePrefix={arXiv},
  primaryClass={cs.CL},
  year={2021}
}
@article{Chen2021UniSpeechSAT,
  title   = {UniSpeech-SAT: Universal Speech Representation Learning with  Speaker Aware Pre-Training},
  author  = {Sanyuan Chen and Yu Wu and Chengyi Wang and Zhengyang Chen and Zhuo Chen and Shujie Liu and   Jian Wu and Yao Qian and Furu Wei and Jinyu Li and  Xiangzhan Yu},
  eprint={2110.05752},
  archivePrefix={arXiv},
  primaryClass={cs.CL},
  year={2021}
}

Contact Information

For help or issues using UniSpeech models, please submit a GitHub issue.

For other communications related to UniSpeech, please contact Yu Wu ([email protected]).

unispeech's People

Contributors

cywang97 avatar czy97 avatar j4ckl1u avatar markwunlp avatar microsoft-github-policy-service[bot] avatar patrickvonplaten avatar sanyuan-chen avatar valle-demo avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

unispeech's Issues

Formula 6 in paper

Hi there!

Great repo and paper. I had a question that I think maybe a mistake in my understanding of the paper/code. After reading through both I understand:

You are first doing CTC + Contrastive on labeled data "L" and then optional pre-training on "M". However, from your paper I understand that they should be solved as a single task with joint multi-task training (from formula 6 in paper). This does not reflect in the code.

Would be glad if you could please help. Thank You!

More details about the output

When I try to run the example in UniSpeech-SAT directory in this repo, I get 'f' as a tensor of size torch.Size([1, 512, 31]).
What exactly does the variable f represent?

Huggingface sat model missing tokenizer

I tried to use pretrained model from huggingface, it seems no tokenizer uploaded there.

>>> processor = Wav2Vec2Processor.from_pretrained('microsoft/unispeech-sat-base-plus')
OSError: Can't load tokenizer for 'microsoft/unispeech-sat-base-plus'. Make sure that:

- 'microsoft/unispeech-sat-base-plus' is a correct model identifier listed on 'https://huggingface.co/models'
  (make sure 'microsoft/unispeech-sat-base-plus' is not a path to a local directory with something else, in that case)

- or 'microsoft/unispeech-sat-base-plus' is the correct path to a directory containing relevant tokenizer files

(1) Any workaround?
(2) Also, since I don't need tokenizer (used for audio classification), is there any option to disable obtaining tokenizer?

cc @patrickvonplaten

Speaker verification result

Hello,

Thank you for your work on WavLM.
I try to reproduce the results but I have some difficulties.

First of all, I don't undestand exactly the difference between scores displayed in different places. For instance, on Vox1-O:

Moreover I tried to reproduce result from the fine-tuned checkpoint available on this repository (https://drive.google.com/file/d/1-aE1NfzpRCLxA4GUxX9ITI3F9LlbtEGP/view?usp=sharing).

I get the following result on vox1-O:

  • Without normalisation, I get EER = 0.558%
  • With s-norm, I get EER = 0.542%
  • with as-norm (cohort size = 600), I get EER = 0.505%

Do you have any more details to provide?

Thank you

Getting speaker embeddings

UniSpeech-SAT directory in this repo contains an example. The example takes a .wav file as an input and produces a tensor 'f' as an output. Can I get the speaker embeddings from 'f'?

Unispeech-SAT fairseq code

Hi!

From UniSpeech/downstreams/speaker_diarization/README.md:
For UniSpeech-SAT large, we should install the Unispeech-SAT fairseq code.

Where can I find the Unispeech-SAT fairseq code?

Thanks in advance.

UniSpeech Model Download

Hello, the download link for UniSpeech Large EN is invalid, can you update it? Also, can the UniSpeech Base EN model be shared as well? Thank you very much!

The code for computing the relative positional embedding

Hi, I use this project for months and thank you for the excellent work!

These days I dive into the code and find something wrong with the multihead attention:

  1. It seems that the code for computing the positional embedding here is wrong? https://github.com/microsoft/UniSpeech/blob/main/WavLM/modules.py#L730
    Since it updates the positional embedding with gru and passes it to the next layer. However, I think we should only pass the original embedding to the next layer rather than the one w/ gru.

  2. In the other branch, the positional embedding is correct while the input query is used for compute the attention mask. In the paper, we should use the projected query rather than the original one. Not sure is there anything I missed?

  3. The way to compute the relative positional embedding w/ gru seems different from the formula in the paper?

Could anyone please explain these questions?

How to use finetune wavlm large model?

if i want to use finetune wavlm large model in speaker verification job to get best performence, so i need give config_path in this code:

elif model_name == 'wavlm_large':
    config_path = None
    model = ECAPA_TDNN_SMALL(feat_dim=1024, feat_type='wavlm_large', config_path=config_path)

right? but what need i give to config_path? anything is ok?

KeyError: "'_name'" Speaker Diarization

  • I downloaded the models from the link inside the speaker_diarization directory

  • After adding the model inside the path I ran the command

python diarization.py --wav_path tmp/mix_0000496.wav --model_init models/avg-4.pth

  • I am getting the following error on console.
  File "/UniSpeech/downstreams/speaker_diarization/models/utils.py", line 32, in load_model
    model = task.build_model(cfg.model)
  File "//UniSpeech/downstreams/speaker_diarization/fairseq/fairseq/tasks/fairseq_task.py", line 335, in build_model
    model = models.build_model(cfg, self, from_checkpoint)
  File "/UniSpeech/downstreams/speaker_diarization/fairseq/fairseq/models/__init__.py", line 102, in build_model
    f"Could not infer model type from {cfg}. "
KeyError: "'_name'"

Why is my duplicated wavLM results on vox1-o is 30% worse

model EER(mine) EER(official)
wavlm_large_nofinetune.pth 0.965 0.75
wavlm_large_finetune.pth 0.631 0.431

The above results are the validation results of your shared wav_lm models on the original Vox1-o data without changing any code.
What might be the reason for this gap? Wrong settings?
Here is more background about my setting:

  1. Create a conda env as:
conda create -n UniSpeech_py3p8 python=3.8
  1. Following your guidance under https://github.com/microsoft/UniSpeech/tree/main/downstreams/speaker_verification
pip install --require-hashes -r requirements.txt 

The following error will appear:

Collecting numpy<1.23.0,>=1.16.5
ERROR: In --require-hashes mode, all requirements must have their versions pinned with ==. These do not:
    numpy<1.23.0,>=1.16.5 from https://files.pythonhosted.org/packages/2f/14/abc14a3f3663739e5d3c8fd980201d10788d75fea5b0685734227052c4f0/numpy-1.22.4-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl#sha256=64f56fc53a2d18b1924abd15745e30d82a5782b2cab3429aceecc6875bd5add0 (from scipy==1.7.1->-r requirements.txt (line 1))

Then I installed the environment manually (installed around 30~40 tools) just as #26

  1. Here is some related details:
    pip list | grep fairseq
    fairseq 0.12.1 /home/user1/tools/fairseq
    pip list | grep s3prl
    s3prl 0.3.1
    torch.version: 1.9.0+cu102
    python -V: 3.8.13

Thanks for your wonderful work and looking forward for your help.

Problem of cosine similarity computing for speaker verification

python verification.py --model_name ecapa_tdnn --wav1 vox1_data/David_Faustino/hn8GyCJIfLM_0000012.wav --wav2 vox1_data/Josh_Gad/HXUqYaOwrxA_0000015.wav
2023-08-03 18:48:40 | INFO | fairseq.tasks.text_to_speech | Please install tensorboardX: pip install tensorboardX
The similarity score between two audios is 0.9880 (-1.0, 1.0).

When I tried the script provided in readme, I got the similarity of 0.9880, which is far from the 0.2053 in readme and no help to speaker verification task. How to reproduce the result in readme file?

Time stamps in out.rttm file for speaker_diarization

I have a 5 second audio file with two different speakers one after the other. It looks like it is able to recognize the two speakers but the time stamps must be wrong because it says that speaker 2 starts at 6.40 (seconds?) and lasts for 3.88 (seconds?).

Here is the output of the actual out.rttm file:
SPEAKER Anonymous 1 0.52 5.08 <NA> <NA> Anonymous_0 <NA>
SPEAKER Anonymous 1 6.40 3.88 <NA> <NA> Anonymous_1 <NA>

And attached is the .wav file I used to evaluate.
an4_diarize_test.zip

Does anyone know why the time just seems proportionally off like it's counting in half-seconds?

pre-training detail for Unispeech-SAT

Hi there!

Excellent paper in Unispeech-SAT. I have one question regarding pre-training as I see the pre-training code isn't available (I would be happy to know if it is available anywhere). I wanted to know if any kind of normalization was applied to the model embeddings for the utterance-wise contrastive loss (like l2 normalization or instance normalization) etc.

Would be very helpful if you could help me with that!

pip install --require-hashes -r requirements.txt

hello, when i use speaker_verification, it needs pip requirements, i use pip install --require-hashes -r requirements.txt, but there are some problems,

ERROR: In --require-hashes mode, all requirements must have their versions pinned with ==. These do not:
numpy<1.23.0,>=1.16.5 from https://pypi.tuna.tsinghua.edu.cn/packages/2f/14/abc14a3f3663739e5d3c8fd980201d10788d75fea5b0685734227052c4f0/numpy-1.22.4-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl#sha256=64f56fc53a2d18b1924abd15745e30d82a5782b2cab3429aceecc6875bd5add0 (from scipy==1.7.1->-r requirements.txt (line 1))

how can i fix it ?

diarization - KeyError: 'embed.weight'

I got the error running python diarization.py --config_path config/infer_est_nspk1.yaml --wav_path 0.wav --model_init WavLM-Large.pt
Traceback (most recent call last):
File "diarization.py", line 321, in
main(args)
File "diarization.py", line 272, in main
model_all_n_speakers = model_parameter_dict["embed.weight"].shape[0]
KeyError: 'embed.weight'

Thanks in advance.

about updated file "unispeech_sat.th"

  1. run command :python verification.py --model_name unispeech_sat --wav1 vox1_data/David_Faustino/hn8GyCJIfLM_0000012.wav --wav2 vox1_data/Josh_Gad/HXUqYaOwrxA_0000015.wav --checkpoint UniSpeecg-SAT-Large.pt
    2.report an error:
    image

Is there a problem with the file“unispeech_sat.th”, please help me, thank you!

error when pip install and cannot import name 'Wav2VecModel' from 'fairseq.models.wav2vec

The base conda env is: conda create -n unispeech python=3.8
when I run pip install --require-hashes -r requirements.txt under speaker_verification.
The error will appear:

Collecting numpy<1.23.0,>=1.16.5
ERROR: In --require-hashes mode, all requirements must have their versions pinned with ==. These do not:
    numpy<1.23.0,>=1.16.5 from https://files.pythonhosted.org/packages/2f/14/abc14a3f3663739e5d3c8fd980201d10788d75fea5b0685734227052c4f0/numpy-1.22.4-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl#sha256=64f56fc53a2d18b1924abd15745e30d82a5782b2cab3429aceecc6875bd5add0 (from scipy==1.7.1->-r requirements.txt (line 1))

Then I installed the environment manually.

WavLM Inference Error

I loaded the WaveLM Large model from the link provided

When trying to follow the code for loading the pretrained model for inference I get the following error:
cfg = WavLMConfig(checkpoint['cfg'])
KeyError: 'cfg'

It looks like this model does not have any 'cfg' key or 'model' key for that matter.

UniSpeech-SAT/speaker_verification: torchaudio version

Hello,
In UniSpeech-SAT/speaker_verification/verification.py file,
resample function of torchaudio.functional seems to be only supported from torchaudio version 0.9.0 or higher.
For the versions below 0.9.0, there seems to be Resample function of torchaudio.transforms.
Thanks.

Missing recipe/code for speaker verification

Few days back, I was able to access UniSpeech-SAT/speaker_verification/ for Ecapa-tdnn recipe. But now its been removed from github repo. Do you have any plans to include this repo Or its not going to get added in future?

Thanks
Ajinkya Kulkarni

Where is the trainning/test/infer recipe?

Is the recipes in s3prl?

I see unispeech_sat and wavlm in s3prl, and do not find any train/test/infer code in this repository.

WavLM SV demo in s3prl:

  • Traing demo in s3prl repository:

python3 run_downstream.py -n wavlm -m train -u fbank -d sv_voxceleb1

  • Test demo in s3prl repository:

python3 run_downstream.py -m evaluate -e result/downstream/wavlm/wavlm_large_finetune.pth

  • Infer demo in UniSpeech repository:

https://github.com/microsoft/UniSpeech/tree/main/downstreams/speaker_verification#example

Pre-processed data missing

Hi, thank you for sharing your work.

I noticed that the pre-processed data are missing.

From the Unispeech/README.md file:
All our pre-processed data as well as the dictionaries can be downloaded from [here].

Can you provide the link?
Thank you!

Hi, how to calculate the EER or DER

Sorry to bother you, but I am the newer, I want to know how to calculate the EER or DER as the author presented in the paper. I seem not to see that code about it

Release vocab.json for Common Voice

Hey UniSpeech team!

Thanks a lot for making the pre-trained checkpoints available for everyone. Would you mind also open-sourcing the dictionaries /datablob/users/v-chengw/data/commonvoice_20200622/common_voices_splits/nl/phonesMatches_reduced.json for UniSpeech base & large so that the model can be used out of the box for inference?

Is "unispeech_sat.th" wrong ?

Hello,

I think the "unispeech_sat.th" is wrong.
I have just cloned the repository and tried the speaker verification with Unispeech-SAT and when I launch the example :

python verification.py --model_name unispeech_sat --wav1 vox1_data/David_Faustino/hn8GyCJIfLM_0000012.wav --wav2 vox1_data/Josh_Gad/HXUqYaOwrxA_0000015.wav --checkpoint UniSpeech-SAT-Large.pt

I have an error (end of the traceback):
File "/data/coros1/ddallon/workspace/UniSpeech/UniSpeech-SAT/fairseq/models/__init__.py", line 88, in build_model assert model is not None, ( AssertionError: Could not infer model type from {'_name': 'bc_m_hubert', 'label_rate': 50, 'extractor_mode': 'layer_norm', 'structure_type': 'transformer', 'encoder_layers': 24, 'encoder_embed_dim': 1024, 'encoder_ffn_embed_dim': 4096, 'encoder_attention_heads': 16, 'activation_fn': 'gelu', 'dropout': 0.0, 'attention_dropout': 0.0, 'activation_dropout': 0.0, 'encoder_layerdrop': 0.0, 'dropout_input': 0.0, 'dropout_features': 0.0, 'final_dim': 768, 'untie_final_proj': True, 'layer_norm_first': True, 'conv_feature_layers': '[(512,10,5)] + [(512,3,2)] * 4 + [(512,2,2)] * 2', 'conv_bias': False, 'logit_temp': 0.1, 'target_glu': False, 'feature_grad_mult': 1.0, 'boundary_mask': False, 'mask_length': 10, 'mask_prob': 0.8, 'mask_selection': 'static', 'mask_other': 0.0, 'no_mask_overlap': False, 'mask_min_space': 1, 'mask_channel_length': 10, 'mask_channel_prob': 0.0, 'mask_channel_selection': 'static', 'mask_channel_other': 0.0, 'no_mask_channel_overlap': False, 'mask_channel_min_space': 1, 'conv_pos': 128, 'conv_pos_groups': 16, 'latent_temp': [2.0, 0.5, 0.999995], 'skip_masked': False, 'skip_nomask': False, 'relative_position_embedding': False, 'num_buckets': 320, 'max_distance': 1280, 'gru_rel_pos': False, 'expand_attention_head_size': -1, 'streaming': False, 'chunk_size': 0, 'left_chunk': 0, 'num_negatives': 0, 'negatives_from_everywhere': False, 'cross_sample_negatives': 100, 'codebook_negatives': 0, 'quantize_targets': True, 'latent_vars': 320, 'latent_groups': 2, 'latent_dim': 0, 'spk_layer': 12, 'mixing_max_len': -1, 'mixing_prob': 0.5, 'mixing_num': 1, 'pretrained_path': ''}. Available models: dict_keys(['wav2vec', 'wav2vec2', 'wav2vec_ctc', 'wav2vec_seq2seq', 'wav2vec_transducer', 'hubert', 'hubert_ctc', 'transformer_lm', 'unispeech_sat']) Requested model type: bc_m_hubert

And I notice that "bc_m_hubert" appears only in "unispeech_sat.th".

Could you check it or help me ? :-)

wavlm model for downstream test?

Hello, could you please upload the wavlm model config(like the unispeech_sat.th) for speaker diarization downsteam demo? Thank you!

Fine-tuned Unispeech models cannot be downloaded

Hey,

I think the following links are broken:

UniSpeech Large+ | Labeled: 1350 hrs en, Unlabeled: 353 hrs fr | 1 hr fr | download
UniSpeech Large+ | Labeld: 1350 hrs en, Unlabeled: 168 hrs es | 1 hr es | download
UniSpeech Large+ | Labeled: 1350 hrs en, Unlabeld: 90 hrs it | 1 hr it | download

When clicking on "download" they all give:

This XML file does not appear to have any style information associated with it. The document tree is shown below.
<Error>
<Code>AuthenticationFailed</Code>
<Message>Server failed to authenticate the request. Make sure the value of Authorization header is formed correctly including the signature. RequestId:882ea340-c01e-0156-2462-c783db000000 Time:2021-10-22T16:32:28.6558972Z</Message>
<AuthenticationErrorDetail>Signature did not match. String to sign used was rl 2021-10-18T12:05:48Z 2023-10-19T12:05:00Z /blob/tsstd01scus/models/users/v-chengw/commonvoice_model/ft_fr-pt_fr353.large.one2one_baselinelr5e-4/checkpoint_best.pt 2018-03-28 </AuthenticationErrorDetail>
</Error>

It would be great if you could update them :-)

Bug in WavLM?

padding_mask = padding_mask.all(-1)

I think this should be padding_mask = padding_mask.any(-1)


The argument is as follows:

Suppose I have a padded input of size (4, 90799), which consists of 4 wavs. Their lengths are 90799, 75108, 60146, 60087, respectively.

feature, mask = wavlm.extract_features(y, padding_mask=y_mask)
print((1 - mask.int()).sum(-1))

Running the above code will have [283, 235, 188, 188] printed out. But [283, 234, 187, 187] is expected, because 90799 // 320 = 283, 75108 // 320 = 234, 60146 // 320 = 187, 60087 // 320 = 187.

    def forward_padding_mask(
            self, features: torch.Tensor, padding_mask: torch.Tensor,
    ) -> torch.Tensor:
        # padding_mask.size() = (4, 90799)
        extra = padding_mask.size(1) % features.size(1)
        if extra > 0:
            padding_mask = padding_mask[:, :-extra]
        # padding_mask.size() = (4, 90560)
        padding_mask = padding_mask.view(
            padding_mask.size(0), features.size(1), -1
        )
        # padding_mask.size() = (4, 283, 320)
        # padding_mask[1] =
        # [[0, 0, 0, 0, ..., 0, 0, 0],
        #  [0, 0, 0, 0, ..., 0, 0, 0],
        #  ...
        #  [0, 0, 0, 0, ..., 1, 1, 1],
        #  ...
        #  [1, 1, 1, 1, ..., 1, 1, 1],
        #  [1, 1, 1, 1, ..., 1, 1, 1],
        padding_mask = padding_mask.all(-1)
        # padding_mask.size() = (4, 283)
        # padding_mask[1] =
        # [False,
        #  False,
        #  ...
        #  False,  # this should be 'True'
        #  ...
        #  True,
        #  True],
        return padding_mask

Please correct me if I am wrong.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.