Comments (9)
Hello @FrancescoBonzi,
Sorry for the issue! There's an ongoing refactoring of Whisper fine tuning here: #2450
I fixed some issues and now you can train a whisper model and obtain competitive results (i.e. in this case, I went from 2.07% of WER to 1.72%).
I am currently working on this and there will be some slight changes but if you want you can use this pull request as the basis of your work... sorry again.
from speechbrain.
Amazing super fast answer!! Thank you very much, I'll go on with this PR. Does it support also fine-tuning with timestamps?
from speechbrain.
Amazing super fast answer!! Thank you very much, I'll go on with this PR. Does it support also fine-tuning with timestamps?
Unfortunately, not yet. But I plan to add it soon. Basically, I'm improving a lot our interface with Whisper so that we support everything (flash attention, kv cache, prompting etc). I might also add this feature if this is a strong request form the community.
NOTE: as I said, this PR is subject to changes. I am still working heavily on it but I got some good numbers. I didn't cleaned everything so you might have to change some path in yaml etc... sorry for the mess as it is a draft PR there's still some ongoing things to change.
from speechbrain.
I see that there is a lack of material on fine-tuning Whisper with timestamps, maybe this repo but it seems no longer maintained. In general, I think using timestamps is an essential feature for each strong and reliable new version of Whisper, and Speechbrain could be the right place to find it. I'm really interested about it, we're trying to fine-tune Whisper on song lyrics! If you need a hand I can try to help you.
from speechbrain.
I see that there is a lack of material on fine-tuning Whisper with timestamps, maybe this repo but it seems no longer maintained. In general, I think using timestamps is an essential feature for each strong and reliable new version of Whisper, and Speechbrain could be the right place to find it. I'm really interested about it, we're trying to fine-tune Whisper on song lyrics! If you need a hand I can try to help you.
I would say it would be a lovely feature and a nice help of you if you could contribue on this feature! I was also looking at this repo but I dunno if the implementation is giving good results? Maybe you could explore this and lemme know? I haven't spent enough time understanding how it works, I still have some troubles to understand how the alignment is performed through tokens TBH (e.g. how could you say that the word "hello" is at frames 2 to 5 only by using textual representation? I suspect that is this is the case, then maybe whisper is not that good at alignment but I need to explore a bit more)
from speechbrain.
Here I see that the authors trained the model using a precision of 0.02 seconds (1501 special tokens from 0.0s to 30.0s) and treated these tokens like all the others, using one-hot labels. I think at inference time Whisper predicts sentence timestamps and use DTW to predict word timestamps.
While here, the authors explain how they prepared the dataset for training with timestamps.
I'll check this repo the next few days.
from speechbrain.
Here I see that the authors trained the model using a precision of 0.02 seconds (1501 special tokens from 0.0s to 30.0s) and treated these tokens like all the others, using one-hot labels. I think at inference time Whisper predicts sentence timestamps and use DTW to predict word timestamps. While here, the authors explain how they prepared the dataset for training with timestamps.
I'll check this repo the next few days.
Okay! Im currently training some CommonVoice Whisper models (large and small on French and italian, I'll maybe try English as well). I will keep you posted on the results but so far I got some good numbers. I don't know yet if I will add in this PR timestamps supports. I think I will focus on having strong baselines + adding support of long form ASR / prompting. I don't know if it would requires a crazy amount of time to add timestamps TBH. If you want you could open a PR on that? I will review it of course and it could be a nice thing to add in speechbrain :)
from speechbrain.
Okay, I think the code here is a good starting point to address training with timestamps, but it needs some improvements to work with multi-gpu and to be more flexibile. I may build upon your code when it is completed to support also timestamps.
from speechbrain.
Related Issues (20)
- Wav2Vec2Pretrain (HFTransformersInterface implementation) samples padded values for mask_time_indices and negative_sample_indices HOT 2
- spkrec-ecapa-voxceleb-mel-spec model modifies mel spectrum in place when used with CPU HOT 2
- A few unoptimised piece of code (augmentation and masking) HOT 2
- Language codes not following ISO standards in lang-id-voxlingua107-ecapa HOT 2
- Train loop may crash during checkpointing HOT 5
- Possible NCCL-level deadlock during checkpointing HOT 7
- Error during encoding
- AMP at inference time HOT 3
- Unable to use model trained from enhancement template HOT 1
- 🐞 | Import error in speechbrain.pretrained HOT 3
- Not Able to use 'speechbrain/spkrec-ecapa-voxceleb' embeddings in Google Colab. HOT 3
- 'S2STransformerBeamSearcher' object has no attribute 'ctc_forward_step' HOT 2
- Inference interface uncompatible with Python < 3.9 HOT 1
- Issues regarding discrete WavLM and discrete HuBERT HOT 4
- speechbrain inference classifier error HOT 1
- Program memory segmentation error (core dumped) for training LibriMix
- [Feature Request]: AdaMER-CTC for ASR task training
- Cannot reproduce DPRNN results on WSJ0-2Mix (Speech Separation) HOT 6
- ModuleNotFoundError: No module named 'speechbrain.pretrained' HOT 4
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from speechbrain.