Describe the bug I reproduced the recipe for Whisper fine-tuning u

Hello <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-ur

<a href="https://github.com/openai/whisper/discussions/838" data-hovercard-type="discu

<a href="https://github.com/openai/whisper/discussions/838" data-hovercar

Okay, I think the code <a href="https://github.com/jumon/whisper-finetuning/tree/main"

LibriSpeech Whisper Finetuning - WER 98% after 3 epochs about speechbrain HOT 9 CLOSED

FrancescoBonzi commented on July 1, 2024

LibriSpeech Whisper Finetuning - WER 98% after 3 epochs

from speechbrain.

Comments (9)

Adel-Moumen commented on July 1, 2024 1

Hello, we've merged the new Whisper PR that is fixing a bunch of issues :)

Feel free to git clone the latest SB version in order to use it :)

Thanks again for reporting this issue.

from speechbrain.

Adel-Moumen commented on July 1, 2024

Hello @FrancescoBonzi,

Sorry for the issue! There's an ongoing refactoring of Whisper fine tuning here: #2450

I fixed some issues and now you can train a whisper model and obtain competitive results (i.e. in this case, I went from 2.07% of WER to 1.72%).

I am currently working on this and there will be some slight changes but if you want you can use this pull request as the basis of your work... sorry again.

from speechbrain.

FrancescoBonzi commented on July 1, 2024

Amazing super fast answer!! Thank you very much, I'll go on with this PR. Does it support also fine-tuning with timestamps?

from speechbrain.

Adel-Moumen commented on July 1, 2024

Amazing super fast answer!! Thank you very much, I'll go on with this PR. Does it support also fine-tuning with timestamps?

Unfortunately, not yet. But I plan to add it soon. Basically, I'm improving a lot our interface with Whisper so that we support everything (flash attention, kv cache, prompting etc). I might also add this feature if this is a strong request form the community.

NOTE: as I said, this PR is subject to changes. I am still working heavily on it but I got some good numbers. I didn't cleaned everything so you might have to change some path in yaml etc... sorry for the mess as it is a draft PR there's still some ongoing things to change.

from speechbrain.

FrancescoBonzi commented on July 1, 2024

I see that there is a lack of material on fine-tuning Whisper with timestamps, maybe this repo but it seems no longer maintained. In general, I think using timestamps is an essential feature for each strong and reliable new version of Whisper, and Speechbrain could be the right place to find it. I'm really interested about it, we're trying to fine-tune Whisper on song lyrics! If you need a hand I can try to help you.

from speechbrain.

Adel-Moumen commented on July 1, 2024

I see that there is a lack of material on fine-tuning Whisper with timestamps, maybe this repo but it seems no longer maintained. In general, I think using timestamps is an essential feature for each strong and reliable new version of Whisper, and Speechbrain could be the right place to find it. I'm really interested about it, we're trying to fine-tune Whisper on song lyrics! If you need a hand I can try to help you.

I would say it would be a lovely feature and a nice help of you if you could contribue on this feature! I was also looking at this repo but I dunno if the implementation is giving good results? Maybe you could explore this and lemme know? I haven't spent enough time understanding how it works, I still have some troubles to understand how the alignment is performed through tokens TBH (e.g. how could you say that the word "hello" is at frames 2 to 5 only by using textual representation? I suspect that is this is the case, then maybe whisper is not that good at alignment but I need to explore a bit more)

from speechbrain.

FrancescoBonzi commented on July 1, 2024

Here I see that the authors trained the model using a precision of 0.02 seconds (1501 special tokens from 0.0s to 30.0s) and treated these tokens like all the others, using one-hot labels. I think at inference time Whisper predicts sentence timestamps and use DTW to predict word timestamps.
While here, the authors explain how they prepared the dataset for training with timestamps.

I'll check this repo the next few days.

from speechbrain.

Adel-Moumen commented on July 1, 2024

Here I see that the authors trained the model using a precision of 0.02 seconds (1501 special tokens from 0.0s to 30.0s) and treated these tokens like all the others, using one-hot labels. I think at inference time Whisper predicts sentence timestamps and use DTW to predict word timestamps. While here, the authors explain how they prepared the dataset for training with timestamps.

I'll check this repo the next few days.

Okay! Im currently training some CommonVoice Whisper models (large and small on French and italian, I'll maybe try English as well). I will keep you posted on the results but so far I got some good numbers. I don't know yet if I will add in this PR timestamps supports. I think I will focus on having strong baselines + adding support of long form ASR / prompting. I don't know if it would requires a crazy amount of time to add timestamps TBH. If you want you could open a PR on that? I will review it of course and it could be a nice thing to add in speechbrain :)

from speechbrain.

FrancescoBonzi commented on July 1, 2024

Okay, I think the code here is a good starting point to address training with timestamps, but it needs some improvements to work with multi-gpu and to be more flexibile. I may build upon your code when it is completed to support also timestamps.

from speechbrain.

LibriSpeech Whisper Finetuning - WER 98% after 3 epochs about speechbrain HOT 9 CLOSED

Comments (9)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent

Jobs