GithubHelp home page GithubHelp logo

Comments (9)

Adel-Moumen avatar Adel-Moumen commented on June 12, 2024

Hello @FrancescoBonzi,

Sorry for the issue! There's an ongoing refactoring of Whisper fine tuning here: #2450

I fixed some issues and now you can train a whisper model and obtain competitive results (i.e. in this case, I went from 2.07% of WER to 1.72%).

I am currently working on this and there will be some slight changes but if you want you can use this pull request as the basis of your work... sorry again.

from speechbrain.

FrancescoBonzi avatar FrancescoBonzi commented on June 12, 2024

Amazing super fast answer!! Thank you very much, I'll go on with this PR. Does it support also fine-tuning with timestamps?

from speechbrain.

Adel-Moumen avatar Adel-Moumen commented on June 12, 2024

Amazing super fast answer!! Thank you very much, I'll go on with this PR. Does it support also fine-tuning with timestamps?

Unfortunately, not yet. But I plan to add it soon. Basically, I'm improving a lot our interface with Whisper so that we support everything (flash attention, kv cache, prompting etc). I might also add this feature if this is a strong request form the community.

NOTE: as I said, this PR is subject to changes. I am still working heavily on it but I got some good numbers. I didn't cleaned everything so you might have to change some path in yaml etc... sorry for the mess as it is a draft PR there's still some ongoing things to change.

from speechbrain.

FrancescoBonzi avatar FrancescoBonzi commented on June 12, 2024

I see that there is a lack of material on fine-tuning Whisper with timestamps, maybe this repo but it seems no longer maintained. In general, I think using timestamps is an essential feature for each strong and reliable new version of Whisper, and Speechbrain could be the right place to find it. I'm really interested about it, we're trying to fine-tune Whisper on song lyrics! If you need a hand I can try to help you.

from speechbrain.

Adel-Moumen avatar Adel-Moumen commented on June 12, 2024

I see that there is a lack of material on fine-tuning Whisper with timestamps, maybe this repo but it seems no longer maintained. In general, I think using timestamps is an essential feature for each strong and reliable new version of Whisper, and Speechbrain could be the right place to find it. I'm really interested about it, we're trying to fine-tune Whisper on song lyrics! If you need a hand I can try to help you.

I would say it would be a lovely feature and a nice help of you if you could contribue on this feature! I was also looking at this repo but I dunno if the implementation is giving good results? Maybe you could explore this and lemme know? I haven't spent enough time understanding how it works, I still have some troubles to understand how the alignment is performed through tokens TBH (e.g. how could you say that the word "hello" is at frames 2 to 5 only by using textual representation? I suspect that is this is the case, then maybe whisper is not that good at alignment but I need to explore a bit more)

from speechbrain.

FrancescoBonzi avatar FrancescoBonzi commented on June 12, 2024

Here I see that the authors trained the model using a precision of 0.02 seconds (1501 special tokens from 0.0s to 30.0s) and treated these tokens like all the others, using one-hot labels. I think at inference time Whisper predicts sentence timestamps and use DTW to predict word timestamps.
While here, the authors explain how they prepared the dataset for training with timestamps.

I'll check this repo the next few days.

from speechbrain.

Adel-Moumen avatar Adel-Moumen commented on June 12, 2024

Here I see that the authors trained the model using a precision of 0.02 seconds (1501 special tokens from 0.0s to 30.0s) and treated these tokens like all the others, using one-hot labels. I think at inference time Whisper predicts sentence timestamps and use DTW to predict word timestamps. While here, the authors explain how they prepared the dataset for training with timestamps.

I'll check this repo the next few days.

Okay! Im currently training some CommonVoice Whisper models (large and small on French and italian, I'll maybe try English as well). I will keep you posted on the results but so far I got some good numbers. I don't know yet if I will add in this PR timestamps supports. I think I will focus on having strong baselines + adding support of long form ASR / prompting. I don't know if it would requires a crazy amount of time to add timestamps TBH. If you want you could open a PR on that? I will review it of course and it could be a nice thing to add in speechbrain :)

from speechbrain.

FrancescoBonzi avatar FrancescoBonzi commented on June 12, 2024

Okay, I think the code here is a good starting point to address training with timestamps, but it needs some improvements to work with multi-gpu and to be more flexibile. I may build upon your code when it is completed to support also timestamps.

from speechbrain.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.