<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Hello <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-ur

Training Wav2Vec2 model on 100h & experiment-2 about gsoc-wav2vec2 HOT 11 OPEN

thevasudevgupta commented on May 27, 2024

Training Wav2Vec2 model on 100h & experiment-2

from gsoc-wav2vec2.

Comments (11)

sayakpaul commented on May 27, 2024 1

I think to keep it tidy we could use this repo and once we have fixated on something we could incorporate that inside the GSoC repo. WDYT?

I will check out with results tomorrow and share my comments.

from gsoc-wav2vec2.

sayakpaul commented on May 27, 2024 1

Gotcha. Thank you.

from gsoc-wav2vec2.

sayakpaul commented on May 27, 2024 1

@vasudevgupta7 I get a 404 after clicking on the above-mentioned link.

I think we need to think of an augmentation pipeline to regularize the student training so that it is able to match the underlying teacher. The FunMatch paper circumvents this by using an aggressive form of MixUp and an excessively longer training schedule to compensate for it.

Translating that to speech is difficult, I agree and this is where we have opportunities I believe. It might be worth taking a look at AugLy which is an open-source framework providing augmentation transformations for different data modalities including audio. This might help us curate an augmentation pipeline for our purpose.

On the other hand, your last thought on this comment also seems a pretty good direction. If we do try to figure out that mapping (two conv blocks from teacher = 1 conv block in the student, for example) I think we could introduce another bottleneck layer to help make that transfer learnable.

from gsoc-wav2vec2.

thevasudevgupta commented on May 27, 2024

I think to keep it tidy we could use this repo and once we have fixated on something we could incorporate that inside the GSoC repo. WDYT?

Yeah! that would be good

from gsoc-wav2vec2.

sayakpaul commented on May 27, 2024

@vasudevgupta7 seems like the training is now done? The training progress (loss-wise) looks good to me.

Also just for my own reference, this is in regards to distilling the wav2vec2 model fine-tuned on speech recognition, correct?

Wanted to know a bit more about the student architectures. Could you provide brief overviews?

from gsoc-wav2vec2.

thevasudevgupta commented on May 27, 2024

@sayakpaul,

Above experiments are just normal fine-tuning wav2vec2 on 100h of LibriSpeech data. Since, training on 960h takes lot of time, I want to establish some kinda baseline for small amount of data so that further experiments can be started on small data. (We will definitely train on 960h data finally, its just for cutting the experimentation time now as 100h model is also giving reasonable WER)
Further, since experiments involve 2 stage training, I wanted to check if we can follow only stage-1 for further experimentation.

I will post brief overviews for every training experiment (in the table) by tonight!

I am going to do distillation training today.

from gsoc-wav2vec2.

sayakpaul commented on May 27, 2024

Got it. But didn't we have models fine-tuned on the LibriSpeech dataset (100h) already?

Further, since experiments involve 2 stage training, I wanted to check if we can follow only stage-1 for further experimentation.

By two-stage, do you mean training of both student and teacher models? In any case, I think when it's applicable we should be able to use the pre-trained (fine-tuned) models as teachers.

I want to establish some kinda baseline for small amount of data so that further experiments can be started on small data.

Perfectly fine.

from gsoc-wav2vec2.

thevasudevgupta commented on May 27, 2024

Got it. But didn't we have models fine-tuned on the LibriSpeech dataset (100h) already?

No, I directly trained on 960h earlier.

By two-stage, do you mean training of both student and teacher models? In any case, I think when it's applicable we should be able to use the pre-trained (fine-tuned) models as teachers.

By 2 stages, I mean this: #17 (comment)

from gsoc-wav2vec2.

thevasudevgupta commented on May 27, 2024

Hello @sayakpaul, I trained the first distillation model yesterday. Unfortunately, it didn't perform well. It's trying to learn (not all predicted tokens are random). I am trying to change initialisation strategy and some hyper parameters to get it working.

teacher: https://tfhub.dev/vasudevgupta7/wav2vec2-960h/1
student: smaller version of same architecture
loss: alpha*KL-divergence loss + (1-alpha)*(ctc-loss)
script: https://github.com/vasudevgupta7/compressed-wav2vec2/blob/part_2/src/train_distilled.py

from gsoc-wav2vec2.

sayakpaul commented on May 27, 2024

Are you training the student for longer? How's the training progress?

What happens if we only use KL-divergence and completely get rid of the labeled signal?

from gsoc-wav2vec2.

thevasudevgupta commented on May 27, 2024

Currently only for 10 epochs (logs: https://wandb.ai/7vasudevgupta/wav2vec2-distillation/runs/2h82mhgc?workspace=user-7vasudevgupta). I need to play around with alpha. Will do these experiments today.

from gsoc-wav2vec2.

Training Wav2Vec2 model on 100h & experiment-2 about gsoc-wav2vec2 HOT 11 OPEN

Comments (11)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent

Jobs