huggingface / community-events Goto Github PK

Place where folks can contribute to 🤗 community events

Python 2.97% Jupyter Notebook 97.03%

community-events's Issues

Using finetuned whisper checkpoints for inference

Hi!
I have been trying for a long time to get inference on the finetuned model but it keeps throwing an error saying that tokenizer is missing.

Steps to reproduce:

from transformers import pipeline

pipe = pipeline("automatic-speech-recognition",model="nodlehs/whisper_finetune")  # change to "your-username/the-name-you-picked"

def transcribe(audio):
    text = pipe(audio)["text"]
    return text

It seems I am missing a tokenizer file but while running the whisper finetune, no such file was uploaded
Could someone please help me out?

P.S this is my model on hf https://huggingface.co/nodlehs/whisper_finetune/tree/main

The fine tuning script run_speech_recognition_seq2seq_streaming.py use interleave_datasets which will truncate the train split

The fine tuning script run_speech_recognition_seq2seq_streaming.py use interleave_dataset function to add the train and validation split together. But I think what we really want to use is concatenate_dataset, because according to the docs, the result of function interleave_dataset ends when one of the source datasets runs out of examples (the default mode).
For example, if the train split has 100 entries and validation split has 10 entries, the result would contains only 10 entries from validation split and 10 from train split. That means we waste the existing train split dataset.

as example:

>>> from datasets import Dataset, interleave_datasets, concatenate_datasets
>>> d1 = Dataset.from_dict({"a": [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]})
>>> d2 = Dataset.from_dict({"a": [10, 11, 12]})
>>> print(interleave_datasets([d1, d2])['a'])
[0, 10, 1, 11, 2, 12]
>>> print(concatenate_datasets([d1, d2])['a'])
[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12]

Using rename_column and remove_column method for a IterableDataset object leads to its feature property become None --- in the Whisper Fine-Tuning Event

Hi,
In file "interleave_streaming_datasets.ipynb" the rename_column and remove_column methods are used and it will throw an error with this line of code:
dataset = dataset.remove_columns(set(dataset.features.keys()) - set(["audio", "sentence"]))
as the dataset.features becomes None. This is a bug mentioned here.

Misc ideas based on discussions in Slack and TODOs

Just dumping ideas here related to this repo cc @NielsRogge @nateraw

Increasing wer but the training and validation loss seems to decrease.

Issue uploading the dataset

Hi, I tried to upload my dataset to the Hub but I am getting this error message

HTTPError: 403 Client Error: Forbidden for url: https://huggingface.co/api/repos/create - You don't have the rights to create a dataset under this namespace

I tried changing the permissions of the token (write and read) but it didn't work. Any help will be appreciated.

Poor Real-Time Performance of Whisper Models Fine-Tuned on Synthetic Data

Hi,

I have custom text data for plant disease names and plant names like this:

uuid, context 
1er1hhaj13, The Rhododendron, a popular ornamental plant, often suffers from Phytophthora ramorum, a challenging disease to manage and pronounce. This pathogen causes Sudden Oak Death, which can lead to extensive damage and mortality in infected plants.

I used speech-to-text APIs to convert this context into audio WAV files, choosing 10 speakers with mostly American/UK/British accents. So I created around ~5k samples for training and ~2k samples for testing.

I followed the same steps from "Fast whisper finetuning" to finetune the peft version of Whisper Large-v2. The training and validation loss looks good:

Step | Training Loss | Validation Loss
250 | 0.413000 | 0.102663
500 | 0.109900 | 0.130888
750 | 0.116500 | 0.102719
1000 | 0.092800 | 0.099153
1250 | 0.068800 | 0.075613 
1500 | 0.042500 | 0.085680
1750 | 0.047500 | 0.076951
2000 | 0.027500 | 0.065127
2250 | 0.023700 | 0.061832
2500 | 0.012500 | 0.062658
2750 | 0.011500 | 0.061922
3000 | 0.008500 | 0.061463
3250 | 0.005300 | 0.060227
3500 | 0.003800 | 0.060712
3750 | 0.002700 | 0.060332
4000 | 0.002300 | 0.060496

When I calculated WER on the test data:

OpenAI Whisper APIs: 22.03 WER on test data
Finetuned model: 0.3 WER on test data

Which looks good. However, during real-time testing with an Indian English-speaking audience, the accuracy for plant names and disease names was not satisfactory. What strategies could we employ to improve accuracy in real-time settings?
Any guidance or suggestions on this matter would be greatly appreciated. Thank you!

huggan.pytorch.lightweight_gan.lightweight_gan.LightweightGAN _from_pretrained requires use_auth_token but this is not passed by the from_pretrained method inherited from ModelHubMixin

In the mixin code (https://github.com/huggingface/huggingface_hub/blob/9e0ac58813df4e0414d6fd494040953f053dbe0d/src/huggingface_hub/hub_mixin.py#L93) from_pretrained calls _from_pretrained but doesn't pass in a use_auth_token argument.
In the LightweightGAN code this argument is required: https://github.com/huggingface/community-events/blob/main/huggan/pytorch/lightweight_gan/lightweight_gan.py#L854

Notebook showing how this manifests for a user: https://colab.research.google.com/drive/1Lc42pRp0-ZxFKbhfU420ZrpXfA8Q-k-e?usp=sharing (includes how I worked around this for now). Currently if you follow the example usage at e.g. https://huggingface.co/ceyda/butterfly_cropped_uniq1K_512 you'll get an error.

Suggested fix is to just add a default for use_auth_token=None in the LightweightGAN _from_pretrained method, but creating an issue in case someone wants to do a more thorough fix. This code is very rarely used but I've had at least one keen learner stuck on this.

How to use the resulting whisper checkpoints when finetuning

After finetuning , how can I deploy / use the checkpoints?
or how to export the checkpoints into a model that I can load and deploy as in whisper?

import whisper

model = whisper.load_model("base")

Colab runtime crash

I am trying to fine-tune whisper small on a colab notebook using T4 GPU , the issue is when I run this snippet the ram usage is maxed and the notebook crashes, any suggestions or explanations on why that happens?

common_voice map error in the notebook of fine-tune-whisper-non-streaming

NameError: name 'processor' is not defined

Whisper finetune

Hi I'm trying to train whisper fine-tune with multi-gpu
and I don't know what RANK to set
I just set WORLD_SIZE is numer of gpu and MASTER_ADDR is localhost, MASTER_PORT is idle port
When WORLD_SIZE is more than 2 and RANK is set 0, training is hanging
Probably it hanged in setting torch.distributed.TCPStore() part..

anyone who solved this problem?
let me know hint please

Package up HugGAN so we can more easily share components

As discussed in #10 , package up the huggan/ dir so it's pip installable and components within it are easier to share

Lambda platform doesn't support Tensorflow-Gpu

@merveenoyan ,

NVIDIA-SMI 515.65.01
Driver Version: 515.65.01
CUDA Version: 11.7
TensorFlow: 2.11.0

The pytorch library supports it, but tensorflow doesn't.

TF:

tf.config.list_logical_devices("GPU")

Output:

[]

Torch:

import torch
torch.cuda.is_available()

Output:

True

Whisper parameters

Can I use whisper parameters like beam_size and temperature while running my finetuned hf model?

Fine-tuned Whisper models perform worse than OpenAI

Hello there

I participated on the Whisper fine-tuning event hold last December. As result, I trained some models for Catalan language finetuned using Common Voice 11. Here are the models that we trained:

https://huggingface.co/softcatala/whisper-small-ca
https://huggingface.co/jordimas/whisper-medium-ca-2000steps (2000 steps only)

They score well in the WER evaluation produced by the script provided by HuggingFace.

However, when I evaluate these fine-tuned models with real audio, they perform worse than the original OpenAI models. These audio are 4 audio transcribed by humans from 1 to 5 minutes.

More details:

As we know, HuggingFace library does not work well yet with Whisper for audios over 30 seconds
We use https://github.com/ggerganov/whisper.cpp library which converts from HuggingFace models to its own format
- Their converter is solid since when you run the conversion from huggingface/openai you get the same results that with the openAI models

I tested quickly with the Spanish models and the fine tuned models also perform worse than the original OpenAI models.

From what I observed for the case of Catalan models, the fine-tuned models seem to quickly overfit.

Additionally I do not know if you also have seen this article: https://alphacephei.com/nsh/2023/01/15/whisper-finetuning.html from Nickolay Shmyrev.

My questions are :

Has anybody been using the finetune models for real uses cases?
Has anybody observed these problems?

Let me know if you need more details. Thanks in advance!

Add instructions to README for non-streaming mode

@sanchit-gandhi I have seen that the Python script has been changed to support non-streaming mode. Please, could you add instructions to README (which parms) for non streaming mode?

Padding conflict in loss computation

Hi,

Great tutorial! I had a question regarding the data-processing step for this tutorial, where the label tokens are padded with -100 (black-space) before being passed on to the model. Upon running the de-bugger I see that the model makes correct predictions, but predicts tokenizer.pad_token_id (which corresponds to 50256 for Whisper) which leads to different losses depending on what value this padding is done with.

Should the padding not correspond to 50256, and not -100?

One of the comments said # replace padding with -100 to ignore loss correctly but doing so actually yields a higher loss for a prediction that is correct (before fine-tuning has even begun) but has pad-token-ids at the end instead of -100, as expected in the output tensor.

Wer increases but the training and validation loss decreases

I have been trying to fine tune the whisper base on an english dataset only to improve the transciption of the model as the wer seemed to be high without fine tuning. I have a total of 560 audio files to fine tune it .

The above picture shows the base model performnace , there was actually very high difference in the train and validation loss os i used lora optimizer. but after using that you can see the abouve issue in the image which i am facing

When i used large-v2 model the above image as you can see , i am facing that. In this also i used the lora optimizer.

I have been stuck in this what is the issue and how can i solve this

Thank you in advance!
@Vaibhavs10 @sanchit-gandhi

Add Keras Dreambooth notebook to this repo

That way it's visible in GitHub and people can fix issues/typos/etc.

Update HugGAN template README to not have blank license/dataset tags

Blank license/dataset tags will be blocking if you try to push to hub via git. Should maybe remove them/ see if we can comment them out so folks can fill them in as needed.

this is more question than an issue

hi, we have 12M names and we would like to fine tune whisper on them. also, i am happy to share with you the results.

the question is it better to fine tune whisper using the entire spoken name? Or is it better to fine tune using invidial names and recording snippets of each anme spoken?

Super large number of epoch

Thanks for providing the code for fine-tuning!

Issue:
I am running into an issue when i call trainer.train(), I get super large number of epoch as shown below.

I have tried specifying num_train_epochs, which didn't work.

Training context:
I am finetuning in colab, referencing this script. I am fine-tuning the tiny model for Chinese(zh-TW), and only modified the code where necessary. My script is here: colab link

Lightweight GAN input type error

I get one error when I try to use the Lightweight GAN implementation. Here is the traceback:

Traceback (most recent call last):
  File "cli.py", line 166, in <module>
    main()
  File "cli.py", line 163, in main
    fire.Fire(train_from_folder)
  File "/opt/conda/lib/python3.8/site-packages/fire/core.py", line 141, in Fire
    component_trace = _Fire(component, args, parsed_flag_args, context, name)
  File "/opt/conda/lib/python3.8/site-packages/fire/core.py", line 466, in _Fire
    component, remaining_args = _CallAndUpdateTrace(
  File "/opt/conda/lib/python3.8/site-packages/fire/core.py", line 681, in _CallAndUpdateTrace
    component = fn(*varargs, **kwargs)
  File "cli.py", line 160, in train_from_folder
    run_training(model_args, data, load_from, new, num_train_steps, name, seed)
  File "cli.py", line 53, in run_training
    model.train(G, D, D_aug)
  File "/notebooks/community-events/huggan/pytorch/lightweight_gan/lightweight_gan.py", line 1074, in train
    real_output, real_output_32x32, real_aux_loss = D_aug(image_batch,  calc_aux_loss = True, **aug_kwargs)
  File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
    return forward_call(*input, **kwargs)
  File "/notebooks/community-events/huggan/pytorch/lightweight_gan/lightweight_gan.py", line 285, in forward
    return self.D(images, **kwargs)
  File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
    return forward_call(*input, **kwargs)
  File "/notebooks/community-events/huggan/pytorch/lightweight_gan/lightweight_gan.py", line 648, in forward
    x = net(x)
  File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
    return forward_call(*input, **kwargs)
  File "/notebooks/community-events/huggan/pytorch/lightweight_gan/lightweight_gan.py", line 171, in forward
    return sum(map(lambda fn: fn(x), self.branches))
  File "/notebooks/community-events/huggan/pytorch/lightweight_gan/lightweight_gan.py", line 171, in <lambda>
    return sum(map(lambda fn: fn(x), self.branches))
  File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
    return forward_call(*input, **kwargs)
  File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/container.py", line 141, in forward
    input = module(input)
  File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
    return forward_call(*input, **kwargs)
  File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/conv.py", line 447, in forward
    return self._conv_forward(input, self.weight, self.bias)
  File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/conv.py", line 443, in _conv_forward
    return F.conv2d(input, weight, bias, self.stride,
RuntimeError: Input type (torch.FloatTensor) and weight type (torch.cuda.FloatTensor) should be the same or input should be a MKLDNN tensor and weight is a dense tensor

I tried with multiple datasets and even with the default one. I am looking at the code to find where is there a problem, it seems that data and models are not on the same device in Discriminator forward method.

Here is my 🤗 accelerate config:

This machine (Paperspace RTX A5000),
no distrubuted training,
not only on cpu,
no DeepSpeed,
1 total process,
no fp16 or bf16

Blog not found for A Complete Guide To Audio Datasets

In the Readme of whisper-fine-tuning-event, the link to A Complete Guide To Audio Datasets is not found. Kindly update the readme accordingly.

[WFTE] "Num examples: Unknown" when it reached eval step on both Colab and the Python code.

The code reaches eval step and prints that num_example: unknown and gets stuck

I didn't change anything on the example code and tried both the google colab and the python variant on a google VM.

I tried:
1-changing the split to the same one as train.
2- Disabling predict_with_generate and do_normalize_eval.

How to prepare audio dataset for whisper fine-tuning with timestamps?

I am trying to prepare a dataset for whisper fine-tuning , and I have a lot of small segment clip , most of them less than 6 seconds, I read the paper, but didn’t understand this paragraph:

“ When a final transcript segment is only partially included in the current 30- second audio chunk, we predict only its start time token for the segment when in timestamp mode, to indicate that the subsequent decoding should be performed on an audio window aligned with that time, otherwise we truncate the audio to not include the segment”

So when should I add the final segment if it is partially included in the current 30-second chunk, and when should I truncate the chunk without it, and if I added it how to extract only relevant transcription?

To make it clear:

|           window           |           window           |
|segment|-----segment---|--segment--|

assume that every window is 30 seconds, how to get the correct relevant transcription of the partially included segments?
Anyone could help?

Increasing WER & Validation Loss During Whisper Fine-Tuning

Hi,
I've recently created a dataset using speech-to-text APIs on custom documents. The dataset consists of 1,000 audio samples, with 700 designated for training and 300 for testing. In total, this equates to about 4 hours of audio, where each clip is approximately 30 seconds long.

I'm attempting to fine-tune the Whisper small model with the help of HuggingFace's script, following the tutorial they've provided Fine-Tune Whisper For Multilingual ASR with 🤗 Transformers.

Before diving into the fine-tuning, I evaluated the WER on OpenAI's pre-trained model, which stood at WER = 23.078%.

However, as my fine-tuning progresses, I'm observing some unexpected behavior:

As visible, the Validation Loss and WER are both on the rise during the fine-tuning phase. I'm at a bit of a loss here. Why might this be happening? Any insights or recommendations would be greatly appreciated.

Thank you in advance!
@Vaibhavs10 @sanchit-gandhi

WhisperPositionalEmbedding

Hi there
I'm trying to fintune whisper model but there is a problem that decoder positional embedding size(small model case is [448,768]) should not bigger than 448(first dim)
I have two question
Q1) When I use a 10 second or more long wav file, that problem let stop training.. is it problem related to file size..?

prob code line is below

        # embed positions
        positions = self.embed_positions(input_ids, past_key_values_length=past_key_values_length)

        hidden_states = inputs_embeds + positions

in transformers/models/whisper/modeling_whisper.py:872 is stopped line
if I change the max_target_positions then I use random embedding layer instead existing whisper's embedding layer..
Q2) let me know any solution..?

Script deletes files and subfolders on my machine

I tried running the following script and it deletes all the files and folders in my current working directory.

https://github.com/huggingface/community-events/tree/main/whisper-fine-tuning-event#python-script

There is a --overwrite_output_dir \ flag but I'm guessing the behaviour should be to delete a folder inside the current working directory, not all the files in the folder.

There should probably be a way to rewrite this since deleting folders and subfolders on someones computer is dangerous and I trusted the code and let it run on my computer.

Visualize ControlNet results using Weights & Biases Tables

Add optional support for visualization of ControlNet results using wandb.Table in train_controlnet_flax.py. This can be used for summarizing results during training and inference.

huggingface / community-events Goto Github PK

community-events's Issues

Recommend Projects

Recommend Topics

Recommend Org

Jobs