Comments (9)
cc @Vaibhavs10 @sanchit-gandhi
from community-events.
Hi @SuperKogito,
The fine-tuned checkpoints can be inferred in multiple ways, the most simplest way would be to perhaps use it as part of the ASR pipeline
as mentioned in the snippet below:
from transformers import pipeline
whisper_asr = pipeline(
"automatic-speech-recognition",
model="MODEL_CHECKPOINT_NAME_HERE"
)
whisper_asr(AUDIO_FILE_NAME.mp3)
If you want more fine-grained control over generation then you can also use it with the processor + the model, for that you can do something like this:
import torch
from transformers import WhisperForConditionalGeneration, WhisperProcessor
torch.cuda.empty_cache()
device = "cuda" if torch.cuda.is_available() else "cpu"
model = WhisperForConditionalGeneration.from_pretrained("MODEL_CHECKPOINT_NAME_HERE").to(device)
processor = WhisperProcessor.from_pretrained("MODEL_CHECKPOINT_NAME_HERE")
inputs = processor.feature_extractor(next(iter(common_voice_es))["audio"]["array"], return_tensors="pt", sampling_rate=16_000).input_features.to("cuda")
forced_decoder_ids = processor.get_decoder_prompt_ids(language=LANGUAGE_HERE, task="transcribe")
predicted_ids = model.generate(inputs, max_length=448, forced_decoder_ids=forced_decoder_ids)
processor.tokenizer.batch_decode(predicted_ids, skip_special_tokens=True, normalize=False)[0]
I created a notebook earlier as part of the event to showcase these inference methods you can find it here: https://github.com/Vaibhavs10/notebooks/blob/main/Infer_Whisper_π€transformers_edition.ipynb
To answer your last question about converting the Transformer checkpoints to Open AI Whisper format, we don't have an officially supported utility for it, however there are some community scripts that can help you do that: https://github.com/bayartsogt-ya/whisper-multiple-hf-datasets
# install multiple_datasets
!pip install git+https://github.com/bayartsogt-ya/whisper-multiple-hf-datasets.git
from multiple_datasets.hub_default_utils import convert_hf_whisper
model_name_or_path = 'openai/whisper-tiny'
whisper_checkpoint_path = './whisper-tiny-checkpoint.pt'
convert_hf_whisper(model, whisper_checkpoint_path)
# now transcribe
import whisper
model = whisper.load_model(whisper_model_path)
result = model.transcribe('loooong_audio_path.wav') # probably longer than 10 min? hour?
print(result['text'])
Let me know if you have any other questions, happy transcribing! π€
from community-events.
Hey @SuperKogito,
I think the problem is in the way you are defining the path to your model. You should be able to infer from the checkpoint directly via the below code:
from transformers import pipeline
whisper_asr = pipeline(
"automatic-speech-recognition",
model="whisper-finetuned/checkpoint-40000"
)
whisper_asr(AUDIO_FILE_NAME.mp3)
Just make sure to pass along the path to the specific checkpoint to ensure that the pipeline picks up on the necessary and required files. This way you'd also not need to convert your checkpoint to the Open AI Whisper format.
Do let me know if it doesn't work.
from community-events.
Hey @SuperKogito!
When we use from_pretrained
with a model name or path, we load the weights from this path into our model. So we need to make sure that our model path contains:
- Model weights (
pytorch_model.bin
) - Config (
config.json
)
We'd expect to see the final model weights saved under your output_dir
(whisper-small-finetuned-de-2023-01-03
) at the end of training.
We can see that the weights are saved every save_steps
(4000 steps) during training, but there's an absence of the final weights under your output_dir
(whisper-small-finetuned-de-2023-01-03
).
This could be because trainer.save_model()
is only under the control flow for when we resume training from a checkpoint:
# start training
print("start training")
if checkpoint is None:
train_result = trainer.train()
else :
print("-> Training from checkpoint")
train_result = trainer.train(resume_from_checkpoint=checkpoint)
trainer.save_model()
This means that we only save the final model if we're resuming training from a checkpoint.
In terms of the other files in our directory, there is one file related to the feature extractor:
βββ preprocessor_config.json
And several files related to the tokenizer (no need for tokenizer.pt
):
βββ added_tokens.json
βββ merges.txt
βββ normalizer.json
βββ special_tokens_map.json
βββ tokenizer_config.json
βββ vocab.json
from community-events.
Amazing - happy to hear that @SuperKogito! Enjoy using your fine-tuned model π€
from community-events.
Thank you for your response!
Unfortunately, none of these worked with my resulting checkpoints :(
The first two snippets results in the following error:
whisper-small-finetuned-de-2023-01-03 does not appear to have a file named config.json. Checkout 'https://huggingface.co/whisper-small-finetuned-de-2023-01-03/None' for available files.
as for the second it causes the following:
Whisper(...
...)' is the correct path to a directory containing a config.json file
My checkpoints structure looks as follow:
whisper-small-finetuned-de-2023-01-03/
βββ added_tokens.json
βββ all_results.json
βββ checkpoint-12000
βΒ Β βββ config.json
βΒ Β βββ optimizer.pt
βΒ Β βββ preprocessor_config.json
βΒ Β βββ pytorch_model.bin
βΒ Β βββ rng_state.pth
βΒ Β βββ scaler.pt
βΒ Β βββ scheduler.pt
βΒ Β βββ trainer_state.json
βΒ Β βββ training_args.bin
βββ checkpoint-16000
βΒ Β βββ config.json
βΒ Β βββ optimizer.pt
βΒ Β βββ preprocessor_config.json
βΒ Β βββ pytorch_model.bin
βΒ Β βββ rng_state.pth
βΒ Β βββ scaler.pt
βΒ Β βββ scheduler.pt
βΒ Β βββ trainer_state.json
βΒ Β βββ training_args.bin
βββ checkpoint-20000
βΒ Β βββ config.json
βΒ Β βββ optimizer.pt
βΒ Β βββ preprocessor_config.json
βΒ Β βββ pytorch_model.bin
βΒ Β βββ rng_state.pth
βΒ Β βββ scaler.pt
βΒ Β βββ scheduler.pt
βΒ Β βββ trainer_state.json
βΒ Β βββ training_args.bin
βββ checkpoint-24000
βΒ Β βββ config.json
βΒ Β βββ optimizer.pt
βΒ Β βββ preprocessor_config.json
βΒ Β βββ pytorch_model.bin
βΒ Β βββ rng_state.pth
βΒ Β βββ scaler.pt
βΒ Β βββ scheduler.pt
βΒ Β βββ trainer_state.json
βΒ Β βββ training_args.bin
βββ checkpoint-28000
βΒ Β βββ config.json
βΒ Β βββ optimizer.pt
βΒ Β βββ preprocessor_config.json
βΒ Β βββ pytorch_model.bin
βΒ Β βββ rng_state.pth
βΒ Β βββ scaler.pt
βΒ Β βββ scheduler.pt
βΒ Β βββ trainer_state.json
βΒ Β βββ training_args.bin
βββ checkpoint-32000
βΒ Β βββ config.json
βΒ Β βββ optimizer.pt
βΒ Β βββ preprocessor_config.json
βΒ Β βββ pytorch_model.bin
βΒ Β βββ rng_state.pth
βΒ Β βββ scaler.pt
βΒ Β βββ scheduler.pt
βΒ Β βββ trainer_state.json
βΒ Β βββ training_args.bin
βββ checkpoint-36000
βΒ Β βββ config.json
βΒ Β βββ optimizer.pt
βΒ Β βββ preprocessor_config.json
βΒ Β βββ pytorch_model.bin
βΒ Β βββ rng_state.pth
βΒ Β βββ scaler.pt
βΒ Β βββ scheduler.pt
βΒ Β βββ trainer_state.json
βΒ Β βββ training_args.bin
βββ checkpoint-4000
βΒ Β βββ config.json
βΒ Β βββ optimizer.pt
βΒ Β βββ preprocessor_config.json
βΒ Β βββ pytorch_model.bin
βΒ Β βββ rng_state.pth
βΒ Β βββ scaler.pt
βΒ Β βββ scheduler.pt
βΒ Β βββ trainer_state.json
βΒ Β βββ training_args.bin
βββ checkpoint-40000
βΒ Β βββ config.json
βΒ Β βββ optimizer.pt
βΒ Β βββ preprocessor_config.json
βΒ Β βββ pytorch_model.bin
βΒ Β βββ rng_state.pth
βΒ Β βββ scaler.pt
βΒ Β βββ scheduler.pt
βΒ Β βββ trainer_state.json
βΒ Β βββ training_args.bin
βββ checkpoint-8000
βΒ Β βββ config.json
βΒ Β βββ optimizer.pt
βΒ Β βββ preprocessor_config.json
βΒ Β βββ pytorch_model.bin
βΒ Β βββ rng_state.pth
βΒ Β βββ scaler.pt
βΒ Β βββ scheduler.pt
βΒ Β βββ trainer_state.json
βΒ Β βββ training_args.bin
βββ eval_results.json
βββ merges.txt
βββ normalizer.json
βββ preprocessor_config.json
βββ special_tokens_map.json
βββ tokenizer_config.json
βββ train_results.json
βββ vocab.json
I am not sure why is it not recognizing any of my config.json
files?
I am also missing a tokerizer.pt
, is that normal? am I doing something wrong when training?
My finetuning code is the following:
import os
# direct cache
os.environ["HF_HOME"] = "/trainingdata/chris/.cache/huggingface"
os.environ["TRANSFORMERS_CACHE"] = "/trainingdata/chris/.cache/huggingface/hub"
import torch
# specify the gpu to use
os.environ["CUDA_VISIBLE_DEVICES"] = "0"
torch.cuda.device_count() # print 1
import evaluate
from transformers import WhisperTokenizer
from transformers import WhisperProcessor
from transformers import WhisperFeatureExtractor
from transformers import WhisperForConditionalGeneration
from datasets import Dataset, load_dataset, DatasetDict, Audio, Features, Value
# prepare feature extractor and tokenizer
feature_extractor = WhisperFeatureExtractor.from_pretrained("openai/whisper-small")
tokenizer = WhisperTokenizer.from_pretrained("openai/whisper-small", language="de", task="transcribe")
processor = WhisperProcessor.from_pretrained("openai/whisper-small", language="de", task="transcribe")
def verify_tokenizer(common_voice):
input_str = common_voice["train"][0]["sentence"]
labels = tokenizer(input_str).input_ids
decoded_with_special = tokenizer.decode(labels, skip_special_tokens=False)
decoded_str = tokenizer.decode(labels, skip_special_tokens=True)
print(f"Input: {input_str}")
print(f"Decoded w/ special: {decoded_with_special}")
print(f"Decoded w/out special: {decoded_str}")
print(f"Are equal: {input_str == decoded_str}")
def prepare_dataset(batch):
# load and resample audio data from 48 to 16kHz
audio = batch["audio"]
# compute log-Mel input features from input audio array
batch["input_features"] = feature_extractor(audio["array"], sampling_rate=audio["sampling_rate"]).input_features[0]
# encode target text to label ids
batch["labels"] = tokenizer(batch["sentence"]).input_ids
return batch
def compute_metrics(pred):
pred_ids = pred.predictions
label_ids = pred.label_ids
# replace -100 with the pad_token_id
label_ids[label_ids == -100] = tokenizer.pad_token_id
# we do not want to group tokens when computing the metrics
pred_str = tokenizer.batch_decode(pred_ids, skip_special_tokens=True)
label_str = tokenizer.batch_decode(label_ids, skip_special_tokens=True)
wer = 100 * metric.compute(predictions=pred_str, references=label_str)
print("WER: ", wer)
return {"wer": wer}
from dataclasses import dataclass
from typing import Any, Dict, List, Union
@dataclass
class DataCollatorSpeechSeq2SeqWithPadding:
processor: Any
def __call__(self, features: List[Dict[str, Union[List[int], torch.Tensor]]]) -> Dict[str, torch.Tensor]:
# split inputs and labels since they have to be of different lengths and need different padding methods
# first treat the audio inputs by simply returning torch tensors
input_features = [{"input_features": feature["input_features"]} for feature in features]
batch = self.processor.feature_extractor.pad(input_features, return_tensors="pt")
# get the tokenized label sequences
label_features = [{"input_ids": feature["labels"]} for feature in features]
# pad the labels to max length
labels_batch = self.processor.tokenizer.pad(label_features, return_tensors="pt")
# replace padding with -100 to ignore loss correctly
labels = labels_batch["input_ids"].masked_fill(labels_batch.attention_mask.ne(1), -100)
# if bos token is appended in previous tokenization step,
# cut bos token here as it's append later anyways
if (labels[:, 0] == self.processor.tokenizer.bos_token_id).all().cpu().item():
labels = labels[:, 1:]
batch["labels"] = labels
return batch
# read data
features = Features(
{
"audio": Audio(sampling_rate=16000),
"sentence": Value("string")
}
)
common_voice = load_dataset(
'csv', data_files={
'train': '100k_parsed_eml_train_data.csv',
'test': '30k_parsed_eml_test_data.csv'
}
)
print("Loaded data: ", common_voice)
# read audio
common_voice["train"] = common_voice["train"].cast_column("audio", Audio(sampling_rate=16000))
common_voice["test"] = common_voice["test"].cast_column("audio", Audio(sampling_rate=16000))
print("Formatted train data: ", common_voice["train"][0])
print("Formatted test data: ", common_voice["test"][0])
# verify tokenizer
verify_tokenizer(common_voice)
# extract features
common_voice = common_voice.map(prepare_dataset, remove_columns=common_voice.column_names["train"], num_proc=8)
data_collator = DataCollatorSpeechSeq2SeqWithPadding(processor=processor)
# config metrics
metric = evaluate.load("wer")
# import model
print("load model")
model = WhisperForConditionalGeneration.from_pretrained("openai/whisper-small")
#model.config.forced_decoder_ids = None
model.config.forced_decoder_ids = processor.get_decoder_prompt_ids(language="de", task="transcribe")
model.config.suppress_tokens = []
# define training config
from transformers import Seq2SeqTrainingArguments, Seq2SeqTrainer
output_dir = "./whisper-small-finetuned-de-2023-01-03"
training_args = Seq2SeqTrainingArguments(
output_dir=output_dir, # change to a repo name of your choice
per_device_train_batch_size=4,
gradient_accumulation_steps=4, # increase by 2x for every 2x decrease in batch size
learning_rate=1e-5,
warmup_steps=5000,
max_steps=40000,
gradient_checkpointing=True,
fp16=True,
evaluation_strategy="steps",
per_device_eval_batch_size=4,
predict_with_generate=True,
generation_max_length=225,
save_steps=4000,
eval_steps=4000,
logging_steps=250,
logging_dir="logs",
report_to=["tensorboard"],
load_best_model_at_end=True,
metric_for_best_model="wer",
greater_is_better=False,
)
# config trainer
trainer = Seq2SeqTrainer(
args=training_args,
model=model,
train_dataset=common_voice["train"],
eval_dataset=common_voice["test"],
data_collator=data_collator,
compute_metrics=compute_metrics,
tokenizer=processor.feature_extractor,
)
# save processor
processor.save_pretrained(training_args.output_dir)
# load checkpoints
from transformers.trainer_utils import get_last_checkpoint
last_checkpoint = get_last_checkpoint(training_args.output_dir)
print("checkpoints: ", last_checkpoint)
checkpoint = last_checkpoint
# start training
print("start training")
if checkpoint is None:
train_result = trainer.train()
else :
print("-> Training from checkpoint")
train_result = trainer.train(resume_from_checkpoint=checkpoint)
trainer.save_model()
# evaluate
small_train_dataset = common_voice["train"]
small_eval_dataset = common_voice["test"]
# compute train results
metrics = train_result.metrics
max_train_samples = len(small_train_dataset)
metrics["train_samples"] = min(max_train_samples, len(small_train_dataset))
# save train results
trainer.log_metrics("train", metrics)
trainer.save_metrics("train", metrics)
# compute evaluation results
metrics = trainer.evaluate()
max_val_samples = len(small_eval_dataset)
metrics["eval_samples"] = min(max_val_samples, len(small_eval_dataset))
# save evaluation results
trainer.log_metrics("eval", metrics)
trainer.save_metrics("eval", metrics)
from community-events.
I still cannot test the checkpoints directly but I figured out the conversion issue and re-wrote the code to be more user friendly
"""
The following code is based on:
- https://github.com/bayartsogt-ya/whisper-multiple-hf-datasets
"""
import re
import sys
import torch
import argparse
from transformers import WhisperForConditionalGeneration
whisper_mappings = {
"layers": "blocks",
"fc1": "mlp.0",
"fc2": "mlp.2",
"final_layer_norm": "mlp_ln",
".self_attn.q_proj": ".attn.query",
".self_attn.k_proj": ".attn.key",
".self_attn.v_proj": ".attn.value",
".self_attn_layer_norm": ".attn_ln",
".self_attn.out_proj": ".attn.out",
".encoder_attn.q_proj": ".cross_attn.query",
".encoder_attn.k_proj": ".cross_attn.key",
".encoder_attn.v_proj": ".cross_attn.value",
".encoder_attn_layer_norm": ".cross_attn_ln",
".encoder_attn.out_proj": ".cross_attn.out",
"decoder.layer_norm.": "decoder.ln.",
"encoder.layer_norm.": "encoder.ln_post.",
"embed_tokens": "token_embedding",
"encoder.embed_positions.weight": "encoder.positional_embedding",
"decoder.embed_positions.weight": "decoder.positional_embedding",
"layer_norm": "ln_post",
}
def format_key(key, verbose=False):
# format replacements
rep_sorted = sorted(whisper_mappings, key=len, reverse=True)
rep_escaped = map(re.escape, rep_sorted)
# Create a big OR regex that matches any of the substrings to replace
pattern = re.compile("|".join(rep_escaped))
# For each match, look up the new string in the replacements, being the key the normalized old string
new_key = pattern.sub(lambda m: whisper_mappings[m.group(0)], key)
# debug
if verbose:
print(f"{key} -> {new_key}")
return new_key
def convert_hf_checkpoints_to_whisper(checkpoints_path, generated_whisper_model_path, verbose):
try:
# load checkpoints
transformer_model = WhisperForConditionalGeneration.from_pretrained(checkpoints_path)
config = transformer_model.config
# build dims
dims = {
"n_mels": config.num_mel_bins,
"n_vocab": config.vocab_size,
"n_audio_ctx": config.max_source_positions,
"n_audio_state": config.d_model,
"n_audio_head": config.encoder_attention_heads,
"n_audio_layer": config.encoder_layers,
"n_text_ctx": config.max_target_positions,
"n_text_state": config.d_model,
"n_text_head": config.decoder_attention_heads,
"n_text_layer": config.decoder_layers,
}
# convert
hf_state_dict = transformer_model.model.state_dict()
whisper_state_dict = { format_key(hf_key, verbose): hf_value for hf_key, hf_value in hf_state_dict.items() }
# save model
torch.save({"dims": dims, "model_state_dict": whisper_state_dict}, generated_whisper_model_path)
print("-> whisper-like model is exported under ", generated_whisper_model_path)
except Exception as e:
print(str(e))
print("ConversionError: could not convert checkpoints.")
def main():
# init parser
parser = argparse.ArgumentParser()
parser.add_argument(
"--hf_checkpoints_path",
type=str,
default=None,
help="csv file with data to use for testing.",
)
parser.add_argument(
"--exported_model_path",
type=str,
default=None,
help="path to whisper model.",
)
parser.add_argument(
"--verbose",
default=False,
help="logs verbosity (if True).",
action="store_true"
)
# check args
args = parser.parse_args()
if not args.hf_checkpoints_path:
print('You need to specify the checkpoints path via "the --hf_checkpoints_path flag."')
sys.exit(1)
if not args.exported_model_path:
print('You need to specify a path for the generated model via "the --exported_model_path flag."')
sys.exit(1)
# export model
convert_hf_checkpoints_to_whisper(args.hf_checkpoints_path, args.exported_model_path, args.verbose)
if __name__ == "__main__":
main()
This can be used as follows:
python export.py --hf_checkpoints_path whisper-finetuned/checkpoint-40000 --exported_model_path finetuned_model.pt
from community-events.
@Vaibhavs10 and @sanchit-gandhi, thank you both for your time and help β€οΈ
@sanchit-gandhi was right about my code. Fixing that, made my checkpoints load correctly :))
from community-events.
Hi @sanchit-gandhi @SuperKogito, can you please share how to fixed the issue?? It's still confuse to me.
from community-events.
Related Issues (20)
- Fine-tuned Whisper models perform worse than OpenAI HOT 8
- Super large number of epoch HOT 1
- Using finetuned whisper checkpoints for inference HOT 1
- Whisper parameters HOT 1
- Add Keras Dreambooth notebook to this repo
- Lambda platform doesn't support Tensorflow-Gpu HOT 1
- Visualize ControlNet results using Weights & Biases Tables HOT 11
- WhisperPositionalEmbedding HOT 1
- this is more question than an issue HOT 2
- Padding conflict in loss computation HOT 2
- Whisper finetune HOT 1
- common_voice map error in the notebook of fine-tune-whisper-non-streaming HOT 1
- huggan.pytorch.lightweight_gan.lightweight_gan.LightweightGAN _from_pretrained requires use_auth_token but this is not passed by the from_pretrained method inherited from ModelHubMixin HOT 5
- Increasing WER & Validation Loss During Whisper Fine-Tuning HOT 1
- Poor Real-Time Performance of Whisper Models Fine-Tuned on Synthetic Data HOT 1
- Colab runtime crash HOT 2
- How to prepare audio dataset for whisper fine-tuning with timestamps?
- The fine tuning script run_speech_recognition_seq2seq_streaming.py use interleave_datasets which will truncate the train split HOT 2
- Using rename_column and remove_column method for a IterableDataset object leads to its feature property become None --- in the Whisper Fine-Tuning Event HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
π Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. πππ
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google β€οΈ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from community-events.