GithubHelp home page GithubHelp logo

vistaar's Introduction

Vistaar: Diverse Benchmarks and Training sets for Indian Language ASR

Vistaar is a set of 59 benchmarks and training datasets across various language and domain combinations such as news, education, literature, tourism etc. The training datasets are avaialable for 12 Indian languages amounting to over 10,700 hours of labelled audio data. We also train IndicWhisper models by fine-tuning the Whisper models on the Vistaar train dataset and observe that it has the lowest WER on 39 out of 59 Vistaar benchmarks.

Benchmarks

Vistaar consists of benchmarks from several public datasets - Kathbath, FLEURS, CommonVoice, IndicTTS, MUCS, GramVaani across 12 languages. We evaluate IndicWhisper on these benchmarks along with 3 publicly available ASR systems and 2 commercially available systems. Below mentioned are the results

Datasets bn gu hi kn ml mr or pa sa ta te ur avg
Kathbath 16.6 17.8 10.3 19.3 34.8 19.9 24.7 16.9 45.6 24.2 25 11.9 22.3
Kathbath Hard 19.4 20.6 12.0 22.2 38.4 22.1 29.1 19.7 50.5 27.5 27.8 14.7 25.3
CommonVoice 24.7 - 11.4 - 44.5 22.8 35.2 22.4 - 29.2 - 31.7 27.8
FLEURS 20.9 23.5 15.0 18.6 22.6 20.5 32.9 23.1 - 25.2 25.4 19.2 22.5
IndicTTS 18.8 19.1 7.6 13.2 21.4 11.4 15.0 - - 17.2 33.8 - 17.5
MUCS - 33.2 12.0 - - 12.8 27.5 - - 28.3 32.1 - 24.3
Gramvaani - - 26.8 - - - - - - - - - 26.8
Average 20.1 22.8 13.6 18.3 32.3 18.2 27.4 20.5 48 25.3 28.8 19.4 24.6

Word error rates (%) of IndicWhisper on the Vistaar Benchmark. Values in bold indicates benchmarks where IndicWhisper has the lowest WER.

Model Kathbath Kathbath-Hard FLEURS CommonVoice IndicTTS MUCS Gramvaani Average
Google STT 14.3 16.7 19.4 20.8 18.3 17.8 59.9 23.9
IndicWav2vec 12.2 16.2 18.3 20.2 15 22.9 42.1 21
Azure STT 13.6 15.1 24.3 14.6 15.2 15.1 42.3 20
Nvidia-medium 14 15.6 19.4 20.4 12.3 12.4 41.3 19.4
Nvidia-large 12.7 14.2 15.7 21.2 12.2 11.8 42.6 18.6
IndicWhisper 10.3 12.0 11.4 15.0 7.6 12 26.8 13.6

Comparison of publicly-available models on the Hindi subset of the Vistaar benchmark

Table of contents

Resources

Download Training Datasets and Benchmarks

Datasets Training Datasets Benchmarks
Kathbath link link
Kathbath Hard link link
CommonVoice link link
FLEURS link link
IndicTTS link link
MUCS link link
gramvaani link link

Download Models

Language Model Checkpoint
Bengali hf
Gujarati hf
Hindi hf
Kannada hf
Malayalam hf
Marathi hf
Odia hf
Punjabi hf
Sanskrit hf
Tamil hf
Telugu hf
Urdu hf

Tutorials

Setting up your environment

  • Setting up python virtual environment
    python -m venv <env_name>
    source <env_name>/bin/activate
    
  • Installing/Updating Libraries
    sudo apt install ffmpeg
    pip install -r requirements.txt
    

Evaluating ASR models

  • Manifest creation
    • For each dataset Download and extract the benchmark data in a directory. The data should be extracted in such a way that each folder inside should contain data for a particular language i.e each language specific folder should contain train, valid and test folder and within them the audio + transcript.txt
    • Sample structure of folder tree:
      kathbath
          ├── bengali
          │   ├── audio
          │   └── transcript.txt
          │
          └── gujarti
              ├── audio
              └── transcript.txt 
          .
          .
          .
          .
    
    • The manifest needs to be a Json file where each line is a dictionary with audio filepath, duration of audio and text transcript
     {"audio_filepath": <path to audio file 1>, "duration": <duration of audio file 1>, "text": <transcript of audio file 1>}
     {"audio_filepath": <path to audio file 2>, "duration": <duration of audio file 2>, "text": <transcript of audio file 2>}
     .
     .
     .
    
  • Running evaluation
    python evaluation.py --model_path=<model path> \
    --manifest_path=<manifest path in vistaar> \
    --manifest_name=<dataset name in vistaar> \
    --device=<gpu to use> \
    --batch_size=<batch size> \
    --language=<2 letter language code>
    

Inference using IndicWhisper

  • Sample structure of manifest file
{"audio_filepath":<path to audio file 1>}
{"audio_filepath":<path to audio file 2>}
.
.
.
  • Running batch inference
deepspeed --include localhost:<gpus to include> \
transcribe.py <manifest path> \
<model path> \
<current language> \
<batch size>
<output path>
  • Running inference for a single audio file
from transformers import pipeline

model_path = "hindi_models/whisper-medium-hi_alldata_multigpu"
device = "cuda"
lang_code = "hi"

whisper_asr = pipeline(
    "automatic-speech-recognition", model=model_path, device=device,
)

# Special case to handle odia since odia is not supported by whisper model
if lang_code == 'or':
    whisper_asr.model.config.forced_decoder_ids = (
        whisper_asr.tokenizer.get_decoder_prompt_ids(
            language=None, task="transcribe"
        )
    )
else:
    whisper_asr.model.config.forced_decoder_ids = (
        whisper_asr.tokenizer.get_decoder_prompt_ids(
            language=lang_code, task="transcribe"
        )
    )

result = whisper_asr("audio.mp3")
print(result["text"])

Training on Vistaar Train Datasets

  • Manifest creation
  • Running training
deepspeed --include localhost:<gpus to include> training.py \
--deepspeed=<path to deepspeed config file> \
--model_name_or_path=<model path> \
--dataset_name=<dataset language directory path> \
--language=<language> \
--train_split_name=<train manifest name> \
--eval_split_name=<validation manifest name> \
--max_steps="5000" \
--output_dir=<output directory path> \
--cache_dir=<cache directory for downloaded models> \
--per_device_train_batch_size="64" \
--per_device_eval_batch_size="32" \
--gradient_accumulation_steps="1" \
--logging_steps="10" \
--learning_rate="1e-5" \
--warmup_steps="500" \
--evaluation_strategy="steps" \
--eval_steps="500" \
--save_strategy="steps" \
--save_steps="500" \
--generation_max_length="225" \
--length_column_name="input_length" \
--max_duration_in_seconds="30" \
--text_column_name="sentence" \
--freeze_feature_encoder="False" \
--report_to="tensorboard" \
--metric_for_best_model="wer" \
--greater_is_better="False" \
--load_best_model_at_end \
--gradient_checkpointing \
--fp16 \
--do_train \
--do_eval \
--predict_with_generate \
--do_normalize_eval="False" \
--streaming="True" \
--use_auth_token="True" \
--push_to_hub="True"

License

Vistaar is MIT-licensed. The license applies to all the fine-tuned language models

Contributors

  • Kaushal Bhogale, (AI4Bharat)
  • Sai Narayan Sundaresan, (IITKGP, AI4Bharat)
  • Abhigyan Raman, (AI4Bharat)
  • Tahir Javed, (IITM, AI4Bharat)
  • Mitesh Khapra, (IITM, AI4Bharat, RBCDSAI)
  • Pratyush Kumar, (Microsoft, AI4Bharat)

Contact

Acknowledgements

We would like to thank EkStep Foundation for their generous grant which helped in setting up the Centre for AI4Bharat at IIT Madras to support our students, research staff, data and computational requirements. We would like to thank The Ministry of Electronics and Information Technology (NLTM) for its grant to support the creation of datasets and models for Indian languages under its ambitions Bhashini project. We would also like to thank the Centre for Development of Advanced Computing, India (C-DAC) for providing access to the Param Siddhi supercomputer for training our models. Lastly, we would like to thank Microsoft for its grant to create datasets, tools and resources for Indian languages.

vistaar's People

Contributors

1392001sai avatar kaushal-py avatar

Stargazers

Devansh Khandekar avatar  avatar Pratik Desai avatar Varun Nitesh Shah avatar  avatar  avatar Puneeth Chaganti avatar  avatar Aman Barthwal avatar Aditya Wanwade avatar  avatar  avatar Bandi Avinash Bhargav avatar Snehal Saurabh avatar Abhishek Yadav avatar 咘噜咘噜的碰 avatar Tanya Soni avatar Shriverdhan Pathak avatar Rajarshi Roy avatar  avatar Farha Kousar avatar Inanna avatar  avatar Raj Gothi avatar Sai Sanjay  avatar  avatar Aditya Kulshrestha avatar utsav shukla avatar Tanmay Jain avatar  avatar lishaojie avatar  avatar Anuj avatar  avatar Rounak Kumbhakar avatar  avatar Eldho Ittan George avatar Adeyinka Michael Sotunde avatar  avatar  avatar Sayantan Das avatar  avatar

Watchers

Anoop Kunchukuttan avatar  avatar Gokul NC avatar Abhigyan Raman avatar

vistaar's Issues

Incorrect Hindi Transcription

Hi, I am trying to transcribe YouTube audio in Hindi language using the IndicWhisper Hindi model. However, I am getting incorrect Hindi transcriptions.
For example:
YouTube transcription: यह अभ्यास तुम्हें उसी क्षेत्र में कम कर रहे अन्य लोगों से बहुत आगे लाकर खड़ा कर देगा सुबह के 5 घंटे
IndicWhisper transcription: हर दस किया विषाषा क्या एक है वह मैं विषा दिल ए आने के लिए एक अच्छा विषा आपकी रक्षा

audio.mp4

Can anyone help me with this?

Regarding inference

Hello,
How can I get the timestamp.

I even used transformers pipeline.

from transformers import pipeline
import torch

model = pipeline(
    task="automatic-speech-recognition",
    model="vistar/hindi", 
    device=torch.device("cuda:0"),
    chunk_length_s=30, # if not precised then only generate as much as `max_new_tokens`
    generate_kwargs={"num_beams": 5} # same setting as `openai-whisper` default
)

result = model("audio.mp3", return_timestamps=True)

print(result["text"])
print(result["chunks"])

transcription is good. But It is not returning timestamp.
Any suggestions here?

Thank you

Using IndicWhisper for Kannada transcription: documentation

First of all, thank you for the IndicWhisper model! I have been waiting for a while for a Whisper model finetuned to Indian languages; the openai/Whisper models are good but have not been satisfactory for my needs.

I think it will be very useful if you could provide step-by-step instructions in the documentation how the community can load and run the IndicWhisper model(s) in a python environment. I am specifically interested in Kannada, but of course I imagine the steps will be the same for all languages.

I am a techie but am not particularly skilled in AI/ML. I have a fair understanding of using ML models and I routinely work with the openai/Whisper models for integrating them into my python app. If the IndicWhisper models are fine-tuned versions of Whisper, I am unable to follow what (and why) I have to do anything different than what I do with the English openai/Whisper model. At this point, after reading the 'Inference' section, I am not able to understand where to begin to use IndicWhisper.

While you update documentation, any sample code / pointers to demos will be gratefully appreciated!

Batch Inference is not working properly

Hi @GokulNC @rahular @divkakwani @gowtham1997 @1392001sai @kaushal-py

I am using this CLI command template for batch inference -

deepspeed --include localhost:<gpus to include> \
transcribe.py <manifest path> \
<model path> \
<current language> \
<batch size>
<output path>

I am using this CLI command for batch inference in my local-

deepspeed --include localhost:0 \ 
transcribe.py indic_whisper_vistaar/vistaar/manifest.json \
indic_whisper_vistaar/vistaar/models/hindi_models/whisper-medium-hi_alldata_multigpu \ 
hindi \
16 \ 
indic_whisper_vistaar/vistaar/output.txt

where, following are the parameters -

"localhost:0" is gpu device id
"indic_whisper_vistaar/vistaar/manifest.json" : path to manifest file.
"indic_whisper_vistaar/vistaar/models/hindi_models/whisper-medium-hi_alldata_multigpu hindi" : the path to model.
16 : batch size
indic_whisper_vistaar/vistaar/output.txt: path to output file where transcriptions are to be saved

Issue - The transcripts obtained from the batch inference of each audio file is incomplete (Getting transcript of only few starting seconds of each audio file).

Please respond ASAP

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.