linto-ai / whisper-timestamped Goto Github PK

Multilingual Automatic Speech Recognition with word-level timestamps and confidence

License: GNU Affero General Public License v3.0

Python 99.53% Dockerfile 0.47%

deep-learning speech speech-recognition speech-to-text asr machine-learning python python3 pytorch attention-is-all-you-need

whisper-timestamped's Introduction

whisper-timestamped

Multilingual Automatic Speech Recognition with word-level timestamps and confidence.

Description
- Notes on other approaches
Installation
Usage
Acknowlegment
Citations

Description

Whisper is a set of multi-lingual, robust speech recognition models trained by OpenAI that achieve state-of-the-art results in many languages. Whisper models were trained to predict approximate timestamps on speech segments (most of the time with 1-second accuracy), but they cannot originally predict word timestamps. This repository proposes an implementation to predict word timestamps and provide a more accurate estimation of speech segments when transcribing with Whisper models. Besides, a confidence score is assigned to each word and each segment.

The approach is based on Dynamic Time Warping (DTW) applied to cross-attention weights, as demonstrated by this notebook by Jong Wook Kim. There are some additions to this notebook:

The start/end estimation is more accurate.
Confidence scores are assigned to each word.
If possible (without beam search...), no additional inference steps are required to predict word timestamps (word alignment is done on the fly after each speech segment is decoded).
Special care has been taken regarding memory usage: whisper-timestamped is able to process long files with little additional memory compared to the regular use of the Whisper model.

whisper-timestamped is an extension of the openai-whisper Python package and is meant to be compatible with any version of openai-whisper. It provides more efficient/accurate word timestamps, along with those additional features:

Voice Activity Detection (VAD) can be run before applying Whisper model, to avoid hallucinations due to errors in the training data (for instance, predicting "Thanks you for watching!" on pure silence). Several VAD methods are available: silero (default), auditok, auditok:v3.1
When the language is not specified, the language probabilities are provided among the outputs.

Notes on other approaches

An alternative relevant approach to recovering word-level timestamps involves using wav2vec models that predict characters, as successfully implemented in whisperX. However, these approaches have several drawbacks that are not present in approaches based on cross-attention weights such as whisper_timestamped. These drawbacks include:

The need to find one wav2vec model per language to support, which does not scale well with the multi-lingual capabilities of Whisper.
The need to handle (at least) one additional neural network (wav2vec model), which consumes memory.
The need to normalize characters in Whisper transcription to match the character set of the wav2vec model. This involves awkward language-dependent conversions, such as converting numbers to words ("2" -> "two"), symbols to words ("%" -> "percent", "€" -> "euro(s)")...
The lack of robustness around speech disfluencies (fillers, hesitations, repeated words...) that are usually removed by Whisper.

An alternative approach that does not require an additional model is to look at the probabilities of timestamp tokens estimated by the Whisper model after each (sub)word token is predicted. This was implemented, for instance, in whisper.cpp and stable-ts. However, this approach lacks robustness because Whisper models have not been trained to output meaningful timestamps after each word. Whisper models tend to predict timestamps only after a certain number of words have been predicted (typically at the end of a sentence), and the probability distribution of timestamps outside this condition may be inaccurate. In practice, these methods can produce results that are totally out-of-sync on some periods of time (we observed this especially when there is jingle music). Also, the timestamp precision of Whisper models tends to be rounded to 1 second (as in many video subtitles), which is too inaccurate for words, and reaching better accuracy is tricky.

Installation

First installation

Requirements:

python3 (version higher or equal to 3.7, at least 3.9 is recommended)
ffmpeg (see instructions for installation on the whisper repository)

You can install whisper-timestamped either by using pip:

pip3 install whisper-timestamped

or by cloning this repository and running installation:

git clone https://github.com/linto-ai/whisper-timestamped
cd whisper-timestamped/
python3 setup.py install

Additional packages that might be needed

If you want to plot alignment between audio timestamps and words (as in this section), you also need matplotlib:

pip3 install matplotlib

If you want to use VAD option (Voice Activity Detection before running Whisper model), you also need torchaudio and onnxruntime:

pip3 install onnxruntime torchaudio

If you want to use finetuned Whisper models from the Hugging Face Hub, you also need transformers:

pip3 install transformers

Docker

A docker image of about 9GB can be built using:

git clone https://github.com/linto-ai/whisper-timestamped
cd whisper-timestamped/
docker build -t whisper_timestamped:latest .

Light installation for CPU

If you don't have a GPU (or don't want to use it), then you don't need to install the CUDA dependencies. You should then just install a light version of torch before installing whisper-timestamped, for instance as follows:

pip3 install \
     torch==1.13.1+cpu \
     torchaudio==0.13.1+cpu \
     -f https://download.pytorch.org/whl/torch_stable.html

A specific docker image of about 3.5GB can also be built using:

git clone https://github.com/linto-ai/whisper-timestamped
cd whisper-timestamped/
docker build -t whisper_timestamped_cpu:latest -f Dockerfile.cpu .

Upgrade to the latest version

When using pip, the library can be updated to the latest version using:

pip3 install --upgrade --no-deps --force-reinstall git+https://github.com/linto-ai/whisper-timestamped

A specific version of openai-whisper can be used by running, for example:

pip3 install openai-whisper==20230124

Usage

Python

In Python, you can use the function whisper_timestamped.transcribe(), which is similar to the function whisper.transcribe():

import whisper_timestamped
help(whisper_timestamped.transcribe)

The main difference with whisper.transcribe() is that the output will include a key "words" for all segments, with the word start and end position. Note that the word will include punctuation. See the example below.

Besides, the default decoding options are different to favour efficient decoding (greedy decoding instead of beam search, and no temperature sampling fallback). To have same default as in whisper, use beam_size=5, best_of=5, temperature=(0.0, 0.2, 0.4, 0.6, 0.8, 1.0).

There are also additional options related to word alignement.

In general, if you import whisper_timestamped instead of whisper in your Python script and use transcribe(model, ...) instead of model.transcribe(...), it should do the job:

import whisper_timestamped as whisper

audio = whisper.load_audio("AUDIO.wav")

model = whisper.load_model("tiny", device="cpu")

result = whisper.transcribe(model, audio, language="fr")

import json
print(json.dumps(result, indent = 2, ensure_ascii = False))

Note that you can use a finetuned Whisper model from HuggingFace or a local folder by using the load_model method of whisper_timestamped. For instance, if you want to use whisper-large-v2-nob, you can simply do the following:

import whisper_timestamped as whisper

model = whisper.load_model("NbAiLab/whisper-large-v2-nob", device="cpu")

# ...

Command line

You can also use whisper_timestamped on the command line, similarly to whisper. See help with:

whisper_timestamped --help

The main differences with whisper CLI are:

Output files:
- The output JSON contains word timestamps and confidence scores. See example below.
- There is an additional CSV output format.
- For SRT, VTT, TSV formats, there will be additional files saved with word timestamps.
Some default options are different:
- By default, no output folder is set: Use --output_dir . for Whisper default.
- By default, there is no verbose: Use --verbose True for Whisper default.
- By default, beam search decoding and temperature sampling fallback are disabled, to favour an efficient decoding. To set the same as Whisper default, you can use --accurate (which is an alias for --beam_size 5 --temperature_increment_on_fallback 0.2 --best_of 5).
There are some additional specific options:
- --compute_confidence to enable/disable the computation of confidence scores for each word.
- --punctuations_with_words to decide whether punctuation marks should be included or not with preceding words.

An example command to process several files using the tiny model and output the results in the current folder, as would be done by default with whisper, is as follows:

whisper_timestamped audio1.flac audio2.mp3 audio3.wav --model tiny --output_dir .

Note that you can use a fine-tuned Whisper model from HuggingFace or a local folder. For instance, if you want to use the whisper-large-v2-nob model, you can simply do the following:

whisper_timestamped --model NbAiLab/whisper-large-v2-nob <...>

Plot of word alignment

Note that you can use the plot_word_alignment option of the whisper_timestamped.transcribe() Python function or the --plot option of the whisper_timestamped CLI to see the word alignment for each segment.

The upper plot represents the transformation of cross-attention weights used for alignment with Dynamic Time Warping. The abscissa represents time, and the ordinate represents the predicted tokens, with special timestamp tokens at the beginning and end, and (sub)words and punctuation in the middle.
The lower plot is an MFCC representation of the input signal (features used by Whisper, based on Mel-frequency cepstrum).
The vertical dotted red lines show where the word boundaries are found (with punctuation marks "glued" to the previous word).

Example output

The output of whisper_timestamped.transcribe() function is a python dictionary, which can be viewed in JSON format using the CLI.

The JSON schema can be seen in tests/json_schema.json.

Here is an example output:

whisper_timestamped AUDIO_FILE.wav --model tiny --language fr

{
  "text": " Bonjour! Est-ce que vous allez bien?",
  "segments": [
    {
      "id": 0,
      "seek": 0,
      "start": 0.5,
      "end": 1.2,
      "text": " Bonjour!",
      "tokens": [ 25431, 2298 ],
      "temperature": 0.0,
      "avg_logprob": -0.6674491882324218,
      "compression_ratio": 0.8181818181818182,
      "no_speech_prob": 0.10241222381591797,
      "confidence": 0.51,
      "words": [
        {
          "text": "Bonjour!",
          "start": 0.5,
          "end": 1.2,
          "confidence": 0.51
        }
      ]
    },
    {
      "id": 1,
      "seek": 200,
      "start": 2.02,
      "end": 4.48,
      "text": " Est-ce que vous allez bien?",
      "tokens": [ 50364, 4410, 12, 384, 631, 2630, 18146, 3610, 2506, 50464 ],
      "temperature": 0.0,
      "avg_logprob": -0.43492694334550336,
      "compression_ratio": 0.7714285714285715,
      "no_speech_prob": 0.06502953916788101,
      "confidence": 0.595,
      "words": [
        {
          "text": "Est-ce",
          "start": 2.02,
          "end": 3.78,
          "confidence": 0.441
        },
        {
          "text": "que",
          "start": 3.78,
          "end": 3.84,
          "confidence": 0.948
        },
        {
          "text": "vous",
          "start": 3.84,
          "end": 4.0,
          "confidence": 0.935
        },
        {
          "text": "allez",
          "start": 4.0,
          "end": 4.14,
          "confidence": 0.347
        },
        {
          "text": "bien?",
          "start": 4.14,
          "end": 4.48,
          "confidence": 0.998
        }
      ]
    }
  ],
  "language": "fr"
}

If the language is not specified (e.g. without option --language fr in the CLI) you will find an additional key with the language probabilities:

{
  ...
  "language": "fr",
  "language_probs": {
    "en": 0.027954353019595146,
    "zh": 0.02743500843644142,
    ...
    "fr": 0.9196318984031677,
    ...
    "su": 3.0119704064190955e-08,
    "yue": 2.2565967810805887e-05
  }
}

Options that may improve results

Here are some options that are not enabled by default but might improve results.

Accurate Whisper transcription

As mentioned earlier, some decoding options are disabled by default to offer better efficiency. However, this can impact the quality of the transcription. To run with the options that have the best chance of providing a good transcription, use the following options.

In Python:

results = whisper_timestamped.transcribe(model, audio, beam_size=5, best_of=5, temperature=(0.0, 0.2, 0.4, 0.6, 0.8, 1.0), ...)

On the command line:

whisper_timestamped --accurate ...

Running Voice Activity Detection (VAD) before sending to Whisper

Whisper models can "hallucinate" text when given a segment without speech. This can be avoided by running VAD and gluing speech segments together before transcribing with the Whisper model. This is possible with whisper-timestamped.

In Python:

results = whisper_timestamped.transcribe(model, audio, vad=True, ...)

On the command line:

whisper_timestamped --vad True ...

By default, the VAD method used is silero. But other methods are available, such as earlier versions of silero, or auditok. Those methods were introduced because latest versions of silero VAD can have a lot of false alarms on some audios (speech detected on silence).

In Python:

results = whisper_timestamped.transcribe(model, audio, vad="silero:v3.1", ...)
results = whisper_timestamped.transcribe(model, audio, vad="auditok", ...)

On the command line:

whisper_timestamped --vad silero:v3.1 ...
whisper_timestamped --vad auditok ...

In order to watch the VAD results, you can use the --plot option of the whisper_timestamped CLI, or the plot_word_alignment option of the whisper_timestamped.transcribe() Python function. It will show the VAD results on the input audio signal as following (x-axis is time in seconds):

vad="silero:v4.0"	vad="silero:v3.1"	vad="auditok"

Detecting disfluencies

Whisper models tend to remove speech disfluencies (filler words, hesitations, repetitions, etc.). Without precautions, the disfluencies that are not transcribed will affect the timestamp of the following word: the timestamp of the beginning of the word will actually be the timestamp of the beginning of the disfluencies. whisper-timestamped can have some heuristics to avoid this.

In Python:

results = whisper_timestamped.transcribe(model, audio, detect_disfluencies=True, ...)

On the command line:

whisper_timestamped --detect_disfluencies True ...

Important: Note that when using these options, possible disfluencies will appear in the transcription as a special "[*]" word.

Acknowlegment

whisper: Whisper speech recognition (License MIT).
dtw-python: Dynamic Time Warping (License GPL v3).

Citations

If you use this in your research, please cite the repo:

@misc{lintoai2023whispertimestamped,
  title={whisper-timestamped},
  author={Louradour, J{\'e}r{\^o}me},
  journal={GitHub repository},
  year={2023},
  publisher={GitHub},
  howpublished = {\url{https://github.com/linto-ai/whisper-timestamped}}
}

as well as the OpenAI Whisper paper:

@article{radford2022robust,
  title={Robust speech recognition via large-scale weak supervision},
  author={Radford, Alec and Kim, Jong Wook and Xu, Tao and Brockman, Greg and McLeavey, Christine and Sutskever, Ilya},
  journal={arXiv preprint arXiv:2212.04356},
  year={2022}
}

and this paper for Dynamic-Time-Warping:

@article{JSSv031i07,
  title={Computing and Visualizing Dynamic Time Warping Alignments in R: The dtw Package},
  author={Giorgino, Toni},
  journal={Journal of Statistical Software},
  year={2009},
  volume={31},
  number={7},
  doi={10.18637/jss.v031.i07}
}

whisper-timestamped's People

Contributors

Stargazers

Watchers

Forkers

seanth maxmax2016 kang7367 tkorchagin jgomez696 gchiocchio talkbank kamranjon shantanunair sageralph edgarsantiago93 dkakaie ajilim gmeola-southworks namtzigla mcholy ericlee911110 ahmetcanik deepinfra davidsystems databill86 myth-coder angelol zhongemma murilocurti rahulshivajipawar malloryerik ghazi-mansoor hoperiver dotsaylss diyism beveradb fullscope animebing ncfic ishan-marikar traidn mitrebh zhouqingzq ryotaman regud alex-songs natanfreeman p4thakur anita-arch ahmadhakami wsmalls5 hoonlight andruxa-smirnov trollfred airobotproject truonganhhoang andrewkuo zinc75 shosseini811 mymeile oztanharmanci simonbaars educosta85 macnibblet ayushi1509 qinfang0623 alexandrerodolfo easyvid mike2463 shawnsanderlin juanmals fsd7 poojajayasri chikatsi-joel hal9000com bettercallsaud erivandev franco1215 simpleclickau zzath 5l1v3r1 dldjseh audiorecursos ebinshaji jellun pmfalero cherub0526 fromparis ringofhealth wonyho alexandajerry scoobadood nicomashi superoldman96 jeff11-1-1 stancx1 tetianagrinberg nondzu felixgithub2017 raminia mildlyoffensive abarcovschi sheldonlynn miltos-thestargazer

whisper-timestamped's Issues

AssertionError "assert len(segment_tokens_check) < len(segment["tokens"])" with option --accurate

I'm getting some errors and bad results with the latest version. Running the file through the latest whisper gives good results.

If I use --supress_tokens="" the process completes but I get a bunch of repeated lines. Using the same command with --suppress_tokes=-1 I see the following errors. Here are the text results to get an idea of the repeated lines.

Let me know what other information would be helpful.

whisper_timestamped.txt
whisper.txt

whisper command used
whisper --verbose=False --suppress_tokens=-1 --model=tiny -o out video.mov

whisper_timestamped command used
whisper_timestamped --accurate --verbose=False --suppress_tokens=-1 --model=tiny -o out video.mov

Detected language: English
 98%|?????????????????????????????????????????????????????????????????????????????????????????????????? | 1521171/1557171 [1:04:51<01:32, 390.90frames/s]
Traceback (most recent call last):
  File "/usr/local/bin/whisper_timestamped", line 33, in <module>
    sys.exit(load_entry_point('whisper-timestamped==1.12.1', 'console_scripts', 'whisper_timestamped')())
  File "/usr/local/lib/python3.9/site-packages/whisper_timestamped/transcribe.py", line 2113, in cli
    result = transcribe_timestamped(
  File "/usr/local/lib/python3.9/site-packages/whisper_timestamped/transcribe.py", line 254, in transcribe_timestamped
    (transcription, words) = _transcribe_timestamped_naive(model, audio,
  File "/usr/local/lib/python3.9/site-packages/whisper_timestamped/transcribe.py", line 1150, in _transcribe_timestamped_naive
    assert len(segment_tokens_check) < len(segment["tokens"]) and segment_tokens_check[:-1] == segment["tokens"][:len(segment_tokens_check)-1]
AssertionError

Package             Version
------------------- ----------
certifi             2022.12.7
charset-normalizer  3.1.0
cmake               3.25.2
coloredlogs         15.0.1
Cython              0.29.33
dtw-python          1.3.0
ffmpeg-python       0.2.0
filelock            3.9.0
flatbuffers         23.3.3
future              0.18.3
huggingface-hub     0.13.1
humanfriendly       10.0
idna                3.4
lit                 15.0.7
llvmlite            0.39.1
more-itertools      9.1.0
mpmath              1.3.0
numba               0.56.4
numpy               1.23.5
onnxruntime         1.14.1
openai-whisper      20230308
packaging           23.0
pip                 23.0.1
protobuf            4.22.1
PyYAML              6.0
regex               2022.10.31
requests            2.28.2
scipy               1.10.1
setuptools          67.6.0
sympy               1.11.1
tokenizers          0.13.2
torch               1.13.1+cpu
torchaudio          0.13.1
tqdm                4.65.0
transformers        4.26.1
triton              2.0.0
typing_extensions   4.5.0
urllib3             1.26.15
whisper-timestamped 1.12.1

How can align sentences instead of words?

Add a max_line_length parameter to subtitle files

Hello, first of all thanks for your work, I'm here to give a suggestion.

With word-level timestamps I think it would be possible to add a character limit per line/time in SRT and VTT subtitle files without using a simple line break.

Depending on the audio, the characters can exceed 200+ per line and I believe this problem can be fixed with this implementation.

If it's not possible to add this parameter, when you have time, could you provide me with some code that would make this idea work? (I'm not from the programming area and I have a little difficulty)

Here's a discussion on the subject on Whisper so you can understand a little better: Improve default line lengths in subtitle files

Thanks.

Suggestion: Support beam search / nbest decoding (temperature fallback...)

This should be implemented with a naive (less efficient and more approximate) approach.

Verbose option does not support None (no output)

In the original Whisper transcribe implementation providing None to the verbose option suppresses everything, which is explained here. I believe it is this line here, which overrides the None behavior and subsequently there is no option to have no output whatsoever (providing None shows the progress bar and language, providing False also shows the progress Bar and Providing True does not show the progress bar but displays the list of word-level timestamps).

It would be ideal if there was an option to suppress all output from the transcribe method.

Cannot run multiple transcriptions without reloading the model

The transcription cannot be run again on the same loaded model, it gives an assertion error in the perform_word_alignment function

    for i, w in enumerate(attention_weights):
        assert w.shape[-2] == len(tokens), f"Attention weights have wrong shape: {w.shape[-2]} (expected {len(tokens)})."

Weird repetition on transcript

Thanks for this repo! I found the timestamp for each word is very accurate. However, I encountered some weird repetition in the transcript just like the original Whisper. I used stable_whisper
and this can solve all those repetition in transcript and gives a very stable output. I am wondering if there are some arguments I have to change in transcribe function or is there any way to combine stable_whisper with it to remove those repetitions?
Here is the wav demo.
friends_01.wav.zip

[Bug] Unable to use multiple output formats directly (without "all")

Hi,

I was trying to define multiple output formats directly with format names (without using the "all" option).
I couldn't figure out how to do it, and it eventually dawned to me that this might be a parsing bug. I think the problem is that if you give it a string like "json,srt", that will not match any of the pre-defined strings in choices[].

The diff below fixes the issue, but the commas were lost, so need to use something like "-f json srt" (without the quotes).

output_format_diff.gz

Fatal Error: Got inconsistent text for segment 10

Hi,

I was trying out this for the first time and run into the following error when using CUDA:

ot start time outside of audio boundary
Got start time outside of audio boundary
Got start time outside of audio boundary
Got start time outside of audio boundary
Got start time outside of audio boundary
Got start time outside of audio boundary
Got start time outside of audio boundary
Got start time outside of audio boundary
Got start time outside of audio boundary
Got start time outside of audio boundary
Got start time outside of audio boundary
100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 36000/36000 [00:27<00:00, 1287.63frames/s]
Traceback (most recent call last):
  File "E:\test.py", line 5, in <module>
    result = whisper.transcribe(model, audio)
  File "C:\Users\Stephen\AppData\Local\Programs\Python\Python310\lib\site-packages\whisper_timestamped\transcribe.py", line 226, in transcribe_timestamped
    (transcription, words) = _transcribe_timestamped_efficient(model, audio, trust_whisper_timestamps=trust_whisper_timestamps, **alignment_options, **whisper_options, **other_options)
  File "C:\Users\Stephen\AppData\Local\Programs\Python\Python310\lib\site-packages\whisper_timestamped\transcribe.py", line 797, in _transcribe_timestamped_efficient
    assert len(timestamped_tokens) < len(whisper_tokens) and timestamped_tokens == whisper_tokens[:len(timestamped_tokens)], \
AssertionError: Fatal Error: Got inconsistent text for segment 10:

My file is

import whisper_timestamped as whisper
audio = whisper.load_audio("o.mp3")

model = whisper.load_model("tiny", device="cuda") 
result = whisper.transcribe(model, audio)

This works correctly, when the device is specified as "cpu"

Any hints as to how I can solve this?

Clarifying whisperX limitations

The need to perform twice the inference (once with Whisper, once with wav2vec), which has an impact on the Real Time Factor.

wav2vec inference time is <10% of whisper, so minimal overhead.

The need to handle (at least) one additional neural network, which consumes memory.

These can be run separately with cuda cache cleared.

The need to find one wav2vec model per language to support.

Wav2vec models are available for most languages on https://huggingface.co/models

There is a major limitation with pure attention-based DTW word timestamps that I see currently which seems to give qualitatively worse results:
Whisper sentence timestamps are often incorrect by up to 15 seconds or more, so DTW window/alignment fails and cannot produce valid timestamps then causes severe drifting

whereas WhisperX can use VAD timestamps to window the alignment, removing this dependency on whisper timestamps entirely.

WARNING:whisper_timestamped:Inconsistent number of segments:

I am loving the word timings feature but I have noticed that sometimes it shows an error after processing audio.

The traceback is included below. This issue is not a problem with short tracks as I can simply start over, but with large files it takes about half an hour to process and it becomes very painful to see errors.

Are there any suggestions on how to fix this issue or could you potentially add a try-catch to skip some words but still show me the others?

100%|██████████| 475781/475781 [31:38<00:00, 250.57frames/s]
WARNING:whisper_timestamped:Inconsistent number of segments: whisper_segments (959) != timestamped_word_segments (951)
---------------------------------------------------------------------------
AssertionError                            Traceback (most recent call last)
[<ipython-input-5-5f700e0484fa>](https://m2tnz9and6-496ff2e9c6d22116-0-colab.googleusercontent.com/outputframe.html?vrz=colab-20230112-060047-RC00_501486188#) in <module>
     10 
     11 model = whisper.load_model("medium")
---> 12 result = whisper.transcribe(model, audio)
     13 
     14 json.dump(result, open(words_json, 'w'), indent=4, ensure_ascii = False)

[/usr/local/lib/python3.8/dist-packages/whisper_timestamped/transcribe.py](https://m2tnz9and6-496ff2e9c6d22116-0-colab.googleusercontent.com/outputframe.html?vrz=colab-20230112-060047-RC00_501486188#) in transcribe(model, audio, language, task, refine_whisper_precision, min_word_duration, plot_word_alignment, temperature, compression_ratio_threshold, logprob_threshold, no_speech_threshold, fp16, condition_on_previous_text, initial_prompt, suppress_tokens, verbose)
    250     words = []
    251     for i, (segment, timestamped_words, token) in enumerate(zip(whisper_segments, timestamped_word_segments, tokens)):
--> 252         assert filter_tokens(token) == filter_tokens(segment["tokens"])
    253         offset = segment["seek"] * HOP_LENGTH / SAMPLE_RATE
    254         for timestamped_word in timestamped_words:

AssertionError: Got inconsistent logprobs length : 23 != 22

!whisper_timestamped "/content/drive/MyDrive/نشيد الهدى The Right Way لمحمد المقيط Muhammed Al Muqit مع الكلمات والترجمة الإنجليزية في الوصف [zVYaS_qXM3Y].flac" --model large --output_dir "/content/drive/MyDrive" --model_dir "/content/drive/MyDrive/Whisper" --language Arabic --device cuda --threads 2 --verbose True

:219: RuntimeWarning: scipy._lib.messagestream.MessageStream size changed, may indicate binary incompatibility. Expected 56 from C header, got 64 from PyObject
tcmalloc: large alloc 3087007744 bytes == 0xbade000 @ 0x7f5d5e218680 0x7f5d5e239824 0x5f97c1 0x649901 0x5c43c6 0x4f327e 0x64e618 0x505163 0x56bbe1 0x5f5ee6 0x56bab6 0x569d8a 0x5f60c3 0x56cc92 0x5f5ee6 0x56bab6 0x569d8a 0x68e267 0x67d9b1 0x67da2f 0x67dad1 0x67fbf7 0x6b8082 0x6b840d 0x7f5d5e017083 0x5faa2e
[00:00.000 --> 00:07.000] حبّي أنصار الهدى
[00:07.000 --> 00:15.000] حبّي ركب الفدا
[00:15.000 --> 00:23.000] وطلّوا بسأر الشاهد
[00:23.000 --> 00:30.000] وطلّوا بسأر الشاهد
[00:30.000 --> 00:36.000] لمتى سنظل ربودا نغرق في النوم ونشخر
[00:36.000 --> 00:41.000] وعن الآذان نسمّ ونغمّض كي لا نيسر
[00:41.000 --> 00:46.000] لقت أجراس الخطر والحالة جاءت تهذر
[00:46.000 --> 00:52.000] لمتى نتقع سقول من يمنع زحف المنكر
[00:52.000 --> 00:57.000] مات الإحساس فصرنا من غير فؤادي يذكر
[00:57.000 --> 01:02.000] نبصر إخوانا غرقى فنمر كما لا يبصر
[01:02.000 --> 01:08.000] فلنأمر بالمعروف والنهى عن فعل الشر
[01:08.000 --> 01:15.000] ولنتركها عذارا دجلا خلقت كين عذار
[01:15.000 --> 01:20.000] لا أملك فكرة الكلمات تطلع بالدين مورّر
[01:20.000 --> 01:25.000] أو شخصية تأثير إمّا عندي فلتتققر
[01:25.000 --> 01:31.000] ما تذكر ليس بعذر إنّي أكبل في يوم المحشر
[01:31.000 --> 01:42.000] لما لا تتقف فلما لا تترك ما منه تصغّر
[01:42.000 --> 01:52.000] بأمطار الهدى عبّي ركب الفدا
[01:52.000 --> 01:57.000] وطلّوا بثأر الشاهدة
[01:57.000 --> 02:03.000] فتيات اليوم كفاكي تركضت خلف سرابي
[02:03.000 --> 02:08.000] عودي نحو الإسلام لا تستمعي لذئابي
[02:08.000 --> 02:13.000] فالشرق دمار فيه والأرض عدو كتابي
[02:13.000 --> 02:19.000] همّهمو تمسو الديني وكذا نفضل ترابي
[02:19.000 --> 02:24.000] قتلوا الإنسانية في أعلاقهم بحرابي
[02:24.000 --> 02:31.000] هجروا الأخلاق فصاروا حيوانات في غابي
[02:31.000 --> 02:36.000] حيوانات في غابي يا من تعمل للدنيا
[02:36.000 --> 02:41.000] ليّنفعك فما تعمل أرضاك غدا تصلاهو
[02:41.000 --> 02:47.000] نارا فيها تتقلقل تعمر في دار فنائي
[02:47.000 --> 02:52.000] تهدم دار المستقبل وتضيع العمر الغالي
[02:52.000 --> 02:57.000] وبك الفترة لن تقتل كم سترائي وترائي
[02:57.000 --> 03:03.000] وتخادع فلا تخجل إن الله عليم لن
[03:03.000 --> 03:08.000] تخدعه أو تستوفل فغدا إن لن ترجعان
[03:08.000 --> 03:13.000] غيّك للرشد وتعقل ستنادى بنداءتين
[03:13.000 --> 03:18.000] يوم على الله ستقبل يا خائب يا كافر
[03:18.000 --> 03:24.000] يا خاسر يا فاجر يا علف اذهب ولتطلق أجران
[03:24.000 --> 03:30.000] ممن كنت له تعمال
[03:30.000 --> 03:36.000] وتعمل لن تال على المانونا
[03:36.000 --> 03:42.000] برياء أو أنا
[03:42.000 --> 03:45.000] فلتاكم صاحي راشيد
[03:45.000 --> 03:50.000] حبي أنصار الهدى
[03:50.000 --> 03:55.000] عبيئي ركب الفيدى
[03:55.000 --> 04:05.000] واطمو بسور الشهيد
Traceback (most recent call last):
File "/usr/local/bin/whisper_timestamped", line 8, in
sys.exit(cli())
File "/usr/local/lib/python3.8/dist-packages/whisper_timestamped/transcribe.py", line 1025, in cli
result = transcribe_timestamped(
File "/usr/local/lib/python3.8/dist-packages/whisper_timestamped/transcribe.py", line 491, in transcribe_timestamped
assert i_end == len(logprobs), f"Got inconsistent logprobs length : {len(logprobs)} != {i_end}"
AssertionError: Got inconsistent logprobs length : 23 != 22

Then It does not output the files.

AssertionError: Got empty transcription!

I'm getting a weird error with a long (> 3 hour) videos. Shorter videos seem to output okay. Running the same command with just whisper outputs properly.

whisper_timestamped --verbose=False --suppress_tokens="" --accurate --model=tiny --fp16=False -o /tmp /tmp/video.mov

Traceback (most recent call last):
  File "/usr/local/bin/whisper_timestamped", line 33, in <module>
    sys.exit(load_entry_point('whisper-timestamped==1.11.0', 'console_scripts', 'whisper_timestamped')())
  File "/usr/local/lib/python3.9/site-packages/whisper_timestamped/transcribe.py", line 1928, in cli
    result = transcribe_timestamped(
  File "/usr/local/lib/python3.9/site-packages/whisper_timestamped/transcribe.py", line 235, in transcribe_timestamped
    (transcription, words) = _transcribe_timestamped_naive(model, audio,
  File "/usr/local/lib/python3.9/site-packages/whisper_timestamped/transcribe.py", line 1026, in _transcribe_timestamped_naive
    assert len(tokens), "Got empty transcription!"
AssertionError: Got empty transcription!

How to write SRT file? Are models the same as whisper?

Thank you for the code. I want to read the word-level timestamp results as an SRT format. How can I do that?
Are the model files(tiny,large-v2, large etc.) you are using the same as the model files in this original code "https://github.com/openai/whisper"

import whisper_timestamped as whisper

audio = whisper.load_audio("audio.mp3")

model = whisper.load_model("tiny", device="cuda")

result = whisper.transcribe_timestamped(model, audio, language="en")

Compatibility issues with openai-whisper version 20230306

A new version "20230306" of Whisper has been released

It introduces compatibility issues with whisper-timestamps:

Failure to write VTT and SRT files
This runtime error can occur : #47

This is addressed in #51

There are also problems of repeated text with version "20230306". See openai/whisper#1058
So using previous versions might be better...

Warning of onnxruntime "Removing initializer 'XXX'. It is not used by any node and should be removed from the model." with option --vad

Something happening with onnxruntime==1.14.1 and not onnxruntime==1.13.1

Reported here microsoft/onnxruntime#2662 (comment)

I don't know how to disable this warning

Consider Supporting CTranslate2 for faster inference

I recently learned about faster-whisper which uses the CTranslate2 library for faster inference. It seems you need to convert the whisper models first, but it claims the accuracy is the same for 4x speed improvements and reduced memory on both CPU and GPU.

I'm not sure if it would be feasible to support this but wanted to bring it up in case it was of interest. Feel free to close this issue if it is not possible.

Option --vad not working offline (when VAD torch model has been loaded already)

It seems the VAD option can't be used offline?
On the first run, torch.hub.load() creates a cache (for me, this is ~/.cache/torch).
But when run for the second time, this cache seems to be ignored and it will try to connect to github again.
(And that'll crash the whole thing if there's no internet)

As a side note, man that PyTorch is one nasty dependency to have...over 700 megs??
I almost gave up installing it (yeah I guess it can't be easily avoided).

Tested with 28d29e2 and Linux Mint 20.3.

[Bug] remove_last_null_duration_words

I got this error:

Using cache found in /home/mario/.cache/torch/hub/snakers4_silero-vad_master 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 305493/305493 [14:31<00:00, 350.46frames/s] An additional token was added on segment 33 An additional token was added on segment 39 An additional token was added on segment 43 An additional token was added on segment 44 An additional token was added on segment 45 An additional token was added on segment 60 An additional token was added on segment 65 An additional token was added on segment 91 An additional token was added on segment 92 Traceback (most recent call last): File "/home/mario/code/video_transcriptions/app.py", line 14, in <module> result = whisper.transcribe(model, audio, language="no", vad=True) File "/home/mario/mambaforge/envs/video_transcriptions/lib/python3.9/site-packages/whisper_timestamped-1.12.1-py3.9.egg/whisper_timestamped/transcribe.py", line 264, in transcribe_timestamped transcription, words = remove_last_null_duration_words(transcription, words, recompute_text=True) File "/home/mario/mambaforge/envs/video_transcriptions/lib/python3.9/site-packages/whisper_timestamped-1.12.1-py3.9.egg/whisper_timestamped/transcribe.py", line 1822, in remove_last_null_duration_words assert text.endswith(full_word)

"TypeError: can't convert cuda:0 device type tensor to numpy" with option --plot / plot_word_alignment on CUDA

I clone and setup whisper-timestamped, and it runs well when in CPU version.

But when I want to transcribe and get the mel-spectogram, error appeared.

huggingface_hub.utils._validators.HFValidationError

Hi this sample code below produces the subsequent error -- what am I doing wrong?

import whisper_timestamped as whisper
import argparse
parser = argparse.ArgumentParser()


# function for transcribing audio file

def transcribe_audio(audio_file):
    audio = whisper.load_audio(audio_file)
    model = whisper.load_model("tiny", device="cpu")
    result = whisper.transcribe(model, audio, language="en")
    return result



import json

# use argparse take input from user from parameter "--input" and output to optional parameter "--output"


# Add the input argument
parser.add_argument("--input", help="Input file path", required=True)

# Add the output argument with a default value
parser.add_argument("--output", help="Output file path (default: output.json)")

# Parse the arguments
args = parser.parse_args()

result = transcribe_audio(args.input)

# if output is not specified, use the input file name with .json extension
if args.output is None:
    args.output = args.input + ".json"

# write the result to the output file
with open(args.output, "w") as f:
    f.write(json.dumps(result, indent = 2, ensure_ascii = False))

raise HFValidationError(
huggingface_hub.utils._validators.HFValidationError: Repo id must be in the form 'repo_name' or 'namespace/repo_name': '/Users/julian/.local/share/virtualenvs/whisper-with-timestamps-O7h9XhvG/lib/python3.10/site-packages/openai_whisper-20230124-py3.10.egg/whisper/assets/multilingual'. Use `repo_type` argument if needed.

AssertionError: Got inconsistent logprob at index <N>: <XXX> != <YYY>

Working on it....

Transcription contains duplicated fragments

Tested on the 'medium' model
Got duplicate fragments
But whisper recognized it well.
{"text": " Вместе с вами мы будем делать оборудование. И мы будем делать оборудование, чтобы вы могли увидеть, что это такое. И мы будем делать оборудование, чтобы вы могли увидеть, что это такое. И мы будем делать оборудование, чтобы вы могли увидеть, что это такое. И мы будем делать оборудование, чтобы вы могли увидеть, что это такое. И мы будем делать оборудование, чтобы вы могли увидеть, что это такое. И мы будем делать оборудование, чтобы вы могли увидеть, что это такое. И мы будем делать оборудование, чтобы вы могли увидеть, что это такое. И мы будем делать оборудование, чтобы вы могли увидеть, что это такое. И мы будем делать оборудование, чтобы вы могли увидеть, что это такое. И мы будем делать оборудование, чтобы вы могли увидеть, что это такое. И мы будем делать оборудование, чтобы вы могли увидеть, что это такое. И мы будем делать оборудование, чтобы вы могли увидеть, что это такое. И мы будем делать оборудование, чтобы вы могли увидеть, что это такое. И мы будем делать оборудование, чтобы вы могли увидеть, что это такое. И мы будем делать оборудование, чтобы вы могли увидеть, что это такое. И мы будем делать оборудование, чтобы вы могли увидеть, что это такое. И мы будем делать оборудование, чтобы вы могли увидеть, что это такое. И мы будем делать оборудование, чтобы вы могли увидеть, что это такое. И мы будем делать оборудование, чтобы вы могли увидеть, что это такое. И мы будем делать оборудование, чтобы вы могли увидеть, что это такое.",

whisper:
{ "text": " Ну, одна, двухкомнатная интересует. Ну, я мужчина один. Тридцать три года русский. Сегодня вечером. Да, да, да. Да, да, но длительный срок. Ну, полгода, год. Нету, нету. Да, да, да. Борис. Да, конечно, какая вот, 50%. Да, да, да, я в курсе. Спасибо большое.",

Whisper_timestamped does not transcript all the video?

the Whisper_timestamped transcript only the first 10 words and ignore the rest?

Question: How to efficiently get attention weights with beam search decoding?

Hello, just wanted to ask, if inference has to happen twice during beam search decoding, I'm trying to understand if it is a limitation of the whisper architecture or something that is very difficult to implement in one pass, to see if it is an area I can help in.

What would happen if you got the attention while it was decoding on the fly for the naive solution?

word.strip()

whisper-timestamped strips all words calling xx.strip().
whisper works quite well with e.g. "scriptio continua".
If one removes the xx.strip() everything works as expected (getting the leading spaces).

why whisper_timestamped does not transcript the entire video?

I have a video of 40 seconds and it transcript only the first 20 words and ignore the rest of them?

Delay in the word level transcription

Hello I'm not sure if this is a limitation of whisper or a bug, but sometimes the sentences/segments are delayed and then happen all at once.

This video can reproduce the delay: youtube
Downloaded from the mp4 from here: youtube downloader

I think it happens for sections that are followed with no speech/audio sections. In this particular video, there is an example of it occurring around the 24-second mark, and the text for that segment appears later all at once.

I used whisper large-v1 with temperature=(0.0, 0.2, 0.4, 0.6, 0.8, 1.0), beam_size=5, best_of=5, condition_on_previous_text=False

Fail with "***.en" models

This was working yesterday, I may try reverting to a previous version.

Anyway, here is my code:

import whisper_timestamped

audio = whisper_timestamped.load_audio("memory_not_needed.mp3")
model = whisper_timestamped.load_model("medium.en")
result = whisper_timestamped.transcribe(model, audio)

Here is my error:

100%|██████████████████████████████████████████████████████████████████████████████| 80805/80805 [01:37<00:00, 831.13frames/s]
---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
Cell In[3], line 5
      3 audio = whisper_timestamped.load_audio("memory_not_needed.mp3")
      4 # model = whisper_timestamped.load_model("medium.en")
----> 5 result = whisper_timestamped.transcribe(model, audio)

File d:\python\python39\lib\site-packages\whisper_timestamped\transcribe.py:209, in transcribe_timestamped(model, audio, language, task, remove_punctuation_from_words, compute_word_confidence, include_punctuation_in_confidence, refine_whisper_precision, min_word_duration, plot_word_alignment, seed, naive_approach, temperature, best_of, beam_size, patience, length_penalty, compression_ratio_threshold, logprob_threshold, no_speech_threshold, fp16, condition_on_previous_text, initial_prompt, suppress_tokens, sample_len, verbose)
    202 other_options = dict(
    203     no_speech_threshold=no_speech_threshold,
    204     logprob_threshold=logprob_threshold,
    205     compression_ratio_threshold=compression_ratio_threshold,
    206 )
    208 if naive_approach:
--> 209     (transcription, words) = _transcribe_timestamped_naive(model, audio, min_word_duration=min_word_duration, **alignment_options, **whisper_options, **other_options)
    210 else:
    211     (transcription, words) = _transcribe_timestamped_efficient(model, audio, **alignment_options, **whisper_options, **other_options)

File d:\python\python39\lib\site-packages\whisper_timestamped\transcribe.py:784, in _transcribe_timestamped_naive(model, audio, remove_punctuation_from_words, compute_word_confidence, include_punctuation_in_confidence, refine_whisper_precision_nframes, plot_word_alignment, min_word_duration, **whisper_options)
    781 tokens = tokens[3:] + [tokenizer.timestamp_begin + round((end_sample - start_sample) // AUDIO_SAMPLES_PER_TOKEN)]
    782 attention_weights = [w[:, :, 2:, :] for w in attention_weights]
--> 784 ws = perform_word_alignment(
    785     tokens,
    786     attention_weights,
    787     tokenizer,
    788     use_space=use_space,
    789     remove_punctuation_from_words=remove_punctuation_from_words,
    790     refine_whisper_precision_nframes=refine_whisper_precision_nframes,
    791     mfcc=mfcc,
    792     plot=plot_word_alignment,
    793 )
    795 i_start = 3
    796 segment_logprobs = []

File d:\python\python39\lib\site-packages\whisper_timestamped\transcribe.py:922, in perform_word_alignment(tokens, attention_weights, tokenizer, use_space, refine_whisper_precision_nframes, medfilt_width, qk_scale, most_top_layers, mfcc, plot, remove_punctuation_from_words, unfinished_decoding, debug)
    920 # Check start / end tokens
    921 if start_token < 0:
--> 922     raise RuntimeError(f"Missing start token in {tokenizer.decode_with_timestamps(tokens)}")
    923 if len(tokens) == 1 or end_token < 0:
    924     # This can happens when Whisper is stucked as a Language Model
    925     if debug:

RuntimeError: Missing start token in , it's Dr. Juice here, and today we're going to be talking about why you just don't<|6.60|>

start + end outside length of audio

This 15s audio: gaenswein15.zip

Command line:

python3 whisper_timestamped/transcribe.py ~/gaenswein15.wav --model large-v2 --language de

Timestamps:

    {
      "id": 2,
      "seek": 1300,
      "start": 27.16,
      "end": 27.86,
      "text": " Das hat er als emeritus Ritus gewünscht.",

Different results with whisper and whisper_timestamped

When I run an audio file through whisper I get good transcription but when it is run through whisper_timestamped the results are different with the same parameters.
Audio file:
https://drive.google.com/file/d/11FefP13yHkKOgSMUDjwRR3ztbtAfZCSD/view?usp=sharing

whisper:
script = model.transcribe(audio, language = 'Hindi', beam_size = 5, best_of=5, temperature=(0.0, 0.2, 0.4, 0.6, 0.8, 1.0))

whisper_timestamped:

audio = whisper.load_audio('/data_drive_500/vocals_skanda_sil.wav')
result = whisper.transcribe(model, audio, language='hi', beam_size=5, best_of=5, temperature=(0.0, 0.2, 0.4, 0.6, 0.8, 1.0))

Suggestion: Print timestamped words on-the-fly with option verbose=True

Question: Does this algorithm fundamentally require the entire whisper transcription to be complete before processing, or could it be modified to output segments during processing, like whisper's -v verbose output? For example, if I had a 1-hour recording, could I modify whisper-timestamped to produce the word timestamps in "real time", segment by segment, rather than waiting until the whole 1-hour file is transcribed? I'm happy to try to do it myself, but I just wanted to make sure there's no fundamental reason it can't be done. Would also appreciate any advice for where you would recommend I make the changes. Thanks!

P.S. thanks for sharing this repo, amazing work!

Inconsistent number of segments: whisper_segments (1352) != timestamped_word_segments (1350)

Hi, thanks for making this package publicly available, and thanks in advance for your help!
Unfortunately, I'm still running into this issue:
#24

It seems to be specific to at least two of the files in the dataset I'm trying to transcribe. Some of the files work fine, but when the script gets to one of two specific files I get the following errors. NOTE: these errors are using the medium.en model - I've added error handling to try successively smaller models and to write the tracebacks to file. I'll update this when I know more (e.g., number of files that have problems, if all models have this problem or just a few, etc). Unfortunately I cannot share the audio due to privacy/research restrictions.

The Python function calls to Whisper:

model_name = 'medium'
model = whisper.load_model(model_name, device="cpu")
result = whisper.transcribe(model, audio, language="en")

Error message for the first file:

rocessing file: D:\github\PAT_data\audio\MOH 110-006 V2 02Dec2020 Part 2.wav
100%|██████████| 233676/233676 [55:47<00:00, 69.80frames/s] 
Inconsistent number of segments: whisper_segments (1352) != timestamped_word_segments (1350)
Traceback (most recent call last):

  File ~\Anaconda3\envs\stt\lib\site-packages\spyder_kernels\py3compat.py:356 in compat_exec
    exec(code, globals, locals)

  File d:\github\pat_private\driver_whisper_speech_to_text.py:56
    result = whisper.transcribe(model, audio, language="en")

  File ~\Anaconda3\envs\stt\lib\site-packages\whisper_timestamped\transcribe.py:259 in transcribe_timestamped
    (transcription, words) = _transcribe_timestamped_efficient(model, audio,

  File ~\Anaconda3\envs\stt\lib\site-packages\whisper_timestamped\transcribe.py:851 in _transcribe_timestamped_efficient
    assert l1 == l2 or l1 == 0, f"Inconsistent number of segments: whisper_segments ({l1}) != timestamped_word_segments ({l2})"

AssertionError: Inconsistent number of segments: whisper_segments (1352) != timestamped_word_segments (1350)

Error message for the second file:

Processing file: D:\github\PAT_data\audio\MOH1 110-004 V5 Part 1 of 2 13NOV2020.wav
100%|██████████| 289939/289939 [57:12<00:00, 84.47frames/s] 
Inconsistent number of segments: whisper_segments (862) != timestamped_word_segments (861)
Traceback (most recent call last):

  File ~\Anaconda3\envs\stt\lib\site-packages\spyder_kernels\py3compat.py:356 in compat_exec
    exec(code, globals, locals)

  File d:\github\pat_private\driver_whisper_speech_to_text.py:56
    result = whisper.transcribe(model, audio, language="en")

  File ~\Anaconda3\envs\stt\lib\site-packages\whisper_timestamped\transcribe.py:259 in transcribe_timestamped
    (transcription, words) = _transcribe_timestamped_efficient(model, audio,

  File ~\Anaconda3\envs\stt\lib\site-packages\whisper_timestamped\transcribe.py:851 in _transcribe_timestamped_efficient
    assert l1 == l2 or l1 == 0, f"Inconsistent number of segments: whisper_segments ({l1}) != timestamped_word_segments ({l2})"

AssertionError: Inconsistent number of segments: whisper_segments (862) != timestamped_word_segments (861)

System and versions:
Windows 10 Enterprise (OS build: 19043.2364)
Running on a CPU (Intel(R) Xeon(R) CPU E5-2643 v3 @ 3.40GHz 3.40 GHz)
Running using Python with Conda as my package manager (installed package versions below)
Packages in the environment:

# Name                    Version                   Build  Channel
absl-py                   1.4.0                    pypi_0    pypi
aiohttp                   3.8.4                    pypi_0    pypi
aiosignal                 1.3.1                    pypi_0    pypi
alabaster                 0.7.12             pyhd3eb1b0_0
alembic                   1.10.2                   pypi_0    pypi
antlr4-python3-runtime    4.9.3                    pypi_0    pypi
arrow                     1.2.3           py310haa95532_1
asteroid-filterbanks      0.4.0                    pypi_0    pypi
astroid                   2.14.2          py310haa95532_0
asttokens                 2.0.5              pyhd3eb1b0_0
async-timeout             4.0.2                    pypi_0    pypi
atomicwrites              1.4.0                      py_0
attrs                     22.1.0          py310haa95532_0
audioread                 3.0.0                    pypi_0    pypi
autopep8                  1.6.0              pyhd3eb1b0_1
babel                     2.11.0          py310haa95532_0
backcall                  0.2.0              pyhd3eb1b0_0
backports-cached-property 1.0.2                    pypi_0    pypi
bcrypt                    3.2.0           py310h2bbff1b_1
beautifulsoup4            4.11.1          py310haa95532_0
binaryornot               0.4.4              pyhd3eb1b0_1
black                     22.6.0          py310haa95532_0
blas                      1.0                         mkl
bleach                    4.1.0              pyhd3eb1b0_0
bottleneck                1.3.5           py310h9128911_0
brotlipy                  0.7.0           py310h2bbff1b_1002
bzip2                     1.0.8                he774522_0
ca-certificates           2023.01.10           haa95532_0
cachetools                5.3.0                    pypi_0    pypi
certifi                   2022.12.7       py310haa95532_0
cffi                      1.15.1          py310h2bbff1b_3
chardet                   4.0.0           py310haa95532_1003
charset-normalizer        2.0.4              pyhd3eb1b0_0
click                     8.0.4           py310haa95532_0
cloudpickle               2.0.0              pyhd3eb1b0_0
cmaes                     0.9.1                    pypi_0    pypi
colorama                  0.4.6           py310haa95532_0
coloredlogs               15.0.1                   pypi_0    pypi
colorlog                  6.7.0                    pypi_0    pypi
comm                      0.1.2           py310haa95532_0
commonmark                0.9.1                    pypi_0    pypi
contourpy                 1.0.7                    pypi_0    pypi
cookiecutter              1.7.3              pyhd3eb1b0_0
cryptography              39.0.1          py310h21b164f_0
cycler                    0.11.0                   pypi_0    pypi
cython                    0.29.33                  pypi_0    pypi
debugpy                   1.5.1           py310hd77b12b_0
decorator                 5.1.1              pyhd3eb1b0_0
defusedxml                0.7.1              pyhd3eb1b0_0
diff-match-patch          20200713           pyhd3eb1b0_0
dill                      0.3.6           py310haa95532_0
docopt                    0.6.2                    pypi_0    pypi
docstring-to-markdown     0.11            py310haa95532_0
docutils                  0.18.1          py310haa95532_3
dtw-python                1.3.0                    pypi_0    pypi
einops                    0.3.2                    pypi_0    pypi
entrypoints               0.4             py310haa95532_0
executing                 0.8.3              pyhd3eb1b0_0
ffmpeg                    4.3.1                ha925a31_0    conda-forge
ffmpeg-python             0.2.0                    pypi_0    pypi
filelock                  3.9.0                    pypi_0    pypi
flake8                    6.0.0           py310haa95532_0
flatbuffers               23.3.3                   pypi_0    pypi
flit-core                 3.6.0              pyhd3eb1b0_0
fonttools                 4.39.0                   pypi_0    pypi
frozenlist                1.3.3                    pypi_0    pypi
fsspec                    2023.3.0                 pypi_0    pypi
future                    0.18.3                   pypi_0    pypi
giflib                    5.2.1                h8cc25b3_3
git                       2.34.1               haa95532_0
glib                      2.69.1               h5dc1a3c_2
google-auth               2.16.2                   pypi_0    pypi
google-auth-oauthlib      0.4.6                    pypi_0    pypi
greenlet                  2.0.2                    pypi_0    pypi
grpcio                    1.51.3                   pypi_0    pypi
gst-plugins-base          1.18.5               h9e645db_0
gstreamer                 1.18.5               hd78058f_0
hmmlearn                  0.2.8                    pypi_0    pypi
huggingface-hub           0.13.1                   pypi_0    pypi
humanfriendly             10.0                     pypi_0    pypi
hyperpyyaml               1.1.0                    pypi_0    pypi
icu                       58.2                 ha925a31_3
idna                      3.4             py310haa95532_0
imagesize                 1.4.1           py310haa95532_0
importlib-metadata        4.11.3          py310haa95532_0
importlib_metadata        4.11.3               hd3eb1b0_0
inflection                0.5.1           py310haa95532_0
intel-openmp              2021.4.0          haa95532_3556
intervaltree              3.1.0              pyhd3eb1b0_0
ipykernel                 6.19.2          py310h9909e9c_0
ipython                   8.10.0          py310haa95532_0
ipython_genutils          0.2.0              pyhd3eb1b0_1
isort                     5.9.3              pyhd3eb1b0_0
jedi                      0.18.1          py310haa95532_1
jellyfish                 0.9.0           py310h2bbff1b_0
jinja2                    3.1.2           py310haa95532_0
jinja2-time               0.2.0              pyhd3eb1b0_3
joblib                    1.2.0                    pypi_0    pypi
jpeg                      9e                   h2bbff1b_1
jsonschema                4.17.3          py310haa95532_0
julius                    0.2.7                    pypi_0    pypi
jupyter_client            7.4.9           py310haa95532_0
jupyter_core              5.2.0           py310haa95532_0
jupyterlab_pygments       0.1.2                      py_0
keyring                   23.4.0          py310haa95532_0
kiwisolver                1.4.4                    pypi_0    pypi
lazy-object-proxy         1.6.0           py310h2bbff1b_0
lerc                      3.0                  hd77b12b_0
libclang                  12.0.0          default_h627e005_2
libdeflate                1.17                 h2bbff1b_0
libffi                    3.4.2                hd77b12b_6
libiconv                  1.16                 h2bbff1b_2
libogg                    1.3.5                h2bbff1b_1
libpng                    1.6.39               h8cc25b3_0
librosa                   0.9.2                    pypi_0    pypi
libsodium                 1.0.18               h62dcd97_0
libspatialindex           1.9.3                h6c2663c_0
libtiff                   4.5.0                h6c2663c_2
libvorbis                 1.3.7                he774522_0
libwebp                   1.2.4                hbc33d0d_1
libwebp-base              1.2.4                h2bbff1b_1
libxml2                   2.9.14               h0ad7f3c_0
libxslt                   1.1.35               h2bbff1b_0
llvmlite                  0.39.1                   pypi_0    pypi
lxml                      4.9.1           py310h1985fb9_0
lz4-c                     1.9.4                h2bbff1b_0
mako                      1.2.4                    pypi_0    pypi
markdown                  3.4.1                    pypi_0    pypi
markupsafe                2.1.1           py310h2bbff1b_0
matplotlib                3.7.1                    pypi_0    pypi
matplotlib-inline         0.1.6           py310haa95532_0
mccabe                    0.7.0              pyhd3eb1b0_0
mistune                   0.8.4           py310h2bbff1b_1000
mkl                       2021.4.0           haa95532_640
mkl-service               2.4.0           py310h2bbff1b_0
mkl_fft                   1.3.1           py310ha0764ea_0
mkl_random                1.2.2           py310h4ed8f06_0
more-itertools            9.1.0                    pypi_0    pypi
mpmath                    1.3.0                    pypi_0    pypi
multidict                 6.0.4                    pypi_0    pypi
mypy_extensions           0.4.3           py310haa95532_1
nbclient                  0.5.13          py310haa95532_0
nbconvert                 6.5.4           py310haa95532_0
nbformat                  5.7.0           py310haa95532_0
nest-asyncio              1.5.6           py310haa95532_0
networkx                  2.8.8                    pypi_0    pypi
numba                     0.56.4                   pypi_0    pypi
numexpr                   2.8.4           py310hd213c9f_0
numpy                     1.23.5          py310h60c9a35_0
numpy-base                1.23.5          py310h04254f7_0
numpydoc                  1.5.0           py310haa95532_0
oauthlib                  3.2.2                    pypi_0    pypi
omegaconf                 2.3.0                    pypi_0    pypi
onnxruntime               1.14.1                   pypi_0    pypi
openai-whisper            20230308                 pypi_0    pypi
openssl                   1.1.1t               h2bbff1b_0
optuna                    3.1.0                    pypi_0    pypi
packaging                 22.0            py310haa95532_0
pandas                    1.5.3           py310h4ed8f06_0
pandocfilters             1.5.0              pyhd3eb1b0_0
paramiko                  2.8.1              pyhd3eb1b0_0
parso                     0.8.3              pyhd3eb1b0_0
pathspec                  0.10.3          py310haa95532_0
pcre                      8.45                 hd77b12b_0
pexpect                   4.8.0              pyhd3eb1b0_3
pickleshare               0.7.5           pyhd3eb1b0_1003
pillow                    9.4.0                    pypi_0    pypi
pip                       23.0.1          py310haa95532_0
platformdirs              2.5.2           py310haa95532_0
pluggy                    1.0.0           py310haa95532_1
ply                       3.11            py310haa95532_0
pooch                     1.7.0                    pypi_0    pypi
poyo                      0.5.0              pyhd3eb1b0_0
primepy                   1.3                      pypi_0    pypi
prompt-toolkit            3.0.36          py310haa95532_0
protobuf                  3.20.1                   pypi_0    pypi
psutil                    5.9.0           py310h2bbff1b_0
ptyprocess                0.7.0              pyhd3eb1b0_2
pure_eval                 0.2.2              pyhd3eb1b0_0
pyannote-audio            2.1.1                    pypi_0    pypi
pyannote-core             4.5                      pypi_0    pypi
pyannote-database         4.1.3                    pypi_0    pypi
pyannote-metrics          3.2.1                    pypi_0    pypi
pyannote-pipeline         2.3                      pypi_0    pypi
pyasn1                    0.4.8                    pypi_0    pypi
pyasn1-modules            0.2.8                    pypi_0    pypi
pycodestyle               2.10.0          py310haa95532_0
pycparser                 2.21               pyhd3eb1b0_0
pydeprecate               0.3.2                    pypi_0    pypi
pydocstyle                6.3.0           py310haa95532_0
pydub                     0.25.1             pyhd8ed1ab_0    conda-forge
pyflakes                  3.0.1           py310haa95532_0
pygments                  2.11.2             pyhd3eb1b0_0
pylint                    2.16.2          py310haa95532_0
pylint-venv               2.3.0                    pypi_0    pypi
pyls-spyder               0.4.0              pyhd3eb1b0_0
pynacl                    1.5.0           py310h8cc25b3_0
pyopenssl                 23.0.0          py310haa95532_0
pyparsing                 3.0.9                    pypi_0    pypi
pyqt                      5.15.7          py310hd77b12b_0
pyqt5-sip                 12.11.0         py310hd77b12b_0
pyqtwebengine             5.15.7          py310hd77b12b_0
pyreadline3               3.4.1                    pypi_0    pypi
pyrsistent                0.18.0          py310h2bbff1b_0
pysocks                   1.7.1           py310haa95532_0
python                    3.10.9               h966fe2a_2
python-dateutil           2.8.2              pyhd3eb1b0_0
python-fastjsonschema     2.16.2          py310haa95532_0
python-lsp-black          1.2.1           py310haa95532_0
python-lsp-jsonrpc        1.0.0              pyhd3eb1b0_0
python-lsp-server         1.7.1           py310haa95532_0
python-slugify            5.0.2              pyhd3eb1b0_0
pytoolconfig              1.2.5           py310haa95532_1
pytorch-lightning         1.6.5                    pypi_0    pypi
pytorch-metric-learning   1.7.3                    pypi_0    pypi
pytz                      2022.7          py310haa95532_0
pywin32                   305             py310h2bbff1b_0
pywin32-ctypes            0.2.0           py310haa95532_1000
pyyaml                    6.0             py310h2bbff1b_1
pyzmq                     23.2.0          py310hd77b12b_0
qdarkstyle                3.0.2              pyhd3eb1b0_0
qstylizer                 0.2.2                    pypi_0    pypi
qt-main                   5.15.2               he8e5bd7_7
qt-webengine              5.15.9               hb9a9bb5_5
qtawesome                 1.2.2                    pypi_0    pypi
qtconsole                 5.4.0                    pypi_0    pypi
qtpy                      2.2.0           py310haa95532_0
qtwebkit                  5.212                h3ad3cdb_4
regex                     2022.10.31               pypi_0    pypi
requests                  2.28.1          py310haa95532_0
requests-oauthlib         1.3.1                    pypi_0    pypi
resampy                   0.4.2                    pypi_0    pypi
rich                      12.6.0                   pypi_0    pypi
rope                      1.7.0           py310haa95532_0
rsa                       4.9                      pypi_0    pypi
rtree                     1.0.1           py310h2eaa2aa_0
ruamel-yaml               0.17.21                  pypi_0    pypi
ruamel-yaml-clib          0.2.7                    pypi_0    pypi
scikit-learn              1.2.2                    pypi_0    pypi
scipy                     1.10.1                   pypi_0    pypi
semver                    2.13.0                   pypi_0    pypi
sentencepiece             0.1.97                   pypi_0    pypi
setuptools                65.6.3          py310haa95532_0
shellingham               1.5.0.post1              pypi_0    pypi
simplejson                3.18.3                   pypi_0    pypi
singledispatchmethod      1.0                      pypi_0    pypi
sip                       6.6.2           py310hd77b12b_0
six                       1.16.0             pyhd3eb1b0_1
snowballstemmer           2.2.0              pyhd3eb1b0_0
sortedcontainers          2.4.0              pyhd3eb1b0_0
soundfile                 0.10.3.post1             pypi_0    pypi
soupsieve                 2.3.2.post1     py310haa95532_0
speechbrain               0.5.13                   pypi_0    pypi
sphinx                    5.0.2           py310haa95532_0
sphinxcontrib-applehelp   1.0.2              pyhd3eb1b0_0
sphinxcontrib-devhelp     1.0.2              pyhd3eb1b0_0
sphinxcontrib-htmlhelp    2.0.0              pyhd3eb1b0_0
sphinxcontrib-jsmath      1.0.1              pyhd3eb1b0_0
sphinxcontrib-qthelp      1.0.3              pyhd3eb1b0_0
sphinxcontrib-serializinghtml 1.1.5              pyhd3eb1b0_0
spyder                    5.4.2           py310haa95532_0
spyder-kernels            2.4.2           py310haa95532_0
sqlalchemy                2.0.5.post1              pypi_0    pypi
sqlite                    3.40.1               h2bbff1b_0
stack_data                0.2.0              pyhd3eb1b0_0
sympy                     1.11.1                   pypi_0    pypi
tabulate                  0.9.0                    pypi_0    pypi
tensorboard               2.12.0                   pypi_0    pypi
tensorboard-data-server   0.7.0                    pypi_0    pypi
tensorboard-plugin-wit    1.8.1                    pypi_0    pypi
text-unidecode            1.3                pyhd3eb1b0_0
textdistance              4.2.1              pyhd3eb1b0_0
threadpoolctl             3.1.0                    pypi_0    pypi
three-merge               0.1.1              pyhd3eb1b0_0
tinycss2                  1.2.1           py310haa95532_0
tk                        8.6.12               h2bbff1b_0
tokenizers                0.13.2                   pypi_0    pypi
toml                      0.10.2             pyhd3eb1b0_0
tomli                     2.0.1           py310haa95532_0
tomlkit                   0.11.1          py310haa95532_0
torch                     1.13.1                   pypi_0    pypi
torch-audiomentations     0.11.0                   pypi_0    pypi
torch-pitch-shift         1.2.2                    pypi_0    pypi
torchaudio                0.13.1                   pypi_0    pypi
torchmetrics              0.11.4                   pypi_0    pypi
tornado                   6.2             py310h2bbff1b_0
tqdm                      4.65.0                   pypi_0    pypi
traitlets                 5.7.1           py310haa95532_0
transformers              4.26.1                   pypi_0    pypi
typer                     0.7.0                    pypi_0    pypi
typing-extensions         4.4.0           py310haa95532_0
typing_extensions         4.4.0           py310haa95532_0
tzdata                    2022g                h04d1e81_0
ujson                     5.4.0           py310hd77b12b_0
unidecode                 1.2.0              pyhd3eb1b0_0
urllib3                   1.26.14         py310haa95532_0
vc                        14.2                 h21ff451_1
vs2015_runtime            14.27.29016          h5e58377_2
watchdog                  2.1.6           py310haa95532_0
wcwidth                   0.2.5              pyhd3eb1b0_0
webencodings              0.5.1           py310haa95532_1
werkzeug                  2.2.3                    pypi_0    pypi
whatthepatch              1.0.2           py310haa95532_0
wheel                     0.38.4          py310haa95532_0
whisper-timestamped       1.12.1                   pypi_0    pypi
win_inet_pton             1.1.0           py310haa95532_0
wincertstore              0.2             py310haa95532_2
wrapt                     1.14.1          py310h2bbff1b_0
xz                        5.2.10               h8cc25b3_1
yaml                      0.2.5                he774522_0
yapf                      0.31.0             pyhd3eb1b0_0
yarl                      1.8.2                    pypi_0    pypi
zeromq                    4.3.4                hd77b12b_0
zipp                      3.11.0          py310haa95532_0
zlib                      1.2.13               h8cc25b3_0
zstd                      1.5.2                h19a0ad4_0

Possible failure "AssertionError: Fatal Error: Got inconsistent logprob for segment <...>: <...> != <...>"

Suggestion: Use VAD to improve over Whisper's segment timestamps estimation

See discussion #22

Inconsistent number of segments: whisper_segments (462) != timestamped_word_segments (461)

Hello again, I have some more reproducible errors for you.
The error "Got start time outside of audio boundary" and "Inconsistent number of segments: whisper_segments (462) != timestamped_word_segments (461)"

This youtube video can reproduce the error: youtube
I downloaded the mp4 file of the youtube video from here: youtube downloader

"condition_on_previous_text" is true and the rest of the parameters are default settings

"Got infinite logprob" assertion failure, with option condition_on_previous_text=False

Sometimes running audio triggers the "got infinite logprob" assertion, all audio that triggers this does work in the whisper model from the OpenAI repo.
The error occurs in the "may_flush_segment" function

# see GreedyDecoder.update()
chunck_indices = chunk_tokens_nosot + [tokenizer.eot]
assert len(chunk_logprobs) == len(chunck_indices), f"{len(chunk_logprobs)} != {len(chunck_indices)}"
logprobs = [logprob[i] for (logprob, i) in zip(chunk_logprobs, chunck_indices)]
assert min([p.isfinite().item() for p in logprobs]), "Got infinite logprob"

A sample of audio that I could get to reliably reproduce this error was the mp4 from this youtube link -> https://www.youtube.com/watch?v=D9G1VOjN_84
I downloaded the MP4 from here -> https://yt1ss.net/en?q=https%3A%2F%2Fwww.youtube.com%2Fwatch%3Fv%3DD9G1VOjN_84
(would upload but 10mpbs limit, filesize is 18mb)

The audio was run on the medium model size, with condition_on_previous_text=False and the remaining parameters untouched

Word Timing Accuracy Falls Off After Pauses Example

Hey there, I love this project, thank you for all your work. I wanted to share an example of a video where the timing frequently falls behind or jumps ahead after pauses.

This is my transcribe call: whisper.transcribe(model, audio_file, verbose=None, condition_on_previous_text=False)
I am using the base model - the medium model has similar discrepancies but less extreme. The timestamps i provided below are from the base model.

Video: https://www.youtube.com/watch?v=jPcCG0_U6Z4

For example at 2:57 the audio says "Very unselfish pass" - whisper-timsestamped places this line at 2:54.
Another example is at 6:13 the audio says "Important flick from Davis", but whisper timestamped places this at 6:07

Hope this helps!
Thank you.

** Edit **
If you are unable to replicate any timing issues with this please feel free to close the issue

Suggestion: Add Speaker Diarization

Hello, would it be possible in the future an implementation to identify the speakers? Thank you very much :) :)

Sometimes segments with "no_speech_prob" larger than "speech threshold" and "avg_logprob" lower than "logprob_threshold" still appear

Sorry to do this again, I'm not sure if this is by design, but sometimes segments with both no_speech_prob larger than speech threshold and avg_logprob lower than logprob_threshold still appear, which often result is hallucinated segments. This happened in the first segment not sure if it happens in other areas,

In this specific case this was with beam_size=5, condition_on_previous_text=False, best_of=5 and temperature=(0.0, 0.2, 0.4, 0.6, 0.8, 1.0).

Preparing metadata (setup.py)

I'm trying to upload with pip your project and I get errors.
Please fix it

pip3 install git+https://github.com/Jeronymous/whisper-timestamped

Got infinite logprob

Using this track:
https://drive.google.com/file/d/1-tunB84weKDR_uIA6raGG9vJS1H34YEL/view?usp=share_link

I got this error:
track.wav
93%|█████████▎| 443400/475781 [00:52<00:03, 8523.11frames/s]Got infinite logprob

base model

Make Whisper Requirement more flexible to be able to use a specific Whisper version (as some breakages were introducted in 20230306)

The latest changes to whisper (adding word-level timestamps) have added a dynamic requirement (triton) that breaks if you don't have a specific environment. If we could change the requirements.txt in whisper-timestamped to target a specific whisper version (preferably before these latest whisper changes, maybe the last stable release on jan 24?) - that would be much more stable and would not result in breakages.

AssertionError: Inconsistent number of segments: whisper_segments (12) != timestamped_word_segments (11)

In one of the audio received an AssertionError, what could this be related to?

  File "C:\Users\tenet13\PycharmProjects\testWhisper\main.py", line 77, in handle_single_file
    transcribe_result = whisper_timestamped.transcribe(model, file_path, language='Russian')
  File "C:\Users\tenet13\PycharmProjects\testWhisper\venv\lib\site-packages\whisper_timestamped\transcribe.py", line 307, in transcribe_timestamped
    assert l1 == l2 or l1 == 0, f"Inconsistent number of segments: whisper_segments ({l1}) != timestamped_word_segments ({l2})"
AssertionError: Inconsistent number of segments: whisper_segments (12) != timestamped_word_segments (11)

Inconsistent number of segments error

Hi!

Recently I launched transcription and received such error:

File "/usr/local/lib/python3.9/site-packages/whisper_timestamped/transcribe.py", line 259, in transcribe_timestamped
(transcription, words) = _transcribe_timestamped_efficient(model, audio,
File "/usr/local/lib/python3.9/site-packages/whisper_timestamped/transcribe.py", line 851, in _transcribe_timestamped_efficient
assert l1 == l2 or l1 == 0, f"Inconsistent number of segments: whisper_segments ({l1}) != timestamped_word_segments ({l2})"
AssertionError: Inconsistent number of segments: whisper_segments (57) != timestamped_word_segments (56)

Could you know the reason behind it?

If you need some more details please let me know

Suggestion: Problem with small words in SRT files

I think it would be good to add an option to combine the words, for example 2 words or a maximum of 3. Since there are some words that are too small and in video editors where we could use them, they would simply not appear because the only way for it to happen is add more fps being almost forced to use qualities greater than 4k an example of these words would be "is, this in English" or in Spanish "de, tal, como" but filtering these small words would be almost impossible unless you use a method of that words less than 4 characters are joined to a single timelap with another word, but I think a parameter to place how many words per timelap is more convenient since it would not bother anyone visually as well, it is only a recommendation, sorry for the inconvenience

Specific functions don't work.

You can help me please? And thank for you work

Word level output is combined for Languages that don't use spaces

Japanese is a good example, here is a single word output:

{"text"=>"いきますニュースタブでのサイトメイク表記が実際と違う", "start"=>0.02, "end"=>4.18, "confidence"=>0.719}

Many words are combined together.
Here is an example audio to test with:

japanese-example.1.mp4

Update
We are noticing that this is a situation where language_detection does not occur properly inside _transcribe_timestamped_efficient() but does work well with _transcribe_timestamped_naive() - based on logging inside should_use_space() it seems switching to naive fixes the issue (when using efficent, the language is detected as en and subsequently the incorrect spacing var is used). Could you explain the difference between the two (efficient/naive)?

transcribe fails with error: `Argument #4: Padding size should be less than the corresponding input dimension, but got: padding (200, 200) at dimension 2 of input [1, 1, 144]`

Im getting an error running transcribe with these parameters:

model = whisper_timestamped.load_model(name="large-v2", device="cuda")
input = whisper_timestamped.load_audio(wav_file)
result = whisper_timestamped.transcribe(model, input, language="English", verbose=False, best_of=5, beam_size=5, temperature=(0.0, 0.2, 0.4, 0.6, 0.8, 1.0))

If i remove the best_of=5, beam_size=5, temperature=(0.0, 0.2, 0.4, 0.6, 0.8, 1.0), transcribe works but the accuracy is very poor

result = whisper_timestamped.transcribe(
  File "/root/miniconda3/envs/whisper-timestamped/lib/python3.10/site-packages/whisper_timestamped/transcribe.py", line 207, in transcribe_timestamped
    (transcription, words) = _transcribe_timestamped_naive(model, audio, **alignment_options, **whisper_options, **other_options)
  File "/root/miniconda3/envs/whisper-timestamped/lib/python3.10/site-packages/whisper_timestamped/transcribe.py", line 691, in _transcribe_timestamped_naive
    mfcc = whisper.log_mel_spectrogram(sub_audio).to(model.device)
  File "/root/miniconda3/envs/whisper-timestamped/lib/python3.10/site-packages/whisper/audio.py", line 115, in log_mel_spectrogram
    stft = torch.stft(audio, N_FFT, HOP_LENGTH, window=window, return_complex=True)
  File "/root/miniconda3/envs/whisper-timestamped/lib/python3.10/site-packages/torch/functional.py", line 630, in stft
    input = F.pad(input.view(extended_shape), [pad, pad], pad_mode)
RuntimeError: Argument #4: Padding size should be less than the corresponding input dimension, but got: padding (200, 200) at dimension 2 of input [1, 1, 144]```

Issue of duplicate word lines

The script works just fine as I try to transcribe an audio file in Arabic language but when It reaches 6min out of 8, It starts duplicating one line even though the timing is correct but the lexical is not.