chenxwh / cog-whisper Goto Github PK

This project forked from openai/whisper

License: MIT License

Python 4.89% Roff 95.11%

cog-whisper's Introduction

NOTE:
Some folks reported signigiface slow down in the lastest version including large-v2 checkpoint, therefore it has been temporaily removed from https://replicate.com/openai/whisper, but added here https://replicate.com/cjwbw/whisper instead if you want to access it.

Have personally tested both versions however did not find the slow-down issue as reported. It has been raised to the team and see how to proceed regarding merge back to the mainline model.

Whisper

[Blog] [Paper] [Model card] [Colab example]

Whisper is a general-purpose speech recognition model. It is trained on a large dataset of diverse audio and is also a multi-task model that can perform multilingual speech recognition as well as speech translation and language identification.

Approach

A Transformer sequence-to-sequence model is trained on various speech processing tasks, including multilingual speech recognition, speech translation, spoken language identification, and voice activity detection. All of these tasks are jointly represented as a sequence of tokens to be predicted by the decoder, allowing for a single model to replace many different stages of a traditional speech processing pipeline. The multitask training format uses a set of special tokens that serve as task specifiers or classification targets.

Setup

We used Python 3.9.9 and PyTorch 1.10.1 to train and test our models, but the codebase is expected to be compatible with Python 3.7 or later and recent PyTorch versions. The codebase also depends on a few Python packages, most notably HuggingFace Transformers for their fast tokenizer implementation and ffmpeg-python for reading audio files. The following command will pull and install the latest commit from this repository, along with its Python dependencies

pip install git+https://github.com/openai/whisper.git

To update the package to the latest version of this repository, please run:

pip install --upgrade --no-deps --force-reinstall git+https://github.com/openai/whisper.git

It also requires the command-line tool ffmpeg to be installed on your system, which is available from most package managers:

# on Ubuntu or Debian
sudo apt update && sudo apt install ffmpeg

# on Arch Linux
sudo pacman -S ffmpeg

# on MacOS using Homebrew (https://brew.sh/)
brew install ffmpeg

# on Windows using Chocolatey (https://chocolatey.org/)
choco install ffmpeg

# on Windows using Scoop (https://scoop.sh/)
scoop install ffmpeg

You may need rust installed as well, in case tokenizers does not provide a pre-built wheel for your platform. If you see installation errors during the pip install command above, please follow the Getting started page to install Rust development environment. Additionally, you may need to configure the PATH environment variable, e.g. export PATH="$HOME/.cargo/bin:$PATH". If the installation fails with No module named 'setuptools_rust', you need to install setuptools_rust, e.g. by running:

pip install setuptools-rust

Available models and languages

There are five model sizes, four with English-only versions, offering speed and accuracy tradeoffs. Below are the names of the available models and their approximate memory requirements and relative speed.

Size	Parameters	English-only model	Multilingual model	Required VRAM	Relative speed
tiny	39 M	`tiny.en`	`tiny`	~1 GB	~32x
base	74 M	`base.en`	`base`	~1 GB	~16x
small	244 M	`small.en`	`small`	~2 GB	~6x
medium	769 M	`medium.en`	`medium`	~5 GB	~2x
large	1550 M	N/A	`large`	~10 GB	1x

For English-only applications, the .en models tend to perform better, especially for the tiny.en and base.en models. We observed that the difference becomes less significant for the small.en and medium.en models.

Whisper's performance varies widely depending on the language. The figure below shows a WER breakdown by languages of Fleurs dataset, using the large model. More WER and BLEU scores corresponding to the other models and datasets can be found in Appendix D in the paper.

More examples

Please use the 🙌 Show and tell category in Discussions for sharing more example usages of Whisper and third-party extensions such as web demos, integrations with other tools, ports for different platforms, etc.

License

The code and the model weights of Whisper are released under the MIT License. See LICENSE for further details.

cog-whisper's People

Contributors

Stargazers

Watchers

cog-whisper's Issues

Doesn't include English only models as an option

Whisper docs say:

For English-only applications, the .en models tend to perform better, especially for the tiny.en and base.en models.

Have noticed it transcribing English audio files into German and Japanese. I'd like to try using .en models instead.

both m4a and mp4 audio files aren't fully transcibed

transcribing iPhone's voice memos directly from their native m4a format didn't work

It transcribed about half of my 25 minute memo. (if you have it output the timestamps you can see it tries to read later audio but only transcribes ...

If I convert it to an mp3 before sending it to cog-whisper (or the timestamp version), it succeeds.

Similarly someone showed up in discord with an issue with mp4 files being truncated

Significant latency regression in latest release

Hi @chenxwh and replicate,

The latest available version seems to have a significant latency regression from the version that I have been using for some time now. Trying the same large-v1 model (New) and large model (Old) on what I believe is a warm model seems to have drastically different performance characteristics.

From the replicate runs page, New is 10x slower than old on equivalent data

New	ID	Model	Source	Status	Run Time	Created
New	imiwp7wkk…	openai/whisper	API	Succeeded	57.6 seconds	a minute ago
New	hhj3ijrde…	openai/whisper	API	Succeeded	44.4 seconds	2 minutes ago
Old	fdhdfyvmf…	openai/whisper	API	Succeeded	3.0 seconds	6 minutes ago

In my metrics you can see a latency shift over also in the ~10x range
New:

Old:

New version sha: 23241e5731b44fcb5de68da8ebddae1ad97c5094d24f94ccb11f7c1d33d661e2
Old version sha: b6e7ea7aef18444c29d974fee51ffc1e47e1699cfaf4e5cde0ba47a8db74f3b6

Looking deeper, i decided to "bisect" versions with the following test

Warm up the model with one request
When the warm up request returns, send another request and use that as a measure of performance
Mark as bad if transcription time is >30s, otherwise mark good
Bad: 23241e5731b44fcb5de68da8ebddae1ad97c5094d24f94ccb11f7c1d33d661e2
Good:089ea17a12d0b9fc2f81d620cc6e686de7a156007830789bf186392728ac25e8
Good: 30414ee7c4fffc37e260fcab7842b5be470b9b840f2b608f5baa9bbef9a259ed

So really looks like the latest change added a regression. Going to revert my versioning away from the latest, but thought I would let the team know.

Produces incorrectly formatting .srt files

Can you please add a Language parameter?

If I already know which language the audio file uses, there's no need to waste time trying to detect it. whisper supports a --language <lang> parameter for that, it would be great to have that option on the replicate API as well. Cheers

Add option to SRT translate output - Fix inside

Translation works only for plain text for now. It's useless if you would like to translate YouTube video to upload subtitles for English for example. This modified code will make sure translation output is also formatted to SRT.

Start on line 137, predict.py:

if translate:
translationResult = model.transcribe(
str(audio), task="translate", temperature=temperature, **args
)
# MODIFIED to translate to SRT
translation = write_srt(translationResult["segments"])
return ModelOutput(
segments=result["segments"],
detected_language=LANGUAGES[result["language"]],
transcription=transcription,
translation=translation if translate else None,
)

I tested it on another model, works great.

Support Large v2

openai@4179ed2

OpenAI seems to have shipped a new version of the large model.

2 questions:

What's the difference with the original whisper?
Do you guys know if this can run on a phone (with the phone hardware fully working from the phone)?

Translated subtitles

If translate to English is set, the only translation output is in a block of text even if transcription is set to output in subtitle format. The subtitle transcription is in the audio file's native language, but it should also be translated.

Also just a small aside, but there should be a blank line in between subtitle lines:

Output	Correct
156 00:06:03,000 --> 00:06:04,000 そのトーンで 157 00:06:04,000 --> 00:06:05,000 いやうまい	156 00:06:03,000 --> 00:06:04,000 そのトーンで 157 00:06:04,000 --> 00:06:05,000 いやうまい

No option to get output transcriptions in any popular subtitle format

Hi,

Thank you for making whisper available. It's a great tool.
After reading this article https://simonwillison.net/2022/Sep/30/action-transcription/ I decided to give whisper a go.
It worked remarkably well.
Although, there's no way to get the output transcription in srt or vtt format.

SubRip srt https://en.wikipedia.org/wiki/.srt
WebVTT vtt https://en.wikipedia.org/wiki/WebVTT

whisper does support both formats -> https://github.com/openai/whisper/blob/0b1ba3d46ebf7fe6f953acfd8cad62a4f851b49f/whisper/transcribe.py#L306-L312

It'd be great to have it, so the users could generate timed transcriptions (subtitles) for the videos they'd like to watch.

Here's an example scenario describing the use case:

Given I have an audio track file in any language supported by whisper
When I run the model with an argument to generate timed transcription in "(srt|vtt)" format
And I GET the prediction via API
Then I should see the transcription and additional timed transcription in "(srt|vtt)" format

Thanks

Fails for hebrew

I can select "Hebrew" from the language menu at https://replicate.com/openai/whisper.

Logs:

Transcribe with large model
Traceback (most recent call last):
File "/root/.pyenv/versions/3.8.15/lib/python3.8/site-packages/cog/server/worker.py", line 209, in _predict
result = self._predictor.predict(**payload)
File "predict.py", line 141, in predict
detected_language=LANGUAGES[result["language"]],
KeyError: 'Hebrew'

prediction: https://replicate.com/p/cgbd4adpfvh7hefafam6w3tsru

Latest model is deleted?

https://replicate.com/openai/whisper/versions

latest model (23241e5731b44fcb5de68da8ebddae1ad97c5094d24f94ccb11f7c1d33d661e2) is not found on the web and I got a few error with message Model (version:23241e5731b44fcb5de68da8ebddae1ad97c5094d24f94ccb11f7c1d33d661e2) not found, defaulting to 30414ee7c4fffc37e260fcab7842b5be470b9b840f2b608f5baa9bbef9a259ed.
Parameters are different between these versions and will not work if fallback is done.
Is this a replicate issue?