GithubHelp home page GithubHelp logo

cog-whisper's Introduction

NOTE:
Some folks reported signigiface slow down in the lastest version including large-v2 checkpoint, therefore it has been temporaily removed from https://replicate.com/openai/whisper, but added here https://replicate.com/cjwbw/whisper instead if you want to access it.

Have personally tested both versions however did not find the slow-down issue as reported. It has been raised to the team and see how to proceed regarding merge back to the mainline model.

Whisper

[Blog] [Paper] [Model card] [Colab example] Replicate

Whisper is a general-purpose speech recognition model. It is trained on a large dataset of diverse audio and is also a multi-task model that can perform multilingual speech recognition as well as speech translation and language identification.

Approach

Approach

A Transformer sequence-to-sequence model is trained on various speech processing tasks, including multilingual speech recognition, speech translation, spoken language identification, and voice activity detection. All of these tasks are jointly represented as a sequence of tokens to be predicted by the decoder, allowing for a single model to replace many different stages of a traditional speech processing pipeline. The multitask training format uses a set of special tokens that serve as task specifiers or classification targets.

Setup

We used Python 3.9.9 and PyTorch 1.10.1 to train and test our models, but the codebase is expected to be compatible with Python 3.7 or later and recent PyTorch versions. The codebase also depends on a few Python packages, most notably HuggingFace Transformers for their fast tokenizer implementation and ffmpeg-python for reading audio files. The following command will pull and install the latest commit from this repository, along with its Python dependencies

pip install git+https://github.com/openai/whisper.git 

To update the package to the latest version of this repository, please run:

pip install --upgrade --no-deps --force-reinstall git+https://github.com/openai/whisper.git

It also requires the command-line tool ffmpeg to be installed on your system, which is available from most package managers:

# on Ubuntu or Debian
sudo apt update && sudo apt install ffmpeg

# on Arch Linux
sudo pacman -S ffmpeg

# on MacOS using Homebrew (https://brew.sh/)
brew install ffmpeg

# on Windows using Chocolatey (https://chocolatey.org/)
choco install ffmpeg

# on Windows using Scoop (https://scoop.sh/)
scoop install ffmpeg

You may need rust installed as well, in case tokenizers does not provide a pre-built wheel for your platform. If you see installation errors during the pip install command above, please follow the Getting started page to install Rust development environment. Additionally, you may need to configure the PATH environment variable, e.g. export PATH="$HOME/.cargo/bin:$PATH". If the installation fails with No module named 'setuptools_rust', you need to install setuptools_rust, e.g. by running:

pip install setuptools-rust

Available models and languages

There are five model sizes, four with English-only versions, offering speed and accuracy tradeoffs. Below are the names of the available models and their approximate memory requirements and relative speed.

Size Parameters English-only model Multilingual model Required VRAM Relative speed
tiny 39 M tiny.en tiny ~1 GB ~32x
base 74 M base.en base ~1 GB ~16x
small 244 M small.en small ~2 GB ~6x
medium 769 M medium.en medium ~5 GB ~2x
large 1550 M N/A large ~10 GB 1x

For English-only applications, the .en models tend to perform better, especially for the tiny.en and base.en models. We observed that the difference becomes less significant for the small.en and medium.en models.

Whisper's performance varies widely depending on the language. The figure below shows a WER breakdown by languages of Fleurs dataset, using the large model. More WER and BLEU scores corresponding to the other models and datasets can be found in Appendix D in the paper.

WER breakdown by language

More examples

Please use the 🙌 Show and tell category in Discussions for sharing more example usages of Whisper and third-party extensions such as web demos, integrations with other tools, ports for different platforms, etc.

License

The code and the model weights of Whisper are released under the MIT License. See LICENSE for further details.

cog-whisper's People

Contributors

abumj avatar bfirsh avatar bquast avatar brainwane avatar bubthegreat avatar chenxwh avatar codebycaleb avatar cool-rr avatar corentinj avatar dmarx avatar drdaxxy avatar elieron avatar eudoxos avatar fcakyon avatar flockonus avatar gglanzani avatar hanacchi avatar jibinmathew69 avatar jongwook avatar ldanilov avatar mgoin avatar michaelmonashev avatar nick-konovalchuk avatar sawadata avatar sradc avatar stupid-kid-af avatar szpasztor avatar tomstuart avatar vickianand avatar vulumecode avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar

cog-whisper's Issues

Doesn't include English only models as an option

Whisper docs say:

For English-only applications, the .en models tend to perform better, especially for the tiny.en and base.en models.

Have noticed it transcribing English audio files into German and Japanese. I'd like to try using .en models instead.

both m4a and mp4 audio files aren't fully transcibed

transcribing iPhone's voice memos directly from their native m4a format didn't work

It transcribed about half of my 25 minute memo. (if you have it output the timestamps you can see it tries to read later audio but only transcribes ...

If I convert it to an mp3 before sending it to cog-whisper (or the timestamp version), it succeeds.

Similarly someone showed up in discord with an issue with mp4 files being truncated

Significant latency regression in latest release

Hi @chenxwh and replicate,

The latest available version seems to have a significant latency regression from the version that I have been using for some time now. Trying the same large-v1 model (New) and large model (Old) on what I believe is a warm model seems to have drastically different performance characteristics.

From the replicate runs page, New is 10x slower than old on equivalent data

New ID Model Source Status Run Time Created
New imiwp7wkk… openai/whisper API Succeeded 57.6 seconds a minute ago
New hhj3ijrde… openai/whisper API Succeeded 44.4 seconds 2 minutes ago
Old fdhdfyvmf… openai/whisper API Succeeded 3.0 seconds 6 minutes ago

In my metrics you can see a latency shift over also in the ~10x range
New:
image
Old:
image

New version sha: 23241e5731b44fcb5de68da8ebddae1ad97c5094d24f94ccb11f7c1d33d661e2
Old version sha: b6e7ea7aef18444c29d974fee51ffc1e47e1699cfaf4e5cde0ba47a8db74f3b6

Looking deeper, i decided to "bisect" versions with the following test

  1. Warm up the model with one request
  2. When the warm up request returns, send another request and use that as a measure of performance
  3. Mark as bad if transcription time is >30s, otherwise mark good
    Bad: 23241e5731b44fcb5de68da8ebddae1ad97c5094d24f94ccb11f7c1d33d661e2
    Good:089ea17a12d0b9fc2f81d620cc6e686de7a156007830789bf186392728ac25e8
    Good: 30414ee7c4fffc37e260fcab7842b5be470b9b840f2b608f5baa9bbef9a259ed

So really looks like the latest change added a regression. Going to revert my versioning away from the latest, but thought I would let the team know.

Can you please add a Language parameter?

If I already know which language the audio file uses, there's no need to waste time trying to detect it. whisper supports a --language <lang> parameter for that, it would be great to have that option on the replicate API as well. Cheers

Add option to SRT translate output - Fix inside

Translation works only for plain text for now. It's useless if you would like to translate YouTube video to upload subtitles for English for example. This modified code will make sure translation output is also formatted to SRT.

Start on line 137, predict.py:

if translate:
translationResult = model.transcribe(
str(audio), task="translate", temperature=temperature, **args
)
# MODIFIED to translate to SRT
translation = write_srt(translationResult["segments"])
return ModelOutput(
segments=result["segments"],
detected_language=LANGUAGES[result["language"]],
transcription=transcription,
translation=translation if translate else None,
)

I tested it on another model, works great.

2 questions:

  1. What's the difference with the original whisper?
  2. Do you guys know if this can run on a phone (with the phone hardware fully working from the phone)?

Translated subtitles

If translate to English is set, the only translation output is in a block of text even if transcription is set to output in subtitle format. The subtitle transcription is in the audio file's native language, but it should also be translated.

Also just a small aside, but there should be a blank line in between subtitle lines:

Output Correct
156
00:06:03,000 --> 00:06:04,000
そのトーンで
157
00:06:04,000 --> 00:06:05,000
いやうまい
156
00:06:03,000 --> 00:06:04,000
そのトーンで

157
00:06:04,000 --> 00:06:05,000
いやうまい

No option to get output transcriptions in any popular subtitle format

Hi,

Thank you for making whisper available. It's a great tool.
After reading this article https://simonwillison.net/2022/Sep/30/action-transcription/ I decided to give whisper a go.
It worked remarkably well.
Although, there's no way to get the output transcription in srt or vtt format.

whisper does support both formats -> https://github.com/openai/whisper/blob/0b1ba3d46ebf7fe6f953acfd8cad62a4f851b49f/whisper/transcribe.py#L306-L312

It'd be great to have it, so the users could generate timed transcriptions (subtitles) for the videos they'd like to watch.

Here's an example scenario describing the use case:

Given I have an audio track file in any language supported by whisper
When I run the model with an argument to generate timed transcription in "(srt|vtt)" format
And I GET the prediction via API
Then I should see the transcription and additional timed transcription in "(srt|vtt)" format

Thanks

Latest model is deleted?

https://replicate.com/openai/whisper/versions

latest model (23241e5731b44fcb5de68da8ebddae1ad97c5094d24f94ccb11f7c1d33d661e2) is not found on the web and I got a few error with message Model (version:23241e5731b44fcb5de68da8ebddae1ad97c5094d24f94ccb11f7c1d33d661e2) not found, defaulting to 30414ee7c4fffc37e260fcab7842b5be470b9b840f2b608f5baa9bbef9a259ed.
Parameters are different between these versions and will not work if fallback is done.
Is this a replicate issue?

2022-12-16_134801

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.