GithubHelp home page GithubHelp logo

Comments (8)

sanchit-gandhi avatar sanchit-gandhi commented on May 16, 2024 2

This looks more or less correct! The benchmarks we ran were from a bunch of YouTube videos (I can give you the URLs), and transcription time is somewhat dependent on audio file. This slower transcription time could be because Whisper is getting caught in a hallucination in one of the batches, causing it to generate till it hits max length (448 tokens).

You could check whether the text has repetitions, or try instantiating the pipeline with a lower max length (we set it to 128 and got complete transcriptions):

# instantiate pipeline in float16
pipeline = FlaxWhisperPipline("openai/whisper-large-v2", dtype=jnp.float16, batch_size=32, max_length=128)

from whisper-jax.

sanchit-gandhi avatar sanchit-gandhi commented on May 16, 2024 1

Correct!

from whisper-jax.

ahxxm avatar ahxxm commented on May 16, 2024

reproduced the hallucination with this audio file on Huggingface
image

it's impressively fast

but 16G memory seems not enough for statement jax = FlaxWhisperPipline("openai/whisper-large-v2", dtype=jnp.bfloat16, batch_size=16), how much memory does it require to instantiate the pipeline? based on a very rough observation, the GPU(Tesla T4, 14G) memory was filled instantly, then memory grows slowly until it hits 16G, then OOM killed

just followed discussions in #7 and the transformer issue, seems we haven't found the cause yet

from whisper-jax.

sanchit-gandhi avatar sanchit-gandhi commented on May 16, 2024

Also worth making sure your audio is already at 16kHz so that we don't resample in the Flax Whisper pipeline (which can be lengthy for long audio files)

from whisper-jax.

sanchit-gandhi avatar sanchit-gandhi commented on May 16, 2024

The absolute transcription time is somewhat dependent on audio sample - since it's proportional to number of tokens generated, it'll depend on speaking rate, propensity to hallucinate, speech:silence ratio, etc. Since we what we really care about is the relative time between systems (rather than necessarily the absolute ones), it would be cool to benchmark with the same audio file using OpenAI's Whisper and Transformer's Whisper on GPU to see what we're aiming for

from whisper-jax.

AndrewZhaoLuo avatar AndrewZhaoLuo commented on May 16, 2024

One more question to do some fair comparisons across libraries. If I am reading the codebase correctly, this is doing a greedy search (e.g. beam_size=1). Is that correct?

from whisper-jax.

AndrewZhaoLuo avatar AndrewZhaoLuo commented on May 16, 2024

Thanks for all your help.

Finally, it might be good to just have the audio you used to benchmark. @sanchit-gandhi can you direct me to the youtube video?

from whisper-jax.

s-tomar avatar s-tomar commented on May 16, 2024

Hi,

On CPU only system (no TPU/GPU), the following deteriorates the overall performance. For a <10 min audio, it consumes almost 25% more time.

SAMPLING_RATE = 16000
audio, sr = librosa.load('test_audio.mp3', sr=SAMPLING_RATE)

I guess there are quite a few parameters to be tuned to achieve good/best performance, and improper tuning can worsen the situation 🤔

from whisper-jax.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.