Hey all, Very interesting work! I am trying to recreate some of the

reproduced the hallucination with <a href="https://traffic.libsyn.com/secure/manager-t

The absolute tranion time is somewhat dependent on audio sample - since it's pro

Recreate Benchmarks on A100 about whisper-jax HOT 8 OPEN

sanchit-gandhi commented on May 16, 2024 1

Recreate Benchmarks on A100

from whisper-jax.

Comments (8)

sanchit-gandhi commented on May 16, 2024 2

This looks more or less correct! The benchmarks we ran were from a bunch of YouTube videos (I can give you the URLs), and transcription time is somewhat dependent on audio file. This slower transcription time could be because Whisper is getting caught in a hallucination in one of the batches, causing it to generate till it hits max length (448 tokens).

You could check whether the text has repetitions, or try instantiating the pipeline with a lower max length (we set it to 128 and got complete transcriptions):

# instantiate pipeline in float16
pipeline = FlaxWhisperPipline("openai/whisper-large-v2", dtype=jnp.float16, batch_size=32, max_length=128)

from whisper-jax.

sanchit-gandhi commented on May 16, 2024 1

Correct!

from whisper-jax.

ahxxm commented on May 16, 2024

reproduced the hallucination with this audio file on Huggingface

it's impressively fast

but 16G memory seems not enough for statement jax = FlaxWhisperPipline("openai/whisper-large-v2", dtype=jnp.bfloat16, batch_size=16), how much memory does it require to instantiate the pipeline? based on a very rough observation, the GPU(Tesla T4, 14G) memory was filled instantly, then memory grows slowly until it hits 16G, then OOM killed

just followed discussions in #7 and the transformer issue, seems we haven't found the cause yet

from whisper-jax.

sanchit-gandhi commented on May 16, 2024

Also worth making sure your audio is already at 16kHz so that we don't resample in the Flax Whisper pipeline (which can be lengthy for long audio files)

from whisper-jax.

sanchit-gandhi commented on May 16, 2024

The absolute transcription time is somewhat dependent on audio sample - since it's proportional to number of tokens generated, it'll depend on speaking rate, propensity to hallucinate, speech:silence ratio, etc. Since we what we really care about is the relative time between systems (rather than necessarily the absolute ones), it would be cool to benchmark with the same audio file using OpenAI's Whisper and Transformer's Whisper on GPU to see what we're aiming for

from whisper-jax.

AndrewZhaoLuo commented on May 16, 2024

One more question to do some fair comparisons across libraries. If I am reading the codebase correctly, this is doing a greedy search (e.g. beam_size=1). Is that correct?

from whisper-jax.

AndrewZhaoLuo commented on May 16, 2024

Thanks for all your help.

Finally, it might be good to just have the audio you used to benchmark. @sanchit-gandhi can you direct me to the youtube video?

from whisper-jax.

s-tomar commented on May 16, 2024

Hi,

On CPU only system (no TPU/GPU), the following deteriorates the overall performance. For a <10 min audio, it consumes almost 25% more time.

SAMPLING_RATE = 16000
audio, sr = librosa.load('test_audio.mp3', sr=SAMPLING_RATE)

I guess there are quite a few parameters to be tuned to achieve good/best performance, and improper tuning can worsen the situation 🤔

from whisper-jax.

Recreate Benchmarks on A100 about whisper-jax HOT 8 OPEN

Comments (8)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent

Jobs