GithubHelp home page GithubHelp logo

ggerganov / whisper.cpp Goto Github PK

View Code? Open in Web Editor NEW
31.9K 291.0 3.2K 12.47 MB

Port of OpenAI's Whisper model in C/C++

License: MIT License

Makefile 0.41% Python 0.95% C 40.79% C++ 34.19% Shell 0.60% CMake 0.71% Batchfile 0.03% JavaScript 0.03% Go 1.09% Ruby 0.10% Objective-C 4.37% Objective-C++ 0.04% Cuda 10.52% Java 1.11% Metal 4.97% Dockerfile 0.04% Swift 0.04%
openai speech-to-text transformer whisper inference speech-recognition

whisper.cpp's Issues

transcription time 2.7x than wav file

Thanks for sharing whisper.cpp @ggerganov. Wondering if I'm missing something. I tried whisper.cpp on a 40-minute wav file, which took almost 2 hours to transcribe, which doesn't seem to be what others have experienced. I tried transcribing on an 8-vcpu machine, 32 gb of memory. Any settings I'm missing? Appreciate your help.

Unfortunately I'm unable to share the wav file as it's private data.

`
whisper_model_load: loading model from 'models/ggml-large.bin'
whisper_model_load: n_vocab = 51865
whisper_model_load: n_audio_ctx = 1500
whisper_model_load: n_audio_state = 1280
whisper_model_load: n_audio_head = 20
whisper_model_load: n_audio_layer = 32
whisper_model_load: n_text_ctx = 448
whisper_model_load: n_text_state = 1280
whisper_model_load: n_text_head = 20
whisper_model_load: n_text_layer = 32
whisper_model_load: n_mels = 80
whisper_model_load: f16 = 1
whisper_model_load: type = 5
whisper_model_load: mem_required = 4576.00 MB
whisper_model_load: adding 1608 extra tokens
whisper_model_load: ggml ctx size = 3255.34 MB
whisper_model_load: memory size = 304.38 MB
whisper_model_load: model size = 2950.66 MB

main: processing 'output/x.wav' (38688821 samples, 2418.1 sec), 4 threads, lang = en, task = transcribe, timestamps = 1 ...
whisper_print_timings: load time = 4246.85 ms
whisper_print_timings: mel time = 31377.23 ms
whisper_print_timings: sample time = 3421.71 ms
whisper_print_timings: encode time = 4697475.00 ms / 146796.09 ms per layer
whisper_print_timings: decode time = 1830579.38 ms / 57205.61 ms per layer
whisper_print_timings: total time = 6568016.00 ms
`

/whisper.cpp/whisper.cpp:2305:17: internal compiler error: in reshape_init_class, at cp/decl.c:6465

Fully stumped. Only did make. cpp (Ubuntu 11.2.0-19ubuntu1) 11.2.0

whisper.cpp: In function ‘whisper_full_params whisper_full_default_params(whisper_decode_strategy)’: whisper.cpp:2305:17: internal compiler error: in reshape_init_class, at cp/decl.c:6465 2305 | }; | ^ 0x7f415aaa6d8f __libc_start_call_main ../sysdeps/nptl/libc_start_call_main.h:58 0x7f415aaa6e3f __libc_start_main_impl ../csu/libc-start.c:392 Please submit a full bug report, with preprocessed source if appropriate. Please include the complete backtrace with any bug report. See <file:///usr/share/doc/gcc-11/README.Bugs> for instructions.

MUSL Linux builds

Hi there! I'm attempting to build whisper.cpp for MUSL Linux for some lightweight systems, and I figured I would note the issues I ran into during the build.

  1. Alpine appears to not include stdint.h, or alloca.h in its standard library when you install gcc. This results in a slew of errors:
localhost:~/whisper.cpp# make libwhisper.a
cc  -O3 -std=c11   -Wall -Wextra -Wno-unused-parameter -Wno-unused-function -pthread   -c ggml.c
In file included from ggml.h:7,
                 from ggml.c:1:
/usr/lib/gcc/aarch64-alpine-linux-musl/12.2.1/include/stdint.h:9:26: error: no include path in which to search for stdint.h
    9 | # include_next <stdint.h>
      |                          ^
ggml.h:107:5: error: unknown type name 'int64_t'
  107 |     int64_t perf_cycles;
      |     ^~~~~~~
~~snip~~

ggml.c:6:10: fatal error: alloca.h: No such file or directory
    6 | #include <alloca.h>
      |          ^~~~~~~~~~
compilation terminated.
make: *** [Makefile:58: ggml.o] Error 1
localhost:~/whisper.cpp#

This fix is relatively simple, just install g++:

apk add g++
  1. clock_gettime and CLOCK_MONOTONIC are seemingly undefined regardless of compiler used.
localhost:~/whisper.cpp# make libwhisper.a
cc  -O3 -std=c11   -Wall -Wextra -Wno-unused-parameter -Wno-unused-function -pthread   -c ggml.c
ggml.c: In function 'ggml_time_ms':
ggml.c:155:5: warning: implicit declaration of function 'clock_gettime' [-Wimplicit-function-declaration]
  155 |     clock_gettime(CLOCK_MONOTONIC, &ts);
      |     ^~~~~~~~~~~~~
ggml.c:155:19: error: 'CLOCK_MONOTONIC' undeclared (first use in this function)
  155 |     clock_gettime(CLOCK_MONOTONIC, &ts);
      |                   ^~~~~~~~~~~~~~~
ggml.c:155:19: note: each undeclared identifier is reported only once for each function it appears in
ggml.c: In function 'ggml_time_us':
ggml.c:161:19: error: 'CLOCK_MONOTONIC' undeclared (first use in this function)
  161 |     clock_gettime(CLOCK_MONOTONIC, &ts);
      |                   ^~~~~~~~~~~~~~~
make: *** [Makefile:58: ggml.o] Error 1
localhost:~/whisper.cpp# 

Digging around the internet shows a fix for this as inserting #define _POSIX_C_SOURCE 199309L before including the time.h header. This appears to work successfully, placing it on line 10 of ggml.c. It would be nice if this issue could be fixed in some way. I would make a PR if I had sufficient knowledge to implement the required changes, which I don't.

Support for realtime audio input

Noting that the processing time is considerably shorter than the length of speech, is it possible to feed the models real time microphone output? Or does the inference run on the complete audio stream, instead of sample by sample?

This would greatly reduce the latency for voice assistants and the like, that the audio does not need to be fully captured and only after that fed to the models. Basically the same as I did here with SODA: https://github.com/biemster/gasr, but then with an open source and multilang model.

Windows build

Would be nice if someone can help and provide build instructions for Windows.

I think the only thing that might need an update is the pthread dependency in ggml.c.
The rest of the code should build successfully.

Probably a .bat script to download the models would also be nice since no Bash on Windows.

Thread 1 "main" received signal SIGILL, Illegal instruction.

Seemed to come from here:

0x00005555555586fb in _mm256_fmadd_ps (__C=..., __B=..., __A=...) at /usr/lib/gcc/x86_64-linux-gnu/7/include/fmaintrin.h:65
65	  return (__m256)__builtin_ia32_vfmaddps256 ((__v8sf)__A, (__v8sf)__B,

with backtrace

(gdb) bt
#0  0x00005555555586fb in _mm256_fmadd_ps (__C=..., __B=..., __A=...) at /usr/lib/gcc/x86_64-linux-gnu/7/include/fmaintrin.h:65
#1  ggml_vec_dot_f16 (n=96, s=0x7ffffffe4e54, x=0x7fff646b6ee0, y=0x7fff64746ee0) at ggml.c:375
#2  0x0000555555564766 in ggml_compute_forward_conv_1d_1s_f16_f32 (params=0x7ffffffe51c0, src0=0x7fff9025f0f0, src1=0x7fff65482030, dst=0x7fff6556c6f0) at ggml.c:4668
#3  0x0000555555564f40 in ggml_compute_forward_conv_1d_1s (params=0x7ffffffe51c0, src0=0x7fff9025f0f0, src1=0x7fff65482030, dst=0x7fff6556c6f0) at ggml.c:4806
#4  0x0000555555568707 in ggml_compute_forward (params=0x7ffffffe51c0, tensor=0x7fff6556c6f0) at ggml.c:5809
#5  0x000055555556a6ec in ggml_graph_compute (ctx=0x5555557f3b48 <g_state+104>, cgraph=0x7ffffffe5340) at ggml.c:6611
#6  0x0000555555580cb2 in whisper_encode (model=..., n_threads=4, mel_offset=0, mel_inp=..., features=std::vector of length 0, capacity 0) at main.cpp:1353
#7  0x0000555555584664 in main (argc=5, argv=0x7fffffffdb78) at main.cpp:2225

On Ubuntu 18.04, gcc 7.5.0, on an Intel(R) Core(TM) i5-3470 CPU @ 3.20GHz

ggml_graph_compute: Assertion `false' failed

Examples work fine for me, but I get an error when trying different wavefiles larger than 21s:

./main -m models/ggml-base.en.bin mycut-22s-16khz.wav
whisper_model_load: loading model from 'models/ggml-base.en.bin'
[...]
main: processing 'mycut-22s-16khz.wav' (352000 samples, 22.0 sec), 2 threads, lang = en, task = transcribe, timestamps = 1 ...
[...]

main: ggml.c:6658: ggml_graph_compute: Assertion `false' failed.
Aborted (core dumped)

It's apparently fixed if I comment the assert and substitute by cgraph->work = NULL; in ggml.c. But I guess it's not the best workaround, as it crashes again with sigfault if audio duration is more than 43s aproximately.

Running on Ubuntu 18.04 LTS (GNU/Linux 4.15.0-187-generic x86_64), after fixing CACHE_LINE_SIZE and initializers issues #11

Any hint? Thanks!

Python bindings (C-style API)

Good day everyone!
I'm thinking about bindings for Python.

So far, I'm interested in 4 functionalities:

  1. Encoder processing
  2. Decoder processing
  3. Transcription of audio (feed audio bytes, get text)
  4. 3+Times of all words (feed audio bytes, get text + times of each word). Of course, it’s too early to think about the times of words, since even for a python implementation they are still not well done.

Perhaps in the near future, I will try to take up this task. But I had no experience with python bindings. So, if there are craftsmen who can do it quickly (if it can be done quickly... 😃), that would be cool!

Build on FreeBSD

Hi,

I could compile this on FreeBSD 13.1-RELEASE-p2 amd64, having devel/gmake installed (using gmake then instead of make) and using the following modifications:

--- Makefile_ori        2022-10-16 21:19:22.498824000 +0200
+++ Makefile    2022-10-16 22:40:53.787014000 +0200
@@ -22,10 +22,17 @@
        CFLAGS   += -pthread
        CXXFLAGS += -pthread
 endif
+ifeq ($(UNAME_S),FreeBSD)
+       CFLAGS   += -pthread
+       CXXFLAGS += -pthread
+endif
 
 # Architecture specific
 # TODO: probably these flags need to be tweaked on some architectures
 ifeq ($(UNAME_M),x86_64)
+       CFLAGS += -mavx -mavx2 -mfma -mf16c
+endif
+ifeq ($(UNAME_M),amd64)
        CFLAGS += -mavx -mavx2 -mfma -mf16c
 endif
 ifneq ($(filter arm%,$(UNAME_M)),)

(don't know gmake-Makefiles too much, could be prettier with logical or here ...)

--- ggml.c_ori  2022-10-16 21:19:22.502786000 +0200
+++ ggml.c      2022-10-16 21:28:00.140594000 +0200
@@ -2,7 +2,7 @@

 #if defined(_MSC_VER) || defined(__MINGW32__)
 #include <malloc.h> // using malloc.h with MSC/MINGW
-#else
+#elif !defined(__FreeBSD__)
 #include <alloca.h>
 #endif

Seems not so hard to merge changes into upstream ...

For downloading models ftp/wget is needed.

Kind regards,
abelbabel

Feature request

Hi @ggerganov
whisper.cpp look promising, thank you for you work.
I know there is timestamp limitation in the README currently.
Is it possible to include timestamp in the future? that will be useful when generate subtitle.
Or can whisper.cpp support stream mode with steaming audio.

whisper : mark speakers/voices (diarization)

Hi,

I'm not so much into the details of whisper or whisper.cpp and I don't know if it is currently even possible with the foundation, but it would be nice if speakers could be marked or speaker-changes / voice-changes.

This would be very handy when processing interviews, radio/tv shows, films, etc.

Kind regards,
abelbabel

./whisper.cpp/whisper.h:121:81: error: unknown type name 'bool'

I'm attempting to automate rust-bindgen generation. This appears to not work, however, as it uses clang which does not implicitly #include <stdbool.h>. Adding #include <stdbool.h> to line 5 of whisper.h appears to fix this. I'm opening this issue to get feedback and others' thoughts.

Request to support aarch64

Make errors out on a aarch64 server

make base.en
#gcc -pthread -O3 -c ggml.c
gcc -pthread -O3 -mcpu=cortex-a72 -mfloat-abi=hard -mfpu=neon-fp-armv8 -mfp16-format=ieee -mno-unaligned-access -c ggml.c
gcc: error: unrecognized command-line option ‘-mfloat-abi=hard’
gcc: error: unrecognized command-line option ‘-mfpu=neon-fp-armv8’
gcc: error: unrecognized command-line option ‘-mfp16-format=ieee’
gcc: error: unrecognized command-line option ‘-mno-unaligned-access’
make: *** [Makefile:7: ggml.o] Error 1

Perhaps this is enough as C flags? : -Ofast -g -mfpu=neon

Token decoding issue - some characters are missing

./main -m models/ggml-medium.bin -l zh -f ~/Movies/samplecn16k.wav

output

[00:00.000 --> 00:16.000]  元����,其实就是����世界,而且要用����世界这个��来定��元����的话,要比元����本身更加����。到这里就出现问题了。那它为什么不叫����世界呢?最��单的原因就是,����世界这个说法大家已经听��了,而元������得更为新��,又包��成为了一个新的概念。
[00:16.000 --> 00:44.000]  现在的元����技��,����没有我们想象中那么先进。按照目前世界第一元����公司,Roblox公司对于元����的定��来看,它起��要具��8个要素,分别是身份、社交、成进、����、多元、��地、经��、文明。身份就是一个����身份,��现实中的角色无关,这个比��好理解。社交也就是社交系��。成进就是感知����的升��,要做到和现实世界的体��完全相同。����就��������,不会有卡��,多元就多元化,

with OpenAI whisper cli

whisper --language zh ~/Movies/samplecn16k.wav
[00:00.000 --> 00:01.760] 元宇宙其实就虚拟世界
[00:01.760 --> 00:04.400] 而且要用虚拟世界这个词来定义元宇宙的话
[00:04.400 --> 00:06.400] 要比元宇宙本身更加准确
[00:06.400 --> 00:07.680] 但这里就出现问题了
[00:07.680 --> 00:09.360] 那它为什么不叫虚拟世界呢?
[00:09.360 --> 00:10.720] 最简单的原因就是
[00:10.720 --> 00:12.880] 虚拟世界这个说法大家已经听腻了
[00:12.880 --> 00:14.320] 而元宇宙显得更为吸引
[00:14.320 --> 00:16.200] 又包装成为了一个新的概念
[00:16.200 --> 00:17.440] 现在的元宇宙技术
[00:17.440 --> 00:19.160] 原有没有我们想象中那么先进
[00:19.160 --> 00:21.320] 按照目前世界第一元宇宙公司
[00:21.320 --> 00:23.480] 罗布洛克斯公司对于元宇宙的定义来看
[00:23.480 --> 00:25.080] 它起码要具备8个要素
[00:25.080 --> 00:30.680] 分别是身份、社交、成敬、延迟、多元、随地、经济、文明
[00:30.680 --> 00:32.280] 身份就是一个虚拟身份
[00:32.280 --> 00:33.640] 与现实中的角色无关
[00:33.640 --> 00:34.640] 这个比较好理解
[00:34.640 --> 00:36.200] 社交也就是社交系统
[00:36.200 --> 00:38.320] 成敬就是感知设备的升级
[00:38.320 --> 00:40.800] 要做到和现实世界的体验完全相同
[00:40.800 --> 00:42.080] 延迟就网络延迟
[00:42.080 --> 00:43.080] 不会有卡顿
[00:43.080 --> 00:44.200] 多元就多元化
[00:44.200 --> 00:45.600] 比如可以在里面玩游戏

Performance Xeon

Performance report.
Meaning V2 and V3: V2 its before this commit

  • CPU: Intel(R) Xeon(R) Gold 6336Y CPU @ 2.40GHz
  • Task: 200 s of audio (7 diff files with diff quality)

V2 -t

model T, s -t, CPU
tiny 64 1
tiny 21 4
tiny 21 8
tiny 80 16
tiny 175 24
base 42 8
base 93 16
small 110 8
small 190 16
large 420 8
large 537 16

V3 -t

model T, s -t, CPU
tiny 84 1
tiny 32 4
tiny 28 8
tiny 56 16
tiny 86 24
base 58 8
base 125 16
small 104 8
small 177 16
large 570 8
large 850 16

V2 parallel

  • Use parallel bash computations
  • 7 parallel jobs, in each job -t specified
model T, s -t, CPU
tiny 17 1
tiny 9 2
tiny 5 4
base 56 1
base 25 2
base 16 4
small 155 1
small 86 2
small 53 4
large 788 1
large 428 2
large 260 4

Encode vs Decode time (V2 vs V3) tiny

V2

  • File 1
whisper_model_load: type          = 1
whisper_model_load: mem_required  = 452.00 MB
main:     load time =    84.28 ms
main:      mel time =   118.88 ms
main:   sample time =    46.91 ms
main:   encode time =   531.27 ms / 132.82 ms per layer
main:   decode time =  3730.47 ms
main:    total time =  6181.17 ms
  • File 2
main:     load time =    80.49 ms
main:      mel time =    97.64 ms
main:   sample time =    13.85 ms
main:   encode time =   533.10 ms / 133.27 ms per layer
main:   decode time =  1036.91 ms
main:    total time =  2348.79 ms

V3

  • File 1
whisper_model_load: type          = 1
whisper_model_load: mem_required  = 244.00 MB
main:     load time =   241.68 ms
main:      mel time =   656.11 ms
main:   sample time =  1202.84 ms
main:   encode time =  1736.55 ms / 434.14 ms per layer
main:   decode time =  8354.48 ms
main:    total time = 12211.61 ms
  • File 2
main:     load time =   243.57 ms
main:      mel time =   541.42 ms
main:   sample time =   209.42 ms
main:   encode time =  2901.70 ms / 725.42 ms per layer
main:   decode time =  1588.76 ms
main:    total time =  5501.20 ms

Make fails

g++ -O3 -std=c++11 -Wall -Wextra -Wno-unused-parameter -Wno-unused-function -pthread -c whisper.cpp whisper.cpp: In function ‘whisper_full_params whisper_full_default_params(whisper_decode_strategy)’: whisper.cpp:2286:17: sorry, unimplemented: non-trivial designated initializers not supported }; ^ whisper.cpp:2313:17: sorry, unimplemented: non-trivial designated initializers not supported }; ^ Makefile:74: recipe for target 'whisper.o' failed make: *** [whisper.o] Error 1

g++ (Ubuntu 7.5.0-3ubuntu1~18.04) 7.5.0
Copyright (C) 2017 Free Software Foundation, Inc.
This is free software; see the source for copying conditions. There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

Cheaper hardware to run bigger model

Refer our discussion at #8 , I can run ggml-large.bin, for same input audio 120 sec ( 2 minutes) in around 54 minutes on Samsung A52.

What is your suggestion for optimization to run bigger model on cheaper hardware:

  1. Selecting better hardware for array manipulation (Neon?)
  2. Improve algorithm
  3. Use GPU provided by hardware
  4. ... ?

Will be happy if you share the resource I can learn to achieve that goal.

Error: "whisper_full: failed to generate timestamp token - this should not happen"

I was running a task on a german language youtube video with the command line
./main -m ggml-base.bin bauer.wav -t 8 -l de -osrt
and the process ran ok until around the 4-minute mark, then I've got the error:

"whisper_full: failed to generate timestamp token - this should not happen"

repeated several times, and the transcription never resumed.
I changed the command line to use 4 cores, didn´t include the srt file generation and still the same error.
Curiously, if I force english transcription with "-l en", the transcription is ok until 4 minutes or so and then the same sentence repeats until the end of the file.

I think this happened after the commit to reduce the sentence length.

C API threadsafety

I can't see any docs as to threadsafety for the C API. Information here would be very helpful for me and future users. Thanks!

How do I compile to a shared library? without libc++_shared.so ?

I want to experiment with using whisper into the app, but when I open it, an error occurs when the compiled library requires libc++_shared.so,

i use this bash to build for android target

/home/azkdev/Android/Sdk/ndk/toolchains/llvm/prebuilt/linux-x86_64/bin/aarch64-linux-android21-clang -pthread -O3 -std=c11 -mavx -mavx2 -mfma -mf16c -c ./ggml.c -fPIC -lstdc++
/home/azkadev/Android/Sdk/ndk/toolchains/llvm/prebuilt/linux-x86_64/bin/aarch64-linux-android21-clang++ -pthread -O3 -std=c++11 -mavx -mavx2 -mfma -mf16c -c ./whisper.cpp -fPIC -lstdc++
/home/azkadev/Android/Sdk/ndk/toolchains/llvm/prebuilt/linux-x86_64/bin/aarch64-linux-android21-clang++ -pthread -O3 -std=c++11 ./main.cpp -fPIC -lstdc++ whisper.o ggml.o -o ./whisper.so --shared -fPIC -lstdc++

I have also tried this clang-linking-so-library-libc-shared-so, but it doesn't work

Error
Screenshot from 2022-10-08 19-17-21

can you give a build command so it doesn't need libc++_shared.so? sorry i'm still a beginner in cpp

Android example app

Implement a very basic Java application using whisper.cpp. It can be used as an example for running Whisper on Android.

The ggwave-java project can be used as a good starting point. It already provides the audio capture functionality. Instead of passing it to ggwave, we just need to pass it to whisper.cpp.

Edit:
Looking for volunteers to help with this - ideally, we would like to have the same functionality demonstrated as in the iOS example application.

ggml.c CACHE_LINE_SIZE error: initializer element is not constant

I get this on Ubuntu 18.04 gcc 7.5.0 (time to update, yes), and I don't immediately see how to fix it since I don't know __cpp_lib_hardware_interference_size. Otherwise a simple replacement with a #define would suffice.

gcc -pthread -O3 -mavx -mavx2 -mfma -mf16c -c ggml.c
ggml.c:183:36: error: initializer element is not constant
 const size_t CACHE_LINE_SIZE_F32 = CACHE_LINE_SIZE/sizeof(float);

WASM port

We can easily build whisper.cpp as a WASM library using Emscripten:

mkdir build-em
cd build-em
emcmake cmake ..
make

It looks like a big subset of SIMD intrinsics are already supported, so the performance might not be really bad:

https://emscripten.org/docs/porting/simd.html

So let's try running whisper.cpp directly in the browser!

  • The model file could either be fetched on load, or the user can drag and drop it in the browser window
  • We need a simple page that records a short audio at 16 kHz sampling rate and passes it to the WASM module for transcription. Probably something similar to this ggwave example can be used

What's the build process for Windows?

I tried running "make" and got this error:

process_begin: CreateProcess(NULL, uname -s, ...) failed.
process_begin: CreateProcess(NULL, uname -p, ...) failed.
process_begin: CreateProcess(NULL, uname -m, ...) failed.
cc  -O3 -std=c11   -Wall -Wextra -Wno-unused-parameter -Wno-unused-function   -c ggml.c
process_begin: CreateProcess(NULL, cc -O3 -std=c11 -Wall -Wextra -Wno-unused-parameter -Wno-unused-function -c ggml.c, ...) failed.
make (e=2): The system cannot find the file specified.
make: *** [ggml.o] Error 2

Could someone guide me through building this program on Windows? Are there pre-built binaries available? I have Visual Studio 2022 and MinGW installed.

Running inference over a large batch of audio files

Hi! Firstly, thank you so much for this incredible work!

I have been running the tiny.en models on a large number of wav files stored in a folder. I am currently parallelizing the work over a multi-core machine using GNU parallel and running the following command :

find input_data/eng_wav_data -name "*.wav" | parallel 'time ./main -m models/ggml-tiny.en.bin -nt -f {} -t 1 > {.}.txt'

I found that currently the model is loaded each time we have to transcribe a wav file. Is there a way I can circumvent this and load the model only once? Any help would be appreciated. Thank you. Apologies if this issue has been resolved already

Language selection

I'm glad you shared this implementation.
A steep increase in performance relative to the torch on the CPU.

It is possible that you already know, but found how to enable recognition of a certain language.
We just can put in line 2012 main.cpp this:

std::vector<whisper_vocab::id> prompt = { vocab.token_sot, vocab.token_lang, vocab.token_task };  

This 3 tokens formed here:
https://github.com/openai/whisper/blob/8cf36f3508c9acd341a45eb2364239a3d81458b9/whisper/tokenizer.py#L324-L331

For specific use in main.cpp, you can simply specify the desired index manually. But for regular users, it would be cool to specify which language they would prefer to see in the output.

[Feature] recognize data coming via pipe stream

Hi,

it would be great to have a simple app that takes data from pipe and runs recognition on it ... similar to stream.cpp, but instead taking data from audio device, taking it from pipe ...

Could also be an addition to the main-example, so that you can use it like this:

cat samples/jfk.wav | ./main -m models/ggml-medium.bin -f -

Here something similar is done with vosk and python. (ffmpeg-pre-processing could be something people can do on their own before filling the pipe and not part of the app ...)

Kind regards,
abelbabel

Hosting the ggml models in the cloud

Currently, I am hosting the ggml Whisper model files on my Linode server.
However, it has a limited network bandwidth per month and as more people start using whisper.cpp it won't be enough.

What are some good options for hosting ~10GB of data?

The only requirement is to be able to wget/curl the files directly - i.e. Google Drive and alike are not an option.

iOS example app

Implement a very basic iOS application using whisper.cpp

The ggwave-objc project can be used as a good starting point. It already provides the audio capture functionality. We just need to pass the captured data to whisper.cpp.

This code but with CUDA

Does anyone have any ideas of how to use this code but with CUDA libs? I want to move away from the Python version but keep PyTorch CUDA.

Compile error: internal compiler error

I have this issue when trying to compile the most recent version (as of 16 oct 2022):

(base) user@pc:~/whisper.cpp$ make
g++ -O3 -std=c++11 -Wall -Wextra -Wno-unused-parameter -Wno-unused-function -pthread -c whisper.cpp
whisper.cpp: In function ‘whisper_full_params whisper_full_default_params(whisper_decode_strategy)’:
whisper.cpp:2305:17: internal compiler error: in reshape_init_class, at cp/decl.c:6465
2305 | };
| ^
0x7fdf6ca75d8f __libc_start_call_main
../sysdeps/nptl/libc_start_call_main.h:58
0x7fdf6ca75e3f __libc_start_main_impl
../csu/libc-start.c:392
Please submit a full bug report,
with preprocessed source if appropriate.
Please include the complete backtrace with any bug report.
See file:///usr/share/doc/gcc-11/README.Bugs for instructions.
make: *** [Makefile:61: whisper.o] Error 1

just to be sure it wasn't my setup, I compiled the fork I have at https://github.com/Topping1/whisper.cpp (2 commits ahead, 8 commits behind) and it compiled fine. For further verification I ran a diff between the two whisper.cpp and found this:
(left is whisper.cpp at my repository, right is the updated one (as of 16-oct-2022)

image

Do you know what might be causing the issue?

Add function to clear model words

Hi there! I'm trying to save compute resources by reusing WhisperContext objects in a STT server instance, but if no words were detected in the audio, it will cause whatever words were found in the last transcription that had words detected to be spit out again. This is a major issue, and I'd like a way to prevent this. The easiest way I can think of is adding a function to clear the words stored in the model. I considered adding such a feature to my app, but I realized this could cause serious overhead and introduce user privacy risks from storing many sentences compared to just clearing the words from the model itself. Thanks!

Comparison with torch jit

Great work! I find especially the implementation of ggml interesting. It looks like you implement all the basic neural network building blocks with ggml. How do you compare it with the torch jit approach of using a pytorch model in c++?

CMake builds run MUCH slower than Make builds

Hi, and thanks so much for this project. It's really, really fast. I've been compiling M1 Mac, Intel Mac, and Windows, and I've noticed something across the board: CMake builds run much, much slower (3-4x) than via Make. I would love to put some time into fixing this and PRing, but I'm really busy right now.

I may have time in a couple of weeks to contribute but just wanted to put this on your radar in case there's some obvious easy fix.

Output file

Hello there. Seems like redirecting the standard output with either >, >> or tee doesn't work. Would be nice to have an option to save the output to a specific file.

PyTorch performance for Linear layer is 4 times faster than matmul

I am doing some performance optimizations in ggml and it seems that the PyTorch's Linear layer currently outperforms my implementation by a factor of ~4 for big matrices. I am wondering what is the secret there and if someone can give me some tips how to achieve this performance.


Consider the following line from the original whisper implementation:

https://github.com/openai/whisper/blob/e90b8fa7e845ae184ed9aa0babcf3cde6f16719e/whisper/model.py#L73

This is effectively equivalent to a matrix multiplication of x with a square weights matrix from the model (encoder.blocks.0.attn.query.weight) and sum with a bias vector (encoder.blocks.0.attn.query.bias).

I compared the runtime for this line with an explicit matrix multiplication of same size matrices.
To do that, I replaced the line with this piece of code:

# original
        q = self.query(x)

# modified
        start = time.time()
        q = self.query(x)
        print('time for self.query(x) = ', time.time() - start)

        start = time.time()
        r0 = torch.rand(x.shape[1], x.shape[2], dtype=torch.float32)
        r1 = torch.rand(x.shape[2], x.shape[2], dtype=torch.float32)
        r2 = r0 @ r1
        print('time for r2 (mat_mul)  = ', time.time() - start)

        print(self.query)
        print(' x shape = ',  x.shape, ' dtype = ',  x.dtype)
        print('r0 shape = ', r0.shape, ' dtype = ', r0.dtype)
        print('r1 shape = ', r1.shape, ' dtype = ', r1.dtype)
        print('r2 shape = ', r2.shape, ' dtype = ', r2.dtype)

I would have expected that time for self.query(x) to be equal to time for r2 (mat_mul).
However, here is the result on my MacBook when running the large model:

time for self.query(x) =  0.0034177303314208984
time for r2 (mat_mul)  =  0.012507200241088867
Linear(in_features=1280, out_features=1280, bias=True)
 x shape =  torch.Size([1, 1500, 1280])  dtype =  torch.float32
r0 shape =  torch.Size([1500, 1280])  dtype =  torch.float32
r1 shape =  torch.Size([1280, 1280])  dtype =  torch.float32
r2 shape =  torch.Size([1500, 1280])  dtype =  torch.float32

So the Linear layer is almost 4 times faster (3.4 ms vs 12.5 ms) compared to explicit matrix multiplication.


How do we explain this difference?

Is PyTorch using some int8 quantisation technique under the hood to speed up this layer? If so, how can I verify that this is the case?

Any insight will be very much appreciated!

Correct parameter for cross compile for ARM Android ?

What is correct parameter for cross compile for ARM Android ? I'm using Intel Ubuntu , android-ndk-r25b


ggml.c:232:16: warning: implicit declaration of function 'vfmaq_f32' is invalid in C99 [-Wimplicit-function-declaration]
        sum0 = vfmaq_f32(sum0, x0, y0);
               ^
ggml.c:232:14: error: assigning to 'float32x4_t' (vector of 4 'float32_t' values) from **incompatible type** 'int'
        sum0 = vfmaq_f32(sum0, x0, y0);
             ^ ~~~~~~~~~~~~~~~~~~~~~~~

./ggml.c:331:14: error: assigning to 'float16x8_t' (vector of 8 'float16_t' values) from **incompatible type** 'int'
        sum0 = vfmaq_f16(sum0, x0, y0);
             ^ ~~~~~~~~~~~~~~~~~~~~~~~             

SIGFPE on certain audio files

Hey there! I'm testing out whisper.cpp to see if it would be suitable for production use. However I'm running into a SIGFPE on certain audio files: namely those that do not produce any output from the model. Because of the way my system is set up, I'm unable to provide any test files that can reproduce this bug.

However, I was able to build the library with debug symbols and trigger the exception. It seems to be a divide-by-zero error on line 2349 of whisper.cpp:

int progress_cur = (100*seek)/whisper_n_len(ctx);

The GDB output is as follows:

Thread 21 "scripty_stt_ser" received signal SIGFPE, Arithmetic exception.
[Switching to Thread 0x7ffff7085700 (LWP 3869)]
0x0000555555599123 in whisper_full (ctx=0x5555556f6a80, params=..., samples=<optimized out>, n_samples=<optimized out>) at whisper.cpp:2349
2349            int progress_cur = (100*seek)/whisper_n_len(ctx);

Unfortunately, despite compiling with debug symbols (-g flag), bt gave no extra info beyond that:

(gdb) bt
#0  0x0000555555599123 in whisper_full (ctx=0x5555556f6a80, params=..., samples=<optimized out>, n_samples=<optimized out>) at whisper.cpp:2349
#1  0x0000555555593cf6 in whisper_rs::whisper_ctx::WhisperContext::full (self=<optimized out>, params=..., data=...) at src/whisper_ctx.rs:390

Let me know if there's anything else I can do to help!

inlining failed in call to 'always_inline' 'vfmaq_f16': target specific option mismatch

I am trying to compile for ARM64 and there seems to be an issue with some vector functions:

> [linux/arm64 builder 5/5] RUN gcc -pthread -O3 -march=native -c ggml.c &&     g++ -pthread -O3 -std=c++11 -c main.cpp &&     g++ -pthread -o main ggml.o main.o:
#29 3.977 ggml.c:506:14: note: called from here
#29 3.977   506 |         y1 = vfmaq_f16(y1, x1, v8);
#29 3.977       |              ^~~~~~~~~~~~~~~~~~~~~
#29 3.978 In file included from ggml.c:47:
#29 3.978 /usr/lib/gcc/aarch64-linux-gnu/10/include/arm_neon.h:33208:1: error: inlining failed in call to 'always_inline' 'vfmaq_f16': target specific option mismatch
#29 3.978 33208 | vfmaq_f16 (float16x8_t __a, float16x8_t __b, float16x8_t __c)
#29 3.978       | ^~~~~~~~~
#29 3.978 ggml.c:505:14: note: called from here
#29 3.978   505 |         y0 = vfmaq_f16(y0, x0, v8);
#29 3.978       |              ^~~~~~~~~~~~~~~~~~~~~
------
Dockerfile:11
--------------------
  10 |     ADD whisper.cpp/ /build/
  11 | >>> RUN gcc -pthread -O3 -march=native -c ggml.c && \
  12 | >>>     g++ -pthread -O3 -std=c++11 -c main.cpp && \
  13 | >>>     g++ -pthread -o main ggml.o main.o
  14 |     
--------------------
ERROR: failed to solve: process "/bin/sh -c gcc -pthread -O3 -march=native -c ggml.c &&     g++ -pthread -O3 -std=c++11 -c main.cpp &&     g++ -pthread -o main ggml.o main.o" did not complete successfully: exit code: 1

Tested on GitHub actions (logs) and on a Raspberry Pi 4.

Dockerfile:

# build image
FROM debian:bullseye-slim AS builder
WORKDIR /build/
RUN apt-get update && apt-get install --no-install-recommends -y \
    make gcc g++ wget \
 && apt-get clean \
 && rm -rf /var/lib/apt/lists/*

# Install Whisper.cpp
ADD whisper.cpp/ /build/
RUN gcc -pthread -O3 -march=native -c ggml.c && \
    g++ -pthread -O3 -std=c++11 -c main.cpp && \
    g++ -pthread -o main ggml.o main.o

Tutorial on implementaion of ggml?

Hi georgi, I am sure this is not the right platform to make an unreasonable request. Could you make a tutorial or docs how did you went on implementing ggml and especially the design.
I am personally lacking this skill.

Thank you

Unicode/Encoding Issue with Japanese Text

I'm trying running Japanese audio files through whisper.cpp, and the output is returning some "corrupted" output.

Here is the output from whisper and whisper.cpp for comparison:

Command Output
whisper output.wav --model large --language Japanese さくらちゃん**神経もすっごくいいし、バトンもうまいんだけど
./main -m models/ggml-large.bin -l ja -f output.wav さくらちゃん**神��もすっごくいいし、バトンもうまいんだけど。

The expected 「神経も」 portion is the following in hex:

0xE7A59E 0xE7B58C 0xE38282

The "corrupted" 「神��も」 portion is:

0xE7A59E 0xEFBFBD 0xEEBFBD 0xE38282


Note: I had to comment out a few lines from whisper.cpp around line 2300 for "make" to compile. I do not know if this would impact it.

                    .beam_search = {
                        //.n_past = 0,
                        //.beam_width = 10,
                        //.n_best = 5,
                    },

Need -pthread for make on ubuntu 20.04

Makefile

main: ggml.o main.o
	g++ -pthread -o main ggml.o main.o
	./main -h

ggml.o: ggml.c ggml.h
	gcc -pthread -O3 -mavx -mavx2 -mfma -mf16c -c ggml.c

main.o: main.cpp ggml.h
	g++ -pthread -O3 -std=c++11 -c main.cpp

Not working on MacOS (ARM)

Hi, I've been trying to get this to work a few times, but it always fails with an illegal hardware instruction error.

E.g. for ./main -m models/ggml-small.bin -f samples/jfk.wav I get the following output:

whisper_model_load: loading model from 'models/ggml-small.bin'
whisper_model_load: n_vocab       = 51865
whisper_model_load: n_audio_ctx   = 1500
whisper_model_load: n_audio_state = 768
whisper_model_load: n_audio_head  = 12
whisper_model_load: n_audio_layer = 12
whisper_model_load: n_text_ctx    = 448
whisper_model_load: n_text_state  = 768
whisper_model_load: n_text_head   = 12
whisper_model_load: n_text_layer  = 12
whisper_model_load: n_mels        = 80
whisper_model_load: f16           = 1
whisper_model_load: type          = 3
whisper_model_load: mem_required  = 1048.00 MB
whisper_model_load: adding 1608 extra tokens
whisper_model_load: ggml ctx size = 533.05 MB
fish: Job 1, './main -m models/ggml-small.b...' terminated by signal SIGILL (Illegal instruction)

I've tried other models as well, but the result is always the same.

make stream fails due to missing dependency

If you normally try to build stream with make stream, it will fail with:

g++ -O3 -std=c++11 -Wall -Wextra -Wno-unused-parameter -Wno-unused-function -pthread stream.cpp ggml.o whisper.o -o stream `sdl2-config --cflags --libs`
/bin/sh: 1: sdl2-config: not found
stream.cpp:12:10: fatal error: SDL.h: No such file or directory
   12 | #include <SDL.h>
      |          ^~~~~~~
compilation terminated.
make: *** [Makefile:76: stream] Error 1

The missing dependency for this is https://www.libsdl.org/ and can be installed with:

sudo apt-get install libsdl2-dev

Would be nice to add this to the README, I might do this later if I have time.

Timestamps for words instead of sentence possible?

Do you think that could be possible in some way?

I would like to get the time stamp of each word instead the sentence (words bundle).
That could be useful to some kind of karaoke lyrics generator,
or just to text to “lip sync” in a kind of video clip or 3d character synchro.

Cheers

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.