picovoice / speech-to-text-benchmark Goto Github PK

speech to text benchmark framework

License: Apache License 2.0

Python 100.00%

speech-recognition speech-to-text deepspeech voice-recognition offline privacy deep-learning deep-neural-networks google-speech-to-text aws-transcribe

speech-to-text-benchmark's Introduction

Speech-to-Text Benchmark

Made in Vancouver, Canada by Picovoice

This repo is a minimalist and extensible framework for benchmarking different speech-to-text engines.

Data
Metrics
Engines
Usage
Results

Data

Metrics

Word Error Rate

Word error rate (WER) is the ratio of edit distance between words in a reference transcript and the words in the output of the speech-to-text engine to the number of words in the reference transcript.

Core-Hour

The Core-Hour metric is used to evaluate the computational efficiency of the speech-to-text engine, indicating the number of CPU hours required to process one hour of audio. A speech-to-text engine with lower Core-Hour is more computationally efficient. We omit this metric for cloud-based engines.

Model Size

The aggregate size of models (acoustic and language), in MB. We omit this metric for cloud-based engines.

Engines

Usage

This benchmark has been developed and tested on Ubuntu 22.04.

Install FFmpeg
Download datasets.
Install the requirements:

pip3 install -r requirements.txt

In the following, we provide instructions for running the benchmark for each engine. The supported datasets are: COMMON_VOICE, LIBRI_SPEECH_TEST_CLEAN, LIBRI_SPEECH_TEST_OTHER, or TED_LIUM.

Amazon Transcribe Instructions

Replace ${DATASET} with one of the supported datasets, ${DATASET_FOLDER} with path to dataset, and ${AWS_PROFILE} with the name of AWS profile you wish to use.

python3 benchmark.py \
--dataset ${DATASET} \
--dataset-folder ${DATASET_FOLDER} \
--engine AMAZON_TRANSCRIBE \
--aws-profile ${AWS_PROFILE}

Azure Speech-to-Text Instructions

Replace ${DATASET} with one of the supported datasets, ${DATASET_FOLDER} with path to dataset, ${AZURE_SPEECH_KEY} and ${AZURE_SPEECH_LOCATION} information from your Azure account.

python3 benchmark.py \
--dataset ${DATASET} \
--dataset-folder ${DATASET_FOLDER} \
--engine AZURE_SPEECH_TO_TEXT \
--azure-speech-key ${AZURE_SPEECH_KEY}
--azure-speech-location ${AZURE_SPEECH_LOCATION}

Google Speech-to-Text Instructions

Replace ${DATASET} with one of the supported datasets, ${DATASET_FOLDER} with path to dataset, and ${GOOGLE_APPLICATION_CREDENTIALS} with credentials download from Google Cloud Platform.

python3 benchmark.py \
--dataset ${DATASET} \
--dataset-folder ${DATASET_FOLDER} \
--engine GOOGLE_SPEECH_TO_TEXT \
--google-application-credentials ${GOOGLE_APPLICATION_CREDENTIALS}

IBM Watson Speech-to-Text Instructions

Replace ${DATASET} with one of the supported datasets, ${DATASET_FOLDER} with path to dataset, and ${WATSON_SPEECH_TO_TEXT_API_KEY}/${${WATSON_SPEECH_TO_TEXT_URL}} with credentials from your IBM account.

python3 benchmark.py \
--dataset ${DATASET} \
--dataset-folder ${DATASET_FOLDER} \
--engine IBM_WATSON_SPEECH_TO_TEXT \
--watson-speech-to-text-api-key ${WATSON_SPEECH_TO_TEXT_API_KEY}
--watson-speech-to-text-url ${WATSON_SPEECH_TO_TEXT_URL}

OpenAI Whisper Instructions

Replace ${DATASET} with one of the supported datasets, ${DATASET_FOLDER} with path to dataset, and ${WHISPER_MODEL} with the whisper model type (WHISPER_TINY, WHISPER_BASE, WHISPER_SMALL, WHISPER_MEDIUM, or WHISPER_LARGE)

python3 benchmark.py \
--engine ${WHISPER_MODEL} \
--dataset ${DATASET} \
--dataset-folder ${DATASET_FOLDER} \

Picovoice Cheetah Instructions

Replace ${DATASET} with one of the supported datasets, ${DATASET_FOLDER} with path to dataset, and ${PICOVOICE_ACCESS_KEY} with AccessKey obtained from Picovoice Console.

python3 benchmark.py \
--engine PICOVOICE_CHEETAH \
--dataset ${DATASET} \
--dataset-folder ${DATASET_FOLDER} \
--picovoice-access-key ${PICOVOICE_ACCESS_KEY}

Picovoice Leopard Instructions

Replace ${DATASET} with one of the supported datasets, ${DATASET_FOLDER} with path to dataset, and ${PICOVOICE_ACCESS_KEY} with AccessKey obtained from Picovoice Console.

python3 benchmark.py \
--engine PICOVOICE_LEOPARD \
--dataset ${DATASET} \
--dataset-folder ${DATASET_FOLDER} \
--picovoice-access-key ${PICOVOICE_ACCESS_KEY}

Results

Word Error Rate (WER)

Engine	LibriSpeech test-clean	LibriSpeech test-other	TED-LIUM	CommonVoice	Average
Amazon Transcribe	2.6%	5.6%	3.8%	8.7%	5.2%
Azure Speech-to-Text	2.8%	6.2%	4.6%	8.9%	5.6%
Google Speech-to-Text	10.8%	24.5%	14.4%	31.9%	20.4%
Google Speech-to-Text Enhanced	6.2%	13.0%	6.1%	18.2%	10.9%
IBM Watson Speech-to-Text	10.9%	26.2%	11.7%	39.4%	22.0%
Whisper Large (Multilingual)	3.7%	5.4%	4.6%	9.0%	5.7%
Whisper Medium	3.3%	6.2%	4.6%	10.2%	6.1%
Whisper Small	3.3%	7.2%	4.8%	12.7%	7.0%
Whisper Base	4.3%	10.4%	5.4%	17.9%	9.5%
Whisper Tiny	5.9%	13.8%	6.5%	24.4%	12.7%
Picovoice Cheetah	5.6%	12.1%	7.7%	17.5%	10.7%
Picovoice Leopard	5.3%	11.3%	7.2%	16.2%	10.0%

Core-Hour & Model Size

To obtain these results, we ran the benchmark across the entire TED-LIUM dataset and recorded the processing time. The measurement is carried out on an Ubuntu 22.04 machine with AMD CPU (AMD Ryzen 9 5900X (12) @ 3.70GHz), 64 GB of RAM, and NVMe storage, using 10 cores simultaneously. We omit Whisper Large (Multilingual) from this benchmark.

Engine	Core-Hour	Model Size / MB
Whisper Medium	1.50	1457
Whisper Small	0.89	462
Whisper Base	0.28	139
Whisper Tiny	0.15	73
Picovoice Leopard	0.05	36
Picovoice Cheetah	0.09	31

speech-to-text-benchmark's People

Contributors

Stargazers

Watchers

speech-to-text-benchmark's Issues

MacBook Pro M3 Crashes

When i try

--engine WHISPER_LARGE \
--dataset LIBRI_SPEECH_TEST_OTHER \
--dataset-folder /Users/aedell/Documents/GitHub/speech-to-text-benchmark/datasets/LibriSpeech/test-other  \

my Macbook Pro's memory spikes and then the machine locks up. According to ChatGPT, the issue is;

Concurrent Execution: The script uses ProcessPoolExecutor for parallel processing. If the dataset is large and the number of processes (num_workers) is not optimally set, this could lead to high memory usage as each process might consume a significant amount of memory.

Dataset and Engine Initialization: Depending on how the Dataset and Engine classes are implemented (not visible in the provided script), loading large datasets or initializing multiple instances of speech-to-text engines could consume a lot of memory.

Memory Leaks: If there are any memory leaks within the Engine or Dataset classes (e.g., not properly releasing resources), repeated calls in a loop could exacerbate memory consumption over time.

Any suggestions?

Consider adding whisper.cpp and faster-whisper to comparison

Both run on CPU.

whisper.cpp is a port of Whisper in C++, it's now probably one of the most popular open-source speech-to-text models running offline on CPU. With the smallest tiny model (75 MB) it's 5-6 times slower than Leopard in my tests, though.

faster-whisper is a reimplementation of Whisper model using CTranslate2. It's not as popular yet, but it's 5 times faster than whisper.cpp on the same model, which makes it comparable to Leopard in terms of speed (even though it used 75 MB tiny model), in my tests.

DeepSpeech benchmark statistics are very outdated

The statistics for Mozilla's DeepSpeech are outdated even compared to version 0.6 (Please see https://hacks.mozilla.org/2019/12/deepspeech-0-6-mozillas-speech-to-text-engine/)

[CONTRIBUTION] Speech Dataset Generator

Hi everyone!

My name is David Martin Rius and I have just published this project on GitHub: https://github.com/davidmartinrius/speech-dataset-generator/

Now you can create datasets automatically with any audio or lists of audios.

I hope you find it useful.

Here are the key functionalities of the project:

Dataset Generation: Creation of multilingual datasets with Mean Opinion Score (MOS).
Silence Removal: It includes a feature to remove silences from audio files, enhancing the overall quality.
Sound Quality Improvement: It improves the quality of the audio when needed.
Audio Segmentation: It can segment audio files within specified second ranges.
Transcription: The project transcribes the segmented audio, providing a textual representation.
Gender Identification: It identifies the gender of each speaker in the audio.
Pyannote Embeddings: Utilizes pyannote embeddings for speaker detection across multiple audio files.
Automatic Speaker Naming: Automatically assigns names to speakers detected in multiple audios.
Multiple Speaker Detection: Capable of detecting multiple speakers within each audio file.
Store speaker embeddings: The speakers are detected and stored in a Chroma database, so you do not need to assign a speaker name.
Syllabic and words-per-minute metrics

Feel free to explore the project at https://github.com/davidmartinrius/speech-dataset-generator

David Martin Rius

Great Work

Nice work guys. I will be keeping on eye one this as it progresses.. Currently I find its accuracy to be far off.. but in time i bet this project will get better and better

It is super fast!! Can't wait to see where it goes

feature request: compare with more ASR engines

Hi all, just to say thank you for the benchmark.
I propose to add more speech recognition engine to you tests. Here below some engines to add in your benchmark:

WIT.ai
official API doc: https://wit.ai/docs/http/20170307#post__speech_link
https://www.liip.ch/en/blog/speech-recognition-with-wit-ai
IBM Watson speech to text
official API doc: https://www.ibm.com/watson/services/speech-to-text/
https://www.pragnakalp.com/speech-recognition-speech-to-text-python-using-google-api-wit-ai-ibm-cmusphinx/
Microsoft Cognitive Service Speech To text
official API doc: https://azure.microsoft.com/en-us/services/cognitive-services/speech-to-text/
Kaldi
official API doc: https://kaldi-asr.org/doc/about.html

thanks again
giorgio

Text Enhancement

Hi!

I've tried Cheetah briefly in the past and I could only get capitalized letters and spaces for the transcription output. The cloud providers featured on this benchmark offer a transcription with punctuation and appropriate letter capitalization.

Is it still the case? If so, maybe add this information to the README?

Thanks.

please help me to solve this issue

LoadLibrary() argument 1 must be str, not None

WER?

The PicovoiceCheetahASREngine is super fast, but is not accurate based on my test.

Is it suitable to use Levenshtein distance as the WER?

From Wikipedia

The WER is derived from the Levenshtein distance, working at the word level instead of the phoneme level.

Here is some transcripts of Common Voice's cv-valid-dev set.

from "go into your dance" to "go into gour dance"
from "when you start to eat like this something is the matter" to "when you start to eat like this something is the matter"
from "i had seen all that it would presently bring me" to "i have feen all that it would presently bring me"
from "he moved about invisible but everyone could hear him" to "he moved abut and vasible but everyone could hear him"
from "mr lee can't be bothered now" to "mister lincal be bothered now"
from "the shepherd swore that he would" to "the shepherd swore that he would"
from "it must have fallen while i was sitting over there" to "i must have fallnwhile i was sitting over there"
from "just like an organ" to "just like in organ"
from "and the solid part was called the philosopher's stone" to "and the solid park was caurds af alows of her sto"
from "raisins are delicious" to "raisoned arylitious"
from "get the governor on the phone" to "got the governor on the fo"
from "so then try he said to the englishman" to "so then try he said to the englishman"
from "i thought about whether we should find coins and models in it and so on" to "i thought about whether of we should find coins and lodtels in it and to on"
from "the angel touched the man's shoulder and they were both projected far into the future" to "the ageel touched lhe ment showter and thet eiere both projectted far into the futur"
from "lots of places sell tea around here the merchant said" to "the lass of pases selt he aroud here thand much in se"
from "everyone on earth has a treasure that awaits him his heart said" to "onper one on perds has a triture that the waiys hom hes hart said"
from "i'm beginning to like this" to "i am beginning to like this"
from "all they wanted was food and water" to "all they wanted witsh food and water"
from "it has happened many times before" to "it does not many danse before"
from "but most important he was able every day to live out his dream" to "but most important he was able every day to live pout his dream"
from "because you have already lost your savings twice" to "because you have already lost hor savings twice"
from "whenever he could he sought out a new road to travel" to "whenever he colld he sai tout and yow rod to trallo"
from "he was about the same age and height as the boy" to "he was about the same age and height as the"
from "drawing from my own experience as a learner of english and german i value engaging activities that involve everyday conversation" to "rong from mor own experience as the lone of english and german i tout jenin gatehing at tomatis that involve every day contersation"
from "they set off running wildly into the trees" to "they set off fonning wild leans hrough the trees"
from "but finally the merchant appeared and asked the boy to shear four sheep" to "the finally the merchant oppeared in asked the boy to sheer for sheep"
from "the merchant looked anxiously at the boy" to "the merchon non anxiously af the bo"
from "the nurse waddled around the ward" to "the nervese watld around the ward"
from "where did he keep his money" to "where do he caep his momey"
from "i learned the alchemist's secrets in my travels" to "i learned the alchemistsecrets in my trave"
from "everything in life is an omen said the englishman now closing the journal he was reading" to "averything in life is an omen said the englishman now closing the journal he was realing"
from "the boy was startled" to "the boy wast storted"
from "half an hour later his shovel hit something solid" to "al aleyre lihter he shavell hit subething sollid"
from "i heard a faint movement under my feet" to "thhard faint movement under my feet"
from "it is i the boy answered" to "it is i the boy answered"
from "its been a long time since she last read chekhov and because of that she no longer feels like the heroine of her own story" to "tin along time since she last read checkel and because of that she no longer fiel like the herror one of her ens story"
from "i learned how to care for sheep and i haven't forgotten how that's done" to "i learn how to kare for sheep and i haven't forgotten 'll dats then"
from "they placed the symbols of the pilgrimage on the doors of their houses" to "e place the sembols of the tigrimice on the doors of their houses"
from "i heard a faint movement under my feet" to "i heard a faint movement on the my fet"
from "he could always go back to being a shepherd" to "he could alwas go back to being a shepherd"
from "its lower end was still embedded" to "it's lower and wesstill imbidded"
from "hundreds must have seen it and taken it for a falling star" to "hundreds must have seen it and taken in for a falling stan"
from "as the sun rose the men began to beat the boy" to "as the sun rosed man began to peat the boy"
from "i feel i ought to take care of her" to "i feel i aught to take car of her"
from "i have to find a man who knows that universal language" to "i have to find a mann  who knows that univrsal lang"
from "the burning fire had been extinguished" to "the burning far had been extinguished"
from "how come you speak spanish he asked" to "i'l gons biaks that anish he sk"
from "i thought tonight i'd put miss kelly there" to "i thought tonig to id but miss carly there"
from "the cursor blinked expectantly" to "the curser bliked expectantly"
from "the boy mumbled an answer that allowed him to avoid responding to her question" to "the boy mumbled and onser that a loued him to avoide was ponding to her question"
from "fresh coffee is much better than the freeze dried stuff" to "fresh coffee is much better then the freezes drives stuff"
from "but you will love her and she'll return your love" to "but you were love her and she'll return your love"
from "i think they're going to last for a long time he said to the monk" to "i think they're going to last for a long time he said to the mon"
from "and then he perceived it very slowly" to "and hany perceived it very slowly"
from "my wife pointed out to me the brightness of the red green and yellow signal lights" to "my wife pointed out to me the brightness of the red green and yellow signal lights"
from "they become the soul o f the world" to "then come to saw of the warl"
from "what's going on here" to "what's going on here"
from "the alchemist knocked on the gate of the monastery" to "is the alchemist nocked on the gait of the monastery"
from "her manipulation failed" to "ere minipulation fai"
from "why the shots stopped after the tenth no one on earth has tried to explain" to "why te shot stopped after the tenth no one on earth has tried ting thran"
from "that's the man who knows all the secrets of the world she said" to "that's the man hur knows all the secrets of the word she is said"
from "they reached the center of a large plaza where the market was held" to "there reach the sentr af a large plase awere the market was helt"
from "he was thinking about omens and someone had appeared" to "he was thinking about omens and someone had appeared"
from "no sense messing up the streets" to "no fhende metting up the street"
from "that's what i'm not supposed to say" to "but if what i've not tre posed to fary"
2018-08-08 15:53:18,429 WER = 0.336652

Compare against Mozilla DeepSpeech TFLite and update to current DeepSpeech?

Heads up, Mozilla DeepSpeech v0.7.0 has been out for a bit with a word error rate of 5.97%.

There is also a Tensorflow Lite model weighing in at 45MB, not sure if this was overlooked earlier, but it has been around since January 2019. This TFLite model is likely a better thing to compare Picovoice Leopard & Cheetah against if the focus is on low resource devices.

I hope your team is faring well in spite of this pandemic!