Comments (14)
Now if we exclude "bad" files from here, we will get more interesting results.
I cannot say that all of these files have poor annotation, but the majority do.
from open_stt.
@akreal
As you requested.
Some results of a more or less well trained model (csv+feather formats).
share_results_1.zip
You can open the feather
file like this
import pandas as pd
df = pd.read_feather('../data/share_results_1.feather')
Overall, this model is not overfitted and there is no post-processing yet.
from open_stt.
Perfect, thank you!
from open_stt.
As you can see the model is not fully fitted yet (we are still in exploratory phase)
But it works perfectly on some easier datasets already
Obviously I exclude the following datasets from the file
- private
- TTS - they are too easy, they are for diversity only
- ASR datasets - because you cannot really use them for validation
from open_stt.
Almost finished collecting v05 and searching hyper-params, will be posting new benchmarks and new data soon
from open_stt.
@snakers4 What model did you use for benchmark?
from open_stt.
@m1ckyro5a
wav2letter inspired fork of the fork of deep speech pytorch
from open_stt.
@snakers4 How about deepspeech2? Which model is better?
from open_stt.
It is hard to tell yet
The performance now is more limited by the data for us, more than by the model
Of course we compared some models side by side (CNN, RNN) only to find that RNNs were a bit better with the same number of weight updates, but slower in general
Some benches we ran on LibriSpeech
network_bench.xlsx
from open_stt.
I will structure the benchmark files from now a bit
- Path
- Annotation / prediction
- CER, WER
- File path in the file db
Please note that exclusion files #7 were based on this benchmarks as well previously
All charts contain CER
Dataset benchmark v05
File
Model
CNN trained with CTC loss
Tuning with phonemes
Youtube
Audio books
TTS
Academic datasets
ASR datasets
Pranks are very noisy by default
Radio
Strict exclude file for distillation
An idea on how to set thresholds:
CLEAN_THRESHOLDS = {
# very strict conditions, datasets are clean, no problem
'tts_russian_addresses_rhvoice_4voices':0.2,
'private_buriy_audiobooks_2':0.1,
# strict conditions, datasets vary
'public_youtube700':0.2,
'public_youtube1120':0.2,
'public_youtube1120_hq':0.2,
'public_lecture_1':0.2,
'public_series_1':0.2,
# strict conditions, dataset mostly clean
'radio_2':0.2,
# very strict conditions, datasets are dirty
'asr_public_phone_calls_1':0.2,
'asr_public_phone_calls_2':0.2,
'asr_public_stories_1':0.2,
'asr_public_stories_2':0.2,
# mostly just to filter outliers
'ru_tts':0.4,
'ru_ru':0.4,
'voxforge_ru':0.4,
'russian_single':0.4
}
from open_stt.
Also a comment - model was not over-fitted, it is selected based on optimal generalization
from open_stt.
https://ru-open-stt.ams3.digitaloceanspaces.com/benchmark_v05_public.csv.zip is in fact a gzip-compressed file (not a zip-compressed one), so one should decompress it with zcat benchmark_v05_public.csv.zip > benchmark_v05_public.csv
unzipping fails with:
$ unzip benchmark_v05_public.csv.zip
Archive: benchmark_v05_public.csv.zip
End-of-central-directory signature not found. Either this file is not
a zipfile, or it constitutes one disk of a multi-part archive. In the
latter case the central directory and zipfile comment will be found on
the last disk(s) of this archive.
unzip: cannot find zipfile directory in one of benchmark_v05_public.csv.zip or
benchmark_v05_public.csv.zip.zip, and cannot find benchmark_v05_public.csv.zip.ZIP, period.
after gzip-decompression the first line contains some weird stuff:
$ head -n 1 benchmark_v05_public.csv
data/dataset_cleaning/benchmark_v05_public.csv0000644000175000001441656463430613513563560021050 0ustar kerasusers
from open_stt.
Hi! What datasets have speaker labels?
Is there any information in which release the speaker labels will be?
Thanks a lot!
from open_stt.
We decided not to update and / or maintain these for reasons.
from open_stt.
Related Issues (20)
- public_youtube700 is subset of public_youtube1200? HOT 3
- Ordering of the audio files. HOT 3
- Q: Is there annotations for radio_v4 dataset? HOT 2
- Двойные буквы и дефисы для тестовых данных HOT 5
- нет txt файлов в public_speech.tar.gz HOT 20
- Опыт применения open_stt для обучения распознавания телефонных разговоров на DeepSpeech HOT 1
- No seeders on the torrent file! HOT 1
- What does alignment mean in annotation ? HOT 1
- Any more information about the structure of the folder HOT 3
- Can't download dataset HOT 7
- Не могу скачать файл HOT 3
- Torrent announcement HOT 3
- В private_buriy_audiobooks_2 нет буквы ё, а в private_buriy_audiobooks_2_val есть HOT 3
- Сколько говорящих в данном датасете? HOT 2
- Does the dataset contain the speaker's IDs? HOT 4
- How was this dataset assembled? HOT 1
- Download error HOT 4
- Проблема с торрентом HOT 1
- Opus files are not opus theirs vorbis HOT 3
- question of re-sampling HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from open_stt.