Below I will post some of the results on the public part of the dataset Both train

Now if we exclude "bad" files from <a href="https://github.com/snakers4/open_stt/issue

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Some benchmarks on the datasets about open_stt HOT 14 CLOSED

snakers4 commented on May 22, 2024

Some benchmarks on the datasets

from open_stt.

Comments (14)

snakers4 commented on May 22, 2024 3

Now if we exclude "bad" files from here, we will get more interesting results.
I cannot say that all of these files have poor annotation, but the majority do.

share_results_v02.zip

from open_stt.

snakers4 commented on May 22, 2024

@akreal
As you requested.
Some results of a more or less well trained model (csv+feather formats).
share_results_1.zip

You can open the feather file like this

import pandas as pd
df = pd.read_feather('../data/share_results_1.feather')

Overall, this model is not overfitted and there is no post-processing yet.

from open_stt.

akreal commented on May 22, 2024

Perfect, thank you!

from open_stt.

snakers4 commented on May 22, 2024

As you can see the model is not fully fitted yet (we are still in exploratory phase)
But it works perfectly on some easier datasets already

Obviously I exclude the following datasets from the file

private
TTS - they are too easy, they are for diversity only
ASR datasets - because you cannot really use them for validation

from open_stt.

snakers4 commented on May 22, 2024

Almost finished collecting v05 and searching hyper-params, will be posting new benchmarks and new data soon

from open_stt.

m1ckyro5a commented on May 22, 2024

@snakers4 What model did you use for benchmark?

from open_stt.

snakers4 commented on May 22, 2024

@m1ckyro5a
wav2letter inspired fork of the fork of deep speech pytorch

from open_stt.

m1ckyro5a commented on May 22, 2024

@snakers4 How about deepspeech2? Which model is better?

from open_stt.

snakers4 commented on May 22, 2024

It is hard to tell yet
The performance now is more limited by the data for us, more than by the model
Of course we compared some models side by side (CNN, RNN) only to find that RNNs were a bit better with the same number of weight updates, but slower in general

Some benches we ran on LibriSpeech
network_bench.xlsx

from open_stt.

snakers4 commented on May 22, 2024

I will structure the benchmark files from now a bit

Path
Annotation / prediction
CER, WER
File path in the file db

Please note that exclusion files #7 were based on this benchmarks as well previously

All charts contain CER

Dataset benchmark v05

File

Model

CNN trained with CTC loss
Tuning with phonemes

Youtube

TED talks are much cleaner

Audio books

Notice the second normal bump

TTS

Academic datasets

ASR datasets

Pranks are very noisy by default

Radio

Quite good fit as well

Strict exclude file for distillation

An idea on how to set thresholds:

CLEAN_THRESHOLDS = {
    # very strict conditions, datasets are clean, no problem
    'tts_russian_addresses_rhvoice_4voices':0.2,
    'private_buriy_audiobooks_2':0.1,
    
    # strict conditions, datasets vary
    'public_youtube700':0.2,
    'public_youtube1120':0.2,
    'public_youtube1120_hq':0.2,
    'public_lecture_1':0.2,
    'public_series_1':0.2,
    
    # strict conditions, dataset mostly clean
    'radio_2':0.2,

    # very strict conditions, datasets are dirty
    'asr_public_phone_calls_1':0.2,
    'asr_public_phone_calls_2':0.2,
    'asr_public_stories_1':0.2,
    'asr_public_stories_2':0.2,
    
    # mostly just to filter outliers
    'ru_tts':0.4,
    'ru_ru':0.4,
    'voxforge_ru':0.4,
    'russian_single':0.4
}

from open_stt.

snakers4 commented on May 22, 2024

Also a comment - model was not over-fitted, it is selected based on optimal generalization

from open_stt.

vadimkantorov commented on May 22, 2024

https://ru-open-stt.ams3.digitaloceanspaces.com/benchmark_v05_public.csv.zip is in fact a gzip-compressed file (not a zip-compressed one), so one should decompress it with zcat benchmark_v05_public.csv.zip > benchmark_v05_public.csv

unzipping fails with:

 $ unzip benchmark_v05_public.csv.zip
Archive:  benchmark_v05_public.csv.zip
  End-of-central-directory signature not found.  Either this file is not
  a zipfile, or it constitutes one disk of a multi-part archive.  In the
  latter case the central directory and zipfile comment will be found on
  the last disk(s) of this archive.
unzip:  cannot find zipfile directory in one of benchmark_v05_public.csv.zip or
        benchmark_v05_public.csv.zip.zip, and cannot find benchmark_v05_public.csv.zip.ZIP, period.

after gzip-decompression the first line contains some weird stuff:

$ head -n 1 benchmark_v05_public.csv
data/dataset_cleaning/benchmark_v05_public.csv0000644000175000001441656463430613513563560021050 0ustar  kerasusers

from open_stt.

johnnych7027 commented on May 22, 2024

Hi! What datasets have speaker labels?
Is there any information in which release the speaker labels will be?
Thanks a lot!

from open_stt.

snakers4 commented on May 22, 2024

We decided not to update and / or maintain these for reasons.

from open_stt.

Some benchmarks on the datasets about open_stt HOT 14 CLOSED

Comments (14)

Dataset benchmark v05

File

Model

Youtube

Audio books

TTS

Academic datasets

ASR datasets

Radio

Strict exclude file for distillation

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent

Jobs