GithubHelp home page GithubHelp logo

Comments (14)

snakers4 avatar snakers4 commented on May 22, 2024 3

Now if we exclude "bad" files from here, we will get more interesting results.
I cannot say that all of these files have poor annotation, but the majority do.

share_results_v02.zip

image

from open_stt.

snakers4 avatar snakers4 commented on May 22, 2024

@akreal
As you requested.
Some results of a more or less well trained model (csv+feather formats).
share_results_1.zip

You can open the feather file like this

import pandas as pd
df = pd.read_feather('../data/share_results_1.feather')

Overall, this model is not overfitted and there is no post-processing yet.

from open_stt.

akreal avatar akreal commented on May 22, 2024

Perfect, thank you!

from open_stt.

snakers4 avatar snakers4 commented on May 22, 2024

As you can see the model is not fully fitted yet (we are still in exploratory phase)
But it works perfectly on some easier datasets already

image

Obviously I exclude the following datasets from the file

  • private
  • TTS - they are too easy, they are for diversity only
  • ASR datasets - because you cannot really use them for validation

from open_stt.

snakers4 avatar snakers4 commented on May 22, 2024

Almost finished collecting v05 and searching hyper-params, will be posting new benchmarks and new data soon

from open_stt.

m1ckyro5a avatar m1ckyro5a commented on May 22, 2024

@snakers4 What model did you use for benchmark?

from open_stt.

snakers4 avatar snakers4 commented on May 22, 2024

@m1ckyro5a
wav2letter inspired fork of the fork of deep speech pytorch

from open_stt.

m1ckyro5a avatar m1ckyro5a commented on May 22, 2024

@snakers4 How about deepspeech2? Which model is better?

from open_stt.

snakers4 avatar snakers4 commented on May 22, 2024

It is hard to tell yet
The performance now is more limited by the data for us, more than by the model
Of course we compared some models side by side (CNN, RNN) only to find that RNNs were a bit better with the same number of weight updates, but slower in general

Some benches we ran on LibriSpeech
network_bench.xlsx

from open_stt.

snakers4 avatar snakers4 commented on May 22, 2024

I will structure the benchmark files from now a bit

  • Path
  • Annotation / prediction
  • CER, WER
  • File path in the file db

Please note that exclusion files #7 were based on this benchmarks as well previously

All charts contain CER

Dataset benchmark v05

File

File

Model

CNN trained with CTC loss
Tuning with phonemes

Youtube

TED talks are much cleaner
youtube

Audio books

Notice the second normal bump
books

TTS

tts

Academic datasets

academic

ASR datasets

Pranks are very noisy by default
asr

Radio

Quite good fit as well
radio

Strict exclude file for distillation

An idea on how to set thresholds:

CLEAN_THRESHOLDS = {
    # very strict conditions, datasets are clean, no problem
    'tts_russian_addresses_rhvoice_4voices':0.2,
    'private_buriy_audiobooks_2':0.1,
    
    # strict conditions, datasets vary
    'public_youtube700':0.2,
    'public_youtube1120':0.2,
    'public_youtube1120_hq':0.2,
    'public_lecture_1':0.2,
    'public_series_1':0.2,
    
    # strict conditions, dataset mostly clean
    'radio_2':0.2,

    # very strict conditions, datasets are dirty
    'asr_public_phone_calls_1':0.2,
    'asr_public_phone_calls_2':0.2,
    'asr_public_stories_1':0.2,
    'asr_public_stories_2':0.2,
    
    # mostly just to filter outliers
    'ru_tts':0.4,
    'ru_ru':0.4,
    'voxforge_ru':0.4,
    'russian_single':0.4
}

from open_stt.

snakers4 avatar snakers4 commented on May 22, 2024

Also a comment - model was not over-fitted, it is selected based on optimal generalization

from open_stt.

vadimkantorov avatar vadimkantorov commented on May 22, 2024

https://ru-open-stt.ams3.digitaloceanspaces.com/benchmark_v05_public.csv.zip is in fact a gzip-compressed file (not a zip-compressed one), so one should decompress it with zcat benchmark_v05_public.csv.zip > benchmark_v05_public.csv

unzipping fails with:

 $ unzip benchmark_v05_public.csv.zip
Archive:  benchmark_v05_public.csv.zip
  End-of-central-directory signature not found.  Either this file is not
  a zipfile, or it constitutes one disk of a multi-part archive.  In the
  latter case the central directory and zipfile comment will be found on
  the last disk(s) of this archive.
unzip:  cannot find zipfile directory in one of benchmark_v05_public.csv.zip or
        benchmark_v05_public.csv.zip.zip, and cannot find benchmark_v05_public.csv.zip.ZIP, period.

after gzip-decompression the first line contains some weird stuff:

$ head -n 1 benchmark_v05_public.csv
data/dataset_cleaning/benchmark_v05_public.csv0000644000175000001441656463430613513563560021050 0ustar  kerasusers

from open_stt.

johnnych7027 avatar johnnych7027 commented on May 22, 2024

Hi! What datasets have speaker labels?
Is there any information in which release the speaker labels will be?
Thanks a lot!

from open_stt.

snakers4 avatar snakers4 commented on May 22, 2024

We decided not to update and / or maintain these for reasons.

from open_stt.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.