GithubHelp home page GithubHelp logo

Comments (9)

eziolotta avatar eziolotta commented on May 25, 2024

Even if the texts of the examples are the same, the speakers may be different.
Different speakers I think can be useful even if they say the same thing.

In the example you say (Mattia Pascal of Pirandello) the Speaker are the same (both clips are derived from the same LibriVox clips), but they are different segments: the MLS one is longer, so they are not duplicates

I think it's hard to find real duplicates, we could keep them ...?

from deepspeech-italian-model.

eziolotta avatar eziolotta commented on May 25, 2024

To solve the apostrophe bug in m-ailabs and mls, we would need to parse both strings (original and normalized).
I made this fix, and other changes, I'll do a PR soon.

from deepspeech-italian-model.

eziolotta avatar eziolotta commented on May 25, 2024

In m-ailabs my fix work fine (we have original text!).
In MLS maybe need to reuse the raw data of mailabs as you say. i Try...

from deepspeech-italian-model.

eziolotta avatar eziolotta commented on May 25, 2024

MAILABS list apostrophe error
mailabs_fixed_token.txt

from deepspeech-italian-model.

nefastosaturo avatar nefastosaturo commented on May 25, 2024

So starting from the mailabs_fixed_token, I tried to detect the problematic MLS books.

Right now I have checked:

Verga, Novelle, "Vita dei campi", book id: 656
656_Verga_Novelle.zip
Pascoli, Myricae, book id: 1590
1590_pascoli.zip
Machiavelli, Il Principe, book id: 10624 <--- I was thinking to discard this one, there are too many latinism

In each zip files you'll find different set of around 50 wrong words. Some of them already got a correction, most of them don't.

Also there is a file with strange behaviour of some sentences (strange chars, bigger errors like some words without spaces and so on). I will check those tokens in a future step.

If you can please choose one set or subset and put the correct word, would be awesome!

The format is:

,
eg:

dellanima,dell'anima
damore,d'amore
unaltro,un altro

if you think that one token could be ambiguous (eg: loro,l'oro), please flag it with SKIP

loro,l'oro,SKIP

from deepspeech-italian-model.

Sav22999 avatar Sav22999 commented on May 25, 2024

@nefastosaturo I take the first one Verga Novelle, id=656.

from deepspeech-italian-model.

Sav22999 avatar Sav22999 commented on May 25, 2024

Et voilà, credo di aver fatto tutto (spero sia corretto) 656_Verga_Novelle.zip

from deepspeech-italian-model.

eziolotta avatar eziolotta commented on May 25, 2024

To check all the texts in MLS, csv generated by importer may help.
train_full.zip

from deepspeech-italian-model.

eziolotta avatar eziolotta commented on May 25, 2024

On M-AILABS there are other examples to exclude:

  • transcription does not match with spoken words :-(
  • audio is truncated before the end of transcription

(folder mix\novelle_per_un_anno_06)
novelle06_16_pirandello_f000028
novelle06_16_pirandello_f000029
novelle06_16_pirandello_f000030
novelle06_16_pirandello_f000031
novelle06_16_pirandello_f000032
novelle06_16_pirandello_f000033
novelle06_16_pirandello_f000034
novelle06_16_pirandello_f000035
novelle06_16_pirandello_f000036
novelle06_16_pirandello_f000037
novelle06_16_pirandello_f000038
novelle06_16_pirandello_f000039
novelle06_16_pirandello_f000040
novelle06_16_pirandello_f000041

novelle06_17_pirandello_f000387

I was able to find them because 3 of them were filtered by importer (see check audio too_short),
then I checked (by hand) whole blocks novelle06_16 and novelle06_17

from deepspeech-italian-model.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.