I was checking these two dataset. The first thing that came in my mi

MAILABS list apostrophe error <a href="https://github.com/MozillaItalia/DeepSpeech

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Et voilà, credo di aver fatto tutto (spero sia corretto) <a href="https://github.com/M

On M-AILABS there are other examples to exclude: tranion

MLS and MAILABS: considerations and issues ( Have you seen my apostrophe?) about deepspeech-italian-model HOT 9 OPEN

nefastosaturo commented on May 25, 2024

MLS and MAILABS: considerations and issues ( Have you seen my apostrophe?)

from deepspeech-italian-model.

Comments (9)

eziolotta commented on May 25, 2024

Even if the texts of the examples are the same, the speakers may be different.
Different speakers I think can be useful even if they say the same thing.

In the example you say (Mattia Pascal of Pirandello) the Speaker are the same (both clips are derived from the same LibriVox clips), but they are different segments: the MLS one is longer, so they are not duplicates

I think it's hard to find real duplicates, we could keep them ...?

from deepspeech-italian-model.

eziolotta commented on May 25, 2024

To solve the apostrophe bug in m-ailabs and mls, we would need to parse both strings (original and normalized).
I made this fix, and other changes, I'll do a PR soon.

from deepspeech-italian-model.

eziolotta commented on May 25, 2024

In m-ailabs my fix work fine (we have original text!).
In MLS maybe need to reuse the raw data of mailabs as you say. i Try...

from deepspeech-italian-model.

eziolotta commented on May 25, 2024

MAILABS list apostrophe error
mailabs_fixed_token.txt

from deepspeech-italian-model.

nefastosaturo commented on May 25, 2024

So starting from the mailabs_fixed_token, I tried to detect the problematic MLS books.

Right now I have checked:

Verga, Novelle, "Vita dei campi", book id: 656
656_Verga_Novelle.zip
Pascoli, Myricae, book id: 1590
1590_pascoli.zip
Machiavelli, Il Principe, book id: 10624 <--- I was thinking to discard this one, there are too many latinism

In each zip files you'll find different set of around 50 wrong words. Some of them already got a correction, most of them don't.

Also there is a file with strange behaviour of some sentences (strange chars, bigger errors like some words without spaces and so on). I will check those tokens in a future step.

If you can please choose one set or subset and put the correct word, would be awesome!

The format is:

,
eg:

dellanima,dell'anima
damore,d'amore
unaltro,un altro

if you think that one token could be ambiguous (eg: loro,l'oro), please flag it with SKIP

loro,l'oro,SKIP

from deepspeech-italian-model.

Sav22999 commented on May 25, 2024

@nefastosaturo I take the first one Verga Novelle, id=656.

from deepspeech-italian-model.

Sav22999 commented on May 25, 2024

Et voilà, credo di aver fatto tutto (spero sia corretto) 656_Verga_Novelle.zip

from deepspeech-italian-model.

eziolotta commented on May 25, 2024

To check all the texts in MLS, csv generated by importer may help.
train_full.zip

from deepspeech-italian-model.

eziolotta commented on May 25, 2024

On M-AILABS there are other examples to exclude:

transcription does not match with spoken words :-(
audio is truncated before the end of transcription

(folder mix\novelle_per_un_anno_06)
novelle06_16_pirandello_f000028
novelle06_16_pirandello_f000029
novelle06_16_pirandello_f000030
novelle06_16_pirandello_f000031
novelle06_16_pirandello_f000032
novelle06_16_pirandello_f000033
novelle06_16_pirandello_f000034
novelle06_16_pirandello_f000035
novelle06_16_pirandello_f000036
novelle06_16_pirandello_f000037
novelle06_16_pirandello_f000038
novelle06_16_pirandello_f000039
novelle06_16_pirandello_f000040
novelle06_16_pirandello_f000041

novelle06_17_pirandello_f000387

I was able to find them because 3 of them were filtered by importer (see check audio too_short),
then I checked (by hand) whole blocks novelle06_16 and novelle06_17

from deepspeech-italian-model.

MLS and MAILABS: considerations and issues ( Have you seen my apostrophe?) about deepspeech-italian-model HOT 9 OPEN

Comments (9)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent

Jobs