Comments (9)
Even if the texts of the examples are the same, the speakers may be different.
Different speakers I think can be useful even if they say the same thing.
In the example you say (Mattia Pascal of Pirandello) the Speaker are the same (both clips are derived from the same LibriVox clips), but they are different segments: the MLS one is longer, so they are not duplicates
I think it's hard to find real duplicates, we could keep them ...?
from deepspeech-italian-model.
To solve the apostrophe bug in m-ailabs and mls, we would need to parse both strings (original and normalized).
I made this fix, and other changes, I'll do a PR soon.
from deepspeech-italian-model.
In m-ailabs my fix work fine (we have original text!).
In MLS maybe need to reuse the raw data of mailabs as you say. i Try...
from deepspeech-italian-model.
MAILABS list apostrophe error
mailabs_fixed_token.txt
from deepspeech-italian-model.
So starting from the mailabs_fixed_token, I tried to detect the problematic MLS books.
Right now I have checked:
Verga, Novelle, "Vita dei campi", book id: 656
656_Verga_Novelle.zip
Pascoli, Myricae, book id: 1590
1590_pascoli.zip
Machiavelli, Il Principe, book id: 10624 <--- I was thinking to discard this one, there are too many latinism
In each zip files you'll find different set of around 50 wrong words. Some of them already got a correction, most of them don't.
Also there is a file with strange behaviour of some sentences (strange chars, bigger errors like some words without spaces and so on). I will check those tokens in a future step.
If you can please choose one set or subset and put the correct word, would be awesome!
The format is:
,
eg:
dellanima,dell'anima
damore,d'amore
unaltro,un altro
if you think that one token could be ambiguous (eg: loro,l'oro), please flag it with SKIP
loro,l'oro,SKIP
from deepspeech-italian-model.
@nefastosaturo I take the first one Verga Novelle, id=656.
from deepspeech-italian-model.
Et voilà, credo di aver fatto tutto (spero sia corretto) 656_Verga_Novelle.zip
from deepspeech-italian-model.
To check all the texts in MLS, csv generated by importer may help.
train_full.zip
from deepspeech-italian-model.
On M-AILABS there are other examples to exclude:
- transcription does not match with spoken words :-(
- audio is truncated before the end of transcription
(folder mix\novelle_per_un_anno_06)
novelle06_16_pirandello_f000028
novelle06_16_pirandello_f000029
novelle06_16_pirandello_f000030
novelle06_16_pirandello_f000031
novelle06_16_pirandello_f000032
novelle06_16_pirandello_f000033
novelle06_16_pirandello_f000034
novelle06_16_pirandello_f000035
novelle06_16_pirandello_f000036
novelle06_16_pirandello_f000037
novelle06_16_pirandello_f000038
novelle06_16_pirandello_f000039
novelle06_16_pirandello_f000040
novelle06_16_pirandello_f000041
novelle06_17_pirandello_f000387
I was able to find them because 3 of them were filtered by importer (see check audio too_short),
then I checked (by hand) whole blocks novelle06_16 and novelle06_17
from deepspeech-italian-model.
Related Issues (20)
- MITADS - Transcript roman numbers HOT 4
- Readme improvements
- Not clear how to do a simple speech recognition HOT 9
- deepspeech - lm.binary and trie: how to? HOT 4
- Create the "contributing" file HOT 1
- Experiment on creating a new dataset audio+text HOT 3
- Voxforge bad samples, help for cleaning up HOT 3
- MITADS - convert numbers to their literal expression HOT 2
- LIST OF AUDIO+TEXT DATASETS HOT 10
- Really bad results on Raspberry Pi 4 HOT 1
- Other italian models for transfer learning HOT 4
- MITADS - new corpora to import HOT 3
- Building a custom external scorer (extending the Italian text corpus) HOT 4
- ERROR: Model provided has model identifier 'K�+�', should be 'TFL3' HOT 5
- Project license HOT 3
- Migrate to Coqui
- Docker build fail HOT 2
- Documentation about how to run the various bash script alone
- DOCKERFILE Merge flag TRANSFER_LEARNING and DROP_SOURCE_LAYER HOT 2
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from deepspeech-italian-model.