GithubHelp home page GithubHelp logo

mozillaitalia / deepspeech-italian-model Goto Github PK

View Code? Open in Web Editor NEW
92.0 14.0 20.0 39.18 MB

Tooling for producing Italian model (public release available) for DeepSpeech and text corpus

License: GNU General Public License v3.0

Shell 6.54% Python 66.35% Jupyter Notebook 27.11%
deepspeech italiano hacktoberfest

deepspeech-italian-model's Introduction

DeepSpeech Italian Model

License

Aggregatore degli strumenti per la generazione di un modello di machine learning per la lingua Italiana del progetto Common Voice. Ci trovi su Telegram con il nostro bot @mozitabot nel gruppo Developers dove dirigiamo e discutiamo lo sviluppo oppure sul forum.


Licenza

Il codice o gli script sono rilasciati sotto licenza GPLv3 mentre i modello rilasciati sono sotto CC0 ovvero di pubblico dominio.

Regole

  • Ticket e pull requests in inglese
  • Readme in Italiano

Requisiti

Python 3.7+

Quick Start

Usa colab:


Open In Colab

oppure:

   # Attiva un virtual environments
   python3 -m venv env $HOME/tmp/deepspeech-venv/
   source $HOME/tmp/deepspeech-venv/bin/activate

   # Installa DeepSpeech
   pip3 install deepspeech==0.9.3

   # Scarica e scompatta i file per il modello italiano (verifica l'ultima versione rilasciata!)
   curl -LO https://github.com/MozillaItalia/DeepSpeech-Italian-Model/releases/download/2020.08.07/model_tensorflow_it.tar.xz
   tar xvf model_tensorflow_it.tar.xz

   # Oppure utilizza il modello italiano con transfer learning da quello inglese (verifica l'ultima versione rilasciata!)
   curl -LO https://github.com/MozillaItalia/DeepSpeech-Italian-Model/releases/download/2020.08.07/transfer_model_tensorflow_it.tar.xz
   tar xvf transfer_model_tensorflow_it.tar.xz
   
   # estrai un sample a caso dal dataset cv_tiny
   wget -c https://github.com/MozillaItalia/DeepSpeech-Italian-Model/files/4610711/cv-it_tiny.tar.gz -O - | tar -xz common_voice_it_19574474.wav

   # Trascrivi (audio MONO, formato WAV e campionato a 16000Hz)
   deepspeech --model output_graph.pbmm --scorer scorer --audio common_voice_it_19574474.wav

Differenze del modello italiano puro e con transfer learning

Da 08/2020 rilasciamo il modello in due versioni, puro ovvero solo dataset di lingua italiana (specificato nel release) e la versione con transfer learning.
La seconda versione include il transfer learning dal modello di lingua ufficiale rilasciato da Mozilla, che include altri dataset oltre a quello di Common Voice superando le oltre 7000 ore di materiale. Questo modello si è dimostrato molto piú affidabile nel riconoscimento viste le poche ore di lingua italiana che disponiamo al momento.

Sviluppo

Corpora per il modello del linguaggio

Nella cartella MITADS sono presenti tutti gli script che permettono la generazione del corpus testuale MITADS. Per maggiori informazioni fare riferimento al README relativo.

Addestramento del modello

Fare riferimento al README nella cartella DeepSpeech per la documentazione necessaria per creare l'immagine Docker utilizzata per addestrare il modello acustico e del linguaggio.

Generare il modello con COLAB Open In Colab

Fare riferimento al README in notebooks.

Come programmare con DeepSpeech

Fare riferimento al nostro wiki in costruzione che contiene link e altro materiale.

Risorse

deepspeech-italian-model's People

Contributors

alex179ohm avatar dag7dev avatar danieltinazzi avatar dependabot[bot] avatar eliatolin avatar eziolotta avatar gianluigimemoli avatar ilyasmg avatar jotaro-sama avatar lissyx avatar mone27 avatar mte90 avatar nefastosaturo avatar paolo-losi avatar pascaldr avatar xrmx avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

deepspeech-italian-model's Issues

Remove # from dataset


I Initializing variables...

I STARTING Optimization

Epoch 0 |   Training | Elapsed Time: 0:00:42 | Steps: 45 | Loss: 184.312358                                                                                                                                                                                                                                                                                                             

Epoch 0 | Validation | Elapsed Time: 0:00:08 | Steps: 31 | Loss: 165.173247 | Dataset: /mnt/extracted/data/cv-it/clips/dev.csv                                                                                                                                                                                                                                                          

Epoch 0 | Validation | Elapsed Time: 0:00:00 | Steps: 0 | Loss: 0.000000 | Dataset: /mnt/extracted/data/lingualibre/lingua_libre_Q385-ita-Italian_dev.csv                                                                                                                                                                                                                               

I Saved new best validating model with loss 165.173247 to: /mnt/checkpoints/best_dev-45

Epoch 1 |   Training | Elapsed Time: 0:00:39 | Steps: 45 | Loss: 171.796456                                                                                                                                                                                                                                                                                                             

Epoch 1 | Validation | Elapsed Time: 0:00:07 | Steps: 31 | Loss: 162.115896 | Dataset: /mnt/extracted/data/cv-it/clips/dev.csv                                                                                                                                                                                                                                                          

Epoch 1 | Validation | Elapsed Time: 0:00:00 | Steps: 0 | Loss: 0.000000 | Dataset: /mnt/extracted/data/lingualibre/lingua_libre_Q385-ita-Italian_dev.csv                                                                                                                                                                                                                               

WARNING:tensorflow:From /home/trainer/ds-train/lib/python3.6/site-packages/tensorflow/python/training/saver.py:960: remove_checkpoint (from tensorflow.python.training.checkpoint_management) is deprecated and will be removed in a future version.

Instructions for updating:

Use standard file APIs to delete files with this prefix.

W0925 18:24:51.043290 140513397847872 deprecation.py:323] From /home/trainer/ds-train/lib/python3.6/site-packages/tensorflow/python/training/saver.py:960: remove_checkpoint (from tensorflow.python.training.checkpoint_management) is deprecated and will be removed in a future version.

Instructions for updating:

Use standard file APIs to delete files with this prefix.

I Saved new best validating model with loss 162.115896 to: /mnt/checkpoints/best_dev-90

Epoch 2 |   Training | Elapsed Time: 0:00:39 | Steps: 45 | Loss: 150.745584                                                                                                                                                                                                                                                                                                             

Epoch 2 | Validation | Elapsed Time: 0:00:07 | Steps: 31 | Loss: 136.187367 | Dataset: /mnt/extracted/data/cv-it/clips/dev.csv                                                                                                                                                                                                                                                          

Epoch 2 | Validation | Elapsed Time: 0:00:00 | Steps: 0 | Loss: 0.000000 | Dataset: /mnt/extracted/data/lingualibre/lingua_libre_Q385-ita-Italian_dev.csv                                                                                                                                                                                                                               

I Saved new best validating model with loss 136.187367 to: /mnt/checkpoints/best_dev-135

Epoch 3 |   Training | Elapsed Time: 0:00:39 | Steps: 45 | Loss: 127.614623                                                                                                                                                                                                                                                                                                             

Epoch 3 | Validation | Elapsed Time: 0:00:08 | Steps: 31 | Loss: 123.730088 | Dataset: /mnt/extracted/data/cv-it/clips/dev.csv                                                                                                                                                                                                                                                          

Epoch 3 | Validation | Elapsed Time: 0:00:00 | Steps: 0 | Loss: 0.000000 | Dataset: /mnt/extracted/data/lingualibre/lingua_libre_Q385-ita-Italian_dev.csv                                                                                                                                                                                                                               

I Saved new best validating model with loss 123.730088 to: /mnt/checkpoints/best_dev-180

Epoch 4 |   Training | Elapsed Time: 0:00:39 | Steps: 45 | Loss: 114.725798                                                                                                                                                                                                                                                                                                             

Epoch 4 | Validation | Elapsed Time: 0:00:08 | Steps: 31 | Loss: 115.417479 | Dataset: /mnt/extracted/data/cv-it/clips/dev.csv                                                                                                                                                                                                                                                          

Epoch 4 | Validation | Elapsed Time: 0:00:00 | Steps: 0 | Loss: 0.000000 | Dataset: /mnt/extracted/data/lingualibre/lingua_libre_Q385-ita-Italian_dev.csv                                                                                                                                                                                                                               

I Saved new best validating model with loss 115.417479 to: /mnt/checkpoints/best_dev-225

Epoch 5 |   Training | Elapsed Time: 0:00:39 | Steps: 45 | Loss: 104.686398                                                                                                                                                                                                                                                                                                             

Epoch 5 | Validation | Elapsed Time: 0:00:08 | Steps: 31 | Loss: 136.464502 | Dataset: /mnt/extracted/data/cv-it/clips/dev.csv                                                                                                                                                                                                                                                          

Epoch 5 | Validation | Elapsed Time: 0:00:00 | Steps: 0 | Loss: 0.000000 | Dataset: /mnt/extracted/data/lingualibre/lingua_libre_Q385-ita-Italian_dev.csv                                                                                                                                                                                                                               

I Early stop triggered as (for last 4 steps) validation loss: 136.464502 with standard deviation: 8.535361 and mean: 125.111644

I FINISHED optimization in 0:04:52.388711

E While processing /mnt/extracted/data/cv-it/clips/common_voice_it_17894238.wav:

E   "ERROR: Your transcripts contain characters (e.g. '#') which do not occur in data/alphabet.txt! Use util/check_characters.py to see what characters are in your [train,dev,test].csv transcripts, and then add all these to data/alphabet.txt."

Support on text corpus generation for python threads

I am wondering if we can speed up the scripts for the text corpus generation using python thread.

It is something that we can do when we will have all the script working, so we can hack all of them to split their process on reading and cleaning data.

Our estimation that is can take like 4 hours now.

[output cleaning] ted_importer.py

These are some problems that I've found on looking at ted_importer.py output. I'll write down from the most serious, at least for me :)

code issues:

  • clean_log() is not defined
  • bs4 lxml parser: bs4.FeatureNotFound. Using BeautifulSoup(data, 'html.parser') it works

output issues:

  • symbols: ♪ ; ♫ ; T ∇ Sτ ; E=mc² ; 31¼%

  • html escape: (sanitize.py escapehtml() could be useful):

& 
"

  • some unknown chars: eg
Nessuno aveva mai studiato l'�involucro
L'�immagine alle mie spalle mostra
  • Some sentences starts with 00, seems that new line split numbers:
Ed ecco come userò il mio premio di 10
00 dollari
  • accents: E' vs È , ò vs o´
falo´
È un infuso creato 
  • when there is an apostrophe the next word occurs without a space. There are lot of these kind:
E'il drone
E'l'applicazione
E'stata
realta'virtuale
E'più
L'immagine che venne un po'dopo aveva una spiegazione semplice
  • invalid roman numbers:
La natura nel senso del IXX secolo, giusto
VV: Sì, tre persone sono scese sul fondo dell'Oceano Pacifico

and about this last one: lot of sentences starts with 2 letter and then ":"

AC: Se dimagrisci un po'
AG: Ci sono certamente implicazioni tecniche
ZF: Scendi tu dal palco

Remove lingualibre

In order to simplify development of the italian model I propose to remove the lingualibre dataset.
The reasons are that lingualibre in Italian is only 4 minutes long, so is not so useful to improve the dataset.
The main issue as pointed out in #17 is that the max test batch size is 16 due to the small lingualibre dataset, but I have not found an easy way to specify it correctly in the docker image (make test_batch_size and batch_size different). Moreover it would force to use a not so optimal test batch for the other dataset.

Not clear how to do a simple speech recognition

It would be great if the instructions in the README were dumb-proof.

I just tried to follow them and the results were nonsensical.

It may clearly be due to error on our side or the environment (WSL) but looking at the release, I suspect that some data is missing (I just followed strictly what's on the README).

Parallelize OpenSubtitles exporter

The exporter currently has two problems that severy limit its parallelization:

  • first and foremost, each time it submits a job to the thread pool, it waits for it to finish before submitting the next one (
    total_lines += future.result()
    );
  • with Python, threads are not so effective at parallelization because of the GIL (unless your jobs are I/O bound, which is not the case here), so a process pool should be used instead of a thread pool.

Migrate Deepspeech scripts to Italian

We need to migrate all the bash scripts and docker file replacing the French references from file to parameters to italian.
Right now localize the readme in that folder in english or italian is not a priority https://github.com/MozillaItalia/commonvoice-it/tree/master/DeepSpeech
The scripts in that fodler download various packages from other resources like lingualibre to add more data to the model generation, package and generate the model for deepspeech.

Until this is not done we cannot generate the model for italian to use with deepspeech.

Migrate Common Voice data

Missing files:

  • CommonVoice-Data/names.py (needs data from a Italian source, and properly patched to be able to parse italian data) #4
  • CommonVoice-Data/libretheatre.py (needs data from Italian source, maybe has to be rewritten entirely) #5
  • CommonVoice-Data/wikipedia.py (done)
  • CommonVoice-Data/wikisource.py (needs italian translation of the book "Le forceures de blocus", has to be rewritten to be able to scrap the italian book) #5
  • CommonVoice-Data/framabook.py #5
  • CommonVoice-Data/utils.py (needs to be adapted to italian language)

Originally posted by @alex179ohm in #2 (comment)

MITADS - Transcript roman numbers

We have the issue that the text corpus include roman numbers but we need to convert those as usual numbers but also to spot fake positives and so on.

We need a way to detect roman numbers and not other text that include that letters.

Supporto per sviluppo

Ciao a tutti,
vorrei sapere se c'e'un forum, chat o altro canale ufficiale o non ufficiale dove discutere dello sviluppo di DeepSpeech italiano e dare/avere supporto durante il training o lo sviluppo di un language model custom.

Greazie

Change Mitads to another name on public release

The purpose of this ticket is to find a name to the text corpus we are working on that will be used as reference everywhere, probably also outside this project.
Basically it will be like a brand name, it is important that is easy to reconize the original/author (at least for me).


I personally don't like the MITADS names because it is difficult to pronunce and understand what it meas.
Here i propose some other random ideas:

  • MozIta
  • MozItaDS
  • ItaSpeech (best for me)
  • ItaDS

Looking forward to hear your input

Originally posted by @mone27 in #36 (comment)

Generate an Italian text corpus

After discussing in the community (especially @paolo-losi) the model need a better text corpus, right now the available ones have issues with the licensing that we need.
CC0 or public domain, or CC license with commercial support but we want to release just the scripts and not the final dataset to avoid any troubles.
The point is to create a corpus not from wikipedia or encyclopedic, manuals and so on but colloquial resources like chats, discussions/emails, quotes that are more similar to the needs of a voice recognition usage.

So we need to replace https://github.com/MozillaItalia/DeepSpeech-Italian-Model/blob/master/DeepSpeech/it/build_lm.sh#L12 with something else.

So our idea is to generate a static txt file on the fly with a billion words from this kind of text.
We need stuff after 1920+ to get an italian more modern.
For every resource we need to sanitize and cleaning to remove symbols and other stuff not needed.

Workflow

  • Create a new folder MITADS = Mozilla Italia DeepSpeech (we can change codename for the corpus, refer to #65)
  • For every source generare an exporter, that sanitize data to remove symbols (<>/ etc)/chapter titles and generate a single txt file
    • Gutenberg
    • Wikiquote
    • Opensubtitle
    • Wikisource
    • Eulogos
    • An.ANA.S.
    • QALL-ME
    • TED transcription
    • Wacky
  • Tool to aggregate all of them and remove duplicates

What tools use?

Considering that the deepspeech model is executed on linux machine we can use Bash but is not so very fast so we have to use Python.
Also this corpus doesn't need to be generated at every model generation but once for all of them by us.

MIgrate book testers

There are different scripts in CommonVoice-Data that are used to download stuff in CC0 and test the model generated:

  • wikisource.py (need to be migrated to an italian book)
  • libretheatre.py (replaced to something else)
  • framabook.py (replaced to something else)
  • project-gutenberg.py (adapt to italian language)
  • assemble nationale (replaced to something else)
  • bano (replaced to something else)

We need to replace because they doesn't exist in italian so we can create similar aggregator from:

Different errors in bash scripts

Hi there,
( have to write in English or i can write in Italian ? )

i have tested the docker instance and i've found this errors:

  • file "import_lingualibre.sh" line 7:
    wget https://lingualibre.fr/datasets/Q385-ita-Italian.zip -O /mnt/source/lingua_libre_Q385-ita-Italian_train.zip

change "/mnt/source/" into "/mnt/sources/"

  • file "generate_alphabet.sh" line :11,12,13
    the execution of this script ends with:
    sed -i s/#//g '/mnt/extracted/data/*test.csv'
    sed: can't read /mnt/extracted/data/*test.csv: No such file or directory

my workaround was to specify the directories like this:
sed -i 's/#//g' /mnt/extracted/data/cv-it/clips/*test.csv
sed -i 's/#//g' /mnt/extracted/data/cv-it/clips/*train.csv
sed -i 's/#//g' /mnt/extracted/data/cv-it/clips/*dev.csv
sed -i 's/#//g' /mnt/extracted/data/lingualibre/*test.csv
sed -i 's/#//g' /mnt/extracted/data/lingualibre/*train.csv
sed -i 's/#//g' /mnt/extracted/data/lingualibre/*dev.csv

  • file "build_lm.sh" line 46:
    seems that the endpoint doesn't exist
    + rm /mnt/lm/lm.arpa
    + '[' '!' -f /mnt/lm/trie ']'
    + curl -sSL https://index.taskcluster.net/v1/task/project.deepspeech.deepspeech.native_client.master.ba56407376f1e1109be33ac87bcb6eb9709b18be.cpu/artifacts/public/native_client.tar.xz
    + pixz -d
    + tar -xf -
    can not seek in input: Illegal seek
    Not an XZ file
    tar: This does not look like a tar archive
    tar: Exiting with failure status due to previous errors

browsing that url gives: ResourceNotFound

P.S.
Everytime pixz return:
can not seek in input: Illegal seek
hope this is only a warning
P.P.S.
i've tried to format this thread as best as possible but seems i can't...sorry if it is too chaotic

Hope to help in some way

Regards
Massimo

Training pipeline

edit 2: edit: change batch size to 128 nevermind, it crashes

I think is better to define a training pipeline as the official Deepspeech releases state.

We dont have the same amount of hours and videocards as DeepSpeech guys so lets start with 0.6 version hyperparameters.

I was thinking to some kind of pipelines to apply to a training-from-scratch model or starting from a pretrained checkpoint (transfer learning). What do you think?

PIPELINE 1 (with 0.6 hyperparameters from the fr repo)

I step

  • generate the scorer with LM_ALPHA and LM_BETA = 0

  • EPOCHS=30
    BATCH_SIZE=64
    N_HIDDEN=2048
    LEARNING_RATE=0.0001
    DROPOUT=0.4
    EARLY_STOP
    ES_EPOCHS (early stop after)=10
    MAX_TO_KEEP=3 (we can keep more checkpoint when we will have more disk space)
    DROP_SOURCE_LAYERS=1 (if using transfer learning)
    USE_AUTOMATIC_MIXED_PRECISION (if training from scratch)

II step:

  • use LM_OPTIMIZER to search good ALPHA and BETA values
  • MAX_ALPHA=5 MAX_BETA=5 MAX_ITER=600

III step:

  • EPOCHS=30
    BATCH_SIZE=64
    N_HIDDEN=2048
    LEARNING_RATE=0.00001 (lower LR)
    DROPOUT=0.4
    EARLY_STOP
    ES_EPOCHS=10
    MAX_TO_KEEP=3
    DROP_SOURCE_LAYERS=1 (if using transfer learning)
    USE_AUTOMATIC_MIXED_PRECISION (if training from scratch)

or:

PIPELINE 2

I step

  • generate the scorer with LM_ALPHA and LM_BETA = 0

  • EPOCHS=100
    BATCH_SIZE=64
    N_HIDDEN=2048
    LEARNING_RATE=0.0001
    DROPOUT=0.4
    EARLY_STOP
    ES_EPOCHS (early stop after)=25 (default value)
    MAX_TO_KEEP=3
    REDUCE_LR_ON_PLATEAU=1 (when learning got stuck, LR will be reduced)
    PLATEAU_EPOCHS=10 (default,number of epochs to consider for RLROP. Smaller than ES_EPOCHS)
    DROP_SOURCE_LAYERS=1 (if using transfer learning)
    USE_AUTOMATIC_MIXED_PRECISION (if training from scratch)

II step:

  • use LM_OPTIMIZER to search good ALPHA and BETA values
  • MAX_ALPHA=5 MAX_BETA=5 MAX_ITER=600

[output cleaning] qallme_importer.py

Found those issues:

  • random "???" strings
  • accents like these: e` a` o` i` u` eg:
    sai dire il nome della squadra che giochera` contro il Pine`

Those could be fixed by adding some regex rules to mapping_normalization list:

instead removing only the single square brackets:
re.compile('\[.*?\]'), u'']
and then:

re.compile('a`'), u'à']
re.compile('u`'), u'ù']
re.compile('i`'), u'ì']
re.compile('o`'), u'ò']

and about e` :

re.compile('perche`'), u'perché']
re.compile(' ne`'), u'né']
.. list of other words that needs é instead of è...
re.compile(' e`'), u'è']

Archive audio+text to download

Lists of resources we can implement to add more datasets for DeepSpeech (maybe generate a custom dataset based on Common Voice dataset organization, in the readme there is a sample, or on the fly to avoid license issues):

Check also: #34

Otherwise we can evaluate this tools to generate a dataset based on youtube:

Another solution is to use https://github.com/srinivr/kaldi-long-audio-alignment with the italia model to auto split text+audio in small fragment to speed up.

The most important part is that the data need to be aggregated to avoid license issue, this means that the files need to be all together and is not possible to recreate the original files.

incompatibility of the italian model with deepspeech 0.5.1

deepspeech version 0.5.1 installed with pip in a fresh virtualenv cannot properly
load the italian model. It seem that kenml version used to train the model is more recent
than the version linked to deepspeech 0.5.1

$ deepspeech --model italian/output_graph.pbmm --audio test.wav --lm italian/lm.binary --trie italian/trie --alphabet italian/alphabet.txt 
Loading model from file italian/output_graph.pbmm
TensorFlow: v1.13.1-10-g3e0cc53
DeepSpeech: v0.5.1-0-g4b29b78
2019-11-13 11:02:34.378997: I tensorflow/core/platform/cpu_feature_guard.cc:141] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
2019-11-13 11:02:34.387460: E tensorflow/core/framework/op_kernel.cc:1325] OpKernel ('op: "UnwrapDatasetVariant" device_type: "CPU"') for unknown op: UnwrapDatasetVariant
2019-11-13 11:02:34.387495: E tensorflow/core/framework/op_kernel.cc:1325] OpKernel ('op: "WrapDatasetVariant" device_type: "GPU" host_memory_arg: "input_handle" host_memory_arg: "output_handle"') for unknown op: WrapDatasetVariant
2019-11-13 11:02:34.387508: E tensorflow/core/framework/op_kernel.cc:1325] OpKernel ('op: "WrapDatasetVariant" device_type: "CPU"') for unknown op: WrapDatasetVariant
2019-11-13 11:02:34.387609: E tensorflow/core/framework/op_kernel.cc:1325] OpKernel ('op: "UnwrapDatasetVariant" device_type: "GPU" host_memory_arg: "input_handle" host_memory_arg: "output_handle"') for unknown op: UnwrapDatasetVariant
Loaded model in 0.0109s.
Loading language model from files italian/lm.binary italian/trie
Error: Trie file version mismatch (4 instead of expected 3). Update your trie file.
Loaded language model in 0.0439s.
Running inference.
Error running session: Not found: PruneForTargets: Some target nodes not found: initialize_state 
Segmentation fault (core dumped)

[output cleaning] eulogos importer

I think that we got a lot of issues here.

There are issues related to the syntax used by user in chat rooms, messages with fancy characters from chatbot and symbols used for text-graphic stuff. I'll just leave here some recurring examples

Chat room info messages:

DCC CHAT ip <ip.address.format.here>

CoTiDie is [email protected] * CoTiDieMoRi

U:\>tracert 195.94.177.137
Tracing route to TARAS [195.94.177.137]
over a maximum of 30 hops:
e questa è la sua stazione
NetBIOS Remote Machine Name Table
Name               Type         Status
TARAS            UNIQUE      Registered\
EULOGOS          GROUP       Registered
TARAS            UNIQUE      Registered
TARAS            UNIQUE      Registered
EULOGOS          GROUP       Registered
FRANCESCO        UNIQUE      Registered
MAC Address = 52-54-AB-DD-22-98

wolvie is AWAY since Mon Oct 13 11:00:29 1997 Reason: 4OKKUPATO: lavoro

El_Diablo ----==>>>>------>12,11 EL-GRECO  ----==>>>>------>

] [Time/0h 0m] [Log/On] [Page/On]

free-join 1,15 -The Most Advanced Script Ever Seen-
free-join 15,15  14,14                                  15,15
free-join 15,15  14,14  14,1-16=14º15 14°14S15ho16wD15ow14N 14P15r16O14°15,1 14º16=14-14,14  15,15
free-join 15,15  14,14                                  15,15

E_D-away Set Away: Tuesday 10/14/97 Pager: On MsgLog: Off Beeper: Off Reason: not chatting
Type /ctcp E_D-away PAGE REASON to get my attention

CI-WUGY (AWAY:scripting...) gØne since: 4:34pm

<U+0081> -=º °ShowDowN v6.5 PrO° º=- <U+0081>

Junes [^?Auto-Set Back^?] at (2:47:51pm) Away for: 46mins 28secs -LOG OFF-

/ctcp nextphase This_Is_Not_A_Fucking_CTCP___This_Is_A_CoCoNuts_Island_CTCP_ :D

Deth got [<U+008D>(8Lemon)<U+008D>] [<U+008D>(8Lemon)<U+008D>] [<U+008D>(--2-7---)<U+008D>]

Symbols and text noise made by nicknames or by chatting:

ciao a tutti ;9

.... .... 

\\olverin pensa che sia il caso di sganciare

c e qualcuno????????????????????????????,,

kimy [Pitch], usate le query please

/msg drago ciao

adios agua.......................- - ->

Pannella usa la tromba§

Mannaccia **§§°ç°ç§é*ç°é*ç°é°çé°ç°

etupensicheseiofossiilpresidentedellajamaicastareiacazzeggiarequicontuttelefighechecisonola'???????????????'  :> [08]

/ ___| |_ _|    / \     / _ \  | |
| |      | |    / _ \   | | | | | |
\____| |___| /_/   \_\  \___/  (_)

italia .'..'.

Que|o io avrei il crack del kali

/Msg LiveFast, sono il suo agente

che ca^?^?o hai capito

different languages:

nothing je t ai dit que je suis la
If Anyone Speaks Too Long Texts He Will Be Kicked
hello th^?^?e girls

Improve download status for Mitads

This is what happens with parlareitaliano:

...17066%, 0 MB, 58052 KB/s, 0 seconds passedDownloading in ./parsing/parlareitaliano/b01f001f.hsw
...12047%, 0 MB, 45343 KB/s, 0 seconds passedDownloading in ./parsing/parlareitaliano/b01f003f.hsw
...24824%, 0 MB, 59074 KB/s, 0 seconds passedDownloading in ./parsing/parlareitaliano/b01f005f.hsw

We need to test all the importers for their status bar because sometimes don't works flawlessy.

Trie file Generation

Hi all,

thanks for the hard-work that you've put in creating this fork. I'm in the process of creating a reduced dictionary to control a robot but I'm having some issues with the trie file generation. I've tried to generate the trie file using the generate_trie script of the 0.7.0a1 native client without success. I've event tried to run it on the lm_binary that comes with the "2020.03.13" release but it still fails to recognize words. This is, for instance, the result if I try to get "cinque sei sette otto nove dieci" recognized:

example

Obviously everything works fine if I use the lm.binary and trie provided together with the model

Define general rules for all sentences

I think it is important to define some rules for processing sentences from all importers.
This checks can either be done in the wrapper script or in sanitize.py (this can be more efficient)
My proposal is:

  • everything should be converted to lowercase
  • [^\s'abcdefghijklmnopqrstuvwxyzàèéìíòóôùú,\.!?:;] if a sentence match this regex it contains not valid chars so should be discarded

The discard will be done after trying to clean sentences (like removing trailing dashes or unescaping html)

Document the versioning

Just write in the readme about how we are releasing the model.

My idea is 2019.2-0.1, this means using the whole set of scripts with the 2nd CV Italian dataset but generated as 0.1 version because of different testing or other reasons.
What do you think? @astrastefania @mone27

Wrapper script for corpus generation

The idea is a script that execute the others and at the end generate an unique txt file and do some sanitization on that like:

  • We need to remove duplicate lines right now I have 3 lines with Respira.
  • Remove empty lines
  • Script that remove all the lines that include letters not part of this [^A-Za-z0-9àÈèìéòù,;:'.! ]

New books from Librivox

New package Clips-Mitads and Importer for Clips dataset

Ref: http://www.clips.unina.it/it/index.jsp

Tasks:

For the first 3 steps

We need to parse the txt of every recording to generate a unique CSV and package this csv with all the wav and remove the rest of the files.

New package name Clips-Mitads, just as reference.

CSV to create

wav_filename,wav_filesize,transcript
common_voice_it_19574474.wav,175148,ben degna di ammirazione
common_voice_it_19574387.wav,291884,noi possiamo benissimo non ritrovarci in quello che facciamo

Scripts unfinished: https://gist.github.com/Mte90/116e5d8a17973b7bd9bd9050662736dd

  • The csv is missing the wav filesize
  • The extraction of the rar need to avoid overwrites and get the files from the "etichettate" folder if exist

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.