Texts for the Ukrainian Text-to-Speech dataset

Overview

This repository contains scripts to generate Text-to-Speech datasets (text part of them) for Ukrainian.

Related works

https://github.com/egorsmkv/ukrainian-tts-datasets (Open Source Ukrainian Text-to-Speech datasets)

Install

pip install --upgrade ukrainian-word-stress

pip install git+https://github.com/NeonBohdan/ukrainian-accentor-transformer.git

Data

It is in the data/raw folder.

Apply scripts to data

Extract appropriate sentences

python tools/extract_exclamations.py --in_file data/raw/boh_joho_batkiv.txt >> data/exclamations.txt
python tools/extract_exclamations.py --in_file data/raw/boyci_za_pravdu.txt >> data/exclamations.txt
python tools/extract_exclamations.py --in_file data/raw/chorna_rada.txt >> data/exclamations.txt
python tools/extract_exclamations.py --in_file data/raw/duma_mushketery.txt >> data/exclamations.txt
python tools/extract_exclamations.py --in_file data/raw/franko.txt >> data/exclamations.txt
python tools/extract_exclamations.py --in_file data/raw/leontovich_hronika_grechok.txt >> data/exclamations.txt

python tools/extract_questions.py --in_file data/raw/boh_joho_batkiv.txt >> data/questions.txt
python tools/extract_questions.py --in_file data/raw/boyci_za_pravdu.txt >> data/questions.txt
python tools/extract_questions.py --in_file data/raw/chorna_rada.txt >> data/questions.txt
python tools/extract_questions.py --in_file data/raw/duma_mushketery.txt >> data/questions.txt
python tools/extract_questions.py --in_file data/raw/franko.txt >> data/questions.txt
python tools/extract_questions.py --in_file data/raw/leontovich_hronika_grechok.txt >> data/questions.txt

python tools/extract_obvious.py --in_file data/raw/boh_joho_batkiv.txt >> data/obvious.txt
python tools/extract_obvious.py --in_file data/raw/boyci_za_pravdu.txt >> data/obvious.txt
python tools/extract_obvious.py --in_file data/raw/chorna_rada.txt >> data/obvious.txt
python tools/extract_obvious.py --in_file data/raw/duma_mushketery.txt >> data/obvious.txt
python tools/extract_obvious.py --in_file data/raw/franko.txt >> data/obvious.txt
python tools/extract_obvious.py --in_file data/raw/leontovich_hronika_grechok.txt >> data/obvious.txt

Add stresses

python tools/add_stresses.py --in_file datasets/unstressed.txt >> datasets/stressed.txt

python tools/add_stresses_only_csv_transformer.py --in_file data/exclamations.txt >> datasets/stressed/exclamations.csv
python tools/add_stresses_only_csv_transformer.py --in_file data/questions.txt >> datasets/stressed/questions.csv
python tools/add_stresses_only_csv_transformer.py --in_file data/obvious.txt >> datasets/stressed/obvious_3.csv

Convert the dataset obtained from the Online Microphone [1] to convenient format

python tools/prepare_dataset.py --raw_files ../done/lada/ --save_to ../dataset_lada

Notes

The Online Microphone is a proprietary software to record speakers made by Yehor Smoliakov

Acknowledgements

Ukrainian word stress: https://github.com/lang-uk/ukrainian-word-stress
Ukrainian accentor: https://github.com/NeonBohdan/ukrainian-accentor-transformer
Wikisource: https://uk.wikisource.org

egorsmkv / uk-tts-dataset-text Goto Github PK

uk-tts-dataset-text's Introduction

Texts for the Ukrainian Text-to-Speech dataset

Overview

Related works

Install

Data

Apply scripts to data

Extract appropriate sentences

Add stresses

Convert the dataset obtained from the Online Microphone [1] to convenient format

Notes

Acknowledgements

uk-tts-dataset-text's People

Contributors

Watchers

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent

Jobs