tomiinek / multilingual_text_to_speech Goto Github PK
View Code? Open in Web Editor NEWAn implementation of Tacotron 2 that supports multilingual experiments with parameter-sharing, code-switching, and voice cloning.
License: MIT License
An implementation of Tacotron 2 that supports multilingual experiments with parameter-sharing, code-switching, and voice cloning.
License: MIT License
What type of loss_total number, etc, should I be looking for to verify that things seem to be training correctly?
I'm currently at step: 3.792k, 3 hours 12 minutes, total loss 0.2972
Providing pre-phonetized content?
Is there a way to specify phonemes, stress, and vowel lengths for the inputs and skip the phonemizer step?
I'm looking for a way to synthesize two different Native American languages: Cherokee and Mohegan.
The Mohegan language is a stress based language, and I'm hoping the phonetics would map close enough to one or more of the languages already trained for the creation of language lesson materials.
The Cherokee language is a tone based language and this factor presents a challenge.
I'm wondering if it might be possible to "bootstrap" a language such as Cherokee using espeak-ng generated training audio and phonetics then use voice cloning for the actual output. (And skip training the vocoder on the espeak-ng audio).
Thank you for the great job.
I would like to retrain the wavernn vocoder (with the generated_training configuration) and I'm not sure how to proceed.
Have you used ground truth mel coefficients as inputs or have you used ground truth aligned mel coefficients?
In the later case (gta): what is the sequence of commands to use? The gta.py script does not take the json parameters file as the train.py script does (thanks to the hyper_parameter argument), so it seems to only be usable with the default (ljspeech) configuration. Moreover it generates .npy files for the generated ground truth aligned mel coefficients with names that do not match the wav files names, where the preprocess.py script in the WaveRnn project expects the two files (wav and mel .npy files) to have the same root.
Thanks again
Hello, I have a problem with the encoder of the synthesizer.
Do you have used the original encoder in Tacotron-2 that including Convolution and LSTM?
Why do you use Convolution and Highway Convolution in encoder?
Thanks.
Hello
If I just train two kind of datasets, how do I set the parameters, such as generator_dim and generator_bottleneck_dim ...etc for generator_switching.json .
Hello, Is it possible for you to make Google colabs for Training and Synthesis our own custom audio files uploaded on Gdrive? it would be so easy and awesome to use it on weak computers.
Note: the current colabs are working on pre-trained models and they are working perfectly but it would be so awesome if you will make colabs for Training and Synthesis our new models based on our custom multilingual voices.
Thank you
Hello,
I passed a lot of time to try to understand how the WaveRNN by Tomiinek works to retrain it by myself but I failled.
The documentation is not done and I tried a lot of things...
For the moment I am blocked after generate the GTAs because I am not able to link a GTA file to a WAV file.
Can you help me please ?
Thanks
Hello,
Thank you for making this code base available! this is absolutely fascinating stuff.
I am working on a research project where I need to produce accented English audio on custom text i.e. I have audios of native speakers which I want to "add an accent" to. I was able to successfully run the two Google Collab notebooks you provided to find that the model is able to output dynamically accented audio. However, I noticed that English is not included in the audio files that the models were trained on.
I want to add English as a supported language. One way to do so that I see is to download the common voice english database or a subset of it, clean it to be in the format of your "cleaned" common voice dataset and then follow the steps to train the models from scratch essentially. There are a couple issues I see with my plan: a) your waveRNN has been pre-trained on CSS10 data which doesn't include english so if thats an issue I might also have to train waveRNN again b) I am not entirely sure yet how to "clean" the common voice data c) the common voice english database is 50GBs which is too big.
Essentially, I am hoping there is a simpler way to fine-tune the existing models to support English. If you provide me with some direction on this, Id very much appreciate that!
Best,
Vic
Hello and first of all thank you for your work!
What would be the steps to add a language? In this case French from France (and not from Quebec).
Can I use the result from this Training colab ?
https://colab.research.google.com/drive/14X73UiywnoL9VS30iPDcX4WXxwZWv2e2
Thank you! :)
Is it possible to stop training then add additional speakers to train.txt and val.txt then resume training?
Will the added speakers show up as additional entries in the model param dict or will it halt with an error?
Thank you for your awesome paper & code.
I have a question about adversarial speaker classifier.
I think, In training, the speaker classifier doesn't affect to rest model.
In training, The speaker classifier and the rest of the model are independent of each other. Right?
If my understanding right, What is the speaker classifier used for?
Thank you.
I have to use small batches, (size 10), because of memory constraints on my GPU.
How do I set the batch count for gradient accumulation to be used over as part of the hyper params?
Or is this automatic calculated based on "ideal batch" / "actual batch" somewhere?
Hi Tomminek!
Thank You for your great Paper & Source Code.
I want to expand a language like Korean.
Also, I want to apply voice cloning in the Korean Language.
For voice cloning, I think the Korean dataset should be made of multi-speakers... Did I get it right??
If so, then can you tell some parts I can refer to?? It would be very helpful.
Thank You.
Hi! I'm here to ask you a question about something strange.
I'm trying to get them to study in four V100 x four environments, but every epoch an explosion ends, there's an OOM error.
I think we have enough memory, do you have any idea why this is happening?
Hello,this project is very nice and thank you for your share!
There is an error when I run train.py: RuntimeError: Expected tensor for argument #1 'indices' to have scalar type Long; but got torch.cuda.FloatTensor instead (while checking arguments for embedding)
Environment: The system is win10, and other environments are consistent with requirement.txt
I don't know what causes this problem .Is it a problem with win10 system or I need to change the code?
When I try to run the TextToSpeechDataset.create_meta_file method uusing css10 dateset I am getting this error.
Building phoneme dictionary: ░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░ 0.0%
---------------------------------------------------------------------------
RuntimeError Traceback (most recent call last)
/Multilingual_Text_to_Speech/utils/text.py in _phonemize(text, language)
90 seperators = Separator(word=' ', phone='')
---> 91 phonemes = phonemize(text, separator=seperators, backend='espeak', language=language)
92 except RuntimeError:
9 frames
RuntimeError: espeak not installed on your system
During handling of the above exception, another exception occurred:
FileNotFoundError Traceback (most recent call last)
/usr/local/lib/python3.7/dist-packages/epitran/simple.py in _load_g2p_map(self, code, rev)
90 except IndexError:
91 raise DatafileError('Add an appropriately-named mapping to the data/maps directory.')
---> 92 with open(path, 'rb') as f:
93 reader = csv.reader(f, encoding='utf-8')
94 orth, phon = next(reader)
FileNotFoundError: [Errno 2] No such file or directory: '/usr/local/lib/python3.7/dist-packages/epitran/data/map/chinese.csv'
I have installed the epitran and for installing espeak I used these:
! pip install python-espeak
and got this error
Collecting python-espeak
Using cached https://files.pythonhosted.org/packages/59/5b/45437090dbd71ee9f586dc7f650c6e8c4815bd8bff9b2923d4db5b9120ed/python-espeak-0.6.3.tar.gz
Building wheels for collected packages: python-espeak
Building wheel for python-espeak (setup.py) ... error
ERROR: Failed building wheel for python-espeak
Running setup.py clean for python-espeak
Failed to build python-espeak
Installing collected packages: python-espeak
Running setup.py install for python-espeak ... error
ERROR: Command errored out with exit status 1: /usr/bin/python3 -u -c 'import sys, setuptools, tokenize; sys.argv[0] = '"'"'/tmp/pip-install-nlsz0hgq/python-espeak/setup.py'"'"'; __file__='"'"'/tmp/pip-install-nlsz0hgq/python-espeak/setup.py'"'"';f=getattr(tokenize, '"'"'open'"'"', open)(__file__);code=f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, __file__, '"'"'exec'"'"'))' install --record /tmp/pip-record-ahzj2e1q/install-record.txt --single-version-externally-managed --compile Check the logs for full command output.
Hi Tomiinek!! I have a question about Generated-Encoder.
As I understand, After training, Generated-Encoder's fully-connected layer becomes a dictionary.
{Key: language id embedding, Value: Conv or batch-norm's weight}. Right??
If so, rather than generating weight by fully-connected layer, it's better to fix weight by language and use it, don't you?
If not, is there a case where weight is changed by other elements other than language after training?
Thank You!!
What would be the minimum GPU RAM recommended and the minimum batch size to produce acceptable results?
How about how long did it take for training for the pre-trained weights?
I currently have a GeForce GTX 1060 6GB.
My purpose is to make a Vietnamese Speech Synthesis model be able to pronounce English.
But, I have just a Vietnamese dataset with 2 female speakers (13100 and 16500 utterances ~ around 50-55 hours).
I want to train this dataset with LibriTTS with 127 female speakers(~ 58 hours).
Do you have any suggestions for my experiments?
Hi @Tomiinek !!
Thanks to your great project, we were able to add an additional fun TTS project to our team's project called PORORO.
You can easily use many natural language processing and voice tasks including your TTS using pororo.
And I've expanded your project to English, Korean, and Jejueo. You can check at this page.
Thank you for your great project!!
Hi, @Tomiinek . It's a nice job, and it's an honor to see this project.
I have some questions about train.txt; hope you can solve my puzzle.
/data/css10/
, we can see the train.txt (original).prepare_css_spectrograms.py
file is processed, we got the spectrogram and linear spectrogram and changed the structure of train.txt (processed)(original) 000285|chinese|chinese|chinese/call_to_arms/call_to_arms_0285.wav|||húixiāngdòu de húizì, zěnyáng xiě de?| (processed) 000285|chinese|chinese|chinese/call_to_arms/call_to_arms_0285.wav|spectrograms\000285.npy|linear_spectrograms\000285.npy|húixiāngdòu de húizì, zěnyáng xiě de?|
So, how to get the train.txt (original) file? I want to create it for another dataset.
I have questions about the meaning of these variables: idx
, s
, ph
idx
: As long as we make sure that IDX is unique and points to specific audio, we can define it in any way we want, right? Usually, the id is defined by the name of the file, but I found that you did not do this. Can I define this variable with the filename of the audio?
s
: speaker? Puzzle. If the data set has only one person's voice, it is defined as the language name? Otherwise, it is defined as the serial number of different people (0,1,2,3,4....)?
ph
: I don't understand the meaning of this variable, is that mean "\n"?
Very interesting paper! Wondering any paper of this work available:)
Hello,this project is so nice and thank you for your share!
I've train English and Chinese model with a total of hundreds of speakers in each language using LibriTTS and thchs30(Chinese dataset) and a private dataset. All data are resampled to 22k and denoise. This time I try to use phonemes (phonemize) and especially add tone in Chinese.
Now it trains 25k steps and the loss drops well. the result is OK and it could pronounce right. But it still has problems:
So I'm wondering if you can give me some advice to optimize it. Thx!
There are my params:
"balanced_sampling": True,
"batch_size": 80,
"case_sensitive": False,
"checkpoint_each_epochs": 20,
"encoder_dimension": 256,
"encoder_type": "generated",
"epochs": 1000,
"generator_bottleneck_dim": 1,
"generator_dim": 2,
"languages": ["zh", "en"],
"language_embedding_dimension": 0,
"learning_rate": 0.001,
"learning_rate_decay_each": 10000,
"learning_rate_decay_start": 10000,
"use_phonemes":True,
"multi_language": True,
"multi_speaker": True,
"perfect_sampling": True,
"predict_linear": False,
"reversal_classifier": True,
"reversal_classifier_dim": 256,
"reversal_classifier_w": 0.125,
"reversal_gradient_clipping": 0.25,
"speaker_embedding_dimension": 256,
Hi Tomiinek!!
I'm sorry I seem to ask you questions all the time.
I'm so interested in your project that much, so please understand!
I'm going to train the WaveRNN model to fit my dataset.
But I couldn't find any words in your WaveRNN repo about the format in which to configure the dataset format.
Can you give me any advice on how to configure dataset formats or give me relevant materials?
Thank You!!
Hi Tomiinek!!
I have a question about vocoder.
Your model supports 10 languages but one vocoder was used.
I wonder if I use 10 vocoders (for each language), the voice quality improve?
I wanna hear your thought.
Thank You!
Thanks for your excellent work. I'm trying to synthesis Chinese-English code-switched speech. Datasets are as followed:
ST-CMDS: Chinese, 853 speakers, 120 utterances per speaker, samplerate 16k
VCTK: English, 108 speakers, 123~502 uterances per speaker, samplerate 48k
Chinese transcriptions are converted to pinyin with tones by pypinyin. VCTK audios are downsampled to 16k with ffmpeg.
I'm training model on 8 v100 32G gpus. Parameters are:
"balanced_sampling": true,
"batch_size": 512,
"case_sensitive": true,
"characters": " !',-.?ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyzàáèéìíòóùúüāēěīńōūǎǐǒǔǘǚǜ",
"checkpoint_each_epochs": 1,
"dataset": "stcmds_vctk",
"encoder_dimension": 256,
"encoder_type": "generated",
"epochs": 300,
"generator_bottleneck_dim": 1,
"generator_dim": 2,
"languages": ["en", "zh"],
"language_embedding_dimension": 0,
"learning_rate": 0.005,
"learning_rate_decay_each": 1000,
"learning_rate_decay_start": 1000,
"multi_language": true,
"multi_speaker": true,
"perfect_sampling": true,
"predict_linear": false,
"reversal_classifier": true,
"reversal_classifier_dim": 256,
"reversal_classifier_w": 0.125,
"reversal_gradient_clipping": 0.25,
"speaker_embedding_dimension": 1024,
"version": "GENERATED-SWITCHING"
After training for 37 hours, logs are:
Eval/classifier seams to be too small, audios in "Audio/forced" and "Audio/generated" are very unclear。
Any advices please. Thanks in advance.
Thank for your efforts.
Could you please tell me which task this code does in the "HighwayConvBlockGenerated" class?
When training using my custom data, this return error that says the x's dimension and (h2 * p)'s dimension miss match.
the (h2 * p)'s dimension is a half of x, so they could not be added.
The error: RuntimeError: The size of tensor a (3584) must match the size of tensor b (1792) at non-singleton dimension 1
def forward(self, x):
e, x = x
_, h = super(HighwayConvBlockGenerated, self).forward((e, x))
chunks = torch.chunk(h, 2 * self._groups, 1)
h1 = torch.cat(chunks[0::2], 1)
h2 = torch.cat(chunks[1::2], 1)
p = self._gate(h1)
return e, h2 * p + x * (1.0 - p)
I have trained this model at 7k steps, but I cannot get any acceptable audio. The dataset have 40h, and I predict linear spectrogram then use griffin-lim to get waveform.
audio.zip
I used the two kinds of language datasets including VCTK(English, 30 speakers, 11654 clips) and STCMDS(Chinese, 30 speakers, 3600 clips).
This result is good when I synthesized the Chinese text.
But when I used the three kinds of language datasets including VCTK, STCMDS, and TAT(Minnan Language, 30 speakers, 8586 clips), the result is bad.
When I synthesized the Chinese text, the synthesized audio sounds like it through the voice changer, and the speaking is too fast.
There are synthesized waveforms such as two languages and three languages in the compressed file.
I don't have any idea about this problem.
Wouldn't model performance be better if we increase the number of model parameters?
Have you ever done an experiment like this?
Where can I download the pretrained model? "https://github.com/Tomiinek/Multilingual_Text_to_Speech/releases/download/v1.0/$tacotron_chpt" I cann't find this file path now.
Thanks very much!
Hello, I am struggling to get an idea of the workflow of this awesome project.
As you mentioned in the readme that we have to use the comvoi.zip and for now I am using the css10 dataset for three languages.
everything is set up but may somebody please explain a workflow on how and why to fix things to train the model and then synthesize it using the WaveRNN.
I have a lot of questions but let's first see if someone will guide me for the initial steps.
There appears to be some long term running memory leak, probably related to graphs. As the training progresses, my Xorg memory consumption gradually increases. If I stop the training, the memory is instantly released.
I suspect it is related to graphs because of the following warning:
RuntimeWarning: More than 20 figures have been opened. Figures created through the pyplot interface (`matplotlib.pyplot.figure`) are retained until explicitly closed and may consume too much memory. (To control this warning, see the rcParam `figure.max_open_warning`).
fig = plt.figure(figsize=(16, 4))
Hi! Tomiinek!!
I trained tacotron with generated-training.json
config.
With your WaveRNN weight file, result of tts are pretty good.
But WaveRNN that you opened was a pre-trained model in only 5 languages, so there were some disappointing results for other languages.
So, I trained WaveRNN with your wavernn repository.
With your comment in here, I trained WaveRNN.
I've trained WaveRNN with 600k steps, but this WaveRNN doesn't make voice at all.
So I confused. The WaveRNN you opened is good at producing voice, but why can't I do it at all?
The figure above is a waveform made from my WaveRNN.
The figure above is a waveform made from your WaveRNN.
The training loss scale is about 2.3. Here are sample quant and mel spectrogram. Quant Sample Mel Sample.
I think if there is a Mel-Spectrogram, Wave pair, it should be trained, but I don't know why not.
Below figures are image of Quant Sample and Mel Sample.
I did not touch the parameters of hparams.py in the WaveRNN repository. Could you check it for me?
Please help me Tomiinek.
Thank you.
Hello,
It is very great repo! thanks.
Need make training for Urdu (arabic letters). How can make dataset like yours? Need change vocabulary somewhere?
Hello Tomáš,
I have a problem when I try to train my dataset with WaveRNN.
It returns me this error:
voxfurem@cc:~/nvme/src/WaveRNN$ python train_wavernn.py --force
Initialising Model...
Trainable Parameters: 4.744M
Path=/media/nvme/src/WaveRNN/dataset
Batch size=32
Paths = <utils.paths.Paths object at 0x7fab0adebca0>
NUM EXIST = 3
Restoring from latest checkpoint...
Loading latest weights: /media/nvme/src/WaveRNN/checkpoints/css_raw.wavernn/latest_weights.pyt
Loading latest optimizer state: /media/nvme/src/WaveRNN/checkpoints/css_raw.wavernn/latest_optim.pyt
| Epoch: 1/147059 (1/68) | Loss: 6.9385 | LR: 0.0010 | 1.3 steps/s | Step: 0k | Traceback (most recent call last):
File "train_wavernn.py", line 136, in <module>
voc_train_loop(paths, voc_model, loss_func, optimizer, scheduler, train_set, test_set, total_steps)
File "train_wavernn.py", line 31, in voc_train_loop
for i, (x, y, m) in enumerate(train_set, 1):
File "/home/voxfurem/.local/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 345, in __next__
data = self._next_data()
File "/home/voxfurem/.local/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 838, in _next_data
return self._process_data(data)
File "/home/voxfurem/.local/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 881, in _process_data
data.reraise()
File "/home/voxfurem/.local/lib/python3.8/site-packages/torch/_utils.py", line 394, in reraise
raise self.exc_type(msg)
ValueError: Caught ValueError in DataLoader worker process 1.
Original Traceback (most recent call last):
File "/home/voxfurem/.local/lib/python3.8/site-packages/torch/utils/data/_utils/worker.py", line 178, in _worker_loop
data = fetcher.fetch(index)
File "/home/voxfurem/.local/lib/python3.8/site-packages/torch/utils/data/_utils/fetch.py", line 47, in fetch
return self.collate_fn(data)
File "/media/nvme/src/WaveRNN/utils/dataset.py", line 109, in collate_vocoder
labels = np.stack(labels).astype(np.int64)
File "/home/voxfurem/.local/lib/python3.8/site-packages/numpy/core/shape_base.py", line 416, in stack
raise ValueError('all input arrays must have the same shape')
**ValueError: all input arrays must have the same shape**
I spend some hours to try to understand this error, I develop a system to seach the file which can be the cause, I restart with another voice package, etc... but I failed to find a solution.
Do you have any idea of the root cause for this problem ?
Thanks in advance
Hello and thank you for sharing your work!
I was wondering if you could provide a list with the available languages and speakers in your pretained models.
In addition, do you happen to have a female spanish speaker?
Thank you very much in advance!
Lucía
Hi Tomiinek!!
I used your WaveRNN Repo to train vocoder.
I've formatted the data, I've trained WaveRNN, but I haven't been able to produce a voice at all.
Is the hyperparameter the same as the one at hparams.py when you train your model?
Firstly, thanks for this great work.
I'm trying to use your model for streaming text-to-speech applications. The quality is good, but the speed is slow. I'm using a GPU Tesla V4 with 16G RAM.
Any change to the configuration can help in speeding up the process.
Currently, producing 2 seconds of audio takes 14 seconds of processing starting from the text to the produced audio
Hi Tomiinek!!
I have a question. What hyperparameters need to be changed if the language is extended?
You experimented using CSS10. (10 Languages)
If I will experiment with 14 languages, What hyperparameter to be changed?
If I will experimenting with 14 languages, what hyperparameters would I need to change?
Please give me your wisdom.
Hi Tommy, may you please explain the different use of prepare_css_spectrograms.py and TextToSpeechDataset.create_meta_file?
And how the run TextToSpeechDataset.create_meta_file it if I am using the css10 dataset? like settings for batch file and other settings you want to prefer.
hope you wont ignore me this time.
I'm working on seeing what can be done with smaller batch sizes with the existing data sets, and currently have my batch size set to '10'. This results in a bit over 4GB memory being allocated on my NVIDIA GPU which only has 5GB-5.5GB available memory.
I'm using pre-generated spectrograms.
However, the GPU usage seems a tad low and the CPU usage high.
Right now top shows python at 100% CPU utilization. With GPU usage fluctuation between about 45% to 88%.
Is there some other data that can be pre-calculated and cached to reduce python bottlenecking on the CPU?
Is it possible to use the saved npy output files with a different vocoder?
If yes, are there any special steps required for shaping, etc?
I am trying to train the system on CSS10 dataset and and receving the error stated in the title of this issue.
The error is raised following the training command PYTHONIOENCODING=utf-8 python3 train.py --hyper_parameters generated_switching
as instructed in the documentation. I have downloaded all languages of CSS10 and changed the dataset parameters in params.py file as such:
******************* DATASET SPECIFICATION *******************
4 dataset = "css10"
3 cache_spectrograms = True
2 languages = ['zh', 'fi', 'de', 'el', 'hu', 'ja', 'kss', 'ru', 'es', 'fr', 'nl']
44 balanced_sampling = False
1 perfect_sampling = False
I have also preprocessed all the spectrograms with the prepare spectrograms file. I do not understand where I went wrong in following the documentation. I understand the error but not it's cause and supressing it in the code does not work. Can anyone help?
Please include instructions on how to resume training starting with your 70k iteration weights.
Would it be possible to add additional languages as part of a fine tuning process?
I want to train using generated_training on two languages "russian"+"french".
I downloaded CSS10 data from links you gave on these languages, ran the prepare_css_spectrograms.py, created a custom generated_training.json file
I run train.py on 2 GPUs
I have several issues:
In addition, its been running for 3 hours but nothing seems to happen (tensorboard is empty). is that normal?
How long does a training like that should take?
PS. I switched tensorboard to tensorboardX in the utils/logging.py due to an issue I had. Don't know if it has any affect...
Thank you for sharing
I want to ask about initialization for 2 languages. Vietnam and English (using LJSpeech)
Can you let me ask about the places to configure?
Because your code is a bit confusing.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.