Comments (31)
Oh, I see.
First, training on two languages does not make use of the model capabilities. The more languages you have, the more information which can be shared across languages is present.
Second, you should have more speakers for each language (or you should have very similar voices for both mono-speaker languages). The model has to learn the difference between language- and speaker- dependent information. However, there is the multi-speaker VCTK dataset for English (a subset with a few speakers should be sufficient) and some Chinese voices in my cleaned Common Voice data (see readme). You do not have to have a lot of examples for each speaker (50 transcript-recording pairs per speaker can be ok), but you should have more speakers with diverse voices (such as 10 or 20). If this is you case, just add these multi-speaker data to you actual dataset and it should be better.
Third, you should definitely reduce generator_dim
to something like 2..4 and generator_bottleneck_dim
should be lower than this number, e.g., 1 or 2. Also speaker_embedding_dimension
should be changed to roughly correspond to the number of speakers you have. So if you have like 20 speakers, 16 or 32 ...
Finally, there is reversal_classifier_w
which controls the weight of the adversarial speaker classifier's loss. This parameter is really tricky. High values prevent the model from convergence, low values cause no effect. However, you should first try to make your data multi-speaker.
from multilingual_text_to_speech.
About "language_embedding_dimension" and "generator_dim" , do they have the same meaning ?
And then, why language_embedding_dimension is set zero when training the five languages model ?
from multilingual_text_to_speech.
@YihWenWang Hi, I'm trying to train a Chinese-English mixed TTS model too.
Just like our daily talking, such as:——A:请问你从事什么领域的研究?
——B:我从事Computer Vision方面的研究工作。I plan to use the datasets of LJSpeech, ST-CMDS, and Biaobei to train it.
As in the above example, I am not familiar with the TTS field.
Could you give me a simple suggestion of steps to change this project?
I use the datasets of VCTK(English, 30 speakers) and STCMDS(Mandarin, 30 speakers).
My steps:
- Download the datasets and put the datasets into the "data" folder.
- Organize the train.txt and val.txt . There must contain the label of the speaker, language, audio, spectrograms, linear spectrograms, and text. It does not necessarily require transliteration.
- I use the "pypinyin" package to convert Mandarin text.
- In the "data" folder, the prepare_css_spectrograms.py must be modified. You have to change the path of the dataset.
- Note whether the parameter "sample rate" matches the sample rate of the audio in the datasets.
- The parameter "generator_bottleneck_dim" and "generator_dim" are adjusted according to the number of languages.
from multilingual_text_to_speech.
@YihWenWang Hi, I'm trying to train a Chinese-English mixed TTS model too.
Just like our daily talking, such as:——A:请问你从事什么领域的研究?
——B:我从事Computer Vision方面的研究工作。I plan to use the datasets of LJSpeech, ST-CMDS, and Biaobei to train it.
As in the above example, I am not familiar with the TTS field.
Could you give me a simple suggestion of steps to change this project?I use the datasets of VCTK(English, 30 speakers) and STCMDS(Mandarin, 30 speakers).
My steps:
- Download the datasets and put the datasets into the "data" folder.
- Organize the train.txt and val.txt . There must contain the label of the speaker, language, audio, spectrograms, linear spectrograms, and text. It does not necessarily require transliteration.
- I use the "pypinyin" package to convert Mandarin text.
- In the "data" folder, the prepare_css_spectrograms.py must be modified. You have to change the path of the dataset.
- Note whether the parameter "sample rate" matches the sample rate of the audio in the datasets.
- The parameter "generator_bottleneck_dim" and "generator_dim" are adjusted according to the number of languages.
Thank you for your reply. Your suggestion gives me a clearer idea to do this work.
Thanks again!
from multilingual_text_to_speech.
他们是我的综合演讲。
我使用VCTK和STCMDS数据集来训练此模型。
我的综合文字是“推荐一些社会书。”。
谢谢。
VCTK_STCMDS.zip
Hi! I training on LJSpeech and Biaobei now, hours=12, epoch=25, steps==9K. Currently, English begins to sound like words, but the Chinese still sounds like nothing.
So I would like to ask you how long it takes you to train, the Chinese result will begin to sounds like Chinese characters, and how long does it take to get the final result you think is good.
from multilingual_text_to_speech.
他们是我的综合演讲。
我使用VCTK和STCMDS数据集来训练此模型。
我的综合文字是“推荐一些社会书。”。
谢谢。
VCTK_STCMDS.zipHi! I training on LJSpeech and Biaobei now, hours=12, epoch=25, steps==9K. Currently, English begins to sound like words, but the Chinese still sounds like nothing.
So I would like to ask you how long it takes you to train, the Chinese result will begin to sounds like Chinese characters, and how long does it take to get the final result you think is good.
Hello, I use V100 to train for three days.
from multilingual_text_to_speech.
Hi! 😄
I am not sure I understand your question 😟 What do you mean by "two kind of datasets"? You can change the parameters arbitrarily or create another file with your parameters. The dataset is specified by the dataset
parameter (with values such as css_comvoi
, css10
, ljspeech
)
from multilingual_text_to_speech.
Thanks
The situation that I trained the dataset including two kind of language (English and Chinese) and I didn't change any parameters except for "languages" in generator_switching.json, but I got the bad result when I trained to epoch 120.
When I synthesized two languages in a sentence and I had assigned the speaker, but the result had the voice of two speakers, not one speaker.
I don't have any idea for this result.
from multilingual_text_to_speech.
okay
Thank you very much.
I will try to train the dataset including more speakers for each language and adjust the parameters.
from multilingual_text_to_speech.
Hello,
I use the VCTK dataset including 108 speakers for english and THCHS-30 dataset including 60 speakers for chinese.
My generated_switching.json setting :
generator_dim = 4,
generator_bottleneck_dim = 2,
speaker_embedding_dimension = 64,
reversal_classifier_w = 0.125
And the training result, I could synthesize two languages in a sentence by one speaker.
But there is a problem.
If I synthesize a sentence such as "recommend the some 社會書。", the voice of the second half of the sentence could become smaller.
I don't have any idea for this result.
from multilingual_text_to_speech.
你好 😁
Do I understand it correctly that the voice seems to be the same throughout the whole sentence, but volume changes? This might be caused by the recordings (from two different datasets) normalized in a different way. Do the corresponding spectrograms have similar magnitudes?
You can try to normalize your audio files and repeat training with the new data. You can for example run this command to normalize every .wav
in your-directory
to the same volume level.
find "your-directory" -name '*.wav' | while read f; do
sox "${f}" tmp.wav gain -n -3 && mv tmp.wav "${f}"
done
Hope it helps 😇 再见
from multilingual_text_to_speech.
Thanks for your suggestion.
I will try it.
But I still have a question, if the sample rate of dataset is 16000 Hz, which parameters should I modify except for sample rate.
from multilingual_text_to_speech.
You do not have to change stft_window_ms
nor stft_shift_ms
, because these values are in milliseconds. However, you can reduce num_fft
to a lower value such as 1024, because stft_window_ms * sample_rate
gives you something around 800.
from multilingual_text_to_speech.
Sorry, I want to ask you about some question.
How do I get the mel-spectrogram from .npy file?
Because I want to use Waveglow to synthesize the waveform.
from multilingual_text_to_speech.
Hm, spectrograms are stored in two-dimensional numpy arrays and saved into .npy
files. Just use numpy.load
to load them back into memory.
If you want to train the Waveglow model on spectrograms produced by your Tacotron, use the gta.py
script which can produce ground-truth-aligned spectrograms (GTA) given your model and original spectrograms.
from multilingual_text_to_speech.
But I don't train waveglow, I had trained waveglow.
I want to feed synthesized the .npy file to waveglow and synthesize the audio by waveglow.
The figure is mel-spectrogram that I load the .npy file and try to synthesize the audio from waveglow.
But I get the noise audio.
from multilingual_text_to_speech.
Spectrograms seem to be ok, so I am afraid I cannot help you.
Just a few hints that come to my mind and can help you debugging:
- Does the synthesis using Griffin-Lim work?
- What is the range of spectrogram values?
- Does Waveglow expect somehow normalized spectrograms?
- How do the spectrograms produced by Waveglow preprocessing look like?
from multilingual_text_to_speech.
No, they don't. language_embedding_dimension
specifies the dimension of the language embedding concatenated to the decoder input. generator_dim
defines the dimension of the language embedding used in the parameter generator. It is set to zero because the model has enough information about the language from the encoder.
from multilingual_text_to_speech.
@YihWenWang Hello Wang, Can you share your synthesized speech samples?
THX!
from multilingual_text_to_speech.
from multilingual_text_to_speech.
from multilingual_text_to_speech.
They are my synthesis speech.
I use the VCTK and STCMDS datasets to train this model.
My synthesized text is "Recommend the some 社會書。".
Thanks.
VCTK_STCMDS.zip
from multilingual_text_to_speech.
from multilingual_text_to_speech.
@YihWenWang Samples sound nice, did you train this model based only on VCTK and STCMDS ?
from multilingual_text_to_speech.
@YihWenWang Samples sound nice, did you train this model based only on VCTK and STCMDS ?
Yes, I only use datasets of VCTK(English, 30 speakers) and STCMDS(Mandarin, 30 speakers).
from multilingual_text_to_speech.
@YihWenWang Hi, I'm trying to train a Chinese-English mixed TTS model too.
Just like our daily talking, such as:
——A:请问你从事什么领域的研究?
——B:我从事Computer Vision方面的研究工作。
I plan to use the datasets of LJSpeech, ST-CMDS, and Biaobei to train it.
As in the above example, I am not familiar with the TTS field.
Could you give me a simple suggestion of steps to change this project?
from multilingual_text_to_speech.
他们是我的综合演讲。
我使用VCTK和STCMDS数据集来训练此模型。
我的综合文字是“推荐一些社会书。”。
谢谢。
VCTK_STCMDS.zipHi! I training on LJSpeech and Biaobei now, hours=12, epoch=25, steps==9K. Currently, English begins to sound like words, but the Chinese still sounds like nothing.
So I would like to ask you how long it takes you to train, the Chinese result will begin to sounds like Chinese characters, and how long does it take to get the final result you think is good.Hello, I use V100 to train for three days.
By the way, do you use phonemes when training?
from multilingual_text_to_speech.
他们是我的综合演讲。
我使用VCTK和STCMDS数据集来训练此模型。
我的综合文字是“推荐一些社会书。”。
谢谢。
VCTK_STCMDS.zipHi! I training on LJSpeech and Biaobei now, hours=12, epoch=25, steps==9K. Currently, English begins to sound like words, but the Chinese still sounds like nothing.
So I would like to ask you how long it takes you to train, the Chinese result will begin to sounds like Chinese characters, and how long does it take to get the final result you think is good.Hello, I use V100 to train for three days.
By the way, do you use phonemes when training?
No, I didn't use phonemes when training.
I just use the label of text, speaker, language, spectrogram, and linear spectrogram.
from multilingual_text_to_speech.
Thanks. I have a problem with the tones of Mandarin.
No matter I use pinyin or phonemes to train, the pronunciation of the four tones are not accurate.
I use the "pypinyin" package to convert Mandarin text.
The different we are:
I use pinyin package of requirement.txt file.
I use biaobei dataset(10,000) and LJSpeech dataset(5,000).
I don't know what causes this. Do you have any ideas or could you share params.py
with me?
from multilingual_text_to_speech.
他们是我的综合演讲。
我用VCTK和STCMDS数据集来训练这个模型。
我的综合文字是“推荐一些社会书。”。
谢谢。
VCTK_STCMDS.zip你好!我现在在LJSpeech和Biaobei上训练,小时= 12,历元= 25,步骤== 9K。目前,英语开始听起来像单词,但中文仍然听起来像什么。
所以我想问一下你训练需要多长时间,中文结果会开始图像汉字,需要多长时间才能得到你认为好的结果。你好,我用V100训练了三天。
大问一下,你在训练时使用音素吗?
不,我在训练时没有使用音素。 我只使用文本、说话者、语言、描绘图和形象图的标签。
请问你对VCTK数据集做了消除静音的处理吗
from multilingual_text_to_speech.
@YihWenWang How did you get the Tacotron mel spectrograms to work with waveglow in the end? They seem similar, but look like they are normalized somehow.
Tacotron value examples
[-54.77068739 -47.15882725 -45.828745 -44.59372329 -43.22799777
-42.7517943 -42.11187298 -42.25688537 -42.81581903 -43.02588636, ...]
Waveglow value examples (for same audio file)
[-3.9470453 -2.820666 -2.7616765 -2.5435247 -2.5574331 -2.2251318
-2.0958776 -2.0956624 -2.170114 -2.0375078, ...]
from multilingual_text_to_speech.
Related Issues (20)
- Adding support for windows sapi5 or android HOT 4
- Voice cloning attempts HOT 1
- Model is much slower on CPU HOT 6
- No softmax layer in the classifier? HOT 1
- Params.py issue
- torch version issue HOT 4
- can't run train.py HOT 1
- data HOT 3
- When I try to train it, I got the following error: HOT 6
- How "Pronunciation control" can be implemented? HOT 1
- batchnorm1D on padded values results in large activation scaling HOT 3
- Project dependencies may have API risk issues HOT 2
- about µ and variances σ HOT 5
- Dataset with various sample rates and frequency bins HOT 1
- preprocess Error HOT 1
- Can we get a cloned voicie in Real Time ? HOT 1
- is the pretrained model support speech generation in Hebrew? HOT 1
- CUDA Out of Memory error after a couple of epochs HOT 1
- Same here.
- why do we need multiple languages & multiple speakers? HOT 2
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from multilingual_text_to_speech.