Hello If I just train two kind of datasets, how do I set the parameters, such as g

<a class="user-mention notranslate" data-hovercard-type="user" data-hover

<a class="user-mention notranslate" data-hovercard-type="use

Set parameters for training two languages dataset,about tomiinek/multilingual_text_to_speech

Comments (31)

Tomiinek commented on June 15, 2024 2

Oh, I see.

First, training on two languages does not make use of the model capabilities. The more languages you have, the more information which can be shared across languages is present.

Second, you should have more speakers for each language (or you should have very similar voices for both mono-speaker languages). The model has to learn the difference between language- and speaker- dependent information. However, there is the multi-speaker VCTK dataset for English (a subset with a few speakers should be sufficient) and some Chinese voices in my cleaned Common Voice data (see readme). You do not have to have a lot of examples for each speaker (50 transcript-recording pairs per speaker can be ok), but you should have more speakers with diverse voices (such as 10 or 20). If this is you case, just add these multi-speaker data to you actual dataset and it should be better.

Third, you should definitely reduce generator_dim to something like 2..4 and generator_bottleneck_dim should be lower than this number, e.g., 1 or 2. Also speaker_embedding_dimension should be changed to roughly correspond to the number of speakers you have. So if you have like 20 speakers, 16 or 32 ...

Finally, there is reversal_classifier_w which controls the weight of the adversarial speaker classifier's loss. This parameter is really tricky. High values prevent the model from convergence, low values cause no effect. However, you should first try to make your data multi-speaker.

from multilingual_text_to_speech.

YihWenWang commented on June 15, 2024 1

About "language_embedding_dimension" and "generator_dim" , do they have the same meaning ?
And then, why language_embedding_dimension is set zero when training the five languages model ?

from multilingual_text_to_speech.

YihWenWang commented on June 15, 2024 1

@YihWenWang Hi, I'm trying to train a Chinese-English mixed TTS model too.
Just like our daily talking, such as:

——A：请问你从事什么领域的研究？
——B：我从事Computer Vision方面的研究工作。

I plan to use the datasets of LJSpeech, ST-CMDS, and Biaobei to train it.
As in the above example, I am not familiar with the TTS field.
Could you give me a simple suggestion of steps to change this project?

I use the datasets of VCTK(English, 30 speakers) and STCMDS(Mandarin, 30 speakers).
My steps：

Download the datasets and put the datasets into the "data" folder.
Organize the train.txt and val.txt . There must contain the label of the speaker, language, audio, spectrograms, linear spectrograms, and text. It does not necessarily require transliteration.
I use the "pypinyin" package to convert Mandarin text.
In the "data" folder, the prepare_css_spectrograms.py must be modified. You have to change the path of the dataset.
Note whether the parameter "sample rate" matches the sample rate of the audio in the datasets.
The parameter "generator_bottleneck_dim" and "generator_dim" are adjusted according to the number of languages.

from multilingual_text_to_speech.

leijue222 commented on June 15, 2024 1

@YihWenWang Hi, I'm trying to train a Chinese-English mixed TTS model too.
Just like our daily talking, such as:

——A：请问你从事什么领域的研究？
——B：我从事Computer Vision方面的研究工作。

I plan to use the datasets of LJSpeech, ST-CMDS, and Biaobei to train it.
As in the above example, I am not familiar with the TTS field.
Could you give me a simple suggestion of steps to change this project?

I use the datasets of VCTK(English, 30 speakers) and STCMDS(Mandarin, 30 speakers).
My steps：

Download the datasets and put the datasets into the "data" folder.

Organize the train.txt and val.txt . There must contain the label of the speaker, language, audio, spectrograms, linear spectrograms, and text. It does not necessarily require transliteration.

I use the "pypinyin" package to convert Mandarin text.

In the "data" folder, the prepare_css_spectrograms.py must be modified. You have to change the path of the dataset.

Note whether the parameter "sample rate" matches the sample rate of the audio in the datasets.

The parameter "generator_bottleneck_dim" and "generator_dim" are adjusted according to the number of languages.

Thank you for your reply. Your suggestion gives me a clearer idea to do this work.
Thanks again!

from multilingual_text_to_speech.

leijue222 commented on June 15, 2024 1

他们是我的综合演讲。
我使用VCTK和STCMDS数据集来训练此模型。
我的综合文字是“推荐一些社会书。”。
谢谢。
VCTK_STCMDS.zip

Hi! I training on LJSpeech and Biaobei now, hours=12, epoch=25, steps==9K. Currently, English begins to sound like words, but the Chinese still sounds like nothing.
So I would like to ask you how long it takes you to train, the Chinese result will begin to sounds like Chinese characters, and how long does it take to get the final result you think is good.

from multilingual_text_to_speech.

YihWenWang commented on June 15, 2024 1

他们是我的综合演讲。
我使用VCTK和STCMDS数据集来训练此模型。
我的综合文字是“推荐一些社会书。”。
谢谢。
VCTK_STCMDS.zip

Hi! I training on LJSpeech and Biaobei now, hours=12, epoch=25, steps==9K. Currently, English begins to sound like words, but the Chinese still sounds like nothing.
So I would like to ask you how long it takes you to train, the Chinese result will begin to sounds like Chinese characters, and how long does it take to get the final result you think is good.

Hello, I use V100 to train for three days.

from multilingual_text_to_speech.

Tomiinek commented on June 15, 2024

Hi! 😄
I am not sure I understand your question 😟 What do you mean by "two kind of datasets"? You can change the parameters arbitrarily or create another file with your parameters. The dataset is specified by the dataset parameter (with values such as css_comvoi, css10, ljspeech)

from multilingual_text_to_speech.

YihWenWang commented on June 15, 2024

Thanks
The situation that I trained the dataset including two kind of language (English and Chinese) and I didn't change any parameters except for "languages" in generator_switching.json, but I got the bad result when I trained to epoch 120.
When I synthesized two languages in a sentence and I had assigned the speaker, but the result had the voice of two speakers, not one speaker.
I don't have any idea for this result.

from multilingual_text_to_speech.

YihWenWang commented on June 15, 2024

okay
Thank you very much.
I will try to train the dataset including more speakers for each language and adjust the parameters.

from multilingual_text_to_speech.

YihWenWang commented on June 15, 2024

Hello,
I use the VCTK dataset including 108 speakers for english and THCHS-30 dataset including 60 speakers for chinese.
My generated_switching.json setting :
generator_dim = 4,
generator_bottleneck_dim = 2,
speaker_embedding_dimension = 64,
reversal_classifier_w = 0.125
And the training result, I could synthesize two languages in a sentence by one speaker.
But there is a problem.
If I synthesize a sentence such as "recommend the some 社會書。", the voice of the second half of the sentence could become smaller.
I don't have any idea for this result.

from multilingual_text_to_speech.

Tomiinek commented on June 15, 2024

你好 😁

Do I understand it correctly that the voice seems to be the same throughout the whole sentence, but volume changes? This might be caused by the recordings (from two different datasets) normalized in a different way. Do the corresponding spectrograms have similar magnitudes?

You can try to normalize your audio files and repeat training with the new data. You can for example run this command to normalize every .wav in your-directory to the same volume level.

find "your-directory" -name '*.wav' | while read f; do
    sox "${f}" tmp.wav gain -n -3 && mv tmp.wav "${f}"
done

Hope it helps 😇 再见

from multilingual_text_to_speech.

YihWenWang commented on June 15, 2024

Thanks for your suggestion.
I will try it.
But I still have a question, if the sample rate of dataset is 16000 Hz, which parameters should I modify except for sample rate.

from multilingual_text_to_speech.

Tomiinek commented on June 15, 2024

You do not have to change stft_window_ms nor stft_shift_ms, because these values are in milliseconds. However, you can reduce num_fft to a lower value such as 1024, because stft_window_ms * sample_rate gives you something around 800.

from multilingual_text_to_speech.

YihWenWang commented on June 15, 2024

Sorry, I want to ask you about some question.
How do I get the mel-spectrogram from .npy file?
Because I want to use Waveglow to synthesize the waveform.

from multilingual_text_to_speech.

Tomiinek commented on June 15, 2024

Hm, spectrograms are stored in two-dimensional numpy arrays and saved into .npy files. Just use numpy.load to load them back into memory.

If you want to train the Waveglow model on spectrograms produced by your Tacotron, use the gta.py script which can produce ground-truth-aligned spectrograms (GTA) given your model and original spectrograms.

from multilingual_text_to_speech.

YihWenWang commented on June 15, 2024

But I don't train waveglow, I had trained waveglow.
I want to feed synthesized the .npy file to waveglow and synthesize the audio by waveglow.
The figure is mel-spectrogram that I load the .npy file and try to synthesize the audio from waveglow.
But I get the noise audio.

from multilingual_text_to_speech.

Tomiinek commented on June 15, 2024

Spectrograms seem to be ok, so I am afraid I cannot help you.

Just a few hints that come to my mind and can help you debugging:

Does the synthesis using Griffin-Lim work?
What is the range of spectrogram values?
Does Waveglow expect somehow normalized spectrograms?
How do the spectrograms produced by Waveglow preprocessing look like?

from multilingual_text_to_speech.

Tomiinek commented on June 15, 2024

No, they don't. language_embedding_dimension specifies the dimension of the language embedding concatenated to the decoder input. generator_dim defines the dimension of the language embedding used in the parameter generator. It is set to zero because the model has enough information about the language from the encoder.

from multilingual_text_to_speech.

lightwithshadow commented on June 15, 2024

@YihWenWang Hello Wang, Can you share your synthesized speech samples?
THX!

from multilingual_text_to_speech.

YihWenWang commented on June 15, 2024

Hello Okay, but I don’t have computer at the moment. Sorry, could you wait for me? I will give you about my synthesis speech tomorrow. gy_Zhao <[email protected]>於 2020年9月25日週五，上午9:52寫道：

…

@YihWenWang <https://github.com/YihWenWang> Hello Wang, Can you share your synthesized speech samples? THX! — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#7 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AEV5X3U44UM77VUWONZJHDLSHPZXVANCNFSM4NZSOWKA> .

from multilingual_text_to_speech.

lightwithshadow commented on June 15, 2024

ok, thx！

…

------------------ 原始邮件 ------------------ 发件人: "Tomiinek/Multilingual_Text_to_Speech" <[email protected]>; 发送时间: 2020年9月25日(星期五) 上午9:59 收件人: "Tomiinek/Multilingual_Text_to_Speech"<[email protected]>; 抄送: "1101174181"<[email protected]>;"Comment"<[email protected]>; 主题: Re: [Tomiinek/Multilingual_Text_to_Speech] Set parameters for training two languages dataset (#7) Hello Okay, but I don’t have computer at the moment. Sorry, could you wait for me? I will give you about my synthesis speech tomorrow. gy_Zhao <[email protected]>於 2020年9月25日週五，上午9:52寫道： > > > @YihWenWang <https://github.com/YihWenWang> Hello Wang, Can you share > your synthesized speech samples? > > > THX! > > > > > — > You are receiving this because you were mentioned. > > > Reply to this email directly, view it on GitHub > <#7 (comment)>, > or unsubscribe > <https://github.com/notifications/unsubscribe-auth/AEV5X3U44UM77VUWONZJHDLSHPZXVANCNFSM4NZSOWKA> > . > > > — You are receiving this because you commented. Reply to this email directly, view it on GitHub, or unsubscribe.

from multilingual_text_to_speech.

YihWenWang commented on June 15, 2024

They are my synthesis speech.
I use the VCTK and STCMDS datasets to train this model.
My synthesized text is "Recommend the some 社會書。".
Thanks.
VCTK_STCMDS.zip

from multilingual_text_to_speech.

lightwithshadow commented on June 15, 2024

Thanks 发自我的iPhone

…

------------------ Original ------------------ From: YihWenWang <[email protected]> Date: Fri,Sep 25,2020 10:10 PM To: Tomiinek/Multilingual_Text_to_Speech <[email protected]> Cc: gy_Zhao <[email protected]>, Comment <[email protected]> Subject: Re: [Tomiinek/Multilingual_Text_to_Speech] Set parameters for training two languages dataset (#7) They are my synthesis speech. I use the VCTK and STCMDS datasets to train this model. VCTK_STCMDS.zip — You are receiving this because you commented. Reply to this email directly, view it on GitHub, or unsubscribe.

from multilingual_text_to_speech.

Maxxiey commented on June 15, 2024

@YihWenWang Samples sound nice, did you train this model based only on VCTK and STCMDS ?

from multilingual_text_to_speech.

YihWenWang commented on June 15, 2024

@YihWenWang Samples sound nice, did you train this model based only on VCTK and STCMDS ?

Yes, I only use datasets of VCTK(English, 30 speakers) and STCMDS(Mandarin, 30 speakers).

from multilingual_text_to_speech.

leijue222 commented on June 15, 2024

@YihWenWang Hi, I'm trying to train a Chinese-English mixed TTS model too.
Just like our daily talking, such as:

——A：请问你从事什么领域的研究？
——B：我从事Computer Vision方面的研究工作。

I plan to use the datasets of LJSpeech, ST-CMDS, and Biaobei to train it.
As in the above example, I am not familiar with the TTS field.
Could you give me a simple suggestion of steps to change this project?

from multilingual_text_to_speech.

leijue222 commented on June 15, 2024

他们是我的综合演讲。
我使用VCTK和STCMDS数据集来训练此模型。
我的综合文字是“推荐一些社会书。”。
谢谢。
VCTK_STCMDS.zip

Hi! I training on LJSpeech and Biaobei now, hours=12, epoch=25, steps==9K. Currently, English begins to sound like words, but the Chinese still sounds like nothing.
So I would like to ask you how long it takes you to train, the Chinese result will begin to sounds like Chinese characters, and how long does it take to get the final result you think is good.

Hello, I use V100 to train for three days.

By the way, do you use phonemes when training?

from multilingual_text_to_speech.

YihWenWang commented on June 15, 2024

他们是我的综合演讲。
我使用VCTK和STCMDS数据集来训练此模型。
我的综合文字是“推荐一些社会书。”。
谢谢。
VCTK_STCMDS.zip

Hi! I training on LJSpeech and Biaobei now, hours=12, epoch=25, steps==9K. Currently, English begins to sound like words, but the Chinese still sounds like nothing.
So I would like to ask you how long it takes you to train, the Chinese result will begin to sounds like Chinese characters, and how long does it take to get the final result you think is good.

Hello, I use V100 to train for three days.

By the way, do you use phonemes when training?

No, I didn't use phonemes when training.
I just use the label of text, speaker, language, spectrogram, and linear spectrogram.

from multilingual_text_to_speech.

leijue222 commented on June 15, 2024

Thanks. I have a problem with the tones of Mandarin.
No matter I use pinyin or phonemes to train, the pronunciation of the four tones are not accurate.

I use the "pypinyin" package to convert Mandarin text.

The different we are:
I use pinyin package of requirement.txt file.
I use biaobei dataset(10,000) and LJSpeech dataset(5,000).

I don't know what causes this. Do you have any ideas or could you share params.py with me?

from multilingual_text_to_speech.

SayHelloRudy commented on June 15, 2024

他们是我的综合演讲。
我用VCTK和STCMDS数据集来训练这个模型。
我的综合文字是“推荐一些社会书。”。
谢谢。
VCTK_STCMDS.zip

你好！我现在在LJSpeech和Biaobei上训练，小时= 12，历元= 25，步骤== 9K。目前，英语开始听起来像单词，但中文仍然听起来像什么。
所以我想问一下你训练需要多长时间，中文结果会开始图像汉字，需要多长时间才能得到你认为好的结果。

你好，我用V100训练了三天。

大问一下，你在训练时使用音素吗？

不，我在训练时没有使用音素。我只使用文本、说话者、语言、描绘图和形象图的标签。
请问你对VCTK数据集做了消除静音的处理吗

from multilingual_text_to_speech.

DoritoDog commented on June 15, 2024

@YihWenWang How did you get the Tacotron mel spectrograms to work with waveglow in the end? They seem similar, but look like they are normalized somehow.

Tacotron value examples

[-54.77068739 -47.15882725 -45.828745   -44.59372329 -43.22799777
 -42.7517943  -42.11187298 -42.25688537 -42.81581903 -43.02588636, ...]

Waveglow value examples (for same audio file)

[-3.9470453 -2.820666  -2.7616765 -2.5435247 -2.5574331 -2.2251318
 -2.0958776 -2.0956624 -2.170114  -2.0375078, ...]

from multilingual_text_to_speech.

Set parameters for training two languages dataset about multilingual_text_to_speech HOT 31 CLOSED

Comments (31)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent

Jobs