GithubHelp home page GithubHelp logo

Comments (31)

Tomiinek avatar Tomiinek commented on June 15, 2024 2

Oh, I see.

First, training on two languages does not make use of the model capabilities. The more languages you have, the more information which can be shared across languages is present.

Second, you should have more speakers for each language (or you should have very similar voices for both mono-speaker languages). The model has to learn the difference between language- and speaker- dependent information. However, there is the multi-speaker VCTK dataset for English (a subset with a few speakers should be sufficient) and some Chinese voices in my cleaned Common Voice data (see readme). You do not have to have a lot of examples for each speaker (50 transcript-recording pairs per speaker can be ok), but you should have more speakers with diverse voices (such as 10 or 20). If this is you case, just add these multi-speaker data to you actual dataset and it should be better.

Third, you should definitely reduce generator_dim to something like 2..4 and generator_bottleneck_dim should be lower than this number, e.g., 1 or 2. Also speaker_embedding_dimension should be changed to roughly correspond to the number of speakers you have. So if you have like 20 speakers, 16 or 32 ...

Finally, there is reversal_classifier_w which controls the weight of the adversarial speaker classifier's loss. This parameter is really tricky. High values prevent the model from convergence, low values cause no effect. However, you should first try to make your data multi-speaker.

from multilingual_text_to_speech.

YihWenWang avatar YihWenWang commented on June 15, 2024 1

About "language_embedding_dimension" and "generator_dim" , do they have the same meaning ?
And then, why language_embedding_dimension is set zero when training the five languages model ?

from multilingual_text_to_speech.

YihWenWang avatar YihWenWang commented on June 15, 2024 1

@YihWenWang Hi, I'm trying to train a Chinese-English mixed TTS model too.
Just like our daily talking, such as:

——A:请问你从事什么领域的研究?
——B:我从事Computer Vision方面的研究工作。

I plan to use the datasets of LJSpeech, ST-CMDS, and Biaobei to train it.
As in the above example, I am not familiar with the TTS field.
Could you give me a simple suggestion of steps to change this project?

I use the datasets of VCTK(English, 30 speakers) and STCMDS(Mandarin, 30 speakers).
My steps:

  1. Download the datasets and put the datasets into the "data" folder.
  2. Organize the train.txt and val.txt . There must contain the label of the speaker, language, audio, spectrograms, linear spectrograms, and text. It does not necessarily require transliteration.
  3. I use the "pypinyin" package to convert Mandarin text.
  4. In the "data" folder, the prepare_css_spectrograms.py must be modified. You have to change the path of the dataset.
  5. Note whether the parameter "sample rate" matches the sample rate of the audio in the datasets.
  6. The parameter "generator_bottleneck_dim" and "generator_dim" are adjusted according to the number of languages.

from multilingual_text_to_speech.

leijue222 avatar leijue222 commented on June 15, 2024 1

@YihWenWang Hi, I'm trying to train a Chinese-English mixed TTS model too.
Just like our daily talking, such as:

——A:请问你从事什么领域的研究?
——B:我从事Computer Vision方面的研究工作。

I plan to use the datasets of LJSpeech, ST-CMDS, and Biaobei to train it.
As in the above example, I am not familiar with the TTS field.
Could you give me a simple suggestion of steps to change this project?

I use the datasets of VCTK(English, 30 speakers) and STCMDS(Mandarin, 30 speakers).
My steps:

  1. Download the datasets and put the datasets into the "data" folder.
  2. Organize the train.txt and val.txt . There must contain the label of the speaker, language, audio, spectrograms, linear spectrograms, and text. It does not necessarily require transliteration.
  3. I use the "pypinyin" package to convert Mandarin text.
  4. In the "data" folder, the prepare_css_spectrograms.py must be modified. You have to change the path of the dataset.
  5. Note whether the parameter "sample rate" matches the sample rate of the audio in the datasets.
  6. The parameter "generator_bottleneck_dim" and "generator_dim" are adjusted according to the number of languages.

Thank you for your reply. Your suggestion gives me a clearer idea to do this work.
Thanks again!

from multilingual_text_to_speech.

leijue222 avatar leijue222 commented on June 15, 2024 1

他们是我的综合演讲。
我使用VCTK和STCMDS数据集来训练此模型。
我的综合文字是“推荐一些社会书。”。
谢谢。
VCTK_STCMDS.zip

Hi! I training on LJSpeech and Biaobei now, hours=12, epoch=25, steps==9K. Currently, English begins to sound like words, but the Chinese still sounds like nothing.
So I would like to ask you how long it takes you to train, the Chinese result will begin to sounds like Chinese characters, and how long does it take to get the final result you think is good.

from multilingual_text_to_speech.

YihWenWang avatar YihWenWang commented on June 15, 2024 1

他们是我的综合演讲。
我使用VCTK和STCMDS数据集来训练此模型。
我的综合文字是“推荐一些社会书。”。
谢谢。
VCTK_STCMDS.zip

Hi! I training on LJSpeech and Biaobei now, hours=12, epoch=25, steps==9K. Currently, English begins to sound like words, but the Chinese still sounds like nothing.
So I would like to ask you how long it takes you to train, the Chinese result will begin to sounds like Chinese characters, and how long does it take to get the final result you think is good.

Hello, I use V100 to train for three days.

from multilingual_text_to_speech.

Tomiinek avatar Tomiinek commented on June 15, 2024

Hi! 😄
I am not sure I understand your question 😟 What do you mean by "two kind of datasets"? You can change the parameters arbitrarily or create another file with your parameters. The dataset is specified by the dataset parameter (with values such as css_comvoi, css10, ljspeech)

from multilingual_text_to_speech.

YihWenWang avatar YihWenWang commented on June 15, 2024

Thanks
The situation that I trained the dataset including two kind of language (English and Chinese) and I didn't change any parameters except for "languages" in generator_switching.json, but I got the bad result when I trained to epoch 120.
When I synthesized two languages in a sentence and I had assigned the speaker, but the result had the voice of two speakers, not one speaker.
I don't have any idea for this result.

from multilingual_text_to_speech.

YihWenWang avatar YihWenWang commented on June 15, 2024

okay
Thank you very much.
I will try to train the dataset including more speakers for each language and adjust the parameters.

from multilingual_text_to_speech.

YihWenWang avatar YihWenWang commented on June 15, 2024

Hello,
I use the VCTK dataset including 108 speakers for english and THCHS-30 dataset including 60 speakers for chinese.
My generated_switching.json setting :
generator_dim = 4,
generator_bottleneck_dim = 2,
speaker_embedding_dimension = 64,
reversal_classifier_w = 0.125
And the training result, I could synthesize two languages in a sentence by one speaker.
But there is a problem.
If I synthesize a sentence such as "recommend the some 社會書。", the voice of the second half of the sentence could become smaller.
I don't have any idea for this result.

from multilingual_text_to_speech.

Tomiinek avatar Tomiinek commented on June 15, 2024

你好 😁

Do I understand it correctly that the voice seems to be the same throughout the whole sentence, but volume changes? This might be caused by the recordings (from two different datasets) normalized in a different way. Do the corresponding spectrograms have similar magnitudes?

You can try to normalize your audio files and repeat training with the new data. You can for example run this command to normalize every .wav in your-directory to the same volume level.

find "your-directory" -name '*.wav' | while read f; do
    sox "${f}" tmp.wav gain -n -3 && mv tmp.wav "${f}"
done

Hope it helps 😇 再见

from multilingual_text_to_speech.

YihWenWang avatar YihWenWang commented on June 15, 2024

Thanks for your suggestion.
I will try it.
But I still have a question, if the sample rate of dataset is 16000 Hz, which parameters should I modify except for sample rate.

from multilingual_text_to_speech.

Tomiinek avatar Tomiinek commented on June 15, 2024

You do not have to change stft_window_ms nor stft_shift_ms, because these values are in milliseconds. However, you can reduce num_fft to a lower value such as 1024, because stft_window_ms * sample_rate gives you something around 800.

from multilingual_text_to_speech.

YihWenWang avatar YihWenWang commented on June 15, 2024

Sorry, I want to ask you about some question.
How do I get the mel-spectrogram from .npy file?
Because I want to use Waveglow to synthesize the waveform.

from multilingual_text_to_speech.

Tomiinek avatar Tomiinek commented on June 15, 2024

Hm, spectrograms are stored in two-dimensional numpy arrays and saved into .npy files. Just use numpy.load to load them back into memory.

If you want to train the Waveglow model on spectrograms produced by your Tacotron, use the gta.py script which can produce ground-truth-aligned spectrograms (GTA) given your model and original spectrograms.

from multilingual_text_to_speech.

YihWenWang avatar YihWenWang commented on June 15, 2024

12235295496817
But I don't train waveglow, I had trained waveglow.
I want to feed synthesized the .npy file to waveglow and synthesize the audio by waveglow.
The figure is mel-spectrogram that I load the .npy file and try to synthesize the audio from waveglow.
But I get the noise audio.

from multilingual_text_to_speech.

Tomiinek avatar Tomiinek commented on June 15, 2024

Spectrograms seem to be ok, so I am afraid I cannot help you.

Just a few hints that come to my mind and can help you debugging:

  • Does the synthesis using Griffin-Lim work?
  • What is the range of spectrogram values?
  • Does Waveglow expect somehow normalized spectrograms?
  • How do the spectrograms produced by Waveglow preprocessing look like?

from multilingual_text_to_speech.

Tomiinek avatar Tomiinek commented on June 15, 2024

No, they don't. language_embedding_dimension specifies the dimension of the language embedding concatenated to the decoder input. generator_dim defines the dimension of the language embedding used in the parameter generator. It is set to zero because the model has enough information about the language from the encoder.

from multilingual_text_to_speech.

lightwithshadow avatar lightwithshadow commented on June 15, 2024

@YihWenWang Hello Wang, Can you share your synthesized speech samples?
THX!

from multilingual_text_to_speech.

YihWenWang avatar YihWenWang commented on June 15, 2024

from multilingual_text_to_speech.

lightwithshadow avatar lightwithshadow commented on June 15, 2024

from multilingual_text_to_speech.

YihWenWang avatar YihWenWang commented on June 15, 2024

They are my synthesis speech.
I use the VCTK and STCMDS datasets to train this model.
My synthesized text is "Recommend the some 社會書。".
Thanks.
VCTK_STCMDS.zip

from multilingual_text_to_speech.

lightwithshadow avatar lightwithshadow commented on June 15, 2024

from multilingual_text_to_speech.

Maxxiey avatar Maxxiey commented on June 15, 2024

@YihWenWang Samples sound nice, did you train this model based only on VCTK and STCMDS ?

from multilingual_text_to_speech.

YihWenWang avatar YihWenWang commented on June 15, 2024

@YihWenWang Samples sound nice, did you train this model based only on VCTK and STCMDS ?

Yes, I only use datasets of VCTK(English, 30 speakers) and STCMDS(Mandarin, 30 speakers).

from multilingual_text_to_speech.

leijue222 avatar leijue222 commented on June 15, 2024

@YihWenWang Hi, I'm trying to train a Chinese-English mixed TTS model too.
Just like our daily talking, such as:

——A:请问你从事什么领域的研究?
——B:我从事Computer Vision方面的研究工作。

I plan to use the datasets of LJSpeech, ST-CMDS, and Biaobei to train it.
As in the above example, I am not familiar with the TTS field.
Could you give me a simple suggestion of steps to change this project?

from multilingual_text_to_speech.

leijue222 avatar leijue222 commented on June 15, 2024

他们是我的综合演讲。
我使用VCTK和STCMDS数据集来训练此模型。
我的综合文字是“推荐一些社会书。”。
谢谢。
VCTK_STCMDS.zip

Hi! I training on LJSpeech and Biaobei now, hours=12, epoch=25, steps==9K. Currently, English begins to sound like words, but the Chinese still sounds like nothing.
So I would like to ask you how long it takes you to train, the Chinese result will begin to sounds like Chinese characters, and how long does it take to get the final result you think is good.

Hello, I use V100 to train for three days.

By the way, do you use phonemes when training?

from multilingual_text_to_speech.

YihWenWang avatar YihWenWang commented on June 15, 2024

他们是我的综合演讲。
我使用VCTK和STCMDS数据集来训练此模型。
我的综合文字是“推荐一些社会书。”。
谢谢。
VCTK_STCMDS.zip

Hi! I training on LJSpeech and Biaobei now, hours=12, epoch=25, steps==9K. Currently, English begins to sound like words, but the Chinese still sounds like nothing.
So I would like to ask you how long it takes you to train, the Chinese result will begin to sounds like Chinese characters, and how long does it take to get the final result you think is good.

Hello, I use V100 to train for three days.

By the way, do you use phonemes when training?

No, I didn't use phonemes when training.
I just use the label of text, speaker, language, spectrogram, and linear spectrogram.

from multilingual_text_to_speech.

leijue222 avatar leijue222 commented on June 15, 2024

Thanks. I have a problem with the tones of Mandarin.
No matter I use pinyin or phonemes to train, the pronunciation of the four tones are not accurate.

I use the "pypinyin" package to convert Mandarin text.

The different we are:
I use pinyin package of requirement.txt file.
I use biaobei dataset(10,000) and LJSpeech dataset(5,000).

I don't know what causes this. Do you have any ideas or could you share params.py with me?

from multilingual_text_to_speech.

SayHelloRudy avatar SayHelloRudy commented on June 15, 2024

他们是我的综合演讲。
我用VCTK和STCMDS数据集来训练这个模型。
我的综合文字是“推荐一些社会书。”。
谢谢。
VCTK_STCMDS.zip

你好!我现在在LJSpeech和Biaobei上训练,小时= 12,历元= 25,步骤== 9K。目前,英语开始听起来像单词,但中文仍然听起来像什么。
所以我想问一下你训练需要多长时间,中文结果会开始图像汉字,需要多长时间才能得到你认为好的结果。

你好,我用V100训练了三天。

大问一下,你在训练时使用音素吗?

不,我在训练时没有使用音素。 我只使用文本、说话者、语言、描绘图和形象图的标签。
请问你对VCTK数据集做了消除静音的处理吗

from multilingual_text_to_speech.

DoritoDog avatar DoritoDog commented on June 15, 2024

@YihWenWang How did you get the Tacotron mel spectrograms to work with waveglow in the end? They seem similar, but look like they are normalized somehow.

Tacotron value examples

[-54.77068739 -47.15882725 -45.828745   -44.59372329 -43.22799777
 -42.7517943  -42.11187298 -42.25688537 -42.81581903 -43.02588636, ...]

Waveglow value examples (for same audio file)

[-3.9470453 -2.820666  -2.7616765 -2.5435247 -2.5574331 -2.2251318
 -2.0958776 -2.0956624 -2.170114  -2.0375078, ...]

from multilingual_text_to_speech.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.