tomiinek / multilingual_text_to_speech Goto Github PK

An implementation of Tacotron 2 that supports multilingual experiments with parameter-sharing, code-switching, and voice cloning.

License: MIT License

Python 69.83% Shell 4.88% Jupyter Notebook 25.30%

code-switching multilingual speech-synthesis text-to-speech tts voice-cloning

multilingual_text_to_speech's People

Contributors

Stargazers

Watchers

Forkers

thien223 entn-at shaun95 templeblock appalachianwine maxxiey xuehao-marker charlottecuc chunhuiwang-china yyht trendingtechnology agonzalezd yihwenwang c00renut meelement yangmingqi williamxi1 cherokeelanguage satoshirobatofujimoto linzai1992 whitefu zhangkai2017 sungjae-cho gaoyu1983 yqli2420 basem-ahmed tigl x-ccs lightwithshadow markyouyuren luomingshuang elliotthwang rosssong ruclion chenchy sshuster wahidmounir levaid angelqazh marcusrogerio shaunholt crazycharles6 michaellin99999 leijue222 eric102004 guoyang94 ntzzc saber5433 xiaomochen520 tubbz-alt atlonxp dwtcourses raikarsagar xuexidi tsaifangsheng cnlinxi ttslr sfrhaxor joan126 sapphire008 cuongnm5 acul3 yunhaoyuan shaojinding dina-adel wuyx517 asmaa-s sciai-ai krigeta jnanesh-k-p ductho9799 wadoodabdul ishine ulandz dabonneville tiamat-tech auzxb lcsouzamenezes 3i-hust-tts ragavera macroustc redeaux-games yingfenging mu-y luvpine triper1022 anselmo0v sx-tts ensky0 bemuse01 yfliao raytrac3r sravanidn masterchief02 anigi98932 funnylittleman shubhampachori12110095 techthiyanes zflys rijul-gupta

multilingual_text_to_speech's Issues

How effective is my training so far?

What type of loss_total number, etc, should I be looking for to verify that things seem to be training correctly?

I'm currently at step: 3.792k, 3 hours 12 minutes, total loss 0.2972

Is there a way to specify phonemes, stress, and vowel lengths for the inputs?

Providing pre-phonetized content?

Is there a way to specify phonemes, stress, and vowel lengths for the inputs and skip the phonemizer step?

I'm looking for a way to synthesize two different Native American languages: Cherokee and Mohegan.

Mohegan

The Mohegan language is a stress based language, and I'm hoping the phonetics would map close enough to one or more of the languages already trained for the creation of language lesson materials.

Cherokee

The Cherokee language is a tone based language and this factor presents a challenge.

I'm wondering if it might be possible to "bootstrap" a language such as Cherokee using espeak-ng generated training audio and phonetics then use voice cloning for the actual output. (And skip training the vocoder on the espeak-ng audio).

Wavernn vocoder retraining

Thank you for the great job.
I would like to retrain the wavernn vocoder (with the generated_training configuration) and I'm not sure how to proceed.
Have you used ground truth mel coefficients as inputs or have you used ground truth aligned mel coefficients?
In the later case (gta): what is the sequence of commands to use? The gta.py script does not take the json parameters file as the train.py script does (thanks to the hyper_parameter argument), so it seems to only be usable with the default (ljspeech) configuration. Moreover it generates .npy files for the generated ground truth aligned mel coefficients with names that do not match the wav files names, where the preprocess.py script in the WaveRnn project expects the two files (wav and mel .npy files) to have the same root.
Thanks again

train failed on multiGPUs

It work fine on single GPU, but failed on 3 GPUs

only change "max_gpus" to 3, is there anything I should change?

CUDA_VISIBLE_DEVICES=0,1,2 PYTHONIOENCODING=utf-8 nohup python -u train.py --hyper_parameters generated_training > log.file 2>&1 &

synthesize problem

Hello,when I synthesized a sentence,a problem occurred

I tried another sentence ,the same problem occurred again

About Generate Convolution Encoder

Hello, I have a problem with the encoder of the synthesizer.
Do you have used the original encoder in Tacotron-2 that including Convolution and LSTM?
Why do you use Convolution and Highway Convolution in encoder?

Thanks.

Set parameters for training two languages dataset

Hello
If I just train two kind of datasets, how do I set the parameters, such as generator_dim and generator_bottleneck_dim ...etc for generator_switching.json .

Google Colab For Training and Synthesis custom Audio sets on Weak Computers?

Hello, Is it possible for you to make Google colabs for Training and Synthesis our own custom audio files uploaded on Gdrive? it would be so easy and awesome to use it on weak computers.

Note: the current colabs are working on pre-trained models and they are working perfectly but it would be so awesome if you will make colabs for Training and Synthesis our new models based on our custom multilingual voices.

Thank you

Unable to find the procedure to retrain the WaveRNN vocoder

Hello,

I passed a lot of time to try to understand how the WaveRNN by Tomiinek works to retrain it by myself but I failled.
The documentation is not done and I tried a lot of things...

For the moment I am blocked after generate the GTAs because I am not able to link a GTA file to a WAV file.

Can you help me please ?

Thanks

How to add english as a supported language?

Hello,

Thank you for making this code base available! this is absolutely fascinating stuff.

I am working on a research project where I need to produce accented English audio on custom text i.e. I have audios of native speakers which I want to "add an accent" to. I was able to successfully run the two Google Collab notebooks you provided to find that the model is able to output dynamically accented audio. However, I noticed that English is not included in the audio files that the models were trained on.

I want to add English as a supported language. One way to do so that I see is to download the common voice english database or a subset of it, clean it to be in the format of your "cleaned" common voice dataset and then follow the steps to train the models from scratch essentially. There are a couple issues I see with my plan: a) your waveRNN has been pre-trained on CSS10 data which doesn't include english so if thats an issue I might also have to train waveRNN again b) I am not entirely sure yet how to "clean" the common voice data c) the common voice english database is 50GBs which is too big.

Essentially, I am hoping there is a simpler way to fine-tune the existing models to support English. If you provide me with some direction on this, Id very much appreciate that!

Best,

Vic

Integrate a new language

Hello and first of all thank you for your work!

What would be the steps to add a language? In this case French from France (and not from Quebec).

Can I use the result from this Training colab ?

https://colab.research.google.com/drive/14X73UiywnoL9VS30iPDcX4WXxwZWv2e2

Thank you! :)

Question: Adding speakers to existing languages.

Is it possible to stop training then add additional speakers to train.txt and val.txt then resume training?

Will the added speakers show up as additional entries in the model param dict or will it halt with an error?

I have a question about the paper.

Hi Tommiinek!
I have a question about the paper.

Above figure, I think it's multi-language input.
Then can you explain how the lanugage ID is embedded?

Thank You!

Question : Adversarial Speaker Classifier

Thank you for your awesome paper & code.
I have a question about adversarial speaker classifier.

I think, In training, the speaker classifier doesn't affect to rest model.
In training, The speaker classifier and the rest of the model are independent of each other. Right?

If my understanding right, What is the speaker classifier used for?

Thank you.

How to set gradient accumulation ?

I have to use small batches, (size 10), because of memory constraints on my GPU.

How do I set the batch count for gradient accumulation to be used over as part of the hyper params?

Or is this automatic calculated based on "ideal batch" / "actual batch" somewhere?

How can I expand a language like Korean?

Hi Tomminek!
Thank You for your great Paper & Source Code.

I want to expand a language like Korean.
Also, I want to apply voice cloning in the Korean Language.

For voice cloning, I think the Korean dataset should be made of multi-speakers... Did I get it right??
If so, then can you tell some parts I can refer to?? It would be very helpful.

Thank You.

CUDA Out-Of-Memory Error

Hi! I'm here to ask you a question about something strange.
I'm trying to get them to study in four V100 x four environments, but every epoch an explosion ends, there's an OOM error.

I think we have enough memory, do you have any idea why this is happening?

training problem

Hello，this project is very nice and thank you for your share!
There is an error when I run train.py: RuntimeError: Expected tensor for argument #1 'indices' to have scalar type Long; but got torch.cuda.FloatTensor instead (while checking arguments for embedding)

Environment: The system is win10, and other environments are consistent with requirement.txt
I don't know what causes this problem .Is it a problem with win10 system or I need to change the code?

espeak and epitran error?

When I try to run the TextToSpeechDataset.create_meta_file method uusing css10 dateset I am getting this error.

Building phoneme dictionary: ░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░ 0.0%
---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
/Multilingual_Text_to_Speech/utils/text.py in _phonemize(text, language)
     90         seperators = Separator(word=' ', phone='')
---> 91         phonemes = phonemize(text, separator=seperators, backend='espeak', language=language)
     92     except RuntimeError:

9 frames
RuntimeError: espeak not installed on your system

During handling of the above exception, another exception occurred:

FileNotFoundError                         Traceback (most recent call last)
/usr/local/lib/python3.7/dist-packages/epitran/simple.py in _load_g2p_map(self, code, rev)
     90         except IndexError:
     91             raise DatafileError('Add an appropriately-named mapping to the data/maps directory.')
---> 92         with open(path, 'rb') as f:
     93             reader = csv.reader(f, encoding='utf-8')
     94             orth, phon = next(reader)

FileNotFoundError: [Errno 2] No such file or directory: '/usr/local/lib/python3.7/dist-packages/epitran/data/map/chinese.csv'

I have installed the epitran and for installing espeak I used these:

! pip install python-espeak
and got this error

Collecting python-espeak
  Using cached https://files.pythonhosted.org/packages/59/5b/45437090dbd71ee9f586dc7f650c6e8c4815bd8bff9b2923d4db5b9120ed/python-espeak-0.6.3.tar.gz
Building wheels for collected packages: python-espeak
  Building wheel for python-espeak (setup.py) ... error
  ERROR: Failed building wheel for python-espeak
  Running setup.py clean for python-espeak
Failed to build python-espeak
Installing collected packages: python-espeak
    Running setup.py install for python-espeak ... error
ERROR: Command errored out with exit status 1: /usr/bin/python3 -u -c 'import sys, setuptools, tokenize; sys.argv[0] = '"'"'/tmp/pip-install-nlsz0hgq/python-espeak/setup.py'"'"'; __file__='"'"'/tmp/pip-install-nlsz0hgq/python-espeak/setup.py'"'"';f=getattr(tokenize, '"'"'open'"'"', open)(__file__);code=f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, __file__, '"'"'exec'"'"'))' install --record /tmp/pip-record-ahzj2e1q/install-record.txt --single-version-externally-managed --compile Check the logs for full command output.

[Question] About Generated Encoder

Hi Tomiinek!! I have a question about Generated-Encoder.
As I understand, After training, Generated-Encoder's fully-connected layer becomes a dictionary.
{Key: language id embedding, Value: Conv or batch-norm's weight}. Right??

If so, rather than generating weight by fully-connected layer, it's better to fix weight by language and use it, don't you?
If not, is there a case where weight is changed by other elements other than language after training?

Thank You!!

Documentation Request: minimum GPU RAM recommended and the minimum batch size

What would be the minimum GPU RAM recommended and the minimum batch size to produce acceptable results?

How about how long did it take for training for the pre-trained weights?

I currently have a GeForce GTX 1060 6GB.

Train with a small number of speakers

My purpose is to make a Vietnamese Speech Synthesis model be able to pronounce English.

But, I have just a Vietnamese dataset with 2 female speakers (13100 and 16500 utterances ~ around 50-55 hours).
I want to train this dataset with LibriTTS with 127 female speakers(~ 58 hours).

Do you have any suggestions for my experiments?

Extension of this project.

Hi @Tomiinek !!
Thanks to your great project, we were able to add an additional fun TTS project to our team's project called PORORO.
You can easily use many natural language processing and voice tasks including your TTS using pororo.

And I've expanded your project to English, Korean, and Jejueo. You can check at this page.
Thank you for your great project!!

Discuss Chinese-English mixed TTS.

Hi, @Tomiinek . It's a nice job, and it's an honor to see this project.
I have some questions about train.txt; hope you can solve my puzzle.

How to get the original train.txt file?
Git clone this project, under /data/css10/, we can see the train.txt (original).
After the prepare_css_spectrograms.py file is processed, we got the spectrogram and linear spectrogram and changed the structure of train.txt (processed)

(original) 000285|chinese|chinese|chinese/call_to_arms/call_to_arms_0285.wav|||húixiāngdòu de húizì， zěnyáng xiě de?|
(processed) 000285|chinese|chinese|chinese/call_to_arms/call_to_arms_0285.wav|spectrograms\000285.npy|linear_spectrograms\000285.npy|húixiāngdòu de húizì， zěnyáng xiě de?|

So, how to get the train.txt (original) file? I want to create it for another dataset.

The structure of train.txt

I have questions about the meaning of these variables: idx, s, ph

Multilingual_Text_to_Speech/data/prepare_css_spectrograms.py

Line 57 in ca00959

idx, s, l, a, _, _, raw_text, ph = i

idx: As long as we make sure that IDX is unique and points to specific audio, we can define it in any way we want, right? Usually, the id is defined by the name of the file, but I found that you did not do this. Can I define this variable with the filename of the audio?
s: speaker? Puzzle. If the data set has only one person's voice, it is defined as the language name? Otherwise, it is defined as the serial number of different people (0,1,2,3,4....)?
ph: I don't understand the meaning of this variable, is that mean "\n"?

Any paper of this work available?

Very interesting paper! Wondering any paper of this work available:)

The problem of voice quality and voice conversion

Hello，this project is so nice and thank you for your share!
I've train English and Chinese model with a total of hundreds of speakers in each language using LibriTTS and thchs30(Chinese dataset) and a private dataset. All data are resampled to 22k and denoise. This time I try to use phonemes (phonemize) and especially add tone in Chinese.
Now it trains 25k steps and the loss drops well. the result is OK and it could pronounce right. But it still has problems:

The inference audio always has noise.
While I try to do the voice conversion(which is my main task) to let a speaker who never says Chinese to speak well, the output audio is not his voice at all. That makes me really confused.

So I'm wondering if you can give me some advice to optimize it. Thx!
There are my params:

"balanced_sampling": True,
"batch_size": 80,
"case_sensitive": False,
"checkpoint_each_epochs": 20,
"encoder_dimension": 256,
"encoder_type": "generated",
"epochs": 1000,
"generator_bottleneck_dim": 1,
"generator_dim": 2,
"languages": ["zh", "en"],
"language_embedding_dimension": 0,
"learning_rate": 0.001,
"learning_rate_decay_each": 10000,
"learning_rate_decay_start": 10000,
"use_phonemes":True,
"multi_language": True,
"multi_speaker": True,
"perfect_sampling": True,
"predict_linear": False,
"reversal_classifier": True,
"reversal_classifier_dim": 256,
"reversal_classifier_w": 0.125,
"reversal_gradient_clipping": 0.25,
"speaker_embedding_dimension": 256,

WaveRNN Dataset format

Hi Tomiinek!!
I'm sorry I seem to ask you questions all the time.
I'm so interested in your project that much, so please understand!

I'm going to train the WaveRNN model to fit my dataset.
But I couldn't find any words in your WaveRNN repo about the format in which to configure the dataset format.

Can you give me any advice on how to configure dataset formats or give me relevant materials?

Thank You!!

I have a question about Vocoder.

Hi Tomiinek!!

I have a question about vocoder.
Your model supports 10 languages but one vocoder was used.

I wonder if I use 10 vocoders (for each language), the voice quality improve?
I wanna hear your thought.

Thank You!

Eval/classifier is too small, and synthesised speech unclear.

Thanks for your excellent work. I'm trying to synthesis Chinese-English code-switched speech. Datasets are as followed:
ST-CMDS: Chinese, 853 speakers, 120 utterances per speaker, samplerate 16k
VCTK: English, 108 speakers, 123~502 uterances per speaker, samplerate 48k
Chinese transcriptions are converted to pinyin with tones by pypinyin. VCTK audios are downsampled to 16k with ffmpeg.
I'm training model on 8 v100 32G gpus. Parameters are:
"balanced_sampling": true,
"batch_size": 512,
"case_sensitive": true,
"characters": " !',-.?ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyzàáèéìíòóùúüāēěīńōūǎǐǒǔǘǚǜ",
"checkpoint_each_epochs": 1,
"dataset": "stcmds_vctk",
"encoder_dimension": 256,
"encoder_type": "generated",
"epochs": 300,
"generator_bottleneck_dim": 1,
"generator_dim": 2,
"languages": ["en", "zh"],
"language_embedding_dimension": 0,
"learning_rate": 0.005,
"learning_rate_decay_each": 1000,
"learning_rate_decay_start": 1000,
"multi_language": true,
"multi_speaker": true,
"perfect_sampling": true,
"predict_linear": false,
"reversal_classifier": true,
"reversal_classifier_dim": 256,
"reversal_classifier_w": 0.125,
"reversal_gradient_clipping": 0.25,
"speaker_embedding_dimension": 1024,
"version": "GENERATED-SWITCHING"
After training for 37 hours, logs are:

Eval/classifier seams to be too small, audios in "Audio/forced" and "Audio/generated" are very unclear。
Any advices please. Thanks in advance.

HighwayConvBlockGenerated class return size miss match error

Thank for your efforts.

Could you please tell me which task this code does in the "HighwayConvBlockGenerated" class?
When training using my custom data, this return error that says the x's dimension and (h2 * p)'s dimension miss match.
the (h2 * p)'s dimension is a half of x, so they could not be added.
The error: RuntimeError: The size of tensor a (3584) must match the size of tensor b (1792) at non-singleton dimension 1

def forward(self, x):
            e, x = x
            _, h = super(HighwayConvBlockGenerated, self).forward((e, x))
            chunks = torch.chunk(h, 2 * self._groups, 1)
            h1 = torch.cat(chunks[0::2], 1)
            h2 = torch.cat(chunks[1::2], 1)
            p = self._gate(h1)
            return e, h2 * p + x * (1.0 - p)

When this model can converge?

I have trained this model at 7k steps, but I cannot get any acceptable audio. The dataset have 40h, and I predict linear spectrogram then use griffin-lim to get waveform.

The synthesized audio sounds like it through the voice changer.

audio.zip
I used the two kinds of language datasets including VCTK(English, 30 speakers, 11654 clips) and STCMDS(Chinese, 30 speakers, 3600 clips).
This result is good when I synthesized the Chinese text.
But when I used the three kinds of language datasets including VCTK, STCMDS, and TAT(Minnan Language, 30 speakers, 8586 clips), the result is bad.
When I synthesized the Chinese text, the synthesized audio sounds like it through the voice changer, and the speaking is too fast.
There are synthesized waveforms such as two languages and three languages in the compressed file.
I don't have any idea about this problem.

[Question] Model Capacity

Wouldn't model performance be better if we increase the number of model parameters?
Have you ever done an experiment like this?

could you tell me where's the pretrained models? Thanks.

Where can I download the pretrained model? "https://github.com/Tomiinek/Multilingual_Text_to_Speech/releases/download/v1.0/$tacotron_chpt" I cann't find this file path now.

Thanks very much!

Basic workflow on how to train the model and synthesize it? When to use the datasets. Help! 🙏

Hello, I am struggling to get an idea of the workflow of this awesome project.

As you mentioned in the readme that we have to use the comvoi.zip and for now I am using the css10 dataset for three languages.
everything is set up but may somebody please explain a workflow on how and why to fix things to train the model and then synthesize it using the WaveRNN.

I have a lot of questions but let's first see if someone will guide me for the initial steps.

Memory leak

There appears to be some long term running memory leak, probably related to graphs. As the training progresses, my Xorg memory consumption gradually increases. If I stop the training, the memory is instantly released.

I suspect it is related to graphs because of the following warning:

RuntimeWarning: More than 20 figures have been opened. Figures created through the pyplot interface (`matplotlib.pyplot.figure`) are retained until explicitly closed and may consume too much memory. (To control this warning, see the rcParam `figure.max_open_warning`).
  fig = plt.figure(figsize=(16, 4))

[Question] WaveRNN Vocoder

Hi! Tomiinek!!
I trained tacotron with generated-training.json config.
With your WaveRNN weight file, result of tts are pretty good.

But WaveRNN that you opened was a pre-trained model in only 5 languages, so there were some disappointing results for other languages.
So, I trained WaveRNN with your wavernn repository.
With your comment in here, I trained WaveRNN.

I've trained WaveRNN with 600k steps, but this WaveRNN doesn't make voice at all.
So I confused. The WaveRNN you opened is good at producing voice, but why can't I do it at all?

The figure above is a waveform made from my WaveRNN.

The figure above is a waveform made from your WaveRNN.

The training loss scale is about 2.3. Here are sample quant and mel spectrogram. Quant Sample Mel Sample.
I think if there is a Mel-Spectrogram, Wave pair, it should be trained, but I don't know why not.
Below figures are image of Quant Sample and Mel Sample.

I did not touch the parameters of hparams.py in the WaveRNN repository. Could you check it for me?
Please help me Tomiinek.
Thank you.

I want add support for Urdu

Hello,
It is very great repo! thanks.
Need make training for Urdu (arabic letters). How can make dataset like yours? Need change vocabulary somewhere?

Train pb on WaveRNN

Hello Tomáš,

I have a problem when I try to train my dataset with WaveRNN.
It returns me this error:

voxfurem@cc:~/nvme/src/WaveRNN$ python train_wavernn.py --force
Initialising Model...
Trainable Parameters: 4.744M
Path=/media/nvme/src/WaveRNN/dataset
Batch size=32
Paths = <utils.paths.Paths object at 0x7fab0adebca0>
NUM EXIST = 3
Restoring from latest checkpoint...
Loading latest weights: /media/nvme/src/WaveRNN/checkpoints/css_raw.wavernn/latest_weights.pyt
Loading latest optimizer state: /media/nvme/src/WaveRNN/checkpoints/css_raw.wavernn/latest_optim.pyt
| Epoch: 1/147059 (1/68) | Loss: 6.9385 | LR: 0.0010 | 1.3 steps/s | Step: 0k | Traceback (most recent call last):
  File "train_wavernn.py", line 136, in <module>
    voc_train_loop(paths, voc_model, loss_func, optimizer, scheduler, train_set, test_set, total_steps)
  File "train_wavernn.py", line 31, in voc_train_loop
    for i, (x, y, m) in enumerate(train_set, 1):
  File "/home/voxfurem/.local/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 345, in __next__
    data = self._next_data()
  File "/home/voxfurem/.local/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 838, in _next_data
    return self._process_data(data)
  File "/home/voxfurem/.local/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 881, in _process_data
    data.reraise()
  File "/home/voxfurem/.local/lib/python3.8/site-packages/torch/_utils.py", line 394, in reraise
    raise self.exc_type(msg)
ValueError: Caught ValueError in DataLoader worker process 1.
Original Traceback (most recent call last):
  File "/home/voxfurem/.local/lib/python3.8/site-packages/torch/utils/data/_utils/worker.py", line 178, in _worker_loop
    data = fetcher.fetch(index)
  File "/home/voxfurem/.local/lib/python3.8/site-packages/torch/utils/data/_utils/fetch.py", line 47, in fetch
    return self.collate_fn(data)
  File "/media/nvme/src/WaveRNN/utils/dataset.py", line 109, in collate_vocoder
    labels = np.stack(labels).astype(np.int64)
  File "/home/voxfurem/.local/lib/python3.8/site-packages/numpy/core/shape_base.py", line 416, in stack
    raise ValueError('all input arrays must have the same shape')
**ValueError: all input arrays must have the same shape**

I spend some hours to try to understand this error, I develop a system to seach the file which can be the cause, I restart with another voice package, etc... but I failed to find a solution.

Do you have any idea of the root cause for this problem ?

Thanks in advance

options list

Hello and thank you for sharing your work!
I was wondering if you could provide a list with the available languages and speakers in your pretained models.
In addition, do you happen to have a female spanish speaker?
Thank you very much in advance!
Lucía

WaveRNN's not learning well.

Hi Tomiinek!!
I used your WaveRNN Repo to train vocoder.
I've formatted the data, I've trained WaveRNN, but I haven't been able to produce a voice at all.

Is the hyperparameter the same as the one at hparams.py when you train your model?

Streaming text to speech

Firstly, thanks for this great work.
I'm trying to use your model for streaming text-to-speech applications. The quality is good, but the speed is slow. I'm using a GPU Tesla V4 with 16G RAM.
Any change to the configuration can help in speeding up the process.

Currently, producing 2 seconds of audio takes 14 seconds of processing starting from the text to the produced audio

[Question] What hyperparameters need to be changed if the language is extended?

Hi Tomiinek!!
I have a question. What hyperparameters need to be changed if the language is extended?

You experimented using CSS10. (10 Languages)
If I will experiment with 14 languages, What hyperparameter to be changed?

If I will experimenting with 14 languages, what hyperparameters would I need to change?
Please give me your wisdom.

Correct file to generate the spectograms?

Hi Tommy, may you please explain the different use of prepare_css_spectrograms.py and TextToSpeechDataset.create_meta_file?

And how the run TextToSpeechDataset.create_meta_file it if I am using the css10 dataset? like settings for batch file and other settings you want to prefer.

hope you wont ignore me this time.

CPU Bottleneck?

I'm working on seeing what can be done with smaller batch sizes with the existing data sets, and currently have my batch size set to '10'. This results in a bit over 4GB memory being allocated on my NVIDIA GPU which only has 5GB-5.5GB available memory.

I'm using pre-generated spectrograms.

However, the GPU usage seems a tad low and the CPU usage high.

Right now top shows python at 100% CPU utilization. With GPU usage fluctuation between about 45% to 88%.

Is there some other data that can be pre-calculated and cached to reduce python bottlenecking on the CPU?

Using the NPY output with a different vocoder?

Is it possible to use the saved npy output files with a different vocoder?

If yes, are there any special steps required for shaping, etc?

AssertionError: Validation set contains speakers which are not present in train set!

I am trying to train the system on CSS10 dataset and and receving the error stated in the title of this issue.

The error is raised following the training command PYTHONIOENCODING=utf-8 python3 train.py --hyper_parameters generated_switching as instructed in the documentation. I have downloaded all languages of CSS10 and changed the dataset parameters in params.py file as such:

   ******************* DATASET SPECIFICATION *******************
 
  4     dataset = "css10"                 
  3     cache_spectrograms = True  
  2     languages = ['zh', 'fi', 'de', 'el', 'hu', 'ja', 'kss', 'ru', 'es', 'fr', 'nl']              
44     balanced_sampling = False          
  1     perfect_sampling = False

I have also preprocessed all the spectrograms with the prepare spectrograms file. I do not understand where I went wrong in following the documentation. I understand the error but not it's cause and supressing it in the code does not work. Can anyone help?

Documentation Request: Include instructions on how to fine tune pre-existing weights

Please include instructions on how to resume training starting with your 70k iteration weights.

Would it be possible to add additional languages as part of a fine tuning process?

problem training

I want to train using generated_training on two languages "russian"+"french".
I downloaded CSS10 data from links you gave on these languages, ran the prepare_css_spectrograms.py, created a custom generated_training.json file

I run train.py on 2 GPUs

I have several issues:

In addition, its been running for 3 hours but nothing seems to happen (tensorboard is empty). is that normal?
How long does a training like that should take?

PS. I switched tensorboard to tensorboardX in the utils/logging.py due to an issue I had. Don't know if it has any affect...

How to add other language?

Thank you for sharing
I want to ask about initialization for 2 languages. Vietnam and English (using LJSpeech)
Can you let me ask about the places to configure?
Because your code is a bit confusing.

tomiinek / multilingual_text_to_speech Goto Github PK

multilingual_text_to_speech's People

Contributors

Stargazers

Watchers

Forkers

multilingual_text_to_speech's Issues

Mohegan

Cherokee

Recommend Projects

Recommend Topics

Recommend Org

Jobs