Thanks for wonderful work which gives good expressive TTS for English speakers. I was

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

<a class="user-mention notranslate" data-hovercard-type="user" data-hover

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

<a class="user-mention notranslate" data-hovercard-type="user" data-hover

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

<a class="user-mention notranslate" data-hovercard-type="user" data-hover

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

<a class="user-mention notranslate" data-hovercard-type="user" data-hover

Multi-lingual training about styletts2 HOT 18 OPEN

nvadigauvce commented on July 30, 2024

Multi-lingual training

from styletts2.

Comments (18)

nvadigauvce commented on July 30, 2024 2

@traderpedroso thanks for answering all my questions in detailed manner. I will try to build multi-lingual TTS model and will report if it is successful.

from styletts2.

SandyPanda-MLDL commented on July 30, 2024

You have to train the PL-bert model with the specific dataset of that particular language you want. A text dataset of size more than 30MB is also sufficient enough, though you can use larger dataset. Then use that trained PL-bert model in StyleTTS2. As you want to work with multilingual data, then of course you need to use specific phonemizer and tokenizer that supports that specific language. And you have to train StyleTTS2 (training stage1 and stage2) with the specific language dataset (train.txt, validate.txt and odd.txt).

from styletts2.

nvadigauvce commented on July 30, 2024

@SandyPanda-MLDL Thanks for quick reply and answering first questions, I understood about training of PL-bert model with multi-lingual dataset.

How about other three questions?
2. can we use this ASR model ( ASR_path: "Utils/ASR/epoch_00080.pth") for other than English language ?
3. If multiple languages, do we need to add multiple language data in OOD_data: "Data/OOD_texts.txt" ?
4. Do we need to add any language id while doing data preparation, similar to adding speaker id in train_list.txt/val_list.txt?

from styletts2.

traderpedroso commented on July 30, 2024

@SandyPanda-MLDL Thanks for quick reply and answering first questions, I understood about training of PL-bert model with multi-lingual dataset.

How about other three questions? 2. can we use this ASR model ( ASR_path: "Utils/ASR/epoch_00080.pth") for other than English language ? 3. If multiple languages, do we need to add multiple language data in OOD_data: "Data/OOD_texts.txt" ? 4. Do we need to add any language id while doing data preparation, similar to adding speaker id in train_list.txt/val_list.txt?

According to the documentation in the readme, it states that the ASR model performs well in other languages. I tested it and indeed it works fine. However, when I trained my ASR model, StyleTTS improved dramatically. After this, I decided to train all models with my own data and achieved results exactly in terms of quality that the model delivers in English.

from styletts2.

nvadigauvce commented on July 30, 2024

@traderpedroso Thanks for reply.

Did you Finetune ASR model (https://github.com/yl4579/AuxiliaryASR) on the top of existing ASR model or trained from scratch with multiple languages ?
Did you also tried to train PL-BERT model with multiple languages ? if yes, then can we combine multiple languages, do we need to give equal amount of training data for each language ?

from styletts2.

traderpedroso commented on July 30, 2024

@traderpedroso Thanks for reply.

Did you Finetune ASR model (https://github.com/yl4579/AuxiliaryASR) on the top of existing ASR model or trained from scratch with multiple languages ?

Did you also tried to train PL-BERT model with multiple languages ? if yes, then can we combine multiple languages, do we need to give equal amount of training data for each language ?

I used the PL-BERT recommended in the multilingual repository https://huggingface.co/papercup-ai/multilingual-pl-bert and it worked perfectly for ASR. I tested it with fine-tuning and also tried training from scratch; both approaches gave me the same result. Clearly, the ASR that I trained from scratch was for a single language.

From my experience training StyleTTS 2, it's only worthwhile because the inference is very fast and consumes little VRAM, but the training cost makes it somewhat unfeasible. Besides, you can only train the second stage with a single GPU. Clearly, I didn't train the model from scratch, which would be even more expensive, but I can guarantee that the quality is sensational. Another advantage of StyleTTS 2 is that it doesn’t hallucinate; the generated audios are extremely reliable, especially for real-time streaming applications that don’t need monitoring. However, in terms of cost vs. benefit, I personally prefer Tortoise for the final outcome.

from styletts2.

nvadigauvce commented on July 30, 2024

@traderpedroso thanks, I understood Auxillary ASR part. Will train it from scratch if quality is bad.

My use case is for Multi-lingual TTS with Indian languages, but Indian languages are not part of PL-BERT (https://huggingface.co/papercup-ai/multilingual-pl-bert ), so do you think can we still use multilingual-pl-bert for unseen languages ?
For multiple languages, do we need to add multiple language data in OOD_data: "Data/OOD_texts.txt" ?
Do we need to add any language id while doing data preparation for multi-lingual use case, similar to adding speaker id in train_list.txt/val_list.txt? because while inferencing how it will know which language to select ?

from styletts2.

traderpedroso commented on July 30, 2024

@traderpedroso thanks, I understood Auxillary ASR part. Will train it from scratch if quality is bad.

My use case is for Multi-lingual TTS with Indian languages, but Indian languages are not part of PL-BERT (https://huggingface.co/papercup-ai/multilingual-pl-bert ), so do you think can we still use multilingual-pl-bert for unseen languages ?

For multiple languages, do we need to add multiple language data in OOD_data: "Data/OOD_texts.txt" ?

Do we need to add any language id while doing data preparation for multi-lingual use case, similar to adding speaker id in train_list.txt/val_list.txt? because while inferencing how it will know which language to select ?

Ensure that the speaker IDs are numbers. I personally used large numbers for the IDs, such as 3000, 3001, etc. You need to fine-tune the multilingual-pl-bert with your language if it is not listed. You do not need to add a language ID. Keep the data as in the example in the Data folder.

I added data in the same language I trained within the Data/OOD_texts.txt, but honestly, I believe it has no relevance because in the first 20 epochs I trained with the original Data/OOD_texts.txt, and the model was already generating quality audios.

In the inference, you need to put a dropdown list to select the language for your G2P, in this case, phonemizer, or use a library that detects the language and switches the lag in the phonemizer, for example, en-us, it, fr, etc.

from styletts2.

mc-marcocheng commented on July 30, 2024

@traderpedroso How many hours of audio data did you use for training?

from styletts2.

traderpedroso commented on July 30, 2024

@traderpedroso How many hours of audio data did you use for training?

6 hours audio 24000hz batch size 4 on A100 80GB for 10 hours running for 10 epochs the first time i trained with len 300 30 epochs was bad quality after that i did a finetunning the same model for 10 epochs with 800 len after se second epoch was generating perfect audio

from styletts2.

mc-marcocheng commented on July 30, 2024

@traderpedroso How many hours of audio data did you use for training?

6 hours audio 24000hz batch size 4 on A100 80GB for 10 hours running for 10 epochs the first time i trained with len 300 30 epochs was bad quality after that i did a finetunning the same model for 10 epochs with 800 len after se second epoch was generating perfect audio

That is much less audio data than I expected. For the len that you changed, do you mean the max_len in the config?

from styletts2.

traderpedroso commented on July 30, 2024

@traderpedroso How many hours of audio data did you use for training?

6 hours audio 24000hz batch size 4 on A100 80GB for 10 hours running for 10 epochs the first time i trained with len 300 30 epochs was bad quality after that i did a finetunning the same model for 10 epochs with 800 len after se second epoch was generating perfect audio

That is much less audio data than I expected. For the len that you changed, do you mean the max_len in the config?

Yes, max_len of 800, but I found a more efficient way to train the fourth model that I trained. Now I followed this approach: first, I trained the model with audio from 2 seconds to a maximum of 4 seconds. Second, max_len of 300. Of course, the final quality wasn't interesting, but it was perfectly trained for 50 epochs in less than 2 hours. Then I did the finetuning for 5 epochs with audio of the "same length" of 8 seconds. The model turned out perfect, with zero noise in the end, and a smooth pronunciation. It became very humanized and much better, and I spent fewer resources on the training. The 8-second audios can be from various speakers with a maximum of 80 seconds each. In my case, I trained with 50 speakers and the fine-tuning only one hour dataset with max_len 800.

from styletts2.

nvadigauvce commented on July 30, 2024

@traderpedroso Thanks for your insights.

I was able to successfully FT the model with 4 hour of data for single Indic speaker. But in the end of audio, I hear some noisy click sounds. Any pointer to solve this issue ?
I was able to use max_len=400 initial training, and max_len=300 for join training, If I increase the max_len, getting OOM. Did you used max_len=800 for joint training also ?

from styletts2.

tanishbajaj101 commented on July 30, 2024

@traderpedroso Thanks for your insights.

I was able to successfully FT the model with 4 hour of data for single Indic speaker. But in the end of audio, I hear some noisy click sounds. Any pointer to solve this issue ?

I was able to use max_len=400 initial training, and max_len=300 for join training, If I increase the max_len, getting OOM. Did you used max_len=800 for joint training also ?

hey! were you able to build the PL-BERT model for hindi? i seem to be in the same situation as you.

from styletts2.

traderpedroso commented on July 30, 2024

@traderpedroso Thanks for your insights.

I was able to successfully FT the model with 4 hour of data for single Indic speaker. But in the end of audio, I hear some noisy click sounds. Any pointer to solve this issue ?

I was able to use max_len=400 initial training, and max_len=300 for join training, If I increase the max_len, getting OOM. Did you used max_len=800 for joint training also ?

You need to add silence padding to your audio before training. I added 500ms to the beginning and end of the audio file. Then, during inference, I implemented a workaround with.

def trim_audio(audio_np_array, sample_rate=24000, trim_ms=350):
    trim_samples = int(trim_ms * sample_rate / 1000)
    if len(audio_np_array) > 2 * trim_samples:
        trimmed_audio_np = audio_np_array[trim_samples:-trim_samples]
    else:
        trimmed_audio_np = audio_np_array
    return trimmed_audio_np

def tts(input: str, voice="Bia", output_sample_rate=24000, alpha=0.7, beta=0.7, diffusion_steps=5, embedding_scale=2, output_wav_file=None):
    text = normalizer(input)
    if text.strip() == "":
        raise ValueError("insert some text")
    if len(text) > 50000:
        raise ValueError("max 50.000 tokens")
    
    texts = split_sentence(text)
    audios = []
    for t in texts:
        audio = styletts2importable.inference(
            t,
            voices[voice],
            alpha=alpha,
            beta=beta,
            diffusion_steps=diffusion_steps,
            embedding_scale=embedding_scale,
        )
        trimmed_audio = trim_audio(audio)
        audios.append(trimmed_audio)
    output_audio = np.concatenate(audios)
    if output_wav_file:
        scipy.io.wavfile.write(output_wav_file, rate=output_sample_rate, data=output_audio)
    return output_sample_rate, output_audio

from styletts2.

nvadigauvce commented on July 30, 2024

@traderpedroso Thanks for detailed answer and code, this is very helpful.

from styletts2.

nvadigauvce commented on July 30, 2024

@tanishbajaj101 I have trained Hindi StyleTTS2 model, with existing English BERT model and it seems to be working fine without any issue. So I have not yet explored Hindi PL-BERT model.

from styletts2.

Multi-lingual training about styletts2 HOT 18 OPEN

Comments (18)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent

Jobs