Comments (18)
@traderpedroso thanks for answering all my questions in detailed manner. I will try to build multi-lingual TTS model and will report if it is successful.
from styletts2.
You have to train the PL-bert model with the specific dataset of that particular language you want. A text dataset of size more than 30MB is also sufficient enough, though you can use larger dataset. Then use that trained PL-bert model in StyleTTS2. As you want to work with multilingual data, then of course you need to use specific phonemizer and tokenizer that supports that specific language. And you have to train StyleTTS2 (training stage1 and stage2) with the specific language dataset (train.txt, validate.txt and odd.txt).
from styletts2.
@SandyPanda-MLDL Thanks for quick reply and answering first questions, I understood about training of PL-bert model with multi-lingual dataset.
How about other three questions?
2. can we use this ASR model ( ASR_path: "Utils/ASR/epoch_00080.pth") for other than English language ?
3. If multiple languages, do we need to add multiple language data in OOD_data: "Data/OOD_texts.txt" ?
4. Do we need to add any language id while doing data preparation, similar to adding speaker id in train_list.txt/val_list.txt?
from styletts2.
@SandyPanda-MLDL Thanks for quick reply and answering first questions, I understood about training of PL-bert model with multi-lingual dataset.
How about other three questions? 2. can we use this ASR model ( ASR_path: "Utils/ASR/epoch_00080.pth") for other than English language ? 3. If multiple languages, do we need to add multiple language data in OOD_data: "Data/OOD_texts.txt" ? 4. Do we need to add any language id while doing data preparation, similar to adding speaker id in train_list.txt/val_list.txt?
According to the documentation in the readme, it states that the ASR model performs well in other languages. I tested it and indeed it works fine. However, when I trained my ASR model, StyleTTS improved dramatically. After this, I decided to train all models with my own data and achieved results exactly in terms of quality that the model delivers in English.
from styletts2.
@traderpedroso Thanks for reply.
-
Did you Finetune ASR model (https://github.com/yl4579/AuxiliaryASR) on the top of existing ASR model or trained from scratch with multiple languages ?
-
Did you also tried to train PL-BERT model with multiple languages ? if yes, then can we combine multiple languages, do we need to give equal amount of training data for each language ?
from styletts2.
@traderpedroso Thanks for reply.
- Did you Finetune ASR model (https://github.com/yl4579/AuxiliaryASR) on the top of existing ASR model or trained from scratch with multiple languages ?
- Did you also tried to train PL-BERT model with multiple languages ? if yes, then can we combine multiple languages, do we need to give equal amount of training data for each language ?
I used the PL-BERT recommended in the multilingual repository https://huggingface.co/papercup-ai/multilingual-pl-bert and it worked perfectly for ASR. I tested it with fine-tuning and also tried training from scratch; both approaches gave me the same result. Clearly, the ASR that I trained from scratch was for a single language.
From my experience training StyleTTS 2, it's only worthwhile because the inference is very fast and consumes little VRAM, but the training cost makes it somewhat unfeasible. Besides, you can only train the second stage with a single GPU. Clearly, I didn't train the model from scratch, which would be even more expensive, but I can guarantee that the quality is sensational. Another advantage of StyleTTS 2 is that it doesn’t hallucinate; the generated audios are extremely reliable, especially for real-time streaming applications that don’t need monitoring. However, in terms of cost vs. benefit, I personally prefer Tortoise for the final outcome.
from styletts2.
@traderpedroso thanks, I understood Auxillary ASR part. Will train it from scratch if quality is bad.
- My use case is for Multi-lingual TTS with Indian languages, but Indian languages are not part of PL-BERT (https://huggingface.co/papercup-ai/multilingual-pl-bert ), so do you think can we still use multilingual-pl-bert for unseen languages ?
- For multiple languages, do we need to add multiple language data in OOD_data: "Data/OOD_texts.txt" ?
- Do we need to add any language id while doing data preparation for multi-lingual use case, similar to adding speaker id in train_list.txt/val_list.txt? because while inferencing how it will know which language to select ?
from styletts2.
@traderpedroso thanks, I understood Auxillary ASR part. Will train it from scratch if quality is bad.
- My use case is for Multi-lingual TTS with Indian languages, but Indian languages are not part of PL-BERT (https://huggingface.co/papercup-ai/multilingual-pl-bert ), so do you think can we still use multilingual-pl-bert for unseen languages ?
- For multiple languages, do we need to add multiple language data in OOD_data: "Data/OOD_texts.txt" ?
- Do we need to add any language id while doing data preparation for multi-lingual use case, similar to adding speaker id in train_list.txt/val_list.txt? because while inferencing how it will know which language to select ?
Ensure that the speaker IDs are numbers. I personally used large numbers for the IDs, such as 3000, 3001, etc. You need to fine-tune the multilingual-pl-bert with your language if it is not listed. You do not need to add a language ID. Keep the data as in the example in the Data folder.
I added data in the same language I trained within the Data/OOD_texts.txt, but honestly, I believe it has no relevance because in the first 20 epochs I trained with the original Data/OOD_texts.txt, and the model was already generating quality audios.
In the inference, you need to put a dropdown list to select the language for your G2P, in this case, phonemizer, or use a library that detects the language and switches the lag in the phonemizer, for example, en-us, it, fr, etc.
from styletts2.
@traderpedroso How many hours of audio data did you use for training?
from styletts2.
@traderpedroso How many hours of audio data did you use for training?
6 hours audio 24000hz batch size 4 on A100 80GB for 10 hours running for 10 epochs the first time i trained with len 300 30 epochs was bad quality after that i did a finetunning the same model for 10 epochs with 800 len after se second epoch was generating perfect audio
from styletts2.
@traderpedroso How many hours of audio data did you use for training?
6 hours audio 24000hz batch size 4 on A100 80GB for 10 hours running for 10 epochs the first time i trained with len 300 30 epochs was bad quality after that i did a finetunning the same model for 10 epochs with 800 len after se second epoch was generating perfect audio
That is much less audio data than I expected. For the len
that you changed, do you mean the max_len
in the config?
from styletts2.
@traderpedroso How many hours of audio data did you use for training?
6 hours audio 24000hz batch size 4 on A100 80GB for 10 hours running for 10 epochs the first time i trained with len 300 30 epochs was bad quality after that i did a finetunning the same model for 10 epochs with 800 len after se second epoch was generating perfect audio
That is much less audio data than I expected. For the
len
that you changed, do you mean themax_len
in the config?
Yes, max_len of 800, but I found a more efficient way to train the fourth model that I trained. Now I followed this approach: first, I trained the model with audio from 2 seconds to a maximum of 4 seconds. Second, max_len of 300. Of course, the final quality wasn't interesting, but it was perfectly trained for 50 epochs in less than 2 hours. Then I did the finetuning for 5 epochs with audio of the "same length" of 8 seconds. The model turned out perfect, with zero noise in the end, and a smooth pronunciation. It became very humanized and much better, and I spent fewer resources on the training. The 8-second audios can be from various speakers with a maximum of 80 seconds each. In my case, I trained with 50 speakers and the fine-tuning only one hour dataset with max_len 800.
from styletts2.
@traderpedroso Thanks for your insights.
- I was able to successfully FT the model with 4 hour of data for single Indic speaker. But in the end of audio, I hear some noisy click sounds. Any pointer to solve this issue ?
- I was able to use max_len=400 initial training, and max_len=300 for join training, If I increase the max_len, getting OOM. Did you used max_len=800 for joint training also ?
from styletts2.
@traderpedroso Thanks for your insights.
- I was able to successfully FT the model with 4 hour of data for single Indic speaker. But in the end of audio, I hear some noisy click sounds. Any pointer to solve this issue ?
- I was able to use max_len=400 initial training, and max_len=300 for join training, If I increase the max_len, getting OOM. Did you used max_len=800 for joint training also ?
hey! were you able to build the PL-BERT model for hindi? i seem to be in the same situation as you.
from styletts2.
@traderpedroso Thanks for your insights.
- I was able to successfully FT the model with 4 hour of data for single Indic speaker. But in the end of audio, I hear some noisy click sounds. Any pointer to solve this issue ?
- I was able to use max_len=400 initial training, and max_len=300 for join training, If I increase the max_len, getting OOM. Did you used max_len=800 for joint training also ?
You need to add silence padding to your audio before training. I added 500ms to the beginning and end of the audio file. Then, during inference, I implemented a workaround with.
def trim_audio(audio_np_array, sample_rate=24000, trim_ms=350):
trim_samples = int(trim_ms * sample_rate / 1000)
if len(audio_np_array) > 2 * trim_samples:
trimmed_audio_np = audio_np_array[trim_samples:-trim_samples]
else:
trimmed_audio_np = audio_np_array
return trimmed_audio_np
def tts(input: str, voice="Bia", output_sample_rate=24000, alpha=0.7, beta=0.7, diffusion_steps=5, embedding_scale=2, output_wav_file=None):
text = normalizer(input)
if text.strip() == "":
raise ValueError("insert some text")
if len(text) > 50000:
raise ValueError("max 50.000 tokens")
texts = split_sentence(text)
audios = []
for t in texts:
audio = styletts2importable.inference(
t,
voices[voice],
alpha=alpha,
beta=beta,
diffusion_steps=diffusion_steps,
embedding_scale=embedding_scale,
)
trimmed_audio = trim_audio(audio)
audios.append(trimmed_audio)
output_audio = np.concatenate(audios)
if output_wav_file:
scipy.io.wavfile.write(output_wav_file, rate=output_sample_rate, data=output_audio)
return output_sample_rate, output_audio
from styletts2.
@traderpedroso Thanks for detailed answer and code, this is very helpful.
from styletts2.
@tanishbajaj101 I have trained Hindi StyleTTS2 model, with existing English BERT model and it seems to be working fine without any issue. So I have not yet explored Hindi PL-BERT model.
from styletts2.
Related Issues (20)
- FP8 Fine Tuning Crashes HOT 1
- Error Message After Using a fine tuned ASR Model
- Stage 2 Training Fails with NaN Loss on Single GPU Due to Inconsistent Checkpoint Keys
- Getting CUDA Out of memory error in Stage2 training HOT 13
- In training Stage1 after 49th epoch getting RuntimeError: you can only change requires_grad flags of leaf variables, g_loss.requires_grad = True
- First stage training after 49th epoch (i.e., when epoch >= TMA_epoch)
- Getting error in d_loss.backward() of first_stage training
- Can the model learn accents not supported by espeak-ng?
- Joint training is failing with Assertion error
- In 2nd stage training AttributeError: 'AudioDiffusionConditional' object has no attribute 'module'
- Questions about Differentiable Duration Modeling HOT 1
- weird chinese pronunciation HOT 3
- Training PL-BERT on styletts2-community/multilingual-pl-bert
- Can anyone please share checkpoints that we get after we complete both stages of training HOT 3
- Model Size of fine tuned Model
- Can StyleTTS2 use phonemization from different languages to finetune or train?
- StyleTTS Python API doesn't detect devanagari script
- After training 1 epoch, train_first.py crashes: RuntimeError: Expected 2D (unbatched) or 3D (batched) input to conv1d, but got input of size: [1, 1, 1, 800] HOT 1
- Do we need lr scheduler?
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from styletts2.