GithubHelp home page GithubHelp logo

ntt123 / viettts Goto Github PK

View Code? Open in Web Editor NEW
186.0 19.0 81.0 12.12 MB

Vietnamese Text to Speech library

License: MIT License

Shell 1.06% Python 88.59% Jupyter Notebook 10.34%
tts-engines deep-learning tacotron vocoder hifi-gan vietnam vietnamese text-to-speech

viettts's People

Contributors

ntt123 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

viettts's Issues

improve tts

Hi, me again.
I'm training your tts. My dataset is about 16 hours
First, because my dataset utterance is similar to yours, I'm training acoustic model use 2 approach:

  • Continue to train your acoustic_checkpoint to 1.46M step: val loss:0.227 and it's gonna converge.

    • Here is my result:
      image
  • Train from scratch: about 800k step - val loss: 0.301
    image

Here is full detail : https://drive.google.com/drive/folders/1j0OT7KgJOk5hmcOVNPdcdkaekRRxHekk?usp=sharing
Second, I train Hifigan Vocoder (with acoustic 1.46M) about 290k step:
My transcript text: "xin chào tôi là phương anh bản thử số chín"

I got stuck, should I focus on acoustic or vocoder or dataset to improve the result ?
Thanks!

[Gen Loss and Mel-Spec Error]

Hi ,
Could you share with me about your Gen Loss and Mel-Spec Error when you got the final model?
I'm trying to train on another dataset for around 3 days and I see loss, error decrease but the quality voice when testing is quite creepy :)))))))
Thank you so much, I hope you can see my question

How to handle english words in vietnamese text

Hi,
Based on your repo and your answers, I have built successfully a Vietnamese text-to-speech app with my own dataset. It sounds so good in the majority of cases. But I am still stuck on how to handle some English words (e.g, vaccine, morning...) that appear in the text. I have created a list of English words and mapping it with Vietnamese pronounce (e.g, vaccine - vắc xin) and updated it when new English words appear. However, It seems inefficient way.
Do you have any advice for me in this case? Thank you so much.

Error when fine tune new dataset

Hi, I'm preprocess like your pipeline. But when I run fine tune code. I got error:

checkpoints directory : small_cp_hifigan
/usr/local/lib/python3.7/dist-packages/torch/utils/data/dataloader.py:477: UserWarning: This DataLoader will create 8 worker processes in total. Our suggested max number of worker in current system is 2, which is smaller than what this DataLoader is going to create. Please be aware that excessive worker creation might get DataLoader running slow or even freeze, lower the worker number to avoid potential slowness/freeze if necessary.
cpuset_checked))
2021-06-18 07:29:10.395782: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcudart.so.11.0
Epoch: 1
/usr/local/lib/python3.7/dist-packages/torch/functional.py:581: UserWarning: stft will soon require the return_complex parameter be given for real inputs, and will further require that return_complex=True in a future PyTorch release. (Triggered internally at /pytorch/aten/src/ATen/native/SpectralOps.cpp:639.)
normalized, onesided, return_complex)
/usr/local/lib/python3.7/dist-packages/torch/functional.py:581: UserWarning: stft will soon require the return_complex parameter be given for real inputs, and will further require that return_complex=True in a future PyTorch release. (Triggered internally at /pytorch/aten/src/ATen/native/SpectralOps.cpp:639.)
normalized, onesided, return_complex)
/usr/local/lib/python3.7/dist-packages/torch/functional.py:581: UserWarning: stft will soon require the return_complex parameter be given for real inputs, and will further require that return_complex=True in a future PyTorch release. (Triggered internally at /pytorch/aten/src/ATen/native/SpectralOps.cpp:639.)
normalized, onesided, return_complex)
/usr/local/lib/python3.7/dist-packages/torch/functional.py:581: UserWarning: stft will soon require the return_complex parameter be given for real inputs, and will further require that return_complex=True in a future PyTorch release. (Triggered internally at /pytorch/aten/src/ATen/native/SpectralOps.cpp:639.)
normalized, onesided, return_complex)
/usr/local/lib/python3.7/dist-packages/torch/functional.py:581: UserWarning: stft will soon require the return_complex parameter be given for real inputs, and will further require that return_complex=True in a future PyTorch release. (Triggered internally at /pytorch/aten/src/ATen/native/SpectralOps.cpp:639.)
normalized, onesided, return_complex)
/usr/local/lib/python3.7/dist-packages/torch/functional.py:581: UserWarning: stft will soon require the return_complex parameter be given for real inputs, and will further require that return_complex=True in a future PyTorch release. (Triggered internally at /pytorch/aten/src/ATen/native/SpectralOps.cpp:639.)
normalized, onesided, return_complex)
/usr/local/lib/python3.7/dist-packages/torch/functional.py:581: UserWarning: stft will soon require the return_complex parameter be given for real inputs, and will further require that return_complex=True in a future PyTorch release. (Triggered internally at /pytorch/aten/src/ATen/native/SpectralOps.cpp:639.)
normalized, onesided, return_complex)
/usr/local/lib/python3.7/dist-packages/torch/functional.py:581: UserWarning: stft will soon require the return_complex parameter be given for real inputs, and will further require that return_complex=True in a future PyTorch release. (Triggered internally at /pytorch/aten/src/ATen/native/SpectralOps.cpp:639.)
normalized, onesided, return_complex)
/usr/local/lib/python3.7/dist-packages/torch/functional.py:581: UserWarning: stft will soon require the return_complex parameter be given for real inputs, and will further require that return_complex=True in a future PyTorch release. (Triggered internally at /pytorch/aten/src/ATen/native/SpectralOps.cpp:639.)
normalized, onesided, return_complex)
Steps : 0, Gen Loss Total : 88.451, Mel-Spec. Error : 1.812, s/b : 2.910
train.py:199: UserWarning: Using a target size (torch.Size([1, 80, 305])) that is different to the input size (torch.Size([1, 80, 304])). This will likely lead to incorrect results due to broadcasting. Please ensure they have the same size.
val_err_tot += F.l1_loss(y_mel, y_g_hat_mel).item()
Traceback (most recent call last):
File "train.py", line 271, in
main()
File "train.py", line 267, in main
train(0, a, h)
File "train.py", line 199, in train
val_err_tot += F.l1_loss(y_mel, y_g_hat_mel).item()
File "/usr/local/lib/python3.7/dist-packages/torch/nn/functional.py", line 2897, in l1_loss
expanded_input, expanded_target = torch.broadcast_tensors(input, target)
File "/usr/local/lib/python3.7/dist-packages/torch/functional.py", line 74, in broadcast_tensors
return _VF.broadcast_tensors(tensors) # type: ignore
RuntimeError: The size of tensor a (304) must match the size of tensor b (305) at non-singleton dimension 2

I don't know why, maybe error when you freeze tensor. What is dimension 2 ? Help!
Thanks!

Graphemes to phonemes

I saw that your synthesized audio clip is good!
But the phoneme is just a single "character" like "ô ư i u ...", how about complicated cases (maybe better as I think) such as: hươu -> h ươu, thái -> th ái
I am trying FastSpeech2, but it seems not good. Did you hear it?
It is great if we can contact further.
Thanks!

Audio playback speed is too fast

Reading speed is too fast with 48k audio samples. Is there a way to reduce the audio speed? Looking forward to everyone's feedback. Thank you so much.

could not synchronize on CUDA context

Today, I ran your acoustic model on colab and I got this issues


training: 0% 0/1900001 [00:00<?, ?it/s]2021-12-07 03:51:13.659473: E external/org_tensorflow/tensorflow/compiler/xla/pjrt/pjrt_stream_executor_client.cc:2085] Execution of replica 0 failed: INTERNAL: CUBLAS_STATUS_EXECUTION_FAILED
training: 0% 0/1900001 [00:16<?, ?it/s]
Traceback (most recent call last):
File "/content/drive/MyDrive/vietTTS/vietTTS/nat/acoustic_trainer.py", line 139, in
train()
File "/content/drive/MyDrive/vietTTS/vietTTS/nat/acoustic_trainer.py", line 101, in train
loss, (params, aux, rng, optim_state) = update(params, aux, rng, optim_state, batch)
File "/usr/local/lib/python3.7/dist-packages/jax/_src/traceback_util.py", line 162, in reraise_with_filtered_traceback
return fun(*args, **kwargs)
File "/usr/local/lib/python3.7/dist-packages/jax/_src/api.py", line 419, in cache_miss
donated_invars=donated_invars, inline=inline)
File "/usr/local/lib/python3.7/dist-packages/jax/core.py", line 1632, in bind
return call_bind(self, fun, *args, **params)
File "/usr/local/lib/python3.7/dist-packages/jax/core.py", line 1623, in call_bind
outs = primitive.process(top_trace, fun, tracers, params)
File "/usr/local/lib/python3.7/dist-packages/jax/core.py", line 1635, in process
return trace.process_call(self, fun, tracers, params)
File "/usr/local/lib/python3.7/dist-packages/jax/core.py", line 627, in process_call
return primitive.impl(f, *tracers, **params)
File "/usr/local/lib/python3.7/dist-packages/jax/interpreters/xla.py", line 690, in _xla_call_impl
out = compiled_fun(*args)
File "/usr/local/lib/python3.7/dist-packages/jax/interpreters/xla.py", line 1100, in _execute_compiled
out_bufs = compiled.execute(input_bufs)
jax._src.traceback_util.UnfilteredStackTrace: RuntimeError: INTERNAL: CUBLAS_STATUS_EXECUTION_FAILED

The stack trace below excludes JAX-internal frames.
The preceding is the original exception that occurred, unmodified.


The above exception was the direct cause of the following exception:

Traceback (most recent call last):
File "/usr/lib/python3.7/runpy.py", line 193, in _run_module_as_main
"main", mod_spec)
File "/usr/lib/python3.7/runpy.py", line 85, in _run_code
exec(code, run_globals)
File "/content/drive/MyDrive/vietTTS/vietTTS/nat/acoustic_trainer.py", line 139, in
train()
File "/content/drive/MyDrive/vietTTS/vietTTS/nat/acoustic_trainer.py", line 101, in train
loss, (params, aux, rng, optim_state) = update(params, aux, rng, optim_state, batch)
File "/usr/local/lib/python3.7/dist-packages/jax/interpreters/xla.py", line 1100, in _execute_compiled
out_bufs = compiled.execute(input_bufs)
RuntimeError: INTERNAL: CUBLAS_STATUS_EXECUTION_FAILED
2021-12-07 03:51:14.389335: E external/org_tensorflow/tensorflow/stream_executor/cuda/cuda_driver.cc:1047] could not synchronize on CUDA context: CUDA_ERROR_ILLEGAL_ADDRESS: an illegal memory access was encountered :: *** Begin stack trace ***

_PyModule_ClearDict
PyImport_Cleanup
Py_FinalizeEx

_Py_UnixMain
__libc_start_main
_start

*** End stack trace ***

2021-12-07 03:51:14.389456: F external/org_tensorflow/tensorflow/compiler/xla/service/gpu/gpu_executable.cc:124] Check failed: pair.first->SynchronizeAllActivity()

I guess this issue comes from mismatch version of requirements.
Could you please define your specific version of dependencies or update requirements ?

Problems with Special Phonemes

Hi,
I have trained with my own dataset, and the results look good. However, I face a problem with special phonemes. The TTS does not stop when meeting special phonemes (in case setting silence_duration = 0) and sometimes it speaks some noise. When setting silence_duration =0.1 or higher, the TTS always speaks noise.
I guess something goes wrong with my *.textgrid files because there are no sil or sp phones in these. (https://drive.google.com/drive/folders/1RAsq-qPMjHMn-seJy3iapWAmLmjwz9nb?usp=sharing) And I have no idea to fix it.
This is my demo: http://6640-113-161-90-253.ngrok.io/
Can you give me some advice?
Thank you.

Error when fine tuning using new dataset.

Hi, I'm new in ML. I trained about 100K step (your pre-trained is 800K) in HiFi-GAN vocoder and the sound is acceptable. Now I want to using different dataset to train model. Should I train new HiFi-GAN model or continue to train pre-trained model? I'm not sure. And when I choose options fine tune using vivos dataset:
%cd '/content/drive/MyDrive/vietTTS/hifi-gan'
!python3 train.py --fine_tuning True --config ../assets/hifigan/config.json --input_wavs_dir=data --input_training_file=train_files.txt --input_validation_file=val_files.txt

And I got this error:
checkpoints directory : cp_hifigan
Loading 'cp_hifigan/g_00105000'
Complete.
Loading 'cp_hifigan/do_00110000'
Complete.
/usr/local/lib/python3.7/dist-packages/torch/utils/data/dataloader.py:477: UserWarning: This DataLoader will create 4 worker processes in total. Our suggested max number of worker in current system is 2, which is smaller than what this DataLoader is going to create. Please be aware that excessive worker creation might get DataLoader running slow or even freeze, lower the worker number to avoid potential slowness/freeze if necessary.
cpuset_checked))
2021-06-15 03:07:39.878886: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcudart.so.11.0
Epoch: 119
Traceback (most recent call last):
File "train.py", line 271, in
main()
File "train.py", line 267, in main
train(0, a, h)
File "train.py", line 113, in train
for i, batch in enumerate(train_loader):
File "/usr/local/lib/python3.7/dist-packages/torch/utils/data/dataloader.py", line 517, in next
data = self._next_data()
File "/usr/local/lib/python3.7/dist-packages/torch/utils/data/dataloader.py", line 1199, in _next_data
return self._process_data(data)
File "/usr/local/lib/python3.7/dist-packages/torch/utils/data/dataloader.py", line 1225, in _process_data
data.reraise()
File "/usr/local/lib/python3.7/dist-packages/torch/_utils.py", line 429, in reraise
raise self.exc_type(msg)
FileNotFoundError: Caught FileNotFoundError in DataLoader worker process 0.
Original Traceback (most recent call last):
File "/usr/local/lib/python3.7/dist-packages/torch/utils/data/_utils/worker.py", line 202, in _worker_loop
data = fetcher.fetch(index)
File "/usr/local/lib/python3.7/dist-packages/torch/utils/data/_utils/fetch.py", line 44, in fetch
data = [self.dataset[idx] for idx in possibly_batched_index]
File "/usr/local/lib/python3.7/dist-packages/torch/utils/data/_utils/fetch.py", line 44, in
data = [self.dataset[idx] for idx in possibly_batched_index]
File "/content/drive/My Drive/vietTTS/hifi-gan/meldataset.py", line 144, in getitem
os.path.join(self.base_mels_path, os.path.splitext(os.path.split(filename)[-1])[0] + '.npy'))
File "/usr/local/lib/python3.7/dist-packages/numpy/lib/npyio.py", line 416, in load
fid = stack.enter_context(open(os_fspath(file), "rb"))
FileNotFoundError: [Errno 2] No such file or directory: 'ft_dataset/VIVOSSPK46_184.npy'

I think when I trained vocoder before, my own dataset does not include these new files. How can I fix it ?
And the second question is what happen if I continue to train my pre-trained vocoder model with different dataset.
Thanks!

need to tranning read number

I tried to read this text "Xin mời bệnh nhân thứ 1"
function work ok, however it keep number at end centense

[Textgrid for dataset]

I am creating textgrid files for my dataset. Can you guide me how to create that file? Or you can give me information. Thank you so much!

Hifi GAN with 24kHz

Hi @NTT123
Thanks for your work!
What is the config of Hifi GAN with 24kHz? (The default is 16kHz)?
Or at least, could you share your tips to calculate the parameters.

Prediction time is too slow

I see an issue coming from text2mel using only 1 cpu and mel2wave being all cpu. Do you have any solution to this problem to optimize processing time? Thank you so much!

How to create lexicon.txt file

It's a great repo.
I have tried to train my own model, but I have still stuck at prepare a dataset. Can you instruct me how to create a lexicon.txt file corresponding to my dataset, then I can use MFA to create a grid file.
Thank you very much.

High RAM usage acoustic model

After MFA training, I've got 39778 Textgrids and 39778 wav file. The issue that when I ran acoustic trainer it cost 14GB RAM and keep increasing. How to fix this? Thank you
Screenshot from 2021-10-27 14-40-13

How to convert hifigan V3 to haiku correctly?

Hi,
I'm trying to using resblock: 2 on HiFi-Gan config v3. Could you please help me to fix the convert script so that it could inference correctly?

Using the model converted from the convert_torch_model_to_haiku.py not work with v3.

Here is the error I'm dealing with:

Traceback (most recent call last):
  File "/usr/lib/python3.7/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "/usr/lib/python3.7/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/content/vietTTS/vietTTS/synthesizer.py", line 38, in <module>
    wave = mel2wave(mel)
  File "/content/vietTTS/vietTTS/hifigan/mel2wave.py", line 40, in mel2wave
    wav, aux = forward.apply(params, aux, rng, mel)
  File "/usr/local/lib/python3.7/dist-packages/haiku/_src/transform.py", line 400, in apply_fn
    out = f(*args, **kwargs)
  File "/content/vietTTS/vietTTS/hifigan/mel2wave.py", line 32, in forward
    return net(x)
  File "/usr/local/lib/python3.7/dist-packages/haiku/_src/module.py", line 433, in wrapped
    out = f(*args, **kwargs)
  File "/usr/local/lib/python3.7/dist-packages/haiku/_src/module.py", line 284, in run_interceptors
    return bound_method(*args, **kwargs)
  File "/content/vietTTS/vietTTS/hifigan/model.py", line 119, in __call__
    xs = self.resblocks[i * self.num_kernels + j](x)
  File "/usr/local/lib/python3.7/dist-packages/haiku/_src/module.py", line 433, in wrapped
    out = f(*args, **kwargs)
  File "/usr/local/lib/python3.7/dist-packages/haiku/_src/module.py", line 284, in run_interceptors
    return bound_method(*args, **kwargs)
  File "/content/vietTTS/vietTTS/hifigan/model.py", line 72, in __call__
    xt = c(xt)
  File "/usr/local/lib/python3.7/dist-packages/haiku/_src/module.py", line 433, in wrapped
    out = f(*args, **kwargs)
  File "/usr/local/lib/python3.7/dist-packages/haiku/_src/module.py", line 284, in run_interceptors
    return bound_method(*args, **kwargs)
  File "/usr/local/lib/python3.7/dist-packages/haiku/_src/conv.py", line 200, in __call__
    w = hk.get_parameter("w", w_shape, inputs.dtype, init=w_init)
  File "/usr/local/lib/python3.7/dist-packages/haiku/_src/base.py", line 333, in get_parameter
    name, bundle_name))
ValueError: Unable to retrieve parameter 'w' for module 'generator/~/res_block1_0/~/conv1_d'. All parameters must be created as part of `init`.

Training with other dataset

Hi all, I see you say that use MFA to prepare dataset (textGrid file), I tried to use it but it has a lot of issues with Vietnamese , I generated a lexicon.txt file based on the g2p model, but when using the acoustic model to generate textGrid file, the error is :
"There were phones in the dictionary that do not have acoustic models: au_T4, au_T6, eu_T5, ieu_T5, ui2_T2, uoi2_T2, uoi3_T6, uou_T1, uou_T3".
And I tried using your lexicon.txt file but the error is : "There were phones in the dictionary that do not have acoustic models: a, c, d, e, i, o, q, u, y, à, á, â, ã, è, é, ê, ì, í, ò, ó, ô, õ, ù, ú, ý, ă, đ, ĩ, ũ, ơ, ư, ạ, ả, ấ, ầ, ẩ, ẫ, ậ, ắ, ằ, ẳ, ẵ, ặ, ẹ, ẻ, ẽ, ế, ề, ể, ễ, ệ, ỉ, ị, ọ, ỏ, ố, ồ, ổ, ỗ, ộ, ớ, ờ, ở, ỡ, ợ, ụ, ủ, ứ, ừ, ử, ữ, ự, ỳ, ỵ, ỷ, ỹ"

Could you share me about the g2p and acoustic model you used? Thank you so much

How to add marker of sil, sp to TextGrid after MFA?

Hi @NTT123,
First of all thank you for your brilliant work! I have successfully trained my dataset with MFA, but it is not generated .TextGrid as a marker for silence, space. Could you please help me on how we can detect and add these symbol to the TextGrid file?

mfa version

hi
I want to train fastspeech2 on a Persian dataset with durations extracted from MFA. which version of mfa do U use for training? do you have any preprocessing on your datasets?

EOFError: Ran out of input

i got an error when continue train from my checkpoint
Resuming from latest checkpoint at /content/drive/MyDrive/vietTTS_Model/acoustic_latest_ckpt.pickle Traceback (most recent call last): File "/usr/lib/python3.7/runpy.py", line 193, in _run_module_as_main "__main__", mod_spec) File "/usr/lib/python3.7/runpy.py", line 85, in _run_code exec(code, run_globals) File "/content/vietTTS/vietTTS/nat/acoustic_trainer.py", line 183, in <module> train() File "/content/vietTTS/vietTTS/nat/acoustic_trainer.py", line 114, in train dic = pickle.load(f) EOFError: Ran out of input

acoustic training generate strange melspectrogram

Hi @NTT123,
After I pulled the latest PR #22 and retried MFA with new format and add the --disable_textgrid_cleanup
The mel I generate is like this even if I train at 61k steps:
mel_061000
Previously, at this step my dataset can speak, but this always generate buzz sound.
Should I keep training?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.