GithubHelp home page GithubHelp logo

fatchord / wavernn Goto Github PK

View Code? Open in Web Editor NEW
2.1K 86.0 699.0 242.13 MB

WaveRNN Vocoder + TTS

Home Page: https://fatchord.github.io/model_outputs/

License: MIT License

Python 100.00%
wavernn pytorch neural-vocoder speech-synthesis tts tacotron text-to-speech

wavernn's People

Contributors

dependabot[bot] avatar fatchord avatar mazzzystar avatar sih4sing5hong5 avatar thebutlah avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

wavernn's Issues

What is the training time?

Hi, I'm thinking of building WaveRNN as part of a masters project and I was just wondering what the training time is like. I only have a single GTX 1060 3GB so I was a bit concerned that training time would make this unrealistic. Also, I have done a term of machine learning modules but we didn't really cover conditioning networks on features of the input date, for example speaker id, current phoneme, syllable, word, etc., could you please point me in the direction of any information/literature on this.

Any help is much appreciated, Thanks

Constructing new mels as Input

@fatchord, thanks for your work on this. The samples you have are fantastic, and the model converges really quickly.

How do I go about creating the mel as the input? Do I need to train another model that produces mels and pipe that as the input? Or should I be able to take any wav file, construct a mel, and pass that as the input?

About input to gru

hi, i have a question about the input in this picture:
image
the x is a float number, if we concat x and mel, the weight of x will be very small, why not use the ulaw input and one_hot it

Slow inference time?

Thanks for this great implementation @fatchord! On a P100 I can generate only about 1600 samples/second i.e. much slower than real time. Is this expected or have I done something wrong?

It looks like this implementation is using 10 res blocks so maybe this is expected? Is there any way to make it 4x real time like the WaveRNN paper does?

Note I am talking only about the vocoder, not the tacotron part (i.e. mel spectrogram -> wav)

dsp library not included

Thank you very much for sharing this! I'm trying to run your NB4a and NB4b code. NB4a includes the dsp library, which is not in this repo. Would you please include it or point me to it? Thanks!

Faster implementation of WaveRNN and licensing

Hi, I managed to make the implementation of WaveRNN much faster by allowing it to use cuDNN's implementation of GRU:

https://github.com/mkotha/WaveRNN

The code is based on yours, although I have heavily modified it.

I'd like to make the above code publicly available, either dedicated to the public domain or released under an open source license. However, I realize that you didn't release your code under an open source license. Would it be possible for me to get a permission to release the code?

Subscaling WaveRNN

Hello,

Thanks for the great work!

Any plans for subscaling WaveRNN implementation? Or is the current WaveRNN implementation already fast enough (compared to say WaveNet generation)?

Possible to share pre-trained weights.

Hi @fatchord Great work on the code and architecture.
I was hoping if it's possible for you to share the pre-trained model so I can study the model on my end before training on my own dataset. Would be of great help, thanks!

At each epoch, loss of first batch is very different from subsequent batches

Hi @fatchord , thank you so much for sharing your great work!

I'm trying to train your alternate model. What I've noticed is that, at each epoch, the loss of the first batch is very different (usually much smaller) than the loss of subsequent batches. Is this normal? Why is this the case? Does that mean the model after the first batch is significantly better since it has a small loss?

Here is an example:
wavernn_loss_example

RuntimeError: cuDNN error: CUDNN_STATUS_EXECUTION_FAILED

# tail -f log.txt -n 5000

Initialising Model...

Trainable Parameters: 4.481M

Loading Model: "checkpoints/9bit_mulaw/latest_weights.pyt"

/pytorch/aten/src/THCUNN/SpatialClassNLLCriterion.cu:99: void cunn_SpatialClassNLLCriterion_updateOutput_kernel(T *, T *, T *, long *, T *, int, int, int, int, int, long) [with T = float, AccumT = float]: block: [6,0,0], thread: [611,0,0] Assertion `t >= 0 && t < n_classes` failed.
/pytorch/aten/src/THCUNN/SpatialClassNLLCriterion.cu:99: void cunn_SpatialClassNLLCriterion_updateOutput_kernel(T *, T *, T *, long *, T *, int, int, int, int, int, long) [with T = float, AccumT = float]: block: [6,0,0], thread: [202,0,0] Assertion `t >= 0 && t < n_classes` failed.
/pytorch/aten/src/THCUNN/SpatialClassNLLCriterion.cu:99: void cunn_SpatialClassNLLCriterion_updateOutput_kernel(T *, T *, T *, long *, T *, int, int, int, int, int, long) [with T = float, AccumT = float]: block: [6,0,0], thread: [301,0,0] Assertion `t >= 0 && t < n_classes` failed.
/pytorch/aten/src/THCUNN/SpatialClassNLLCriterion.cu:99: void cunn_SpatialClassNLLCriterion_updateOutput_kernel(T *, T *, T *, long *, T *, int, int, int, int, int, long) [with T = float, AccumT = float]: block: [6,0,0], thread: [402,0,0] Assertion `t >= 0 && t < n_classes` failed.
| Epoch: 1/157 (345/3205) | Loss: 4.921 | 0.54 steps/s | Step: 0k | Traceback (most recent call last):
  File "train.py", line 89, in <module>
    train_loop(model, optimiser, train_set, test_set, lr)
  File "train.py", line 43, in train_loop
    loss.backward()
  File "/usr/local/lib/python3.5/dist-packages/torch/tensor.py", line 102, in backward
    torch.autograd.backward(self, gradient, retain_graph, create_graph)
  File "/usr/local/lib/python3.5/dist-packages/torch/autograd/__init__.py", line 90, in backward
    allow_unreachable=True)  # allow_unreachable flag
RuntimeError: cuDNN error: CUDNN_STATUS_EXECUTION_FAILED

It seems cuda10 needed, and my current environment can NOT support CUDA10, do you have any idea of not using CUDA10?

Finetuning with gta helpful?

Have you got observable improvements after fine-tuning with gta features? In my experiments, the quality of generated speech doesn't improve compared to the ones using the model just trained with ground truth mel-spectrogram.

tts samples

Hi fatchord, great work - may I ask for the tts samples that you posted, did you use the vocoder trained on the ground truths or on the output of a seq2seq text to features vector model?

Pretrained models

Thanks a lot for this implementation. It would be great to provide pre-trained models for the generated samples.

dataset issue

hi!
Can you tell me what your x data type when you load the data :
m = np.load(f'{self.path}mel/{file}.npy')
x = np.load(f'{self.path}quant/{file}.npy')
In my case ,I was normlized the wave data by 32768,and return float values like 0.001 ,0.02, 0.34............., but in NB4b :
coarse = np.stack(coarse).astype(np.int64)
astype(np.int64) trans all the datas( 0.001,0.02,0.34..........) to zero, so it can't train in model, because the output all are zero.
can you tell me how to fix it ? THX

Training model with a dataset

Any hints on how to use NB2 to train a dataset (say a directory with multiple audio files) and then use the trained model to generate one of those samples?

Thank you in advance

hey, is there a mistake or am i wrong?

In your Fit a 30min Sample, i find your GRU'update state is : hidden_coarse = u_coarse * hidden_coarse + (1. - u_coarse) * e_coarse !!!
Shouldn’t it be: hidden_coarse = u_coarse * e_coarse + (1. - u_coarse) * hidden_coarse

buy the way, thank you for your dedication, your work is great!

Assertion Error

Hi,

I get a strange error when I launch the preprocess.py file.
The folder with my wav files is found and no other files are contained there.

How can I get through this error?

/WaveRNN$ python preprocess.py

7690 wav files found in "/home/ubuntu/wav/"

Traceback (most recent call last):
File "preprocess.py", line 52, in
text_dict = ljspeech(path)
File "/home/ubuntu/WaveRNN/utils/text/recipes.py", line 9, in ljspeech
assert len(csv_file) == 1
AssertionError

16 kHz Implementation

Hi,

Great work. Thanks for sharing.

I'm trying to implement repo with 16kHz sampling rate but after a few epochs on GPU training crashes every time. I feel like I sholud adapt the network but couldn't find a solution yet.

Can you share your way of doing this?

Floating point exception (core dumped)

I tried to train the tacotron model you have on top of the LJ pretrained checkpoint you have. Just ran train_tacotron.py but when I run gen_tacotron.py, I get the following:

Initialising WaveRNN Model...

Trainable Parameters: 4.481M

Loading Weights: "checkpoints/lj.wavernn/latest_weights.pyt"


Initialising Tacotron Model...

Trainable Parameters: 11.078M

Loading Weights: "checkpoints/lj.tacotron/latest_weights.pyt"

+---------+----------+---+-----------------+----------------+-----------------+
| WaveRNN | Tacotron | r | Generation Mode | Target Samples | Overlap Samples |
+---------+----------+---+-----------------+----------------+-----------------+
|  804k   |   197k   | 1 |     Batched     |     11000      |       550       |
+---------+----------+---+-----------------+----------------+-----------------+
 

| Generating 1/6
Floating point exception (core dumped)

Any ideas on how I can go on debugging this?

RuntimeError in NB4

Anyone else having problem running NB4? The code outputs a RuntimeError regarding Tensor dimensions when I run the model.generate(5000) line. The error is as follows:

runtimeerrorwavernn

Anyone please have a suggestion on how to fix this error?
Thank you in advance,
Gwena

About WaveRNN, not Alternative.

Hi,
The Alternative one is easy to train, and got good result.
I also trained the WaveRNN like your NB1/2/3 . but quality is not good even after 7 days. loss >5. I compare it with your Alternative , I wonder if add NN upsampling or Resnet will make it better. Would you share your thoughts. thanks.

For real time generation

Hi, thanks for such a good implementation of WaveRNN. Actually, I am working to integrate this WaveRNN implementation with Tacotron for TTS task and I got good results way faster then Wavenet but way slower than real-time (10 sec audio in nearly 3 mins or so).
Currently what I see that this model gives around 1500 samples/sec with my GTX 1080 ti. But in WaveRNN paper they claim to get 96,000 samples/sec by optimising WaveRNN-896 to P100 GPU even they show subscale to mobile CPU. Is it possible to optimize this WaveRNN at that level, so that we get real-time sampling at least from GPU.

Question about function melspectrogram()

Hi - I tried your alternate model, and it worked good easily, so I am thankful for your work.
But I noticed the output of your melspectrogram() function clips to 1.0 often on LJSpeech data.
(Of course, it might be my bad implementation).
But also it seems the code is similar to keithito/tacotron. In Keith's version he later changed one line to
S = _amp_to_db(_linear_to_mel(np.abs(D))) - hparams.ref_level_db
in response to an "issue" sent in by Rafael Valle. I wonder whether this difference was intentional or not, (or maybe not relevant).
Thanks.

why is randint?

mel_offsets = [np.random.randint(0, offset) for offset in max_offsets]

why is randint?

How to train a new model

hi,

I tried to train a new model using my own wav files.
I get an error at the preprocess.py step.
How is that supposed to work?

mu-law during generation

I am companding my target waveform with a mu-law before quantization. However I am not sure if I should expand the signal during the autoregressive generation or leave it as such, and expand it once the entire signal is generated.

I see that the restoration of the quantized signal happens here:

" sample = 2 * distrib.sample().float() / (self.n_classes - 1.) - 1.\n",

My question is whether I should expand signal from the mu-law right after this line, e.g.:
sample = torch.sign(sample) * (1 / (2 ** bits - 1)) * ((1 + (2 ** bits - 1)) ** torch.abs(sample) - 1)

Upsampling network may be simplified

I think the upsampling network can be replaced by a single torch.nn.Upsample operation with scale_factor equal to hop_length and mode='linear'.

I took a network trained on LJSpeech data and looked at the output of upsampling layer. Upsampled mels from the upsampling network match (up to an arbitrary scaling factor) very closely the values I get by linearly interpolating between the original mel values.

Auxiliary network is still needed, though.

Effect of Resnet CNN (aux) features.

I plotted CNN output of my trained model. It does not seem to be useful to me. Do you see something maybe I am missing? Or have you tried without Resnet features?
image

new model paper/details

Hi, you new model sounds very good, any chance you will write it up in a paper/blog-post? What's the new vocoder, it's it more WaveRNN like or wavenet like?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.