Can we use wavernn to make TTS, what is the input, thank you

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

hi <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

use wavernn to make TTS about wavernn HOT 30 CLOSED

fatchord commented on July 17, 2024

use wavernn to make TTS

from wavernn.

Comments (30)

fatchord commented on July 17, 2024 8

@maozhiqiang You're very welcome. Please keep in mind that there are some details (optimizer, learning rate, conditioning equations, signal splitting technique etc etc) that were left out of the paper so I've had to basically guess at what these are and improvise accordingly.

from wavernn.

fatchord commented on July 17, 2024 5

Hi there, I'm not 100% sure but the matrix multiplication for the inputs is quite small compared to the others in the equations so perhaps that's why it's left out of N.

As for conditioning, I've tried a couple of things - I've passed the conditioning features through a dense layer and then biased the gates. I've also tried concatenating them to the input. Both are working pretty much the same but the results are only okayish - the speech is intelligible but it has this trembling quality which might point to my upsampling network being incorrect . Have a listen:
wavernn_conditioned.tar.gz

I'm currently training another model with different upsampling network & bigger dataset. If results are good I'll upload to the repo.

from wavernn.

jjj8080 commented on July 17, 2024 3

@fatchord Is there any chance you could upload the code version you used to condition on mel-spectrograms?

from wavernn.

tiberiu44 commented on July 17, 2024 2

Hi guys,

I've managed to implement a modified version of WaveRNN (I use two parallel RNN layers instead of modifying the cell) and slightly modified version of Tacotron to output conditioned spectograms. I tested the approach on Romanian (directly on characters) and I got some pretty ok results. You can check them out here (I also added e results obtained with HTS trained on the same corpus with lots of features extracted from text - syllables, phonetic transcription, POS, chunking etc.):

https://tiberiu44.github.io/TTS-Cube/

the project's repo is
https://github.com/tiberiu44/TTS-Cube

I'm currently working on adding some more text-features (unsupervised) and on conditioning the model on multiple speakers.

Hope this helps,
Tibi

from wavernn.

fatchord commented on July 17, 2024

Hi there, I believe you can, in the paper that's what they said they did with it. They are using linguistic/pitch features as conditioning to wavernn.

Right now I'm trying to condition with upsampled mel spectrograms (like the tacotron2 vocoder) but the sound quality isn't great - I'm getting a whispering/gravelly sound, like the speaker has a really bad cold! I still have a couple more ideas to test out so if I have any success I'll upload it to the repo.

from wavernn.

maozhiqiang commented on July 17, 2024

@fatchord Thank you for your quick reply，There are some points in the paper that I haven't fully understood. So I'm using your code to understand it！

from wavernn.

maozhiqiang commented on July 17, 2024

@fatchord Thank you for your detailed explanation！

from wavernn.

naifrec commented on July 17, 2024

hi @fatchord! thanks for your implementation, puts some clarity to the equation (2) in the paper. There are a few things I am not sure I understand. First, it is stated that there are N=5 matrix multiplication, but I count 6 matrix multiplications:

I x_t
R h_{t-1}
O_1 y_c
O_2 relu(O_1 y_c)
O_3 y_f
O_4 relu(O_3 y_f

Now, talking about conditioning, all conditioning equations are left out, which feels pretty convenient because it basically allow them to have N=5. If you add the conditioning operations, this number should raise to at least 6 (or 7). I cannot see in your repository how you have tackled conditioning. I guess you are using a similar approach than in the WaveNet where you have a 1D 1x1 convolution on the conditioning variable, and the result is just added to the hidden state of the RNN?

from wavernn.

lifeiteng commented on July 17, 2024

@naifrec When Sampling, 1 & 2 can be merged, e.g. [R I ] * [h_{t-1} x_t]

from wavernn.

lifeiteng commented on July 17, 2024

@fatchord very nice result, does them use ground-truth c_t?
The conditioning formula is missed in the paper, maybe we should email with the authors.

I concatenate Linguistic(green line) / Acoustic (red line) features with x_t, training on cmu_us_slt_arctic:

from wavernn.

fatchord commented on July 17, 2024

@lifeiteng thanks, it's not a terrible result, at least it's working somewhat. However on the same dataset I get a smoother quality with wavenet(8bit):
wavenet_gen_glr_dataset.tar.gz

I'm not sure what you mean by ground-truth c_t - do you mean conditioning features at time t?

By the way, I'd love to hear your wavernn samples so feel free to post them here!

from wavernn.

lifeiteng commented on July 17, 2024

@fatchord I mean Are those samples synthesized only on Acoustic Features, not include ground-truth coarse & fine as inputs(like training).

Doing experiments now.

from wavernn.

lifeiteng commented on July 17, 2024

BTW, you can try 9-bit or 10-bit u-law in WaveNet .

from wavernn.

maozhiqiang commented on July 17, 2024

@fatchord hi! for mel as a local condition , do you use upsample ! how to as input of wavernn!

from wavernn.

fatchord commented on July 17, 2024

@lifeiteng Re:generation - thanks for clarifying - those are synthesized with mel spectrograms only, the model didn't see any ground truth audio samples when generating.

Re: 9/10 bit wavenet - actually I've been wanting to try this out but my gpu has been busy with other experiments. Did you try it? Any success with it?

from wavernn.

fatchord commented on July 17, 2024

@maozhiqiang I upsampled with a 2d conv transpose layer to keep the channels separate. Have a look at how kan-bayashi does it in his wavenet repo - it's basically the same. Regarding input/conditioning, you've a lot of options: you can concatenate it to the coarse and fine samples, you can transform it with a dense layer, split and bias the gates and another option I guess is to transform it and add to the hidden state after the gates have been computed. There might be other ways but these are the most obvious to me right now.

from wavernn.

lifeiteng commented on July 17, 2024

@fatchord wavenet - I try 9-bit (e.g. 512 softmax) in wavenet, got much more clean result than 8-bit version. I saw a paper which claim 10-bit is better than 8-bit recently.

Now, I like wavernn more, because the training of wavenet needs several days GPU time to get reasonable result.

from wavernn.

maozhiqiang commented on July 17, 2024

@fatchord thanks ! i well implement this like you said!

from wavernn.

fatchord commented on July 17, 2024

@lifeiteng Sounds very interesting, have you got a link to the paper?

from wavernn.

lifeiteng commented on July 17, 2024

@fatchord http://festvox.org/blizzard/bc2017/USTC-NELSLIP_Blizzard2017.pdf

As the original 8bit quantization introduced quantization
noise in synthetic speeches, we proposed to use a 10bit quantization
scheme instead, in order to alleviate this problem.
WaveNet with 3 blocks, which was 30 layers in total, was used
in our system.

from wavernn.

fatchord commented on July 17, 2024

@lifeiteng Thanks! Nice read, I really love the idea of using a GAN to fix the smoothness from the mse loss.

from wavernn.

shreyasnivas commented on July 17, 2024

Hi @fatchord - you've got some nice quality results there both from wavernn and wavenet!

Did you manage to overcome that trembling quality with your wavernn 198k sample? it's something we've been puzzling over too...

Our intuition points to the following areas

The continuity of the RNN is broken by re-initializing the hidden state after 960 samples (and would point to a periodic tremble)
The upsampling network may produce artifacts
experimenting with different conditioning sites (i.e dense layer / concat at gates)
If all else fails, more data..?

To preserve continuity, we are looking into TBPTT and longer input sequence lengths (like 2000 instead of 960) as well as increasing the length of our batches from the dataloader so we can carry forward the hidden states longer on each step.

Our upsampling is the same conv2d approach as in Kay Bayashi's wavenet. (we tried different stride/conv2d settings, like stacking smaller upsampling convs etc but found that the single 2d conv that upsamples the mel_spectrograms 'hop_length' times works best).

We're exploring different conditioning sites now.

The last appraoch is to train it on a huge dataset like LJSpeech and just wait it out, but our initial overfitting attempts on wavernn didn't seem to get rid of the jittery/trembling quality...

Did you find anything that worked? i.e wavernn comparable to wavenet quality?

cheers,
shreyas

from wavernn.

MlWoo commented on July 17, 2024

@lifeiteng Did u message the authors to make sense the conditioning formula?

from wavernn.

lifeiteng commented on July 17, 2024

@MlWoo yes, but not got the response.

from wavernn.

MlWoo commented on July 17, 2024

@lifeiteng Thanks a lot. The curves you show look good. Did you get them by upsampling the conditional features first to adapt the time resolution and then projecting them to bias the gates?

from wavernn.

lifeiteng commented on July 17, 2024

That's a data feeding bug. Actually, I'm not lucky on wavernn.

from wavernn.

MlWoo commented on July 17, 2024

@lifeiteng what a pity!

from wavernn.

MlWoo commented on July 17, 2024

@tiberiu44 Thanks a lot for your contribution. I have tried to train the vocoder with your code, But it is very slow to train the vocoder on 1080ti. I want to confirm the info because I am not familar to dynet framework.

from wavernn.

tiberiu44 commented on July 17, 2024

Hi @MlWoo

It is slow, but it might be slower if you don't use the GPU version of DyNET. See this issue: tiberiu44/TTS-Cube#2

You need to compile DyNET with CUDA support. Also run the training process for the vocoder with --use-gpu --autobatch and --batch-size=4000.

Similar settings go for the encoder (it ignores the batch-size at this point).

from wavernn.

MlWoo commented on July 17, 2024

@tiberiu44 I did compile the dynet with cuda and run the model with cuda backend. I have not checked your model yet, maybe the model is large. Thanks for your reply.

from wavernn.

use wavernn to make TTS about wavernn HOT 30 CLOSED

Comments (30)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent

Jobs