Comments (30)
@maozhiqiang You're very welcome. Please keep in mind that there are some details (optimizer, learning rate, conditioning equations, signal splitting technique etc etc) that were left out of the paper so I've had to basically guess at what these are and improvise accordingly.
from wavernn.
Hi there, I'm not 100% sure but the matrix multiplication for the inputs is quite small compared to the others in the equations so perhaps that's why it's left out of N.
As for conditioning, I've tried a couple of things - I've passed the conditioning features through a dense layer and then biased the gates. I've also tried concatenating them to the input. Both are working pretty much the same but the results are only okayish - the speech is intelligible but it has this trembling quality which might point to my upsampling network being incorrect . Have a listen:
wavernn_conditioned.tar.gz
I'm currently training another model with different upsampling network & bigger dataset. If results are good I'll upload to the repo.
from wavernn.
@fatchord Is there any chance you could upload the code version you used to condition on mel-spectrograms?
from wavernn.
Hi guys,
I've managed to implement a modified version of WaveRNN (I use two parallel RNN layers instead of modifying the cell) and slightly modified version of Tacotron to output conditioned spectograms. I tested the approach on Romanian (directly on characters) and I got some pretty ok results. You can check them out here (I also added e results obtained with HTS trained on the same corpus with lots of features extracted from text - syllables, phonetic transcription, POS, chunking etc.):
https://tiberiu44.github.io/TTS-Cube/
the project's repo is
https://github.com/tiberiu44/TTS-Cube
I'm currently working on adding some more text-features (unsupervised) and on conditioning the model on multiple speakers.
Hope this helps,
Tibi
from wavernn.
Hi there, I believe you can, in the paper that's what they said they did with it. They are using linguistic/pitch features as conditioning to wavernn.
Right now I'm trying to condition with upsampled mel spectrograms (like the tacotron2 vocoder) but the sound quality isn't great - I'm getting a whispering/gravelly sound, like the speaker has a really bad cold! I still have a couple more ideas to test out so if I have any success I'll upload it to the repo.
from wavernn.
@fatchord Thank you for your quick reply,There are some points in the paper that I haven't fully understood. So I'm using your code to understand it!
from wavernn.
@fatchord Thank you for your detailed explanation!
from wavernn.
hi @fatchord! thanks for your implementation, puts some clarity to the equation (2) in the paper. There are a few things I am not sure I understand. First, it is stated that there are N=5 matrix multiplication, but I count 6 matrix multiplications:
- I x_t
- R h_{t-1}
- O_1 y_c
- O_2 relu(O_1 y_c)
- O_3 y_f
- O_4 relu(O_3 y_f
Now, talking about conditioning, all conditioning equations are left out, which feels pretty convenient because it basically allow them to have N=5. If you add the conditioning operations, this number should raise to at least 6 (or 7). I cannot see in your repository how you have tackled conditioning. I guess you are using a similar approach than in the WaveNet where you have a 1D 1x1 convolution on the conditioning variable, and the result is just added to the hidden state of the RNN?
from wavernn.
@naifrec When Sampling, 1 & 2 can be merged, e.g. [R I ] * [h_{t-1} x_t]
from wavernn.
@fatchord very nice result, does them use ground-truth c_t
?
The conditioning formula is missed in the paper, maybe we should email with the authors.
I concatenate Linguistic(green line) / Acoustic (red line) features with x_t
, training on cmu_us_slt_arctic:
from wavernn.
@lifeiteng thanks, it's not a terrible result, at least it's working somewhat. However on the same dataset I get a smoother quality with wavenet(8bit):
wavenet_gen_glr_dataset.tar.gz
I'm not sure what you mean by ground-truth c_t
- do you mean conditioning features at time t?
By the way, I'd love to hear your wavernn samples so feel free to post them here!
from wavernn.
@fatchord I mean Are those samples synthesized only on Acoustic Features, not include ground-truth coarse & fine as inputs(like training)
.
Doing experiments now.
from wavernn.
BTW, you can try 9-bit or 10-bit u-law in WaveNet
.
from wavernn.
@fatchord hi! for mel as a local condition , do you use upsample ! how to as input of wavernn!
from wavernn.
@lifeiteng Re:generation - thanks for clarifying - those are synthesized with mel spectrograms only, the model didn't see any ground truth audio samples when generating.
Re: 9/10 bit wavenet - actually I've been wanting to try this out but my gpu has been busy with other experiments. Did you try it? Any success with it?
from wavernn.
@maozhiqiang I upsampled with a 2d conv transpose layer to keep the channels separate. Have a look at how kan-bayashi does it in his wavenet repo - it's basically the same. Regarding input/conditioning, you've a lot of options: you can concatenate it to the coarse and fine samples, you can transform it with a dense layer, split and bias the gates and another option I guess is to transform it and add to the hidden state after the gates have been computed. There might be other ways but these are the most obvious to me right now.
from wavernn.
@fatchord wavenet - I try 9-bit (e.g. 512 softmax) in wavenet, got much more clean result than 8-bit version. I saw a paper which claim 10-bit is better than 8-bit recently.
Now, I like wavernn more, because the training of wavenet needs several days GPU time to get reasonable result.
from wavernn.
@fatchord thanks ! i well implement this like you said!
from wavernn.
@lifeiteng Sounds very interesting, have you got a link to the paper?
from wavernn.
@fatchord http://festvox.org/blizzard/bc2017/USTC-NELSLIP_Blizzard2017.pdf
As the original 8bit quantization introduced quantization
noise in synthetic speeches, we proposed to use a 10bit quantization
scheme instead, in order to alleviate this problem.
WaveNet with 3 blocks, which was 30 layers in total, was used
in our system.
from wavernn.
@lifeiteng Thanks! Nice read, I really love the idea of using a GAN to fix the smoothness from the mse loss.
from wavernn.
Hi @fatchord - you've got some nice quality results there both from wavernn and wavenet!
Did you manage to overcome that trembling quality with your wavernn 198k sample? it's something we've been puzzling over too...
Our intuition points to the following areas
- The continuity of the RNN is broken by re-initializing the hidden state after 960 samples (and would point to a periodic tremble)
- The upsampling network may produce artifacts
- experimenting with different conditioning sites (i.e dense layer / concat at gates)
- If all else fails, more data..?
To preserve continuity, we are looking into TBPTT and longer input sequence lengths (like 2000 instead of 960) as well as increasing the length of our batches from the dataloader so we can carry forward the hidden states longer on each step.
Our upsampling is the same conv2d approach as in Kay Bayashi's wavenet. (we tried different stride/conv2d settings, like stacking smaller upsampling convs etc but found that the single 2d conv that upsamples the mel_spectrograms 'hop_length' times works best).
We're exploring different conditioning sites now.
The last appraoch is to train it on a huge dataset like LJSpeech and just wait it out, but our initial overfitting attempts on wavernn didn't seem to get rid of the jittery/trembling quality...
Did you find anything that worked? i.e wavernn comparable to wavenet quality?
cheers,
shreyas
from wavernn.
@lifeiteng Did u message the authors to make sense the conditioning formula?
from wavernn.
@MlWoo yes, but not got the response.
from wavernn.
@lifeiteng Thanks a lot. The curves you show look good. Did you get them by upsampling the conditional features first to adapt the time resolution and then projecting them to bias the gates?
from wavernn.
That's a data feeding bug. Actually, I'm not lucky on wavernn.
from wavernn.
@lifeiteng what a pity!
from wavernn.
@tiberiu44 Thanks a lot for your contribution. I have tried to train the vocoder with your code, But it is very slow to train the vocoder on 1080ti. I want to confirm the info because I am not familar to dynet framework.
from wavernn.
Hi @MlWoo
It is slow, but it might be slower if you don't use the GPU version of DyNET. See this issue: tiberiu44/TTS-Cube#2
You need to compile DyNET with CUDA support. Also run the training process for the vocoder with --use-gpu --autobatch and --batch-size=4000.
Similar settings go for the encoder (it ignores the batch-size at this point).
from wavernn.
@tiberiu44 I did compile the dynet with cuda and run the model with cuda backend. I have not checked your model yet, maybe the model is large. Thanks for your reply.
from wavernn.
Related Issues (20)
- Using wavernn pretrained model, loss stuck at 5.6
- Can I use pretrained models with different hparams settings?
- sentence long problem
- Train WaveRnn AttributeError HOT 5
- ValueError - gen_tacotron.py HOT 1
- Error During Computing Consensus Step HOT 1
- adding support for windows sapi5
- why do you minus 2 in preprocessing ?
- AttributeError: module 'librosa' has no attribute 'output' HOT 4
- data\\dataset.pkl isssue HOT 1
- [feature request] dynamic batch size during WaveRNN training depending on free/total GPU memory
- Tacotron to Onnx HOT 1
- Where is the audio file for which itis generating the text? HOT 2
- (Solved, but can be useful to someone) Problems getting the project working for the first time
- spectrogram (image_-to-wav HOT 1
- Help
- Is it possible to generate music using WaveRNN?
- ModuleNotFoundError: No module named 'numba.decorators' When Running quick_start.py HOT 2
- Failed to build wheel for llvmlite
- [CONTRIBUTION] Speech Dataset Generator
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from wavernn.