GithubHelp home page GithubHelp logo

What is the training time? about wavernn HOT 19 CLOSED

fatchord avatar fatchord commented on July 17, 2024
What is the training time?

from wavernn.

Comments (19)

fatchord avatar fatchord commented on July 17, 2024 3

@room101b Hi, this model is very slow to train, it will take around a week with the implementation in this repo. The most obvious way, I think, to speed up training is to create an optimised CUDA kernel for the model, and even then I wouldn't be sure that that would speed up training that much. Add to that, the 1060 is slower than the card I've used (1080) so I wouldn't recommend this particular model unless you want to achieve real-time inference on a mobile app or something like that.

Instead I would recommend checking out FFTNet. I implemented it yesterday and it trains like a beast, without conditioning I was getting 7.5 batches/second - that's 10 times than WaveRNN/Wavenet! Also, it wasn't too hard on gpu memory either - much lighter than Wavenet although not as compact as WaveRNN.

Link: http://gfx.cs.princeton.edu/pubs/Jin_2018_FAR/fftnet-jin2018.pdf

As for conditioning, in my experience these vocoder models are quite robust to conditioning sites. Wavenet and FFTNet papers both have details on how to condition so those are a good start.

from wavernn.

lifeiteng avatar lifeiteng commented on July 17, 2024 2

@fatchord like this

N = 2048
x = batch * M * C
padding = batch * N * C
z = [padding x] -> M + N
while N > 1:
   z = W_L * z[:, :M + N / 2] + W_R * z[:, N / 2:]
   z = relu(W * relu(z))
   N = N / 2
z = relu(Wz)
logits = Wz  -> 0, 1 , ... M-1, M

from wavernn.

james-wynne-dev avatar james-wynne-dev commented on July 17, 2024

@fatchord Thanks for the response, its very helpful. My project will be doing something similar to NSynth, but generating abstract, music concrete style, noises. If you have any further advice it would be appreciated. Thanks.

from wavernn.

fatchord avatar fatchord commented on July 17, 2024

@room101b You're very welcome and your project sounds very interesting. By the way, I've uploaded what I've done so far with FFTNet if you wanna check it out:
https://github.com/fatchord/FFTNet

from wavernn.

lifeiteng avatar lifeiteng commented on July 17, 2024

As CNN can be easily paralleled on GPU, the training of CNN-like model (WaveNet, FFTNet) is fast.

model training inference
WaveNet faster slowest
WaveRNN slower faster
FFTNet fastest slower
Parallel WaveNet fast fastest

@fatchord awesome!
I also implement FFTNet in yesterday and got positive result (dump from training; cached Inference is hard to implement in TensorFlow, WIP):

step 10k

allison_lls007_04488_waveform

step 130k

allison_lls006_03649_waveform

from wavernn.

fatchord avatar fatchord commented on July 17, 2024

@lifeiteng That looks pretty darn good! One thing in the paper had me scratching my head and I'd love to get your input on it.

In section 2.3.2 they say to zero pad by N (they don't explicitly define N but I strongly got the impression it was the receptive field for any given layer in the stack):

z[0:M] = W_L ∗ x[-N:M-N] + W_R ∗ x[0:M]

But if the previous equation (without zero padding) was:

z = W_L ∗ x[0:N/2] + W_R ∗ x[N/2:N]

Wouldn't that mean that the equation from 2.3.2 should read:

z[0:M] = W_L ∗ x[-N/2:M-N/2] + W_R ∗ x[0:M]

Am I missing something?

from wavernn.

fatchord avatar fatchord commented on July 17, 2024

@lifeiteng Thanks! Yeah that was what I was doing initially but the tensor output has an extra N steps in the output if you do it that way - just chop it off before backprop?

from wavernn.

lifeiteng avatar lifeiteng commented on July 17, 2024

logits = Wz -> 0, 1 , ... M-1, M just one extra step output M, yes drop it.

from wavernn.

fatchord avatar fatchord commented on July 17, 2024

@lifeiteng My bad, I was padding inside the layer (like a bloody idiot!). Thanks again.

from wavernn.

lifeiteng avatar lifeiteng commented on July 17, 2024

@fatchord I have sent you a gitter invitation for more in-depth communication.

from wavernn.

fatchord avatar fatchord commented on July 17, 2024

@lifeiteng Thanks, I'll make an account on Gitter now so.

from wavernn.

iovdin avatar iovdin commented on July 17, 2024

@fatchord your 12k iteration sample sounds good.
If WaveRNN is just very tuned RNN then training nn.GRU with 1024 hidden units on mu-law in-out after 12k should produce slightly worse sample but comparable. But is far far from that.
Any idea why is that?

from wavernn.

fatchord avatar fatchord commented on July 17, 2024

@iovdin "Far far from that" as in good or bad? Can you post a sample from your experiment?

from wavernn.

iovdin avatar iovdin commented on July 17, 2024

@fatchord Okay with weight decay and lower learning rate it seems to sound better ("Far from that" meant really bad)
https://lera.ai/s/3318a1

from wavernn.

fatchord avatar fatchord commented on July 17, 2024

@iovdin It doesn't sound too bad - especially considering it's so early in training. Also the 16bits in wavernn makes a big difference when it comes to noise reduction and dynamic range - mu law can only do so much to reduce noise at lower bit depths.

from wavernn.

iovdin avatar iovdin commented on July 17, 2024

@fatchord it shows 10ths of steps i.e. it is 100k steps of 128 seq_len, comparable to your 12k steps with 960 seq_len.

from wavernn.

fatchord avatar fatchord commented on July 17, 2024

@iovdin That sounds like too small a number of time steps for training. Even at a low samplerate of 16kHz, the lowest audible frequency starts around 30Hz which is ~500 steps. I would recommend upping it too around 1000 steps.

from wavernn.

iovdin avatar iovdin commented on July 17, 2024

@fatchord Guys from DeepsSound trained SampleRNN with 128 BPTT steps http://deepsound.io/samplernn_first.html

from wavernn.

fatchord avatar fatchord commented on July 17, 2024

@iovdin Cool link though - thanks!

I'm not too familiar with SampleRNN (although it's a very interesting model), so I can't really comment on it much.

Actually - doesn't SampleRNN operate on frames of samples? Perhaps it's 128 frames of 16 samples? Again, haven't read that paper yet so I could be wrong on that.

from wavernn.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.