Comments (8)
Hi @macarbonneau,
No problem, I'm glad you found the repo useful. I haven't tried using the end (or beginning) segments but there's no real reason it shouldn't work. The thinking behind using the middle segment was to match the training and inference conditions as much as possible. At inference time most of the input to the autoregressive part of the model (rnn2
) will have context from the future and the past. So taking the middle segment is "closer" to what the network will see during inference. If you used the end segment, for example, the autoregressive component wouldn't have future context at training time and the mismatch might cause problems during generation.
Hope that explains my thinking. If anything is unclear let me know.
One of the negative side effects of only using the middle segment is that there are sometimes small artifacts at the beginning or end of the generated audio. For the best quality it might be worth putting in some extra time to train on the entire segment.
from universalvocoding.
Hi @wade3han,
Yeah, I should add some comments explaining those parameters.
First, sample_frames
is the number of frames sampled from the melspectorgram that get fed into the conditioning network (rnn1
in the model). The output then gets upsampled and sent to the auto-regressive part (rnn2
in the model). But if we set sample_frames
to 40 then after upsampling there are 40 x 200 = 8000 samples which takes far too long to train.
To speed things up I only take the middle audio_slice_frames
, upsample them and then use that to condition rnn2
. The pad
parameter is just how many frames are on either side of the middle audio_slice_frames
. So for the default config this would be (40 - 8) / 2 = 16 frames. To account for only taking the middle frames I also padded the melspectograms by pad
on both sided in preprocess.py
.
I hope that helps.
from universalvocoding.
Thanks for your reply!
I guessed that strange artifacts like below happens because of those hyperparameters. Don't you have seen those artifacts? I got those artifacts mainly on the front and back of audio files.
from universalvocoding.
No problem.
Are you using the pretrained model and the generate.py
script? Also, what input audio are you using? Is it something from the ZeroSpeech corpus or your own?
I'm not getting any of those artifacts. For example, here's the original audio:
And here's the reconstruction:
from universalvocoding.
Well, I was training new model from scratch using Korean speech data corpus. It has 300 hours amount of various speakers' utterances, and I was getting those artifacts after I tried to use audio_slice_frames=16
instead of 8. I believed using bigger audio_slice_frames
can help training.
Actually, I'm not sure why those artifacts are generated... I will share you if i figure out why. Please share your opinion if you have any ideas.
from universalvocoding.
@bshall I have one question to your first reply in this thread. Instead of having 40 mel frames, why not use 8 mel frames itself in the input of the rnn1 layer itself?
from universalvocoding.
Hi @dipjyoti92, sorry about the delay I've been away. I found that using only 8 frames as input to the rnn1 layer results in the generated audio being only silences. I think 8 frames is too short for the rnn to learn to appropriately use the reset gate although I haven't investigated this thoroughly.
from universalvocoding.
Hello @bshall !
Thank you for the awesome repo. Your code is very clean, I'm impressed. I'm playing a bit with your implementation and I have a question. Why do you take middle of the mel segment? Why not just the end? is there a benefit of having the padding at the end?
Thank you!!
from universalvocoding.
Related Issues (20)
- 24kHz and 10 bit mu-law model HOT 2
- Question about preprocess.py HOT 1
- Generating samples from generated Mel-spectrograms HOT 3
- Result remains little noise, but loss does not decrease HOT 9
- Changing parameters HOT 2
- How long does it takes to train from the scratch? HOT 4
- About Speaker Voice HOT 4
- preprocessing_mel question HOT 6
- generate_audio questions
- Why the embedding layer instead of the one-hot audio vector? HOT 1
- How to improve performance HOT 2
- audio_slice_frames in v0.2
- audio_slice_frames deprecation in v0.2 HOT 1
- Help needed. Trying to get vocoder working with output from a ML Tracotron HOT 5
- num_steps of training for those demo sample? HOT 5
- Result with other datasets HOT 1
- Inference speed comparison HOT 1
- mulaw encdoing HOT 1
- What's the capacity of this network? HOT 14
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from universalvocoding.