Comments (9)
Hi @Approximetal,
Quick question: are those samples from out-of-domain speakers or were those speakers in the training data?
from universalvocoding.
Hi @Approximetal,
Quick question: are those samples from out-of-domain speakers or were those speakers in the training data?
They are in the training dataset.
from universalvocoding.
Hmmm, how large is the dataset (in hours)? Also how much data is there for each speaker? I'm not sure how much impact dataset size has but it may be a good first thing to investigate.
Otherwise, a good option might be to try finetuning the pretrained model on your dataset?
from universalvocoding.
I use four dataset training together, including VCTK-Corpus, VCC2020, a Chinese Dataset about 24h and a multi-lingual audiobook dataset about 150h.
By the way, how to fine-tune the pre-train model? I use VCTK and vcc2020 to fine-tune, based on the 1000k pre-trained model, decrease batch size to 2 and learning rate 5e-5, but it doesn't work.
from universalvocoding.
Sorry about the delay @Approximetal. That's a lot of data so I don't think finetuning is necessary. Since you have a big dataset my guess is that the large variation in speakers, recording conditions, and background noise causes the output distribution over next audio samples to be flatter. Sampling from the distribution could then introduce some noise.
I'm not sure how to get rid of the noise completely. One option is to try a larger model. If you have time to experiment you could increasing fc_channels
or conditioning_channels
here. Otherwise, if you only intend on generating audio from a smaller subset of speakers at test time you could train for a few epochs on only those speakers.
If you try any of these ideas or get better results some other way please let me know.
Hope that helps.
from universalvocoding.
Sorry about the delay @Approximetal. That's a lot of data so I don't think finetuning is necessary. Since you have a big dataset my guess is that the large variation in speakers, recording conditions, and background noise causes the output distribution over next audio samples to be flatter. Sampling from the distribution could then introduce some noise.
I'm not sure how to get rid of the noise completely. One option is to try a larger model. If you have time to experiment you could increasing
fc_channels
orconditioning_channels
here. Otherwise, if you only intend on generating audio from a smaller subset of speakers at test time you could train for a few epochs on only those speakers.If you try any of these ideas or get better results some other way please let me know.
Hope that helps.
Thank you for your advice, I will try it later, and update my result to you. Actually I tried to increase the number of layers and the dimension of the network, but the loss didn't decrease, I haven't found the reason.
Also, sometimes sharp noise occurs at the head of the synthesized audio. When I comment out the padding mel = torch.nn.functional.pad(mel, [0,0,4,0,0,0], "constant")
the sharp noise disappeared. Do you know why it happens?
from universalvocoding.
Thanks for the update @Approximetal,
My guess is that adding more layers might result in vanishing gradients. Just changing the width of the layers might work though.
@wade3han also reported a similar noise issue at the beginning of the audio in #12. I think this may be because the initial hidden state of the RNN layers may be off. Adding a few silent frames to the spectrogram might give the RNNs a chance to warm up.
Otherwise, as I mention here, it might be worth training without slicing out the middle segment before passing it to the autoregressive layer. This way the network can learn to generate audio using the initial RNN state.
from universalvocoding.
Sorry I made a mistake, I mean when I comment out the padding in inference generation step, the noise disappeared. BTW, do I only need to adjust fc_channels
and conditioning_channels
? What about others? Is it necessary to expand other channels to match the increase of fc_channels
?
I tried two parameters:
"conditioning_channels": 512,
"embedding_dim": 512,
"rnn_channels": 896,
"fc_channels": 512,
and
"conditioning_channels": 256,
"embedding_dim": 256,
"rnn_channels": 896,
"fc_channels": 512,
But the loss didn't decrease at all... @bshall
from universalvocoding.
Hi @Approximetal,
That's weird. I'll look into those parameters and get back to you shortly. The changes you made are correct so I'm not sure what the problem is.
from universalvocoding.
Related Issues (20)
- 24kHz and 10 bit mu-law model HOT 2
- Question about preprocess.py HOT 1
- Usage of audio_slice_frames, sample_frames, pad HOT 8
- Generating samples from generated Mel-spectrograms HOT 3
- Changing parameters HOT 2
- How long does it takes to train from the scratch? HOT 4
- About Speaker Voice HOT 4
- preprocessing_mel question HOT 6
- generate_audio questions
- Why the embedding layer instead of the one-hot audio vector? HOT 1
- How to improve performance HOT 2
- audio_slice_frames in v0.2
- audio_slice_frames deprecation in v0.2 HOT 1
- Help needed. Trying to get vocoder working with output from a ML Tracotron HOT 5
- num_steps of training for those demo sample? HOT 5
- Result with other datasets HOT 1
- Inference speed comparison HOT 1
- mulaw encdoing HOT 1
- What's the capacity of this network? HOT 14
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from universalvocoding.