Comments (10)
I'm not an expert on this, and I think it would be hard to tell without running experiments. If I had to guess (seeing how similar models are scaled in for image generation), I would increase the number of resnet blocks num_blocks: [2, 2, 2, 3, 3, 3]
or num_blocks: [2, 2, 2, 4, 4, 4]
depending on how large you want to go. You could also play with multipliers to increase the number of channels, e.g. multipliers: [1, 2, 4, 4, 4, 8, 8]
.
from audio-diffusion-pytorch.
Thanks! I'll try those settings. So you'd leave the attention features/heads/etc the same?
from audio-diffusion-pytorch.
Heads definitely, you could increase attention_features
to 128 then you would have a total of 128*8=1024
hidden features, which matches Imagen if I'm not wrong. They use twice the number of attention hidden features as channels (since if you have 128 channels with a multiplier of 4 that would be 512 channels).
Btw let me know if you get good results and which setting you end up using :)
from audio-diffusion-pytorch.
I stuck with the numbers you described in your first comment and left attention_features
untouched. You can follow the results here: https://wandb.ai/zaptrem/diffusion-pop-3?workspace=user-zaptrem
I also ran into a couple issues with my PC overheating (really hot this weekend!) and doubled the dataset size half way through, which explains the weird loss curves. Additionally, this doesn't include your more recent context_channels commit. Do you think it's worth resetting with the increased attention and context channels? Or seeing this one further through?
Also, does this library use the VAE encoding trick Stable Diffusion uses to increase efficiency?
from audio-diffusion-pytorch.
Thanks for sharing! The context channels commit is for some experiments I'm doing with conditioning so it's not necessary for unconditional generation. Hard to tell what's worth trying, I would wait for this experiment to be done and maybe run another where you only change the attention size to compare which is more influent.
I tried the VAE to increase efficiency but it's very hard to train a good VAE. There's no good loss function for audio, and it's also hard to make diffusion work with that. I would leave that away for now if you don't what to do lots of experiments :)
from audio-diffusion-pytorch.
it's very hard to train a good VAE
Could one just steal the pretrained VQVAEs from OpenAI's Jukebox? Or is that type not useful for efficiency improvements like that of Stable Diffusion?
from audio-diffusion-pytorch.
I'm not sure that would work since in order to add noise to the encoded input it needs to be in the range -1,1 with a mean of 0. Maybe if properly regularized
from audio-diffusion-pytorch.
That makes sense. Is there any reason a pyramid of diffusers a-la Jukebox's transformer priors couldn't do a similar job? That was my plan once I got something resembling acceptable results out of this level. Also, is there a rule of thumb for when to end training? Or do people just wait until changes are no longer audible/visible?
from audio-diffusion-pytorch.
I switched to the larger attention features version and am getting slightly more encouraging results: https://wandb.ai/zaptrem/diffusion-pop-4?workspace=user-zaptrem
I think I should keep scaling.
Is the learning rate falloff determined by number of epochs, or steps?
from audio-diffusion-pytorch.
That makes sense. Is there any reason a pyramid of diffusers a-la Jukebox's transformer priors couldn't do a similar job?
When you say a pyramid of diffusers do you mean like: a first diffusion model predicting a source at 12kHz, then a second upsampling that to 24kHz, and a third to 48kHz?
Also, is there a rule of thumb for when to end training? Or do people just wait until changes are no longer audible/visible?
There isn't. I've noticed that sometimes, even if the loss seems to converge, the quality continues to improve a bit after that. It's hard to find a rule that always applies, since there's no good metric for audio quality.
I switched to the larger attention features version and am getting slightly more encouraging results: https://wandb.ai/zaptrem/diffusion-pop-4?workspace=user-zaptrem
That's very interesting! (For some reason, the provided link seems to be dead)
Is the learning rate falloff determined by number of epochs, or steps?
I didn't add any LR scheduler, but other people I think use InverseLR, CosineAnnealingLR, or LambdaLR scheduling. Also, ideally you would keep a second model with EMA from which you do the sampling so that it's more stable. See for example the trainer in imagen-pytorch. It's something I might add to the trainer in the future.
Btw, I'm going to move this issue into the general discussion :)
from audio-diffusion-pytorch.
Related Issues (20)
- Question: the sigma_t is not samped from 0 to 1 in v-diffusion, which is not like your thesis mentioned, will it cause any trouble? HOT 1
- could provide a example recipe? HOT 12
- NaN after training for a while HOT 8
- What loss function is being used? HOT 2
- Add support to clip predicted samples to the desired range. HOT 2
- Alternative Noises: Offset, Pyramid, Pink HOT 2
- Spectrogram-based diffusion model HOT 2
- New Try
- Questions about conditional generation HOT 2
- Trained models
- Languages HOT 1
- Model architectures from the paper HOT 4
- Future Work - Models
- Can the repo be used to process MIDI data๏ผ
- AssertionError: ClassiferFreeGuidancePlugin requires embedding
- Class-conditional generation HOT 1
- I have a few questions about 1D-UNet HOT 4
- What is the structure of encoder in diffusionAE? HOT 1
- CUDA OF Memory for 80GB A100 : follow the mousai paper setting of text condition
- Unconditional model generates okay quality of fake human voice but failed on music. HOT 4
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. ๐๐๐
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google โค๏ธ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from audio-diffusion-pytorch.