Comments (4)
Yes. If you use individual normalization on spectrogram when finetune Stable Diffusion, you can not denormalize back to raw spectrogram. Therefore you can not use hifigan to convert raw spectrogram back to audio, since hifigan use <raw audio, raw spectrogram> pair for training.
from auffusion.
Hi,
Thank you for your attention! Our work is inspired by Riffusion, and we aim to demonstrate that pretrained T2I models possess strong cross-modal alignment with generative capabilities. Additionally, their abilities can be transferred to TTA tasks, achieving performance comparable to other TTA models. This approach can save significant time and resources compared to training from scratch.
The key differences between our model and Riffusion include:
-
Riffusion generates only 5-second music clips and hasn't been tested in TTA tasks. For a fair comparison, we reimplemented Riffusion to produce 10-second clips by adjusting the sample rate (from 44.1k to 16k) and hop size. We also use the same dataset to train our models and Riffusion.
-
Riffusion uses individual normalization and quantizes the spectrogram (float) into an image (int), which introduces a non-reversible process (individual normalization) and precision loss (float to int). As a result, their spectrogram conversion process cannot be reversed using the neural vocoder HiFi-GAN. Therefore, they adopt the Griffin-Lim algorithm, which is not sensitive to the initial numerical value of the generated spectrogram, because Griffin-Lim uses an iterative reverse process. In contrast, we adopt global normalization and input the spectrogram (float) directly into the VAE encoder. Therefore, we can use HiFi-GAN to convert back from the VAE decoder output (float). The whole process is a lossless audio conversion, without any precision loss or non-reversible process.
Thus, our carefully designed feature space transformation pipeline can fully utilize the capacity of T2I models. We also provide insightful assessments of text-audio alignment in TTA tasks, examining the impact of different text encoder choices through innovative cross-attention map visualizations between text and the generated spectrogram.
I hope my answer has addressed your concerns.
from auffusion.
Thank you for your response! To make sure I understand you correctly, do you mean that HiFiGAN cannot be applied if using individual normalization? If HiFGAN can be applied on the individual normalization version, do you test its performance?
from auffusion.
Gotcha. Thank you for your explanation.
from auffusion.
Related Issues (8)
- Can I control the duration of theText-guided style transfer's output audio? HOT 1
- Question about CLAP score evaluation HOT 2
- pt_to_numpy in auffusion_pipeline.py has 'staticmethod' object is not callable error HOT 3
- the code for the audio-to-audio generation HOT 4
- Pipeline doesn't work with Diffusers=0.25.1 HOT 3
- `AttributeError: 'NoneType' object has no attribute 'shape'` when giving negative_prompt HOT 1
- About pre-trained VAE HOT 6
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from auffusion.