Hi, Thanks a ton for your work and for sharing this project with everyone. It's su

Key Differences with Riffusion?,about happylittlecat2333/auffusion

Comments (4)

happylittlecat2333 commented on June 14, 2024 2

Yes. If you use individual normalization on spectrogram when finetune Stable Diffusion, you can not denormalize back to raw spectrogram. Therefore you can not use hifigan to convert raw spectrogram back to audio, since hifigan use <raw audio, raw spectrogram> pair for training.

from auffusion.

happylittlecat2333 commented on June 14, 2024

Hi,
Thank you for your attention! Our work is inspired by Riffusion, and we aim to demonstrate that pretrained T2I models possess strong cross-modal alignment with generative capabilities. Additionally, their abilities can be transferred to TTA tasks, achieving performance comparable to other TTA models. This approach can save significant time and resources compared to training from scratch.

The key differences between our model and Riffusion include:

Riffusion generates only 5-second music clips and hasn't been tested in TTA tasks. For a fair comparison, we reimplemented Riffusion to produce 10-second clips by adjusting the sample rate (from 44.1k to 16k) and hop size. We also use the same dataset to train our models and Riffusion.
Riffusion uses individual normalization and quantizes the spectrogram (float) into an image (int), which introduces a non-reversible process (individual normalization) and precision loss (float to int). As a result, their spectrogram conversion process cannot be reversed using the neural vocoder HiFi-GAN. Therefore, they adopt the Griffin-Lim algorithm, which is not sensitive to the initial numerical value of the generated spectrogram, because Griffin-Lim uses an iterative reverse process. In contrast, we adopt global normalization and input the spectrogram (float) directly into the VAE encoder. Therefore, we can use HiFi-GAN to convert back from the VAE decoder output (float). The whole process is a lossless audio conversion, without any precision loss or non-reversible process.

Thus, our carefully designed feature space transformation pipeline can fully utilize the capacity of T2I models. We also provide insightful assessments of text-audio alignment in TTA tasks, examining the impact of different text encoder choices through innovative cross-attention map visualizations between text and the generated spectrogram.

I hope my answer has addressed your concerns.

from auffusion.

IFICL commented on June 14, 2024

Thank you for your response! To make sure I understand you correctly, do you mean that HiFiGAN cannot be applied if using individual normalization? If HiFGAN can be applied on the individual normalization version, do you test its performance?

from auffusion.

IFICL commented on June 14, 2024

Gotcha. Thank you for your explanation.

from auffusion.

Key Differences with Riffusion? about auffusion HOT 4 CLOSED

Comments (4)

Related Issues (8)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent

Jobs