GithubHelp home page GithubHelp logo

Comments (4)

happylittlecat2333 avatar happylittlecat2333 commented on June 14, 2024 2

Yes. If you use individual normalization on spectrogram when finetune Stable Diffusion, you can not denormalize back to raw spectrogram. Therefore you can not use hifigan to convert raw spectrogram back to audio, since hifigan use <raw audio, raw spectrogram> pair for training.

from auffusion.

happylittlecat2333 avatar happylittlecat2333 commented on June 14, 2024

Hi,
Thank you for your attention! Our work is inspired by Riffusion, and we aim to demonstrate that pretrained T2I models possess strong cross-modal alignment with generative capabilities. Additionally, their abilities can be transferred to TTA tasks, achieving performance comparable to other TTA models. This approach can save significant time and resources compared to training from scratch.

The key differences between our model and Riffusion include:

  1. Riffusion generates only 5-second music clips and hasn't been tested in TTA tasks. For a fair comparison, we reimplemented Riffusion to produce 10-second clips by adjusting the sample rate (from 44.1k to 16k) and hop size. We also use the same dataset to train our models and Riffusion.

  2. Riffusion uses individual normalization and quantizes the spectrogram (float) into an image (int), which introduces a non-reversible process (individual normalization) and precision loss (float to int). As a result, their spectrogram conversion process cannot be reversed using the neural vocoder HiFi-GAN. Therefore, they adopt the Griffin-Lim algorithm, which is not sensitive to the initial numerical value of the generated spectrogram, because Griffin-Lim uses an iterative reverse process. In contrast, we adopt global normalization and input the spectrogram (float) directly into the VAE encoder. Therefore, we can use HiFi-GAN to convert back from the VAE decoder output (float). The whole process is a lossless audio conversion, without any precision loss or non-reversible process.

Thus, our carefully designed feature space transformation pipeline can fully utilize the capacity of T2I models. We also provide insightful assessments of text-audio alignment in TTA tasks, examining the impact of different text encoder choices through innovative cross-attention map visualizations between text and the generated spectrogram.

I hope my answer has addressed your concerns.

from auffusion.

IFICL avatar IFICL commented on June 14, 2024

Thank you for your response! To make sure I understand you correctly, do you mean that HiFiGAN cannot be applied if using individual normalization? If HiFGAN can be applied on the individual normalization version, do you test its performance?

from auffusion.

IFICL avatar IFICL commented on June 14, 2024

Gotcha. Thank you for your explanation.

from auffusion.

Related Issues (8)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.