GithubHelp home page GithubHelp logo

happylittlecat2333 / auffusion Goto Github PK

View Code? Open in Web Editor NEW
120.0 120.0 12.0 24.48 MB

Official codes and models of the paper "Auffusion: Leveraging the Power of Diffusion and Large Language Models for Text-to-Audio Generation"

Home Page: https://auffusion.github.io/

License: Other

Python 0.70% Shell 0.01% Jupyter Notebook 99.30%
audio-generation diffusion diffusion-models large-language-models text-to-audio

auffusion's Issues

About pre-trained VAE

Hi, do you directly use the pre-trained VAE in LDM? Or the VAE is first pre-trained on audio spec? Thank you very much.

pt_to_numpy in auffusion_pipeline.py has 'staticmethod' object is not callable error

I have followed the installation steps given and have run the code in a jupyter notebook. Upon running the lines below, I get the ffg error

TypeError                                 Traceback (most recent call last)
Cell In[6], line 2
      1 prompt = "Birds singing sweetly in a blooming garden"
----> 2 output = pipeline(prompt=prompt)
      3 audio = output.audios[0]
      4 sf.write(f"{prompt}.wav", audio, samplerate=16000)

File /opt/conda/envs/auffusion/lib/python3.9/site-packages/torch/utils/_contextlib.py:115, in context_decorator.<locals>.decorate_context(*args, **kwargs)
    112 @functools.wraps(func)
    113 def decorate_context(*args, **kwargs):
    114     with ctx_factory():
--> 115         return func(*args, **kwargs)

File ~/work/Auffusion/auffusion_pipeline.py:1026, in AuffusionPipeline.__call__(self, prompt, height, width, num_inference_steps, guidance_scale, negative_prompt, num_images_per_prompt, eta, generator, latents, prompt_embeds, negative_prompt_embeds, output_type, return_dict, callback, callback_steps, cross_attention_kwargs, guidance_rescale, duration)
   1023     spectrograms.append(spectrogram)
   1025 # Convert to PIL
-> 1026 images = pt_to_numpy(image)    
   1027 images = numpy_to_pil(images)
   1028 images = [image_add_color(image) for image in images]

TypeError: 'staticmethod' object is not callable

Key Differences with Riffusion?

Hi,
Thanks a ton for your work and for sharing this project with everyone. It's super helpful for the community!

When reading the paper, I have several questions about it. The major concern I have is the key difference between Riffuision work. Riffusion didn't release its paper while they explained how they work on their website (the link does not exist anymore, so I use wayback machine to recall it). As I can tell from Riffusion, it's also finetuned from Stable diffusion and trained with paired mel-spectrogram and text descriptions. You also reimplemented this and trained with the same dataset as described in the supp.

From my understanding of your work, Auffusion did several things additionally:

  • Apply global normalization to the mel-spectrogram instead of individual normalization
  • Change the text encoder to CLAP+FlanT5 instead of CLIP for text condition
  • Using HiFiGAN vocoder to convert mel-spectrogram instead of Griffin-Lim algorithm

I assume there might be more differences I didn't notice, perhaps the author can help identify them. I also wonder which parts bring such large improvements. I understand HiFiGAN vocoder can bring a large improvement. From the experiment, the choices of text encoder only make a small difference. Is global normalization also very helpful or just the HiFiGAN vocoder do all the work?

Looking forward to your reply.

the code for the audio-to-audio generation

This is a very exciting job!!
I see the audio-to-audio generation part has been completed in the todo list, but I couldn't find it
can you tell me where is the inference code for the audio-to-audio generation?

Question about CLAP score evaluation

Hi,

In your repo, you mentioned that you use https://huggingface.co/laion/clap-htsat-unfused to compute the CLAP score and I am trying to reproduce the same evaluation. However, I noticed that the weights of CLAP are trained with a sampling rate of 48K but your model only produces the audio with a sampling rate of 16K.

I wonder how you perform the evaluation. Did you upsampling the audio by the off-shelf models?

`AttributeError: 'NoneType' object has no attribute 'shape'` when giving negative_prompt

Thanks for this great work Auffusion!
I opened a PR for this issue. Be free to merge it or resolve it more officially in your ways.

Copied from PR:
For now when giving negative_prompt to pipeline, an error would be raised:

Traceback (most recent call last):
File "D:\path\Auffusion\test.py", line 8, in
output = pipeline(prompt=prompt, negative_prompt="Low quality, average quality.")
File "D:\path\Auffusion\venv\lib\site-packages\torch\utils_contextlib.py", line 115, in decorate_context
return func(*args, **kwargs)
File "D:\path\Auffusion\auffusion_pipeline.py", line 934, in call
prompt_embeds = self._encode_prompt(
File "D:\path\Auffusion\auffusion_pipeline.py", line 757, in _encode_prompt
seq_len = negative_prompt_embeds.shape[1]
AttributeError: 'NoneType' object has no attribute 'shape'
It seems like the negative_prompt_embeds hasn't been inited when elif isinstance(negative_prompt, str) in auffusion_pipeline.py:

...
if negative_prompt is None:
negative_prompt_embeds = torch.zeros_like(prompt_embeds).to(dtype=prompt_embeds.dtype, device=device)
elif isinstance(negative_prompt, str):
negative_prompt = [negative_prompt]
# negative_prompt_embeds remains None here.
...
else:
negative_prompt_embeds = get_prompt_embeds(negative_prompt, device)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.