GithubHelp home page GithubHelp logo

happylittlecat2333 / auffusion Goto Github PK

View Code? Open in Web Editor NEW
117.0 7.0 11.0 24.48 MB

Official codes and models of the paper "Auffusion: Leveraging the Power of Diffusion and Large Language Models for Text-to-Audio Generation"

Home Page: https://auffusion.github.io/

License: Other

Python 0.70% Shell 0.01% Jupyter Notebook 99.30%
audio-generation diffusion diffusion-models large-language-models text-to-audio

auffusion's Introduction

Auffusion: Leveraging the Power of Diffusion and Large Language Models for Text-to-Audio Generation

Paper | Model | Website and Examples | Audio Manipulation Notebooks | Hugging Face Models | Google Colab

Description

Auffusion is a latent diffusion model (LDM) for text-to-audio (TTA) generation. Auffusion can generate realistic audios including human sounds, animal sounds, natural and artificial sounds and sound effects from textual prompts. We introduce Auffusion, a TTA system adapting T2I model frameworks to TTA task, by effectively leveraging their inherent generative strengths and precise cross-modal alignment. Our objective and subjective evaluations demonstrate that Auffusion surpasses previous TTA approaches using limited data and computational resource. We release our model, inference code, and pre-trained checkpoints for the research community.

🚀 News

Auffusion Model Family

Model Name Model Path
Auffusion https://huggingface.co/auffusion/auffusion
Auffusion-Full https://huggingface.co/auffusion/auffusion-full
Auffusion-Full-no-adapter https://huggingface.co/auffusion/auffusion-full-no-adapter

📀 Prerequisites

Our code is built on pytorch version 2.0.1. We mention torch==2.0.1 in the requirements file but you might need to install a specific cuda version of torch depending on your GPU device type. We also depend on diffusers==0.18.2.

Install requirements.txt.

git clone https://github.com/happylittlecat2333/Auffusion/
cd Auffusion
pip install -r requirements.txt

You might also need to install libsndfile1 for soundfile to work properly in linux:

(sudo) apt-get install libsndfile1

⭐ Quickstart Guide

Download the Auffusion model and generate audio from a text prompt:

import IPython, torch
import soundfile as sf
from auffusion_pipeline import AuffusionPipeline

pipeline = AuffusionPipeline.from_pretrained("auffusion/auffusion")

prompt = "Birds singing sweetly in a blooming garden"
output = pipeline(prompt=prompt)
audio = output.audios[0]
sf.write(f"{prompt}.wav", audio, samplerate=16000)
IPython.display.Audio(data=audio, rate=16000)

The auffusion model will be automatically downloaded from Hugging Face and saved in cache. Subsequent runs will load the model directly from cache.

The generate function uses 100 steps and 7.5 guidance_scale by default to sample from the latent diffusion model. You can also vary parameters for different results.

prompt = "Rolling thunder with lightning strikes"
output = pipeline(prompt=prompt, num_inference_steps=100, guidance_scale=7.5)
audio = output.audios[0]
IPython.display.Audio(data=audio, rate=16000)

More generated samples are shown here. You can also try out the colab notebook to generate your own audio samples.

🐍 How to make inferences?

From our released checkpoints in Hugging Face Hub

To perform audio generation in AudioCaps test set from our Hugging Face checkpoints:

python inference.py \
--pretrained_model_name_or_path="auffusion/auffusion" \
--test_data_dir="./data/test_audiocaps.raw.json" \
--output_dir="./output/auffusion_hf" \
--enable_xformers_memory_efficient_attention \

Note

We use the evaluation tools from https://github.com/haoheliu/audioldm_eval to evaluate our models, and we adopt https://huggingface.co/laion/clap-htsat-unfused to compute CLAP score.

Some data instances originally released in AudioCaps have since been removed from YouTube and are no longer available. We thus evaluated our models on all the instances which were available as June, 2023.

Audio Manipulation

We show some examples of audio manipulation using Auffusion. Current audio manipulation methods include:

The audio manipulation code examples can all be found in notebooks.

TODO

"Buy Me A Coffee"

  • Publish demo website and arxiv link.
  • Publish Auffusion and Auffusion-Full checkpoints.
  • Add text-guided style transfer.
  • Add audio-to-audio generation.
  • Add audio inpainting.
  • Add word_swap and reweight prompt2prompt-based control.
  • Add audio super-resolution.
  • Build Gradio web application.
  • Add audio-to-audio, inpainting into Gradio web application.
  • Add style-transfer into Gradio web application.
  • Add audio super-resolution into Gradio web application.
  • Add prompt2prompt-based control into Gradio web application.
  • Add data preprocess and training code.

📚 Citation

Please consider citing the following article if you found our work useful:

@article{xue2024auffusion,
  title={Auffusion: Leveraging the Power of Diffusion and Large Language Models for Text-to-Audio Generation}, 
  author={Jinlong Xue and Yayue Deng and Yingming Gao and Ya Li},
  journal={arXiv preprint arXiv:2401.01044},
  year={2024}
}

🙏 Acknowledgement

Part of the code is borrowed from the following repos. We would like to thank the authors of these repos for their contribution.

Contact

If you have any problems regarding the paper, code, models, or the project itself, please feel free to open an issue or contact Jinlong Xue directly :)

auffusion's People

Contributors

eltociear avatar happylittlecat2333 avatar lingchul avatar mia11939 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar

auffusion's Issues

Key Differences with Riffusion?

Hi,
Thanks a ton for your work and for sharing this project with everyone. It's super helpful for the community!

When reading the paper, I have several questions about it. The major concern I have is the key difference between Riffuision work. Riffusion didn't release its paper while they explained how they work on their website (the link does not exist anymore, so I use wayback machine to recall it). As I can tell from Riffusion, it's also finetuned from Stable diffusion and trained with paired mel-spectrogram and text descriptions. You also reimplemented this and trained with the same dataset as described in the supp.

From my understanding of your work, Auffusion did several things additionally:

  • Apply global normalization to the mel-spectrogram instead of individual normalization
  • Change the text encoder to CLAP+FlanT5 instead of CLIP for text condition
  • Using HiFiGAN vocoder to convert mel-spectrogram instead of Griffin-Lim algorithm

I assume there might be more differences I didn't notice, perhaps the author can help identify them. I also wonder which parts bring such large improvements. I understand HiFiGAN vocoder can bring a large improvement. From the experiment, the choices of text encoder only make a small difference. Is global normalization also very helpful or just the HiFiGAN vocoder do all the work?

Looking forward to your reply.

Question about CLAP score evaluation

Hi,

In your repo, you mentioned that you use https://huggingface.co/laion/clap-htsat-unfused to compute the CLAP score and I am trying to reproduce the same evaluation. However, I noticed that the weights of CLAP are trained with a sampling rate of 48K but your model only produces the audio with a sampling rate of 16K.

I wonder how you perform the evaluation. Did you upsampling the audio by the off-shelf models?

the code for the audio-to-audio generation

This is a very exciting job!!
I see the audio-to-audio generation part has been completed in the todo list, but I couldn't find it
can you tell me where is the inference code for the audio-to-audio generation?

About pre-trained VAE

Hi, do you directly use the pre-trained VAE in LDM? Or the VAE is first pre-trained on audio spec? Thank you very much.

pt_to_numpy in auffusion_pipeline.py has 'staticmethod' object is not callable error

I have followed the installation steps given and have run the code in a jupyter notebook. Upon running the lines below, I get the ffg error

TypeError                                 Traceback (most recent call last)
Cell In[6], line 2
      1 prompt = "Birds singing sweetly in a blooming garden"
----> 2 output = pipeline(prompt=prompt)
      3 audio = output.audios[0]
      4 sf.write(f"{prompt}.wav", audio, samplerate=16000)

File /opt/conda/envs/auffusion/lib/python3.9/site-packages/torch/utils/_contextlib.py:115, in context_decorator.<locals>.decorate_context(*args, **kwargs)
    112 @functools.wraps(func)
    113 def decorate_context(*args, **kwargs):
    114     with ctx_factory():
--> 115         return func(*args, **kwargs)

File ~/work/Auffusion/auffusion_pipeline.py:1026, in AuffusionPipeline.__call__(self, prompt, height, width, num_inference_steps, guidance_scale, negative_prompt, num_images_per_prompt, eta, generator, latents, prompt_embeds, negative_prompt_embeds, output_type, return_dict, callback, callback_steps, cross_attention_kwargs, guidance_rescale, duration)
   1023     spectrograms.append(spectrogram)
   1025 # Convert to PIL
-> 1026 images = pt_to_numpy(image)    
   1027 images = numpy_to_pil(images)
   1028 images = [image_add_color(image) for image in images]

TypeError: 'staticmethod' object is not callable

`AttributeError: 'NoneType' object has no attribute 'shape'` when giving negative_prompt

Thanks for this great work Auffusion!
I opened a PR for this issue. Be free to merge it or resolve it more officially in your ways.

Copied from PR:
For now when giving negative_prompt to pipeline, an error would be raised:

Traceback (most recent call last):
File "D:\path\Auffusion\test.py", line 8, in
output = pipeline(prompt=prompt, negative_prompt="Low quality, average quality.")
File "D:\path\Auffusion\venv\lib\site-packages\torch\utils_contextlib.py", line 115, in decorate_context
return func(*args, **kwargs)
File "D:\path\Auffusion\auffusion_pipeline.py", line 934, in call
prompt_embeds = self._encode_prompt(
File "D:\path\Auffusion\auffusion_pipeline.py", line 757, in _encode_prompt
seq_len = negative_prompt_embeds.shape[1]
AttributeError: 'NoneType' object has no attribute 'shape'
It seems like the negative_prompt_embeds hasn't been inited when elif isinstance(negative_prompt, str) in auffusion_pipeline.py:

...
if negative_prompt is None:
negative_prompt_embeds = torch.zeros_like(prompt_embeds).to(dtype=prompt_embeds.dtype, device=device)
elif isinstance(negative_prompt, str):
negative_prompt = [negative_prompt]
# negative_prompt_embeds remains None here.
...
else:
negative_prompt_embeds = get_prompt_embeds(negative_prompt, device)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.