happylittlecat2333 / auffusion Goto Github PK

Official codes and models of the paper "Auffusion: Leveraging the Power of Diffusion and Large Language Models for Text-to-Audio Generation"

Home Page: https://auffusion.github.io/

License: Other

Python 0.70% Shell 0.01% Jupyter Notebook 99.30%

audio-generation diffusion diffusion-models large-language-models text-to-audio

auffusion's Issues

About pre-trained VAE

Hi, do you directly use the pre-trained VAE in LDM? Or the VAE is first pre-trained on audio spec? Thank you very much.

pt_to_numpy in auffusion_pipeline.py has 'staticmethod' object is not callable error

I have followed the installation steps given and have run the code in a jupyter notebook. Upon running the lines below, I get the ffg error

TypeError                                 Traceback (most recent call last)
Cell In[6], line 2
      1 prompt = "Birds singing sweetly in a blooming garden"
----> 2 output = pipeline(prompt=prompt)
      3 audio = output.audios[0]
      4 sf.write(f"{prompt}.wav", audio, samplerate=16000)

File /opt/conda/envs/auffusion/lib/python3.9/site-packages/torch/utils/_contextlib.py:115, in context_decorator.<locals>.decorate_context(*args, **kwargs)
    112 @functools.wraps(func)
    113 def decorate_context(*args, **kwargs):
    114     with ctx_factory():
--> 115         return func(*args, **kwargs)

File ~/work/Auffusion/auffusion_pipeline.py:1026, in AuffusionPipeline.__call__(self, prompt, height, width, num_inference_steps, guidance_scale, negative_prompt, num_images_per_prompt, eta, generator, latents, prompt_embeds, negative_prompt_embeds, output_type, return_dict, callback, callback_steps, cross_attention_kwargs, guidance_rescale, duration)
   1023     spectrograms.append(spectrogram)
   1025 # Convert to PIL
-> 1026 images = pt_to_numpy(image)    
   1027 images = numpy_to_pil(images)
   1028 images = [image_add_color(image) for image in images]

TypeError: 'staticmethod' object is not callable

Key Differences with Riffusion?

Hi,
Thanks a ton for your work and for sharing this project with everyone. It's super helpful for the community!

When reading the paper, I have several questions about it. The major concern I have is the key difference between Riffuision work. Riffusion didn't release its paper while they explained how they work on their website (the link does not exist anymore, so I use wayback machine to recall it). As I can tell from Riffusion, it's also finetuned from Stable diffusion and trained with paired mel-spectrogram and text descriptions. You also reimplemented this and trained with the same dataset as described in the supp.

From my understanding of your work, Auffusion did several things additionally:

Apply global normalization to the mel-spectrogram instead of individual normalization
Change the text encoder to CLAP+FlanT5 instead of CLIP for text condition
Using HiFiGAN vocoder to convert mel-spectrogram instead of Griffin-Lim algorithm

I assume there might be more differences I didn't notice, perhaps the author can help identify them. I also wonder which parts bring such large improvements. I understand HiFiGAN vocoder can bring a large improvement. From the experiment, the choices of text encoder only make a small difference. Is global normalization also very helpful or just the HiFiGAN vocoder do all the work?

Looking forward to your reply.

the code for the audio-to-audio generation

This is a very exciting job！！
I see the audio-to-audio generation part has been completed in the todo list, but I couldn't find it
can you tell me where is the inference code for the audio-to-audio generation？

Question about CLAP score evaluation

Hi,

In your repo, you mentioned that you use https://huggingface.co/laion/clap-htsat-unfused to compute the CLAP score and I am trying to reproduce the same evaluation. However, I noticed that the weights of CLAP are trained with a sampling rate of 48K but your model only produces the audio with a sampling rate of 16K.

I wonder how you perform the evaluation. Did you upsampling the audio by the off-shelf models?

`AttributeError: 'NoneType' object has no attribute 'shape'` when giving negative_prompt

Thanks for this great work Auffusion!
I opened a PR for this issue. Be free to merge it or resolve it more officially in your ways.

Copied from PR:
For now when giving negative_prompt to pipeline, an error would be raised:

Traceback (most recent call last):
File "D:\path\Auffusion\test.py", line 8, in
output = pipeline(prompt=prompt, negative_prompt="Low quality, average quality.")
File "D:\path\Auffusion\venv\lib\site-packages\torch\utils_contextlib.py", line 115, in decorate_context
return func(*args, **kwargs)
File "D:\path\Auffusion\auffusion_pipeline.py", line 934, in call
prompt_embeds = self._encode_prompt(
File "D:\path\Auffusion\auffusion_pipeline.py", line 757, in _encode_prompt
seq_len = negative_prompt_embeds.shape[1]
AttributeError: 'NoneType' object has no attribute 'shape'
It seems like the negative_prompt_embeds hasn't been inited when elif isinstance(negative_prompt, str) in auffusion_pipeline.py:

...
if negative_prompt is None:
negative_prompt_embeds = torch.zeros_like(prompt_embeds).to(dtype=prompt_embeds.dtype, device=device)
elif isinstance(negative_prompt, str):
negative_prompt = [negative_prompt]
# negative_prompt_embeds remains None here.
...
else:
negative_prompt_embeds = get_prompt_embeds(negative_prompt, device)

Pipeline doesn't work with Diffusers=0.25.1

Hi,
Do you have a plan to let your pipeline support Diffusers==0.25.1? The current code has issues loading pre-train weights from hugging face because adapter_list, tokenizer_list, text_encoder_list are not in the pipeline modules.
The error happens at https://github.com/happylittlecat2333/Auffusion/blob/main/auffusion_pipeline.py#L546.

Can I control the duration of theText-guided style transfer's output audio?

I tested and found that the duration of the output audio is always 10 seconds. How to modify the code to make the output audio duration consistent with the input audio duration

happylittlecat2333 / auffusion Goto Github PK

auffusion's Issues

About pre-trained VAE

pt_to_numpy in auffusion_pipeline.py has 'staticmethod' object is not callable error

Key Differences with Riffusion?

the code for the audio-to-audio generation

Question about CLAP score evaluation

`AttributeError: 'NoneType' object has no attribute 'shape'` when giving negative_prompt

Pipeline doesn't work with Diffusers=0.25.1

Can I control the duration of theText-guided style transfer's output audio?

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent

Jobs