declare-lab / tango Goto Github PK

View Code? Open in Web Editor NEW

914.0 24.0 71.0 13.72 MB

A family of diffusion models for text-to-audio generation.

Home Page: https://tango2-web.github.io/

License: Other

Python 91.59% Shell 0.02% Makefile 0.03% Dockerfile 0.09% MDX 8.28%

audio-generation diffusion diffusion-models language-models large-language-models text-to-audio

tango's Introduction

TANGO: Text to Audio using iNstruction-Guided diffusiOn

🎆 🧨 🔥 🎠 Mustango is now part of Tango! Enjoy generating music from textual prompts. Click here

🎆 🧨 🔥 🎠 Mustango demo is out! Try it here

🎆 🧨 🔥 🎠 Meet Mustango, an exciting addition to the vibrant landscape of Multimodal Large Language Models designed for controlled music generation. Mustango leverages, Latent Diffusion Model (LDM), Flan-T5, and musical features to do the magic!

🔥 Tango has been accepted at ACM MM 2023.

🔥 The demo of TANGO is live on Huggingface Space

📣 We are releasing Tango-Full-FT-AudioCaps which was first pre-trained on TangoPromptBank, a collection of diverse text, audio pairs. We later fine-tuned this checkpoint on AudioCaps. This checkpoint obtained state-of-the-art results for text-to-audio generation on AudioCaps.

📣 We are excited to share that Oracle Cloud has sponsored the project Tango.

Tango Model Family

Model Name	Model Path
Tango	https://huggingface.co/declare-lab/tango
Tango-Full-FT-Audiocaps (state-of-the-art)	https://huggingface.co/declare-lab/tango-full-ft-audiocaps
Tango-Full-FT-Audio-Music-Caps	https://huggingface.co/declare-lab/tango-full-ft-audio-music-caps
Tango-Full	https://huggingface.co/declare-lab/tango-full

Description

TANGO is a latent diffusion model (LDM) for text-to-audio (TTA) generation. TANGO can generate realistic audios including human sounds, animal sounds, natural and artificial sounds and sound effects from textual prompts. We use the frozen instruction-tuned LLM Flan-T5 as the text encoder and train a UNet based diffusion model for audio generation. We perform comparably to current state-of-the-art models for TTA across both objective and subjective metrics, despite training the LDM on a 63 times smaller dataset. We release our model, training, inference code, and pre-trained checkpoints for the research community.

Quickstart Guide

Download the TANGO model and generate audio from a text prompt:

import IPython
import soundfile as sf
from tango import Tango

tango = Tango("declare-lab/tango")

prompt = "An audience cheering and clapping"
audio = tango.generate(prompt)
sf.write(f"{prompt}.wav", audio, samplerate=16000)
IPython.display.Audio(data=audio, rate=16000)

CheerClap.webm

The model will be automatically downloaded and saved in cache. Subsequent runs will load the model directly from cache.

The generate function uses 100 steps by default to sample from the latent diffusion model. We recommend using 200 steps for generating better quality audios. This comes at the cost of increased run-time.

prompt = "Rolling thunder with lightning strikes"
audio = tango.generate(prompt, steps=200)
IPython.display.Audio(data=audio, rate=16000)

Thunder.webm

Use the generate_for_batch function to generate multiple audio samples for a batch of text prompts:

prompts = [
    "A car engine revving",
    "A dog barks and rustles with some clicking",
    "Water flowing and trickling"
]
audios = tango.generate_for_batch(prompts, samples=2)

This will generate two samples for each of the three text prompts.

More generated samples are shown here.

Prerequisites

Our code is built on pytorch version 1.13.1+cu117. We mention torch==1.13.1 in the requirements file but you might need to install a specific cuda version of torch depending on your GPU device type.

Install requirements.txt.

git clone https://github.com/declare-lab/tango/
cd tango
pip install -r requirements.txt

You might also need to install libsndfile1 for soundfile to work properly in linux:

(sudo) apt-get install libsndfile1

Datasets

Follow the instructions given in the AudioCaps repository for downloading the data. The audio locations and corresponding captions are provided in our data directory. The *.json files are used for training and evaluation. Once you have downloaded your version of the data you should be able to map it using the file ids to the file locations provided in our data/*.json files.

Note that we cannot distribute the data because of copyright issues.

How to train?

We use the accelerate package from Hugging Face for multi-gpu training. Run accelerate config from terminal and set up your run configuration by the answering the questions asked.

You can now train TANGO on the AudioCaps dataset using:

accelerate launch train.py \
--text_encoder_name="google/flan-t5-large" \
--scheduler_name="stabilityai/stable-diffusion-2-1" \
--unet_model_config="configs/diffusion_model_config.json" \
--freeze_text_encoder --augment --snr_gamma 5 \

The argument --augment uses augmented data for training as reported in our paper. We recommend training for at-least 40 epochs, which is the default in train.py.

To start training from our released checkpoint use the --hf_model argument.

accelerate launch train.py \
--hf_model "declare-lab/tango" \
--unet_model_config="configs/diffusion_model_config.json" \
--freeze_text_encoder --augment --snr_gamma 5 \

Check train.py and train.sh for the full list of arguments and how to use them.

The training script should automatically download the AudioLDM weights from here. However if the download is slow or if you face any other issues then you can: i) download the audioldm-s-full file from here, ii) rename it to audioldm-s-full.ckpt, and iii) keep it in /home/user/.cache/audioldm/ direcrtory.

How to make inferences?

From your trained checkpoints

Checkpoints from training will be saved in the saved/*/ directory.

To perform audio generation and objective evaluation in AudioCaps test set from your trained checkpoint:

CUDA_VISIBLE_DEVICES=0 python inference.py \
--original_args="saved/*/summary.jsonl" \
--model="saved/*/best/pytorch_model_2.bin" \

Check inference.py and inference.sh for the full list of arguments and how to use them.

From our released checkpoints in Hugging Face Hub

To perform audio generation and objective evaluation in AudioCaps test set from our huggingface checkpoints:

python inference_hf.py --checkpoint="declare-lab/tango"

Note

We use functionalities from audioldm_eval for objective evalution in inference.py. It requires the gold reference audio files and generated audio files to have the same name. You need to create the directory data/audiocaps_test_references/subset and keep the reference audio files there. The files should have names as following: output_0.wav, output_1.wav and so on. The indices should correspond to the corresponding line indices in data/test_audiocaps_subset.json.

We use the term subset as some data instances originally released in AudioCaps have since been removed from YouTube and are no longer available. We thus evaluated our models on all the instances which were available as of 8th April, 2023.

We use wandb to log training and infernce results.

Experimental Results

Model	Datasets	Text	#Params	FD ↓	KL ↓	FAD ↓	OVL ↑	REL ↑
Ground truth	−	−	−	−	−	−	91.61	86.78

DiffSound	AS+AC	✓	400M	47.68	2.52	7.75	−	−
AudioGen	AS+AC+8 others	✗	285M	−	2.09	3.13	−	−
AudioLDM-S	AC	✗	181M	29.48	1.97	2.43	−	−
AudioLDM-L	AC	✗	739M	27.12	1.86	2.08	−	−

AudioLDM-M-Full-FT^‡	AS+AC+2 others	✗	416M	26.12	1.26	2.57	79.85	76.84
AudioLDM-L-Full^‡	AS+AC+2 others	✗	739M	32.46	1.76	4.18	78.63	62.69
AudioLDM-L-Full-FT	AS+AC+2 others	✗	739M	23.31	1.59	1.96	−	−

TANGO	AC	✓	866M	24.52	1.37	1.59	85.94	80.36

Citation

Please consider citing the following article if you found our work useful:

@article{ghosal2023tango,
  title={Text-to-Audio Generation using Instruction Tuned LLM and Latent Diffusion Model},
  author={Ghosal, Deepanway and Majumder, Navonil and Mehrish, Ambuj and Poria, Soujanya},
  journal={arXiv preprint arXiv:2304.13731},
  year={2023}
}

Limitations

TANGO is trained on the small AudioCaps dataset so it may not generate good audio samples related to concepts that it has not seen in training (e.g. singing). For the same reason, TANGO is not always able to finely control its generations over textual control prompts. For example, the generations from TANGO for prompts Chopping tomatoes on a wooden table and Chopping potatoes on a metal table are very similar. Chopping vegetables on a table also produces similar audio samples. Training text-to-audio generation models on larger datasets is thus required for the model to learn the composition of textual concepts and varied text-audio mappings.

We are training another version of TANGO on larger datasets to enhance its generalization, compositional and controllable generation ability.

Acknowledgement

We borrow the code in audioldm and audioldm_eval from the AudioLDM repositories. We thank the AudioLDM team for open-sourcing their code.

tango's People

Contributors

Stargazers

Watchers

tango's Issues

Noisy audio samples

Hi,

I am trying to reproduce the results and have run the inference code. However, the generated audio samples are completely noisy. Any suggestions as to what might be going wrong here? I am sharing my inference.py script and the command I have used to run the code

CUDA_VISIBLE_DEVICES=0 python inference.py --test_file="../audiocaps/test_audiocaps_subset.json" --text_encoder_name="google/flan-t5-large" --scheduler_name="configs/stable_diffusion_2.1.json" --unet_model_config="configs/diffusion_model_config.json" --model="../audioldm-s-full.ckpt" --batch_size=6

tango_files.zip

about tango-full-ft-audiocaps

Hi, thanks for your great open source work!

In your work, I noticed that you used audiocaps dataset to fine tune on tango-full ckpt,
can I know the command for your fine tuning process?
Do I need to modify the learning rate (default=3e-5) and do I use --hf_model or --resume_from_checkpoint in the command?

Looking forward to your reply, thanks again!😊

Can I retrain the model on the audiocaps dataset with a 24GB 4090？

Could you provide the ChatGPT prompt of sound?

The latter is obtained by prompting ChatGPT to explain the sound generated when a boat moves on the sea. Using this ChatGPT-generated description of the sound, TANGO provides superior results.

So, could you provide the ChatGPT prompt of sound?

After just using VAE reconstruct a audio, I only get noise

Here is my code. Is there something wrong on my method about using vae?

`def recon_vae(self, filename):
        """ recon audio only by vae """
        with torch.no_grad():

        waveform, sample_rate = torchaudio.load(filename)
        waveform = torchaudio.functional.resample(waveform, orig_freq=sample_rate, new_freq=16000)[0]
        waveform = waveform - torch.mean(waveform)
        waveform = waveform / (torch.max(torch.abs(waveform)) + 1e-8)
        waveform = 0.5 * waveform
        waveform = waveform / torch.max(torch.abs(waveform))
        waveform = 0.5 * waveform
      
        #waveform = 0.5 * waveform[0:int(len(waveform)*1)]
        
        audio = torch.unsqueeze(waveform, 0)
        audio = torch.nan_to_num(torch.clip(audio, -1, 1))
        audio = torch.autograd.Variable(audio, requires_grad=False)
        melspec, log_magnitudes_stft, energy = self.stft.mel_spectrogram(audio)
        melspec = melspec.transpose(1, 2)
        melspec = melspec.unsqueeze(1)
        truth_lattent = self.vae.get_first_stage_encoding(self.vae.encode_first_stage(melspec))
       
        mel_recon = self.vae.decode_first_stage(truth_lattent)
        wave = self.vae.decode_to_waveform(mel_recon)
    return wave[0], waveform`

Producing audio in different Sample Rate

Hey!

I was wondering if it was possible to train the model in 48kHz audio, and then generate audio directly in 48kHz. Has anyone attempted this?

Sound cloning

I'm looking to build on your research. I understand this isn't the scope of your project. Just curious for an interesting. Just wanted the thoughts of the creators. I want to retrain and repuprose and expertiment with this for expressive TTS instead of generic . I'm somewhat new to working with these models.

OBJECTIVES ->

retrain on a more dynamic dataset
synthetic dataset -> speech/text[w/special utterences]{real speech/lofi speech from 'BARK'}, speech w/synthetic audio envirnoments generated by 'tango'/text[I have a rather large dataset]

EXPECTATATIOS ->

most expressive hybrid TTS[TTS with semantic conditioned background environments]

QUESTIONS ->

what are your thought on approaching voice cloning with this style of architecture? I figure I should approach like inpainting?
If possible, wouldn't it clone any artifact contained in the speech audio?

CLOSING THOUGHTS ->
I'm opening to sharing my results with you guys privately. Appreciate your contribution to the community.

16 GB of GPU memory runs out

Hi.

I'm trying to train this model on a single P100 with 16 GB memory but seem to be running out of memory with a batch size of 2. Do I need more than 16 GB for this model? How can I reduce the GPU memory usage?

Cheers,

About data augment

Hi!
Thanks so much for this work! When I tried to train the model on AudioCaps (didn't change the training script other than file paths), I got this issue:
File "/tango/train.py", line 553, in
main()
File "/tango/train.py", line 459, in main
mixed_mel, _, _, mixed_captions = torch_tools.augment_wav_to_fbank(audios, text, len(audios),
File "/tango/tools/torch_tools.py", line 118, in augment_wav_to_fbank
waveform, captions = augment(paths, texts)
File "/tango/tools/torch_tools.py", line 108, in augment
waveform = torch.tensor(np.concatenate(mixed_sounds, 0))
File "<array_function internals>", line 180, in concatenate
ValueError: need at least one array to concatenate

It would be highly appreciated if you could kindly help me with this problem, thanks!

Downloading AudioCaps data

Hi,

I'm trying to download the AudioCaps data in order to train the Tango model. However, I'm not seeing any instructions in the AudioCaps repository on how to download it. Can you share any scripts or instructions on how to download and format the audio to train Tango?

Thanks!

Can you provide your accelerate config

The demo won't work if sentencepiece is not installed

Hello,

I created a new environment with python3.8, I installed all the requirements.txt dependencies and then I ran the demo. After following the steps of the jupyter notebook of #10 (comment) it didn't work. I solved it installing pip install sentencepiece

It was because I was getting the error. It could not be installed without sentencepiece.
OSError: [Errno 22] Invalid argument: '/home/pc/.cache/hub/models--google--flan-t5-large/blobs/W/"031ce30f2198693aabefeac9588257b81658d7f4.lock'

I just solved it, if someone else has this problem here is the solution.

Inference on server without GPU

Cuda is required for this to work. Can you enable cpu inference on server. I know it could be slow, but I don't mind the wait for experimentation purpose.

Question about inference_hf.py

I noticed that you updated the inference_hf.py file in your tango repository. May I kindly ask how it differs from the inference.py file?

HF Space seems to be down?

GEtting a 404 for this - https://huggingface.co/spaces/declare-lab/tango-text-to-audio-generation
Was it taken down?

License

Hi, thanks for making this great model. On your webpage, you claim that this model is "open source," however the CC-BY-NC-ND license is not open source. According to the OSI, open source software must not restrict commercial use:

No Discrimination Against Fields of Endeavor
The license must not restrict anyone from making use of the program in a specific field of endeavor. For example, it may not restrict the program from being used in a business, or from being used for genetic research.

I would deeply appreciate it if you could make this model truly open source. Thank you!

apple m1?

Error message ive got:tango/venv/lib/python3.10/site-packages/torch/cuda/init.py", line 221, in _lazy_init
raise AssertionError("Torch not compiled with CUDA enabled")
AssertionError: Torch not compiled with CUDA enabled

Hardware.

Hello.

Amazing work.

What kind of hardware should I expect to need to be able to run the model?

Thank you.

Training on own model

Hi, amazing work all!
From what I understand it is possible to train the model on my own data. I've got a sound library, which I want to use to train the model. Do I just replace the .json elements in the data folder?

/Update:
I replaced the data folder json's with mine and after some back and forth I'm stick at the following error. Of course the columns match:

ValueError: Couldn't cast
dataset: string
location: string
to
{'dataset': Value(dtype='string', id=None), 'location': Value(dtype='string', id=None), 'captions': Value(dtype='string', id=None)}
because column names don't match

Perhapse the issue is how the file is made? I pulled he metadata into xlsx -> made python script to make json and did some formatting fixes.

thx!

Ps. how big is your training setup in terms of GPU's?

Nan loss in training

Hi
Thanks for sharing your project.
When I trained your model based on your config, however, the val and train loss was NAN.
I tried many times but the results are still the same.
Can you tell me the reasons and how to solve it?

Question: can we generate 1sec/ms length audio?

The homepage shows only 10secs for all the audios, i want to know if the audio length is controllable ? Or there is a minimum as in audioldm, you can't generate less then 2.5 secs

Inference

Hello, during the inference phase, do I only need to use the 886 audio files from your data/test_audiocaps_subset.json? I have been unable to obtain the results from your paper, even when using your checkpoint.

Update Requirement infos (ipython, libsndfile1)

Congratulation for your great project!

After going through your instructions, I found that I needed to install IPython manually. Please consider adding ipython to the requirement list.

Additionally, for soundfile to function properly (at least on Linux in my case), libsndfile1 is necessary. Maybe you also want to add this to the prerequisites documentation:
(sudo) apt-get install libsndfile1

Is it possible to do fine tuning and incremental training?

I'm curious if it would be possible to fine tune the TANGO model through incremental training.

For example, I'd like to start with the audiocaps dataset, then fine tune the model with new audio datasets on a recurring basis.

Do you have any thoughts on how this could be achieved?

don't have a correct `repo_id` and `repo_type`?

`import IPython
import soundfile as sf
from tango import Tango

print("init tango begin！")
tango = Tango("declare-lab/tango")
print("init tango！")

prompt = "An audience cheering and clapping"
print("prompt input！")

audio = tango.generate(prompt)
print("audio generate！")

sf.write(f"{prompt}.wav", audio, samplerate=16000)
#IPython.display.Audio(davoid imageThread::start_image(Mat frame)ta=audio, rate=16000)`
@deepanwayx @nmder @soujanyaporia
when i run the code ,following error occurs:

it seems that i don't have a correct repo_id and repo_type
how can i deal with it ?

How can I change the duration of the output audio?

Hi, This is really a lovely repository. But how can I change the duration of the generated audio?
Thanks!!

The vae decoder cannot recover original audio with the extracted latent code

Hi! Thank u for making this amazing project public!
I just want a guidance for a problem I meet when tuning this code. The confusion is that why the vae decoder cannot recover the original speech wav when I directly use the latent code extracted by the provided encoder as the input?

Is possible an inference with just a RTX 3080 of 10GB?

Hello,

I know it is very little memory, but it is what I have by now.

By default, the demo code won't inference because of cuda out of memory. I tried to reduce the batch size of the inference to just 1, but is not enough.

Do you know a way to reduce the memory consumption running the inference?

I know that the best solution is to upgrade the GPU to a RTX 3090/4090/A6000, but before that I would like to try another way if possible.

Thank you!

David Martin Rius

ImportError: cannot import name 'Tango' from 'tango' (unknown location)

Getting this error:

ImportError: cannot import name 'Tango' from 'tango' (unknown location)

AttributeError: 'AudioDiffusion' object has no attribute 'device'

Hi,

I am trying to run the following command from the README.md

accelerate launch train.py \
--text_encoder_name="google/flan-t5-large" \
--scheduler_name="stabilityai/stable-diffusion-2-1" \
--unet_model_config="configs/diffusion_model_config.json" \
--freeze_text_encoder --augment --snr_gamma 5 \

However, I am getting this AttributeError

Traceback (most recent call last):
  File "/Users/fielguhit/Workspace/tango/train.py", line 534, in <module>
    main()
  File "/Users/fielguhit/Workspace/tango/train.py", line 434, in main
    device = model.device
  File "/Users/fielguhit/.pyenv/versions/3.10.10/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1269, in __getattr__
    raise AttributeError("'{}' object has no attribute '{}'".format(
AttributeError: 'AudioDiffusion' object has no attribute 'device'

Can you please advise?

encoder_attention_mask error

hello there ,thanks alot for this awesome tool.
i have a problem running it , i get this error
note that i use cpu method ,cause i get oom error when i use gpu cause i have only 8gb vram

PS C:\Users\Genesis\anaconda3\tango> & C:/Users/Genesis/anaconda3/envs/tango/python.exe c:/Users/Genesis/anaconda3/tango/generate.py
Fetching 8 files: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 8/8 [00:00<?, ?it/s]
c:\Users\Genesis\anaconda3\tango\audioldm\audio\stft.py:42: FutureWarning: Pass size=1024 as keyword args. From version 0.10 passing these as positional arguments will result in an error
fft_window = pad_center(fft_window, filter_length)
c:\Users\Genesis\anaconda3\tango\audioldm\audio\stft.py:151: FutureWarning: Pass sr=16000, n_fft=1024, n_mels=64, fmin=0, fmax=8000 as keyword args. From version 0.10 passing these as positional arguments will result in an error
mel_basis = librosa_mel_fn(
UNet initialized randomly.
Some weights of the model checkpoint at google/flan-t5-large were not used when initializing T5EncoderModel: ['decoder.block.16.layer.0.SelfAttention.v.weight', 'decoder.block.14.layer.0.layer_norm.weight', 'decoder.block.7.layer.0.layer_norm.weight', 'decoder.block.10.layer.1.layer_norm.weight', 'decoder.block.13.layer.0.SelfAttention.v.weight', 'decoder.block.20.layer.2.DenseReluDense.wo.weight', 'decoder.block.6.layer.1.EncDecAttention.v.weight', 'decoder.block.16.layer.2.DenseReluDense.wo.weight', 'decoder.block.0.layer.1.EncDecAttention.k.weight', 'decoder.block.2.layer.0.layer_norm.weight', 'decoder.block.8.layer.2.layer_norm.weight', 'decoder.block.16.layer.1.EncDecAttention.v.weight', 'decoder.block.13.layer.2.DenseReluDense.wi_0.weight', 'decoder.block.3.layer.1.EncDecAttention.k.weight', 'decoder.block.4.layer.0.SelfAttention.q.weight', 'decoder.block.1.layer.0.layer_norm.weight', 'decoder.block.20.layer.2.DenseReluDense.wi_0.weight', 'decoder.block.10.layer.1.EncDecAttention.v.weight', 'decoder.block.8.layer.1.layer_norm.weight', 'decoder.block.11.layer.0.SelfAttention.k.weight', 'decoder.block.22.layer.1.layer_norm.weight', 'decoder.block.1.layer.2.DenseReluDense.wo.weight', 'decoder.block.13.layer.0.SelfAttention.q.weight', 'decoder.block.16.layer.1.EncDecAttention.o.weight', 'decoder.block.2.layer.2.DenseReluDense.wi_0.weight', 'decoder.block.3.layer.0.SelfAttention.v.weight', 'decoder.block.16.layer.0.layer_norm.weight', 'decoder.block.11.layer.1.EncDecAttention.k.weight', 'decoder.block.18.layer.2.DenseReluDense.wi_0.weight', 'decoder.block.17.layer.1.EncDecAttention.o.weight', 'decoder.block.7.layer.1.EncDecAttention.v.weight', 'decoder.block.11.layer.2.DenseReluDense.wi_1.weight', 'decoder.block.13.layer.0.SelfAttention.k.weight', 'decoder.block.3.layer.0.SelfAttention.o.weight', 'decoder.block.13.layer.2.layer_norm.weight', 'decoder.block.12.layer.0.SelfAttention.k.weight', 'decoder.block.18.layer.2.DenseReluDense.wo.weight', 'decoder.block.15.layer.2.DenseReluDense.wo.weight', 'decoder.block.20.layer.0.SelfAttention.v.weight', 'decoder.block.4.layer.1.EncDecAttention.v.weight', 'decoder.block.8.layer.2.DenseReluDense.wi_1.weight', 'decoder.block.14.layer.1.EncDecAttention.o.weight', 'decoder.block.7.layer.0.SelfAttention.o.weight', 'decoder.block.20.layer.0.SelfAttention.q.weight', 'decoder.block.21.layer.0.SelfAttention.k.weight', 'decoder.block.8.layer.0.SelfAttention.v.weight', 'decoder.block.20.layer.2.layer_norm.weight', 'decoder.block.23.layer.1.EncDecAttention.q.weight', 'decoder.block.17.layer.1.EncDecAttention.q.weight', 'decoder.block.4.layer.2.DenseReluDense.wi_0.weight', 'decoder.block.10.layer.0.SelfAttention.k.weight', 'decoder.block.22.layer.2.DenseReluDense.wi_0.weight', 'decoder.block.0.layer.1.EncDecAttention.v.weight', 'decoder.block.5.layer.1.EncDecAttention.k.weight', 'decoder.block.5.layer.0.SelfAttention.k.weight', 'decoder.block.8.layer.1.EncDecAttention.k.weight', 'decoder.block.8.layer.0.SelfAttention.o.weight', 'decoder.block.9.layer.1.layer_norm.weight', 'decoder.block.14.layer.2.DenseReluDense.wi_0.weight', 'decoder.block.20.layer.1.EncDecAttention.k.weight', 'decoder.block.0.layer.1.EncDecAttention.o.weight', 'decoder.block.4.layer.0.SelfAttention.k.weight', 'decoder.block.22.layer.1.EncDecAttention.o.weight', 'decoder.block.15.layer.1.EncDecAttention.v.weight', 'decoder.block.6.layer.2.layer_norm.weight', 'decoder.block.12.layer.0.SelfAttention.o.weight', 'decoder.block.13.layer.1.EncDecAttention.q.weight', 'decoder.block.21.layer.1.layer_norm.weight', 'decoder.block.4.layer.1.EncDecAttention.o.weight', 'decoder.block.2.layer.1.EncDecAttention.o.weight', 'decoder.block.18.layer.0.SelfAttention.k.weight', 'decoder.block.16.layer.2.DenseReluDense.wi_1.weight', 'decoder.block.11.layer.2.DenseReluDense.wo.weight', 'decoder.block.13.layer.1.EncDecAttention.o.weight', 'decoder.block.20.layer.2.DenseReluDense.wi_1.weight', 'decoder.block.7.layer.0.SelfAttention.k.weight', 'decoder.block.13.layer.1.EncDecAttention.v.weight', 'decoder.block.23.layer.0.layer_norm.weight', 'decoder.block.17.layer.0.SelfAttention.o.weight', 'decoder.block.15.layer.0.SelfAttention.q.weight', 'decoder.block.17.layer.0.layer_norm.weight', 'decoder.block.11.layer.2.DenseReluDense.wi_0.weight', 'decoder.block.1.layer.0.SelfAttention.q.weight', 'decoder.block.16.layer.0.SelfAttention.k.weight', 'decoder.block.7.layer.2.DenseReluDense.wi_0.weight', 'decoder.block.8.layer.0.SelfAttention.k.weight', 'decoder.block.20.layer.0.SelfAttention.k.weight', 'decoder.block.22.layer.2.DenseReluDense.wi_1.weight', 'decoder.block.13.layer.0.SelfAttention.o.weight', 'decoder.block.11.layer.2.layer_norm.weight', 'decoder.block.1.layer.1.layer_norm.weight', 'decoder.block.20.layer.1.EncDecAttention.q.weight', 'decoder.block.22.layer.1.EncDecAttention.k.weight', 'decoder.block.6.layer.0.SelfAttention.k.weight', 'decoder.block.18.layer.1.EncDecAttention.v.weight', 'decoder.block.9.layer.0.layer_norm.weight', 'decoder.block.23.layer.2.DenseReluDense.wi_1.weight', 'decoder.block.2.layer.0.SelfAttention.k.weight', 'decoder.block.4.layer.2.layer_norm.weight',
'decoder.block.1.layer.2.DenseReluDense.wi_1.weight', 'decoder.block.13.layer.0.layer_norm.weight', 'decoder.block.2.layer.1.layer_norm.weight', 'decoder.block.6.layer.2.DenseReluDense.wi_1.weight', 'decoder.block.10.layer.1.EncDecAttention.o.weight', 'decoder.block.21.layer.2.DenseReluDense.wo.weight', 'decoder.block.3.layer.1.EncDecAttention.q.weight', 'decoder.block.12.layer.1.EncDecAttention.q.weight', 'decoder.block.0.layer.2.layer_norm.weight', 'decoder.block.19.layer.0.SelfAttention.k.weight', 'decoder.block.8.layer.0.SelfAttention.q.weight', 'decoder.block.18.layer.0.SelfAttention.v.weight', 'decoder.block.6.layer.1.EncDecAttention.o.weight', 'decoder.block.5.layer.2.DenseReluDense.wo.weight', 'decoder.block.8.layer.0.layer_norm.weight', 'decoder.block.21.layer.0.SelfAttention.o.weight', 'decoder.block.1.layer.2.layer_norm.weight', 'decoder.block.22.layer.2.layer_norm.weight', 'decoder.block.5.layer.2.DenseReluDense.wi_0.weight', 'decoder.block.9.layer.2.DenseReluDense.wi_0.weight', 'decoder.block.15.layer.1.EncDecAttention.o.weight', 'decoder.block.23.layer.2.layer_norm.weight', 'decoder.block.19.layer.1.EncDecAttention.q.weight', 'decoder.block.15.layer.1.EncDecAttention.k.weight', 'decoder.block.19.layer.0.layer_norm.weight', 'decoder.block.17.layer.2.layer_norm.weight', 'decoder.block.18.layer.1.layer_norm.weight', 'decoder.block.21.layer.2.DenseReluDense.wi_1.weight', 'decoder.block.20.layer.1.layer_norm.weight', 'decoder.block.3.layer.1.layer_norm.weight', 'decoder.block.5.layer.0.SelfAttention.o.weight', 'decoder.block.3.layer.2.DenseReluDense.wi_0.weight', 'decoder.block.16.layer.1.layer_norm.weight', 'decoder.block.11.layer.1.layer_norm.weight', 'decoder.block.3.layer.0.SelfAttention.k.weight', 'decoder.block.9.layer.0.SelfAttention.k.weight', 'decoder.block.20.layer.1.EncDecAttention.o.weight', 'decoder.block.19.layer.2.DenseReluDense.wi_0.weight', 'decoder.block.4.layer.1.layer_norm.weight', 'decoder.block.11.layer.0.SelfAttention.v.weight', 'decoder.block.18.layer.0.layer_norm.weight', 'decoder.block.23.layer.0.SelfAttention.q.weight', 'decoder.block.0.layer.1.EncDecAttention.q.weight', 'decoder.block.16.layer.0.SelfAttention.o.weight', 'decoder.block.14.layer.0.SelfAttention.o.weight', 'decoder.block.21.layer.1.EncDecAttention.o.weight', 'decoder.block.6.layer.0.SelfAttention.v.weight', 'decoder.block.19.layer.2.layer_norm.weight', 'decoder.block.8.layer.1.EncDecAttention.q.weight', 'decoder.block.10.layer.1.EncDecAttention.k.weight', 'decoder.block.4.layer.2.DenseReluDense.wo.weight', 'decoder.block.17.layer.1.EncDecAttention.v.weight', 'decoder.block.7.layer.1.EncDecAttention.q.weight', 'decoder.block.13.layer.2.DenseReluDense.wi_1.weight', 'decoder.block.18.layer.1.EncDecAttention.o.weight', 'decoder.block.2.layer.2.layer_norm.weight', 'decoder.block.0.layer.0.layer_norm.weight', 'decoder.block.7.layer.1.EncDecAttention.o.weight', 'decoder.block.8.layer.2.DenseReluDense.wo.weight', 'decoder.embed_tokens.weight', 'decoder.block.12.layer.1.EncDecAttention.v.weight', 'decoder.block.7.layer.2.DenseReluDense.wi_1.weight', 'decoder.block.5.layer.1.EncDecAttention.v.weight', 'decoder.block.2.layer.0.SelfAttention.o.weight', 'decoder.block.22.layer.0.layer_norm.weight', 'decoder.block.6.layer.2.DenseReluDense.wi_0.weight', 'decoder.block.23.layer.2.DenseReluDense.wo.weight', 'decoder.block.5.layer.1.EncDecAttention.o.weight', 'decoder.block.11.layer.1.EncDecAttention.o.weight', 'decoder.block.3.layer.1.EncDecAttention.o.weight', 'decoder.block.14.layer.0.SelfAttention.k.weight', 'decoder.block.14.layer.2.layer_norm.weight', 'decoder.block.7.layer.0.SelfAttention.v.weight', 'decoder.block.4.layer.2.DenseReluDense.wi_1.weight', 'decoder.block.16.layer.0.SelfAttention.q.weight', 'decoder.block.9.layer.2.DenseReluDense.wo.weight', 'decoder.block.7.layer.2.DenseReluDense.wo.weight', 'decoder.block.22.layer.0.SelfAttention.q.weight', 'decoder.block.4.layer.0.SelfAttention.o.weight', 'decoder.block.19.layer.2.DenseReluDense.wi_1.weight', 'decoder.block.1.layer.2.DenseReluDense.wi_0.weight', 'decoder.block.4.layer.1.EncDecAttention.k.weight', 'decoder.block.20.layer.0.layer_norm.weight', 'decoder.block.10.layer.0.layer_norm.weight', 'decoder.block.6.layer.0.SelfAttention.q.weight', 'decoder.block.12.layer.2.DenseReluDense.wi_1.weight', 'decoder.block.3.layer.0.SelfAttention.q.weight', 'decoder.block.10.layer.0.SelfAttention.q.weight', 'decoder.block.9.layer.0.SelfAttention.o.weight', 'decoder.block.0.layer.0.SelfAttention.k.weight', 'decoder.block.6.layer.0.layer_norm.weight', 'decoder.block.13.layer.2.DenseReluDense.wo.weight', 'decoder.block.15.layer.2.DenseReluDense.wi_1.weight', 'decoder.block.17.layer.1.EncDecAttention.k.weight', 'decoder.block.8.layer.1.EncDecAttention.v.weight', 'decoder.block.17.layer.2.DenseReluDense.wo.weight', 'decoder.block.18.layer.2.layer_norm.weight', 'decoder.block.6.layer.2.DenseReluDense.wo.weight', 'decoder.block.10.layer.2.DenseReluDense.wo.weight', 'decoder.block.0.layer.0.SelfAttention.q.weight', 'decoder.block.1.layer.1.EncDecAttention.q.weight', 'decoder.block.7.layer.2.layer_norm.weight', 'decoder.block.7.layer.0.SelfAttention.q.weight', 'decoder.block.4.layer.0.layer_norm.weight', 'decoder.block.10.layer.2.DenseReluDense.wi_0.weight', 'decoder.block.14.layer.1.EncDecAttention.v.weight', 'decoder.block.18.layer.1.EncDecAttention.k.weight', 'decoder.block.10.layer.0.SelfAttention.v.weight', 'decoder.block.20.layer.0.SelfAttention.o.weight', 'decoder.block.15.layer.0.SelfAttention.o.weight', 'decoder.block.12.layer.0.SelfAttention.q.weight', 'decoder.block.1.layer.0.SelfAttention.o.weight', 'decoder.block.2.layer.0.SelfAttention.v.weight', 'decoder.block.4.layer.1.EncDecAttention.q.weight', 'decoder.block.0.layer.2.DenseReluDense.wo.weight', 'decoder.block.19.layer.1.EncDecAttention.o.weight', 'decoder.block.23.layer.1.EncDecAttention.k.weight', 'decoder.block.5.layer.0.SelfAttention.q.weight', 'decoder.block.12.layer.0.layer_norm.weight', 'decoder.block.14.layer.1.EncDecAttention.k.weight', 'decoder.block.17.layer.0.SelfAttention.q.weight', 'decoder.block.21.layer.2.layer_norm.weight', 'decoder.block.10.layer.1.EncDecAttention.q.weight', 'decoder.block.19.layer.0.SelfAttention.q.weight', 'decoder.block.5.layer.0.SelfAttention.v.weight', 'decoder.block.19.layer.1.EncDecAttention.k.weight', 'decoder.block.1.layer.0.SelfAttention.v.weight', 'decoder.block.12.layer.2.DenseReluDense.wi_0.weight', 'decoder.block.18.layer.1.EncDecAttention.q.weight', 'decoder.block.12.layer.2.DenseReluDense.wo.weight', 'decoder.block.15.layer.0.SelfAttention.k.weight', 'decoder.block.3.layer.2.DenseReluDense.wi_1.weight', 'decoder.block.7.layer.1.EncDecAttention.k.weight', 'decoder.block.12.layer.1.EncDecAttention.o.weight', 'decoder.block.16.layer.1.EncDecAttention.k.weight', 'decoder.block.14.layer.0.SelfAttention.v.weight', 'decoder.block.23.layer.1.layer_norm.weight', 'decoder.block.1.layer.1.EncDecAttention.k.weight', 'decoder.block.6.layer.1.EncDecAttention.k.weight', 'decoder.block.16.layer.2.DenseReluDense.wi_0.weight', 'decoder.block.6.layer.0.SelfAttention.o.weight', 'decoder.block.14.layer.2.DenseReluDense.wo.weight', 'decoder.block.12.layer.1.EncDecAttention.k.weight', 'decoder.block.15.layer.0.layer_norm.weight', 'decoder.block.22.layer.1.EncDecAttention.v.weight', 'decoder.block.23.layer.0.SelfAttention.k.weight', 'decoder.block.12.layer.2.layer_norm.weight', 'decoder.block.7.layer.1.layer_norm.weight', 'decoder.block.0.layer.1.layer_norm.weight', 'decoder.block.5.layer.2.layer_norm.weight', 'decoder.block.22.layer.0.SelfAttention.v.weight', 'decoder.block.11.layer.0.layer_norm.weight', 'decoder.block.1.layer.0.SelfAttention.k.weight', 'decoder.block.13.layer.1.layer_norm.weight', 'decoder.block.23.layer.0.SelfAttention.o.weight', 'decoder.block.17.layer.2.DenseReluDense.wi_0.weight', 'decoder.block.15.layer.1.EncDecAttention.q.weight', 'decoder.block.21.layer.0.layer_norm.weight', 'decoder.block.12.layer.1.layer_norm.weight', 'decoder.block.19.layer.2.DenseReluDense.wo.weight', 'decoder.block.22.layer.0.SelfAttention.o.weight', 'lm_head.weight', 'decoder.block.0.layer.0.SelfAttention.relative_attention_bias.weight', 'decoder.block.11.layer.1.EncDecAttention.q.weight', 'decoder.block.16.layer.2.layer_norm.weight', 'decoder.block.14.layer.2.DenseReluDense.wi_1.weight', 'decoder.block.19.layer.0.SelfAttention.o.weight', 'decoder.block.18.layer.2.DenseReluDense.wi_1.weight', 'decoder.final_layer_norm.weight', 'decoder.block.10.layer.0.SelfAttention.o.weight', 'decoder.block.17.layer.2.DenseReluDense.wi_1.weight', 'decoder.block.15.layer.2.layer_norm.weight', 'decoder.block.19.layer.1.layer_norm.weight', 'decoder.block.20.layer.1.EncDecAttention.v.weight', 'decoder.block.10.layer.2.DenseReluDense.wi_1.weight', 'decoder.block.5.layer.1.EncDecAttention.q.weight', 'decoder.block.14.layer.1.EncDecAttention.q.weight', 'decoder.block.23.layer.1.EncDecAttention.v.weight', 'decoder.block.22.layer.2.DenseReluDense.wo.weight', 'decoder.block.12.layer.0.SelfAttention.v.weight', 'decoder.block.1.layer.1.EncDecAttention.o.weight', 'decoder.block.21.layer.0.SelfAttention.q.weight', 'decoder.block.11.layer.0.SelfAttention.o.weight', 'decoder.block.18.layer.0.SelfAttention.o.weight', 'decoder.block.5.layer.0.layer_norm.weight', 'decoder.block.21.layer.0.SelfAttention.v.weight', 'decoder.block.9.layer.1.EncDecAttention.v.weight', 'decoder.block.2.layer.2.DenseReluDense.wi_1.weight', 'decoder.block.9.layer.2.DenseReluDense.wi_1.weight', 'decoder.block.18.layer.0.SelfAttention.q.weight', 'decoder.block.8.layer.1.EncDecAttention.o.weight', 'decoder.block.17.layer.0.SelfAttention.v.weight', 'decoder.block.3.layer.2.DenseReluDense.wo.weight', 'decoder.block.15.layer.1.layer_norm.weight', 'decoder.block.0.layer.0.SelfAttention.v.weight', 'decoder.block.21.layer.1.EncDecAttention.q.weight', 'decoder.block.3.layer.2.layer_norm.weight', 'decoder.block.16.layer.1.EncDecAttention.q.weight', 'decoder.block.9.layer.1.EncDecAttention.q.weight', 'decoder.block.15.layer.2.DenseReluDense.wi_0.weight', 'decoder.block.14.layer.1.layer_norm.weight', 'decoder.block.1.layer.1.EncDecAttention.v.weight', 'decoder.block.9.layer.2.layer_norm.weight', 'decoder.block.2.layer.0.SelfAttention.q.weight', 'decoder.block.19.layer.0.SelfAttention.v.weight', 'decoder.block.19.layer.1.EncDecAttention.v.weight', 'decoder.block.3.layer.0.layer_norm.weight', 'decoder.block.2.layer.2.DenseReluDense.wo.weight', 'decoder.block.4.layer.0.SelfAttention.v.weight', 'decoder.block.0.layer.0.SelfAttention.o.weight', 'decoder.block.23.layer.0.SelfAttention.v.weight', 'decoder.block.14.layer.0.SelfAttention.q.weight',
'decoder.block.22.layer.1.EncDecAttention.q.weight', 'decoder.block.15.layer.0.SelfAttention.v.weight', 'decoder.block.3.layer.1.EncDecAttention.v.weight', 'decoder.block.0.layer.2.DenseReluDense.wi_0.weight', 'decoder.block.8.layer.2.DenseReluDense.wi_0.weight', 'decoder.block.21.layer.1.EncDecAttention.v.weight', 'decoder.block.10.layer.2.layer_norm.weight', 'decoder.block.9.layer.1.EncDecAttention.o.weight', 'decoder.block.2.layer.1.EncDecAttention.k.weight', 'decoder.block.2.layer.1.EncDecAttention.v.weight', 'decoder.block.2.layer.1.EncDecAttention.q.weight', 'decoder.block.5.layer.2.DenseReluDense.wi_1.weight', 'decoder.block.17.layer.1.layer_norm.weight', 'decoder.block.9.layer.1.EncDecAttention.k.weight', 'decoder.block.21.layer.2.DenseReluDense.wi_0.weight', 'decoder.block.23.layer.1.EncDecAttention.o.weight', 'decoder.block.0.layer.2.DenseReluDense.wi_1.weight', 'decoder.block.17.layer.0.SelfAttention.k.weight', 'decoder.block.21.layer.1.EncDecAttention.k.weight', 'decoder.block.9.layer.0.SelfAttention.v.weight', 'decoder.block.11.layer.1.EncDecAttention.v.weight', 'decoder.block.5.layer.1.layer_norm.weight', 'decoder.block.9.layer.0.SelfAttention.q.weight', 'decoder.block.23.layer.2.DenseReluDense.wi_0.weight', 'decoder.block.6.layer.1.layer_norm.weight', 'decoder.block.11.layer.0.SelfAttention.q.weight', 'decoder.block.13.layer.1.EncDecAttention.k.weight', 'decoder.block.22.layer.0.SelfAttention.k.weight', 'decoder.block.6.layer.1.EncDecAttention.q.weight']

This IS expected if you are initializing T5EncoderModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
This IS NOT expected if you are initializing T5EncoderModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Successfully loaded checkpoint from: declare-lab/tango
c:\Users\Genesis\anaconda3\tango\models.py:223: FutureWarning: Accessing config attribute in_channels directly via 'UNet2DConditionModel' object attribute is deprecated. Please access 'in_channels' over 'UNet2DConditionModel's config object instead, e.g. 'unet.config.in_channels'.
num_channels_latents = self.unet.in_channels
Traceback (most recent call last):
File "c:\Users\Genesis\anaconda3\tango\generate.py", line 8, in
audio = tango.generate(prompt, steps=50)
File "c:\Users\Genesis\anaconda3\tango\tango.py", line 46, in generate
latents = self.model.inference([prompt], self.scheduler, steps, guidance, samples, disable_progress=disable_progress)
File "C:\Users\Genesis\anaconda3\envs\tango\lib\site-packages\torch\autograd\grad_mode.py", line 27, in decorate_context
return func(*args, **kwargs)
File "c:\Users\Genesis\anaconda3\tango\models.py", line 234, in inference
noise_pred = self.unet(
File "C:\Users\Genesis\anaconda3\envs\tango\lib\site-packages\torch\nn\modules\module.py", line 1194, in _call_impl
return forward_call(*input, **kwargs)
TypeError: UNet2DConditionModel.forward() got an unexpected keyword argument 'encoder_attention_mask'

Can this model used for music generation?

Quality is first!

No text input. How can i use this repo to train?

Cannot install on Windows

git clone https://github.com/declare-lab/tango/
cd tango
pip install -r requirements.txt

Ends with

Collecting accelerate==0.18.0
  Using cached accelerate-0.18.0-py3-none-any.whl (215 kB)
Collecting bitsandbytes==0.38.1
  Using cached bitsandbytes-0.38.1-py3-none-any.whl (104.3 MB)
Collecting black==22.3.0
  Using cached black-22.3.0-cp310-cp310-win_amd64.whl (1.1 MB)
Collecting compel==1.1.3
  Using cached compel-1.1.3-py3-none-any.whl (24 kB)
Collecting d4rl==1.1
  Using cached d4rl-1.1-py3-none-any.whl (26.4 MB)
Collecting data_generator==1.0.1
  Using cached Data_Generator-1.0.1-py3-none-any.whl (11 kB)
Collecting deepdiff==6.3.0
  Using cached deepdiff-6.3.0-py3-none-any.whl (69 kB)
Collecting diffusion==6.9.1
  Using cached diffusion-6.9.1-1-py3-none-any.whl (179 kB)
Collecting einops==0.6.1
  Using cached einops-0.6.1-py3-none-any.whl (42 kB)
Collecting flash_attn==1.0.3.post0
  Using cached flash_attn-1.0.3.post0.tar.gz (2.0 MB)
  Preparing metadata (setup.py) ... error
  error: subprocess-exited-with-error

  × python setup.py egg_info did not run successfully.
  │ exit code: 1
  ╰─> [6 lines of output]
      Traceback (most recent call last):
        File "<string>", line 2, in <module>
        File "<pip-setuptools-caller>", line 34, in <module>
        File "C:\Users\Jason\AppData\Local\Temp\pip-install-ftl4rk2a\flash-attn_362c7a8bea504f679c59d4a480161f3e\setup.py", line 6, in <module>
          from packaging.version import parse, Version
      ModuleNotFoundError: No module named 'packaging'
      [end of output]

  note: This error originates from a subprocess, and is likely not a problem with pip.
error: metadata-generation-failed

× Encountered error while generating package metadata.
╰─> See above for output.

note: This is an issue with the package mentioned above, not pip.
hint: See above for details.

It would be nice if the model was stored in the tango dir, e.g. tango/models.
This way it also becomes easier to use tango with docker volumes.