yangdongchao / text-to-sound-synthesis Goto Github PK

The source code of our paper "Diffsound: discrete diffusion model for text-to-sound generation"

Home Page: http://dongchaoyang.top/text-to-sound-synthesis-demo/

Python 99.12% Shell 0.88%

text-to-sound-synthesis's Introduction

Text-to-sound Generation

This paper has been accepted by IEEE Transactions on Audio, Speech and Language Processing.
I have upload all of the pre-trained models on huggingface: https://huggingface.co/Dongchao/Diffsound/tree/main
This is the open source code for our paper "Diffsound: discrete diffusion model for text-to-sound generateion".
You can find the paper on arxiv https://arxiv.org/pdf/2207.09983v1.pdf
The demo page is http://dongchaoyang.top/text-to-sound-synthesis-demo/
2022/08/03 We upload the training code of VQ-VAE and the baseline method of text-to-sound generation (Autoregressive model), and the Diffsound code. Considering that the github has the limitation of file size, we will upload the pre-trained model on google drive disk.
2022/08/06 We uppoad the pre-trained model on google drive. please refer to https://drive.google.com/drive/folders/193It90mEBDPoyLghn4kFzkugbkF_aC8v?usp=sharing
Note that a pre-trained diffsound model is very large, so that we only upload one audioset pretrained model now. More models we will try to upload on other free disk, if you known any free shared disk, please let me know, I will very appreciate.
2022/08/09 We upload trained diffsound model on audiocaps dataset, and the baseline AR model, and the codebook trained on audioset with the size of 512. You can refer to https://pan.baidu.com/s/1R9YYxECqa6Fj1t4qbdVvPQ . The password is lsyr
2022/12/06 Hi, everyone. In our previous setting, we use the wrong sample rate to load wav file, which results in the speech cannot be generated very well. Now, we update the feature extraction module. https://github.com/yangdongchao/Text-to-sound-Synthesis/blob/master/Codebook/feature_extraction/extract_mel_spectrogram.py#L167 . We will re-train our model, all of the pre-trained model can be found on PKU disk: https://disk.pku.edu.cn:443/link/87DE08BDA2521CB54F4911393EB36B4A More details will be updated as soon as. 2023/01/11 The latest pre-trained model on audioset have been released, please refer to PKU disk: https://disk.pku.edu.cn:443/link/E36C91C27830FAF0B9326D8EA685A193 有效期限：2024-06-30 23:59

Overview

Pretrained Model

We release four text-to-sound pretrained model. Including VQVAE trained on Audioset, Vocoder trained on Audioset, generation model trained on Audiocaps and Audioset.

Inference

Please refer the readme.md file in Codebook folder to see how to inference.

Training

Please refer the readme.md file in Codebook folder to see how to train your network.

Reference

This project based on following open source code. https://github.com/XinhaoMei/ACT https://github.com/cientgu/VQ-Diffusion https://github.com/CompVis/taming-transformers https://github.com/lonePatient/Bert-Multi-Label-Text-Classification https://github.com/v-iashin/SpecVQGAN

Cite

@article{yang2022diffsound, title={Diffsound: Discrete Diffusion Model for Text-to-sound Generation}, author={Yang, Dongchao and Yu, Jianwei and Wang, Helin and Wang, Wen and Weng, Chao and Zou, Yuexian and Yu, Dong}, journal={arXiv e-prints}, pages={arXiv--2207}, year={2022} }

License

MIT license

text-to-sound-synthesis's People

Contributors

Stargazers

Watchers

text-to-sound-synthesis's Issues

Missing libraries "ftfy" "regex" "einops"

Thank you for your help. With fresh environment, I get missing libs: "ftfy", "regex", and "einops". After installing these through Conda, I still get the same error about KeyError:
Perhaps my environment (installing via "conda env create" with your config in Codebook) is not really clean? I would greatly appreciate any help you can offer in this matter. Thank you!

(specvqgan) dto@thexder:/apdcephfs/share_1316500/donchaoyang/code3/Text-to-sound-Synthesis/Diffsound$ python3 evaluation/generate_samples_batch.py
Restored from /apdcephfs/share_1316500/donchaoyang/code3/SpecVQGAN/logs/2022-04-24T23-17-27_audioset_codebook256/checkpoints/last.ckpt
Traceback (most recent call last):
File "evaluation/generate_samples_batch.py", line 204, in
Diffsound = Diffsound(config=config_path, path=pretrained_model_path, ckpt_vocoder=ckpt_vocoder)
File "evaluation/generate_samples_batch.py", line 44, in init
self.info = self.get_model(ema=True, model_path=path, config_path=config)
File "evaluation/generate_samples_batch.py", line 64, in get_model
model = build_model(config) #加载 dalle model
File "evaluation/../sound_synthesis/modeling/build.py", line 5, in build_model
return instantiate_from_config(config['model'])
File "evaluation/../sound_synthesis/utils/misc.py", line 132, in instantiate_from_config
return cls(**config.get("params", dict()))
File "evaluation/../sound_synthesis/modeling/models/dalle_spec.py", line 40, in init
self.transformer = instantiate_from_config(diffusion_config)
File "evaluation/../sound_synthesis/utils/misc.py", line 132, in instantiate_from_config
return cls(**config.get("params", dict()))
File "evaluation/../sound_synthesis/modeling/transformers/diffusion_transformer.py", line 172, in init
self.condition_emb = instantiate_from_config(condition_emb_config) # 加载能获得condition embedding的模型
File "evaluation/../sound_synthesis/utils/misc.py", line 132, in instantiate_from_config
return cls(**config.get("params", dict()))
File "evaluation/../sound_synthesis/modeling/embeddings/clip_text_embedding.py", line 25, in init
model, _ = clip.load(clip_name, device='cpu',jit=False)
File "evaluation/../sound_synthesis/modeling/modules/clip/clip.py", line 114, in load
model = build_model(state_dict or model.state_dict()).to(device)
File "evaluation/../sound_synthesis/modeling/modules/clip/model.py", line 409, in build_model
vision_width = state_dict["visual.layer1.0.conv1.weight"].shape[0]
KeyError: 'visual.layer1.0.conv1.weight'

在线期待一个像disco diffusion的colab版本

我第一次接触disco diffusion时就已经想象过扩散模型用在声音上，这就是我想要的。在线期待一个像disco diffusion的colab版本，或者paddle版本。非常期待能用在我的交互声音装置上。

ImportError: cannot import name 'rank_zero_warn' from 'pytorch_lightning.utilities.distributed'

(specvqgan) dto@thexder:/apdcephfs/share_1316500/donchaoyang/code3/Text-to-sound-Synthesis/Diffsound$ python3 evaluation/generate_samples_batch.py
/home/dto/.local/lib/python3.8/site-packages/torch/cuda/init.py:83: UserWarning: CUDA initialization: The NVIDIA driver on your system is too old (found version 10010). Please update your GPU driver by downloading and installing a new version from the URL: http://www.nvidia.com/Download/index.aspx Alternatively, go to: https://pytorch.org to install a PyTorch version that has been compiled with your version of the CUDA driver. (Triggered internally at ../c10/cuda/CUDAFunctions.cpp:109.)
return torch._C._cuda_getDeviceCount() > 0
/home/dto/.local/lib/python3.8/site-packages/torch/amp/autocast_mode.py:198: UserWarning: User provided device_type of 'cuda', but CUDA is not available. Disabling
warnings.warn('User provided device_type of 'cuda', but CUDA is not available. Disabling')
Traceback (most recent call last):
File "evaluation/generate_samples_batch.py", line 204, in
Diffsound = Diffsound(config=config_path, path=pretrained_model_path, ckpt_vocoder=ckpt_vocoder)
File "evaluation/generate_samples_batch.py", line 44, in init
self.info = self.get_model(ema=True, model_path=path, config_path=config)
File "evaluation/generate_samples_batch.py", line 64, in get_model
model = build_model(config) #加载 dalle model
File "evaluation/../sound_synthesis/modeling/build.py", line 5, in build_model
return instantiate_from_config(config['model'])
File "evaluation/../sound_synthesis/utils/misc.py", line 132, in instantiate_from_config
return cls(**config.get("params", dict()))
File "evaluation/../sound_synthesis/modeling/models/dalle_spec.py", line 38, in init
self.init_content_codec_from_ckpt(content_codec_config)
File "evaluation/../sound_synthesis/modeling/models/dalle_spec.py", line 46, in init_content_codec_from_ckpt
model = instantiate_from_config(content_codec_config) # 得到第一阶段的模型
File "evaluation/../sound_synthesis/utils/misc.py", line 131, in instantiate_from_config
cls = getattr(importlib.import_module(module, package=None), cls)
File "/home/dto/miniconda3/envs/specvqgan/lib/python3.8/importlib/init.py", line 127, in import_module
return _bootstrap._gcd_import(name[level:], package, level)
File "", line 1014, in _gcd_import
File "", line 991, in _find_and_load
File "", line 975, in _find_and_load_unlocked
File "", line 671, in _load_unlocked
File "", line 783, in exec_module
File "", line 219, in _call_with_frames_removed
File "evaluation/../sound_synthesis/modeling/codecs/spec_codec/vqgan.py", line 3, in
import pytorch_lightning as pl
File "/home/dto/.local/lib/python3.8/site-packages/pytorch_lightning/init.py", line 42, in
import("pkg_resources").declare_namespace(name)
File "/home/dto/miniconda3/envs/specvqgan/lib/python3.8/site-packages/pkg_resources/init.py", line 2304, in declare_namespace
_handle_ns(packageName, path_item)
File "/home/dto/miniconda3/envs/specvqgan/lib/python3.8/site-packages/pkg_resources/init.py", line 2237, in _handle_ns
loader.load_module(packageName)
File "/home/dto/miniconda3/envs/specvqgan/lib/python3.8/site-packages/pytorch_lightning/init.py", line 28, in
from pytorch_lightning import metrics # noqa: E402
File "/home/dto/miniconda3/envs/specvqgan/lib/python3.8/site-packages/pytorch_lightning/metrics/init.py", line 14, in
from pytorch_lightning.metrics.classification import ( # noqa: F401
File "/home/dto/miniconda3/envs/specvqgan/lib/python3.8/site-packages/pytorch_lightning/metrics/classification/init.py", line 14, in
from pytorch_lightning.metrics.classification.accuracy import Accuracy # noqa: F401
File "/home/dto/miniconda3/envs/specvqgan/lib/python3.8/site-packages/pytorch_lightning/metrics/classification/accuracy.py", line 19, in
from pytorch_lightning.metrics.metric import Metric
File "/home/dto/miniconda3/envs/specvqgan/lib/python3.8/site-packages/pytorch_lightning/metrics/metric.py", line 20, in
from pytorch_lightning.utilities.distributed import rank_zero_warn
ImportError: cannot import name 'rank_zero_warn' from 'pytorch_lightning.utilities.distributed' (/home/dto/.local/lib/python3.8/site-packages/pytorch_lightning/utilities/distributed.py)
(specvqgan) dto@thexder:/apdcephfs/share_1316500/donchaoyang/code3/Text-to-sound-Synthesis/Diffsound$

Where to download the melception.pt to evaluate?

In Codebook/readme.md, it said we can run python Codebook/evaluate.py to test the metrics. However it needs /apdcephfs/share_1316500/donchaoyang/code3/Codebook/evaluation/logs/melception.pt to extract features. I didn't find the checkpoint in this project, neither the download link. Can you provide a download link? Thank you.

Issues from sampling by newest pretrained model

Dear authors,

I try to use your pretrained model listed in readme.md
"2022/08/09 We upload trained diffsound model on audiocaps dataset, and the baseline AR model, and the codebook trained on audioset with the size of 512. (https://disk.pku.edu.cn/link/DA2EAC5BBBF43C9CAB37E0872E50A0E4)"

When I try to run the command "python evaluation/generate_samples_batch.py" to sampling some audio, the codes raise an Error:
"RuntimeError: Error(s) in loading state_dict for VQModel:
size mismatch for quantize.embedding.weight: copying a param with shape torch.Size([512, 256]) from checkpoint, the shape in current model is torch.Size([256, 256])
"

I have already tried many revised versions of your 'caps_text.yaml' (change several 256 to 512), but none of them works. Could you please share any ways for me to do the sampling from your newest trained model? Thanks a lot.

How do we use BERT or CLIP features?

It is shown in Codebook readme.md that
"For the text features, we provide two types of features, (1) use BERT (2) use CLIP
For BERT features, please run
python generete_text_fea/predict_one.py
For CLIP features, please run
python generete_text_fea/generate_fea_clip.py "

However, when I tried run the 'python3 ./Diffsound/train_spec.py --name caps_train --config_file ./Diffsound/configs/caps_512.yaml --tensorboard --load_path None', I found that it only call the "./Diffsound/sound_synthesis/modeling/embeddings/clip_text_embedding.py" file and load the Vit-B-32 model. There is no parts read the files generate by BERT or CLIP.

Shall we do any pre-pocessing on the text and save them as files in order to avoid running the Vit-B-32 model each time? Can we change the model to other models like BERT or CLIP?

Thanks a lot.

ModuleNotFoundError: No module named 'vocoder'

(specvqgan) dto@thexder:/apdcephfs/share_1316500/donchaoyang/code3/Text-to-sound-Synthesis/Codebook$ python3 evaluation/generate_samples.py
Traceback (most recent call last):
File "evaluation/generate_samples.py", line 17, in
from train import instantiate_from_config
File "/apdcephfs/share_1316500/donchaoyang/code3/Text-to-sound-Synthesis/Codebook/./train.py", line 30, in
from vocoder.modules import Generator
ModuleNotFoundError: No module named 'vocoder'

Add License

Can you please add the license to use pre-trained models and the code?

Baidu link

Hi, amazing repo! Thankyou very much for releasing your code.

Unfortunately I can't download from https://pan.baidu.com/s/1R9YYxECqa6Fj1t4qbdVvPQ as UK phone numbers are not supported for baidu sign-up.

Are there any alternative download links available for this model?

Cheers,
Barney

ModuleNotFoundError: No module named 'clip'

ModuleNotFoundError: No module named 'clip'
ERROR conda.cli.main_run:execute(49): conda run python /mnt/d/miaoxiangyang/Text-to-sound-Synthesis/Codebook/generete_text_fea/generate_fea_clip.py failed. (See above for error)

About CC_pretrained model

Dear authors,

When I try to reproduce your code, it seems that in your "Diffsound/running_command/run_train_audioset.py" file, you trained your model by loading a model called "--load_path OUTPUT/pretrained_model/CC_pretrained.pth".

Do we have to find and download CC_pretrained.pth? Where can we find it?

Thanks a lot.

Please add a LICENSE file

Thank you for your great work!
Please do add a License to the repo such that it's clear to incoming users what the agreement is.

how to control the duration of the generated audio

May I ask if it is possible to control the duration of the generated audio through a parameter during inference

Can Diffsound do unconditional generation?

Thank you for your great work.
I was wondering whether diffsound can do unconditional generation? If it could, Could you share some example setting?

Missing required files for audiocaption evaluation

The required files in AudiocaptionLoss config are missing.

path:
  vocabulary: 'data/pickles/words_list.p'
  encoder: 'pretrained_models/audioset_deit.pth'  # 'pretrained_models/deit.pth'
  word2vec: 'pretrained_models/word2vec/w2v_512.model'
  eval_model: 'pretrained_models/ACTm.pth'

How to use the codebook with the size of 512?

Hi, thank you for sharing your great project!

I have a question about your released models.

At pan.baidu.com, you shared your trained coodebook model with the size of 512.
But diffsound_audiocapas.pth (6.32GB) also contains a coodebook with the size of 256, not 512.

I confirmed this with the following code.

import torch

cb_model_path = '../download/baidu/2022-04-22T19-35-05_audioset_codebook512/checkpoints/last.ckpt'
cb_model = torch.load(cb_model_path)
print(cb_model['state_dict']['quantize.embedding.weight'].shape) # (512, 256), ok

ds_model_path = "../download/baidu/diffsound_audiocaps.pth"
ds_model = torch.load(ds_model_path, map_location="cpu")
print(ds_model['model']['content_codec.quantize.embedding.weight'].shape) # (256, 256), should be (512, 256)

This cases a dimension mismatch.
In your generate_samples_batch.py, it firstly loads a coodebook with model.params.content_codec_config.params.ckpt_path in yaml, then loads diffsound_audiocaps.pth. But the latter checkpoint contains a codebook with the size of 256 as I mentioned.

Do I need to drop the codebook weights from diffsound_audiocaps.pth? Or do you have an appropriate diffsound model?

Thank you,

provide examples with [mask] token?

Hi, could you please share some caption examples for pretraining on Audioset? I'm a little confused about the [mask] token setting for clip text encoder.

Pretrained model

Hi,

The link for the latest pretrained models seems to have expired, is there another active link?

2023/01/11 The latest pre-trained model on audioset have been released, please refer to PKU disk: https://disk.pku.edu.cn:443/link/4908743A441B02235C8652742FE44949

Thanks!

Missing package "image_synthesis"

(specvqgan) dto@thexder:~/src/tss/Diffsound$ python evaluation/generate_samples_batch.py
Traceback (most recent call last):
File "evaluation/generate_samples_batch.py", line 19, in
from image_synthesis.utils.io import load_yaml_config
ModuleNotFoundError: No module named 'image_synthesis'

Does not seem to be installable through anaconda.

Pre-trained model on audiocaps

Hi,
Can I get the pre-trained codebook and Diffsound model on the Audiocaps dataset?

Embedding shape issue

Hi, I'm trying to use the pretrained weights of the codebook trained on audioset with a size of 512.
However, I'm confused about the dimension parameters that should be changed.
What should I change for the 'Diffsound/evaluation/caps_text.yaml' ?

Thank you.

Storage of models and more

Hi in the readme.md you have:

Note that a pre-trained diffsound model is very large, so that we only upload one audioset pretrained model now. More models we will try to upload on other free disk, if you known any free shared disk, please let me know, I will very appreciate.

You can upload your models to Hugging Face free of charge, and not only that you can also add a model card for it with code examples etc. check out a tutorial here

You can also set up a Spaces for demoing your model!

Pre-trained model Download issue

Dear authors,

Unfortunately I can't download diffsound_audioset_audiocaps.pth and diffsound_trained_on_audiocaps_256.pth from https://disk.pku.edu.cn/#/link/87DE08BDA2521CB54F4911393EB36B4A

Could you send me another download links or my email that I can use for this model?

my email : [email protected] or [email protected]

Thankyou very much for releasing your code.

Unable to access the pre-trained models

Hi,

Amazing work! Congratulations. I have a small request - can you release the pre-trained models?

Thanks.

yangdongchao / text-to-sound-synthesis Goto Github PK

text-to-sound-synthesis's Introduction

Text-to-sound Generation

Overview

Pretrained Model

Inference

Training

Reference

Cite

License

text-to-sound-synthesis's People

Contributors

Stargazers

Watchers

Forkers

text-to-sound-synthesis's Issues

Recommend Projects

Recommend Topics

Recommend Org

Jobs