GithubHelp home page GithubHelp logo

rongjiehuang / fastdiff Goto Github PK

View Code? Open in Web Editor NEW
396.0 396.0 63.0 3.07 MB

PyTorch Implementation of FastDiff (IJCAI'22)

Python 98.38% Jupyter Notebook 1.62%
ijcai2022 neural-vocoder speech-synthesis text-to-speech vocoder

fastdiff's Introduction

Hi there 👋

Rongjie Huang (黄融杰) did my Graduate study at College of Computer Science and Software, Zhejiang University, supervised by Prof. Zhou Zhao. I also obtained Bachelor’s degree at Zhejiang University. During my graduate study, I was lucky to collaborate with the CMU Speech Team led by Prof. Shinji Watanabe, and Audio Research Team at Zhejiang University. I was grateful to intern or collaborate at TikTok, Shanghai AI Lab (OpenGV Lab), Tencent Seattle Lab, Alibaba Damo Academic, with Yi Ren, Jinglin Liu, Chunlei Zhang and Dong Yu.

My research interest includes Multi-Modal Generative AI, Multi-Modal Language Processing, and AI4Science. I have published first-author papers at the top international AI conferences such as NeurIPS/ICLR/ICML/ACL/IJCAI.

I am actively looking for academic collaboration, feel free to drop me an email.

📎 Homepages

💻 Selected Research Papers

Generative AI for Speech, Sing, and Audio: Spoken Large Language Model, Text-to-Audio Synthesis, Text-to-Speech Synthesis, Singing Voice Synthesis

Audio-Visual Language Processing: Audio-Visual Speech-to-Speech Translation, Self-Supervised Learning

My full paper list is shown at my personal homepage.

Spoken Large Language Model

Text-to-Speech Synthesis

Text-to-Audio Synthesis

Audio-Visual Language Processing

Singing Voice Synthesis

fastdiff's People

Contributors

rongjiehuang avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

fastdiff's Issues

question for hyper-parameters N

I noticed that the hyperparameter N is set in the example you gave.
image
However, when training with my own dataset without setting N.
image
Is it possible to leave the parameter N unset? Or how should I choose N?

ModuleNotFoundError: No module named 'utils.rnnoise'

您好!我在运行您的代码的时候发生这个错误:
File "/qwork4/twu/miniconda/envs/tts/lib/python3.6/importlib/init.py", line 126, in import_module
return _bootstrap._gcd_import(name[level:], package, level)
File "", line 994, in _gcd_import
File "", line 971, in _find_and_load
File "", line 955, in _find_and_load_unlocked
File "", line 665, in _load_unlocked
File "", line 678, in exec_module
File "", line 219, in _call_with_frames_removed
File "/qdell3data/qwork/txw94/qdell3/TTS/FastDiff-main/egs/datasets/audio/lj/pre_align.py", line 1, in
from data_gen.tts.vocoder_pre_align import VocoderPreAlign
File "/home/twu/.pycharm_helpers/pydev/_pydev_bundle/pydev_import_hook.py", line 21, in do_import
module = self._system_import(name, *args, **kwargs)
File "/qdell3data/qwork/txw94/qdell3/TTS/FastDiff-main/data_gen/tts/vocoder_pre_align.py", line 17, in
from utils.rnnoise import rnnoise
File "/home/twu/.pycharm_helpers/pydev/_pydev_bundle/pydev_import_hook.py", line 21, in do_import
module = self._system_import(name, *args, **kwargs)
ModuleNotFoundError: No module named 'utils.rnnoise'

您的utils里面是不是少了一个rnnoise

Pretrain models is lost

Hi, thanks for your great work.
It looks like the link to pre-train models is invalid. Could you please update it? Thanks!

Question about FastDiff-TTS

Hello, Thank you for sharing your code with community.
I'm trying to implement FastDiff-TTS model with my dataset.

My model pronounce well after 120k learning, but the sound quality is not good yet.
So, I have some question for FastDiff-TTS's tendency.

  1. I used pre-derived noise schedule for noise scheduling. If I use the pre-derived scheduler, do FastDiff's sound quality have limitation?
  2. How much training steps required for good sound quality or convergence.
  3. I can hear a minute noise in your demo audio. Is there any way to remove that noise?
  4. Did you have tried multi-speaker-TTS for FastDiff-TTS?

The Audio Sample of my model is at the url below.
https://lime-honeycrisp-5e3.notion.site/Multi-speaker-FastDiff-TTS-5bae38d4562144059bf84651f603ff28

Thank you.

Optimizer parameters

Thank you for making your code publicly available!

I have a question about the optimizer parameters. The paper says β1 = 0.9, β2 = 0.98, ε= 1e−9 were used (Section 5.1). On the other hand, this repository is using the default values of AdamW optimizer β1 = 0.9, β2 = 0.999, ε= 1e−8.

self.optimizer = optimizer = torch.optim.AdamW(
self.model.parameters(),
lr=float(hparams['lr']), weight_decay=float(hparams['weight_decay']))

Which setup is your recommendation?

Integrating with tacotron 1

I am trying to integrate this model to tacotron 1 for TTS but since I do not have corresponding exact audios that are to be generated from the tacotron's output mel spectrogram, how should I go about training the model. When I try to pass the speaker audio from data and its corresponding generated mel-spectrogram from tacotron, I get the error due to assertion: assert in_length == (kernel_length * hop_size). Please let me know if I am missing out something. Or otherwise what should be the strategy to train model to learn to decipher model generated spectrograms.

pretrain model

hi, the link of pretrain model is lost, can you provide it again? 3q

Training config sample rate 24Khz

Hi, I noticed that your pretrained model on LibriTTS is with 22.05 kHz.

  1. What should I change in the base config if I want to train a model with 24kHz?
  2. I also wonder that whether could I finetune the 22.05 kHz pretrained model with the 24kHz data ?

Looking forward to your reply, thanks

Pretrained model is lost

Hi, it seems the link to the pretrained model is lost : (
Could you help fix it? Thanks in advance!
image

Question about noise scheduling process.

Hello I'm trying to implement noise scheduling process refer to BDDM's implementation BDDM/sampler.py

And I have some question for noise scheduling process for FastDiff-TTS.

  1. In the Fastdiff paper, the alphaN, betaN is set as hyperparameter like αˆt = 0.54, βˆt = 0.70. Can I use this hyper parameter for my own Fastdiff-TTS module or another number of reverse steps(ex) 6, 8, 10...)? How does it Calculated?

  2. For BDDM, searching alphaN, betaN requires some greedy searching with search_bin=9, and further searching step=10 for adding noise for params. ex) _alpha_param = alpha_param * (0.95 + np.random.rand() * 0.1)
    Dose Fastdiff requires similar process like above?

  3. For BDDM, STOI and PESQ is estimated for generated audio to find best noise schedule. How could we select best parameters based on two indicators STOI and PESQ?

  4. Are STOI and PESQ also needed for parameter searching process for Fastdiff?

  5. In BDDM, num_reverse_steps = math.floor( T / tau ). But in Fastdiff, T=1000, tau=200 and num_reverse_steps=4. Do I need to calculate num_reverse_steps by math.floor(T/tau) - 1?
    image

Thank you.

train problem

Thank you for your excellent work, but I have a problem, I trained according to the hints without changing any parameters, using 1000k training results in demotts, but the synthesis is not satisfactory, there is a certain amount of noise and current in the audio, I don't know where the problem is, please!

issue with demo yaml file

Hello I was following the demo, and I ran into some trouble with the yaml files: so I tried grabbing many parameters from PortaSpeech/diffspeech/config.yaml and add them to modules/FastDiff/config/FastDiff.yaml, but that causes other issues. Can you point me to the correct yaml file?

process problem

python data_gen/tts/bin/pre_align.py --config /public/home/yao_yh/code/code/FastDiff-main/modules/FastDiff/config/FastDiff.yaml
Traceback (most recent call last):
File "/public/home/yao_yh/code/code/FastDiff-main/data_gen/tts/bin/pre_align.py", line 6, in
from utils.hparams import set_hparams, hparams
ModuleNotFoundError: No module named 'utils.hparams'

I installed the environment and deployed the code according to the readme, the files exist in utils, but I keep getting the above error, I don't know if it's a problem with the file directory?
Also, the link to the processed LJSpeech file is no longer working, thanks!

No module named 'modules.tts

Hello,than you for for awesome work.
When I run your script of

Inference for text-to-speech synthesis

in ReadMe, but got an error:

Traceback (most recent call last):
File "inference/tts/ds.py", line 5, in
from modules.tts.diffspeech.shallow_diffusion_tts import GaussianDiffusion
ModuleNotFoundError: No module named 'modules.tts'

I didn't find

modules.tts.diffspeech.shallow_diffusion_tts

in the repo, could you help me solve it?
Thank you.

cannot import name 'waveglow' from 'vocoders'

Hi,
It seems that some vocoder wrappers missed in the "vocoders" directory. They are imported in the "vocoders/_init_.py" but they are not in the vocoders directory.

Here is the traceback after running binarization step :

Traceback (most recent call last):
  File "data_gen/tts/bin/binarize.py", line 23, in <module>
    binarize()
  File "data_gen/tts/bin/binarize.py", line 16, in binarize
    binarizer_cls = getattr(importlib.import_module(pkg), cls_name)
  File "/usr/lib/python3.7/importlib/__init__.py", line 127, in import_module
    return _bootstrap._gcd_import(name[level:], package, level)
  File "<frozen importlib._bootstrap>", line 1006, in _gcd_import
  File "<frozen importlib._bootstrap>", line 983, in _find_and_load
  File "<frozen importlib._bootstrap>", line 967, in _find_and_load_unlocked
  File "<frozen importlib._bootstrap>", line 677, in _load_unlocked
  File "<frozen importlib._bootstrap_external>", line 728, in exec_module
  File "<frozen importlib._bootstrap>", line 219, in _call_with_frames_removed
  File "/home/payman/TTS/FastDiff/data_gen/tts/vocoder_binarizer.py", line 18, in <module>
    from vocoders.base_vocoder import get_vocoder_cls
  File "/home/payman/TTS/FastDiff/vocoders/__init__.py", line 2, in <module>
    from vocoders import waveglow
ImportError: cannot import name 'waveglow' from 'vocoders' (/home/payman/TTS/FastDiff/vocoders/__init__.py)

About audio sample rate and other hparams

I noticed that the supported datasets of this repository have different audio sample rate. Is the output sample rate of FastDiff vocoder bound to its training data, or does it have a fixed sample rate while all training data is downsampled?

If the sample rate is modifiable, which config file(s) and which hparam(s) should be edited? And additionally, as sample rate changes, which other hparams are supposed to be changed together?
(i. e. What should I do if I want to train a vocoder with higher sample rate?)

By the way, I found myself kind of confused when dealing with all the .yaml config files and hparams. I finally make it to start the training process, but I still cannot fully understand what they mean and how they are organized. It will be much appreciated if more detailed explanations can be provided in README or documentation.

Loss nan occurred in the fastdiff training time

Hello, Huang ! Will there be a loss nan when training fastdiff without loading preweights? How to handle the parameters or adjust them if this happens.and the loss always nonconvergence。how to handle it?
2023-04-03 14-05-08屏幕截图

Noisy outputs when running LJSpeech checkpoint on Tacotron mel spectrograms

Hey @Rongjiehuang,

Thanks a lot for open-sourcing the checkpoint for the FastDiff vocoder for LJSpeech!

I played around with the code a bit and I'm only getting quite noisy generations when decoding the mel spectrogram of a tacotron with FastDiff's vocoder.

Here the code to reproduce:

#!/usr/bin/env python3
import torch
from modules.FastDiff.module.FastDiff_model import FastDiff
from utils import audio
from modules.FastDiff.module.util import compute_hyperparams_given_schedule, sampling_given_noise_schedule

HOP_SIZE = 256  # for 22050 frequency

# download checkpoint to this folder
state_dict = torch.load("./checkpoints/LJSpeech/model_ckpt_steps_500000.ckpt")["state_dict"]["model"]
model = FastDiff().cuda()
model.load_state_dict(state_dict)

train_noise_schedule = noise_schedule = torch.linspace(1e-06, 0.01, 1000)
diffusion_hyperparams = compute_hyperparams_given_schedule(noise_schedule)

# load noise schedule for 200 sampling steps
#noise_schedule = torch.linspace(0.0001, 0.02, 200).cuda()
# load noise schedule for 4 sampling steps
noise_schedule = torch.FloatTensor([3.2176e-04, 2.5743e-03, 2.5376e-02, 7.0414e-01]).cuda()

tacotron2 = torch.hub.load('NVIDIA/DeepLearningExamples:torchhub', 'nvidia_tacotron2', model_math='fp16')
tacotron2 = tacotron2.to("cuda").eval()

text = "Hello world, I missed you so much."
utils = torch.hub.load('NVIDIA/DeepLearningExamples:torchhub', 'nvidia_tts_utils')
sequences, lengths = utils.prepare_input_sequence([text])

with torch.no_grad():
    mels, _, _ = tacotron2.infer(sequences, lengths)

audio_length = mels.shape[-1] * HOP_SIZE
pred_wav = sampling_given_noise_schedule(
    model, (1, 1, audio_length), diffusion_hyperparams, noise_schedule,
    condition=mels, ddim=False, return_sequence=False)

pred_wav = pred_wav / pred_wav.abs().max()
audio.save_wav(pred_wav.view(-1).cpu().float().numpy(), './test.wav', 22050)

After listening to test.wav one can identify the correct sentence but the output is extremely noisy. Any ideas what the reason for this could be? Are any of the hyper-parameters incorrectly set? Or does FastDiff only work with a certain type of Mel-spectrograms?

It would be very nice if you could take a quick look to check whether I have messed up some part of the code 😅

Release of fine-tuned FastSpeech2 model

From the paper, it seems like a FastSpeech2 model was trained end-to-end in combination with the diffusion vocoder. Are you planning on releasing its weights as well?

This would be a super nice addition for the community ❤️

demo_tts.py missing packages & run.py KeyError

I was trying to run the repo on colab:

  • Inference for text-to-speech synthesis
  • Inference from wav file
    using the commands given in the ReadMe file but I am facing some errors that I can not move forward with.
  1. used the default TTS command !python demo_tts.py but it returns an error says
    | models Trainable Parameters: 15.315M 07/21 02:09:50 PM NumExpr defaulting to 2 threads. Traceback (most recent call last): File "/content/drive/MyDrive/FastDiff_main/FastDiff/tasks/base_task.py", line 39, in _get_data_loader value = getattr(self, attr_name) File "/usr/local/lib/python3.7/dist-packages/torch/nn/modules/module.py", line 1208, in __getattr__ type(self).__name__, name)) AttributeError: 'FastDiffTask' object has no attribute '_lazy_test_dataloader'
    File "/content/drive/MyDrive/FastDiff_main/FastDiff/data_gen/tts/vocoder_binarizer.py", line 18, in <module> from vocoders.base_vocoder import get_vocoder_cls File "/content/drive/MyDrive/FastDiff_main/FastDiff/vocoders/__init__.py", line 1, in <module> from vocoders import pwg ImportError: cannot import name 'pwg' from 'vocoders' (/content/drive/MyDrive/FastDiff_main/FastDiff/vocoders/__init__.py)
    I searched for the origin of pwg but didn't find something related.

  2. Inference from wav file using the following command !python run.py --config=='modules/FastDiff/config/FastDiff.yaml'--exp_name $trial --infer --hparams='test_input_dir=wavs,N=$2' and created wavs folder in the following directory FastDiff/wavs/ but it returns the following error
    | libtmux load error. Traceback (most recent call last): File "run.py", line 35, in <module> set_hparams() File "/content/drive/MyDrive/FastDiff_main/FastDiff/utils/hparams.py", line 96, in set_hparams if v in ['True', 'False'] or type(config_node[k]) in [bool, list, dict]: KeyError: 'test_input_dir'
    is there anything wrong in the command?

No module named 'utils'

A stupid question may be, I cloned the repo, run a mel-spec inference command and it throws the following error:

python tasks/run.py --config modules/FastDiff/config/FastDiff.yaml --exp_name FastDiff --infer --hparams='test_mel_dir=mels,use_wav=False,N=4'

Traceback (most recent call last):
  File "/home/.../FastDiff/tasks/run.py", line 3, in <module>
    from utils.hparams import set_hparams, hparams
ModuleNotFoundError: No module named 'utils'

The other commands like inference from wav files also throw the same error.

Is utils/ddp_utils.py missing?

Dear Rongjie, thank you for sharing your code with community, and congratulations on having your paper accepted.

When I try to follow the instructions for inference on wav files, I get the following error:

``bash
File "/home/ubuntu/FastDiff/utils/trainer.py", line 19, in
from utils.ddp_utils import DDP
ModuleNotFoundError: No module named 'utils.ddp_utils'

When I look in the utils/ directory, I see a `tts_utils.py` but no `ddp_utils.py`.  I don't find that file anywhere in this repo.

Is this supposed to be the same as [NATSpeech's ddp_utils](https://github.com/NATSpeech/NATSpeech/blob/f209f8410438bd73232ddc4997768e49ec2b1b84/utils/commons/ddp_utils.py)? 

Thanks.

Finetune on my owndataset

image

Hello, I want to finetune the model with my own dataset I want to understand more about how data structure should be in the following folder : raw_data_dir, processed_data_dir, binary_data_dir

Thanks

Pre-defined noise schedule

Thank you for making your code publicly available!

I have a question about the pre-defined noise schedule. The paper says β = Linear(1e−4, 0.005, 1000) was used (Table 7). On the other hand, the default setting of this repository is β = Linear(0.000001, 0.01, 1000).

T: 1000
beta_0: 0.000001
beta_T: 0.01

Which setup is your recommendation?

To my understanding, the noise-predictor-derived noise schedule depends on a trained score network. It means the noise-predictor-derived noise schedule depends on the pre-defined noise schedule. Is this correct?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.