GithubHelp home page GithubHelp logo

jaedukseo / fastdiff Goto Github PK

View Code? Open in Web Editor NEW

This project forked from rongjiehuang/fastdiff

0.0 0.0 0.0 3.05 MB

PyTorch Implementation of FastDiff (IJCAI'22)

Python 99.07% Jupyter Notebook 0.93%

fastdiff's Introduction

FastDiff: A Fast Conditional Diffusion Model for High-Quality Speech Synthesis

drawing

Rongjie Huang, Max W. Y. Lam, Jun Wang, Dan Su, Dong Yu, Yi Ren, Zhou Zhao

PyTorch Implementation of FastDiff (IJCAI'22): a conditional diffusion probabilistic model capable of generating high fidelity speech efficiently.

arXiv GitHub Stars visitors

We provide our implementation and pretrained models as open source in this repository.

Visit our demo page for audio samples.

News

  • April.22, 2021: FastDiff accepted by IJCAI 2022. The expected release time of the full version codes (including pre-trained models, more datasets, and more neural vocoders) is at the IJCAI-2022 conference (before July. 2022). Please star us and stay tuned!
  • June.21, 2022: The LJSpeech checkpoint and demo code are provided.

Quick Started

We provide an example of how you can generate high-fidelity samples using FastDiff.

To try on your own dataset, simply clone this repo in your local machine provided with NVIDIA GPU + CUDA cuDNN and follow the below intructions.

Support Datasets and Pretrained Models

You can also use pretrained models we provide. Details of each folder are as in follows:

Dataset Config Pretrained Model
LJSpeech modules/FastDiff/config/FastDiff.yaml OneDrive
LibriTTS modules/FastDiff/config/FastDiff_libritts.yaml Coming Soon
VCTK modules/FastDiff/config/FastDiff_vctk.yaml Coming Soon

More supported datasets are coming soon.

Put the checkpoints in checkpoints/$your_experiment_name/model_ckpt_steps_*.ckpt

Dependencies

See requirements in requirement.txt:

Multi-GPU

By default, this implementation uses as many GPUs in parallel as returned by torch.cuda.device_count(). You can specify which GPUs to use by setting the CUDA_DEVICES_AVAILABLE environment variable before running the training module.

Inference for text-to-speech synthesis

  1. Download LJSpeech checkpoint and put it in checkpoint/FastDiff/model_ckpt_steps_*.ckpt
  2. Specify the input $text, and an int-type index $model_index to choose the TTS model. 0(Portaspeech, Ren et al), 1(FastSpeech 2, Ren et al), or 2(DiffSpeech, Liu et al).
  3. Set N for reverse sampling, which is a trade off between quality and speed.
  4. Run the following command.
CUDA_VISIBLE_DEVICES=$GPU python egs/demo_tts.py --N $N --text $text --model $model_index 

Generated wav files are saved in checkpoints/FastDiff/ by default.
Note: For better quality, it's recommended to finetune the FastDiff model.

Inference from wav file

  1. Make wavs directory and copy wav files into the directory.
  2. Set N for reverse sampling, which is a trade off between quality and speed.
  3. Run the following command.
CUDA_VISIBLE_DEVICES=$GPU python tasks/run.py --config $path/to/config  --exp_name $your_experiment_name --infer --hparams='test_input_dir=wavs,N=$N'

Generated wav files are saved in checkpoints/$your_experiment_name/ by default.

Inference for end-to-end speech synthesis

  1. Make mels directory and copy generated mel-spectrogram files into the directory.
    You can generate mel-spectrograms using Tacotron2, Glow-TTS and so forth.
  2. Set N for reverse sampling, which is a trade off between quality and speed.
  3. Run the following command.
CUDA_VISIBLE_DEVICES=$GPU python tasks/run.py --config $path/to/config --exp_name $your_experiment_name --infer --hparams='test_mel_dir=mels,use_wav=False,N=$N'

Generated wav files are saved in checkpoints/$your_experiment_name/ by default.

Note: If you find the output wav noisy, it's likely because of the mel-preprocessing mismatch between the acoustic and vocoder models.

Train your own model

Data Preparation and Configuraion

  1. Set raw_data_dir, processed_data_dir, binary_data_dir in the config file
  2. Download dataset to raw_data_dir. Note: the dataset structure needs to follow egs/datasets/audio/*/pre_align.py, or you could rewrite pre_align.py according to your dataset.
  3. Preprocess Dataset
# Preprocess step: unify the file structure.
python data_gen/tts/bin/pre_align.py --config $path/to/config
# Binarization step: Binarize data for fast IO.
CUDA_VISIBLE_DEVICES=$GPU python data_gen/tts/bin/binarize.py --config $path/to/config

Training the Refinement Network

CUDA_VISIBLE_DEVICES=$GPU python tasks/run.py --config $path/to/config  --exp_name $your_experiment_name --reset

Training the Noise Predictor Network

Coming Soon.

Noise Scheduling

Coming Soon, and you can use our pre-derived noise schedule in this time.

Inference

CUDA_VISIBLE_DEVICES=$GPU python tasks/run.py --config $path/to/config  --exp_name $your_experiment_name --infer

Acknowledgements

This implementation uses parts of the code from the following Github repos: NATSpeech, Tacotron2, and DiffWave-Vocoder as described in our code.

Citations

If you find this code useful in your research, please consider citing:

@article{huang2022fastdiff,
  title={FastDiff: A Fast Conditional Diffusion Model for High-Quality Speech Synthesis},
  author={Huang, Rongjie and Lam, Max WY and Wang, Jun and Su, Dan and Yu, Dong and Ren, Yi and Zhao, Zhou},
  journal={arXiv preprint arXiv:2204.09934},
  year={2022}
}

Disclaimer

This is not an officially supported Tencent product.

fastdiff's People

Contributors

rongjiehuang avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.