GithubHelp home page GithubHelp logo

keonlee9420 / diffgan-tts Goto Github PK

View Code? Open in Web Editor NEW
303.0 9.0 46.0 123.47 MB

PyTorch Implementation of DiffGAN-TTS: High-Fidelity and Efficient Text-to-Speech with Denoising Diffusion GANs

License: MIT License

Python 100.00%
text-to-speech deep-neural-networks pytorch tts speech-synthesis generative-model ddpm diffusion neural-tts non-autoregressive

diffgan-tts's Introduction

DiffGAN-TTS - PyTorch Implementation

PyTorch implementation of DiffGAN-TTS: High-Fidelity and Efficient Text-to-Speech with Denoising Diffusion GANs

Repository Status

  • Naive Version of DiffGAN-TTS
  • Active Shallow Diffusion Mechanism: DiffGAN-TTS (two-stage)

Audio Samples

Audio samples are available at /demo.

Quickstart

DATASET refers to the names of datasets such as LJSpeech and VCTK in the following documents.

MODEL refers to the types of model (choose from 'naive', 'aux', 'shallow').

Dependencies

You can install the Python dependencies with

pip3 install -r requirements.txt

Inference

You have to download the pretrained models and put them in

  • output/ckpt/DATASET_naive/ for 'naive' model.
  • output/ckpt/DATASET_shallow/ for 'shallow' model. Please note that the checkpoint of the 'shallow' model contains both 'shallow' and 'aux' models, and these two models will share all directories except results throughout the whole process.

For a single-speaker TTS, run

python3 synthesize.py --text "YOUR_DESIRED_TEXT" --model MODEL --restore_step RESTORE_STEP --mode single --dataset DATASET

For a multi-speaker TTS, run

python3 synthesize.py --text "YOUR_DESIRED_TEXT" --model MODEL --speaker_id SPEAKER_ID --restore_step RESTORE_STEP --mode single --dataset DATASET

The dictionary of learned speakers can be found at preprocessed_data/DATASET/speakers.json, and the generated utterances will be put in output/result/.

Batch Inference

Batch inference is also supported, try

python3 synthesize.py --source preprocessed_data/DATASET/val.txt --model MODEL --restore_step RESTORE_STEP --mode batch --dataset DATASET

to synthesize all utterances in preprocessed_data/DATASET/val.txt.

Controllability

The pitch/volume/speaking rate of the synthesized utterances can be controlled by specifying the desired pitch/energy/duration ratios. For example, one can increase the speaking rate by 20 % and decrease the volume by 20 % by

python3 synthesize.py --text "YOUR_DESIRED_TEXT" --model MODEL --restore_step RESTORE_STEP --mode single --dataset DATASET --duration_control 0.8 --energy_control 0.8

Please note that the controllability is originated from FastSpeech2 and not a vital interest of DiffGAN-TTS.

Training

Datasets

The supported datasets are

  • LJSpeech: a single-speaker English dataset consists of 13100 short audio clips of a female speaker reading passages from 7 non-fiction books, approximately 24 hours in total.

  • VCTK: The CSTR VCTK Corpus includes speech data uttered by 110 English speakers (multi-speaker TTS) with various accents. Each speaker reads out about 400 sentences, which were selected from a newspaper, the rainbow passage and an elicitation paragraph used for the speech accent archive.

Preprocessing

  • For a multi-speaker TTS with external speaker embedder, download ResCNN Softmax+Triplet pretrained model of philipperemy's DeepSpeaker for the speaker embedding and locate it in ./deepspeaker/pretrained_models/.

  • Run

    python3 prepare_align.py --dataset DATASET
    

    for some preparations.

    For the forced alignment, Montreal Forced Aligner (MFA) is used to obtain the alignments between the utterances and the phoneme sequences. Pre-extracted alignments for the datasets are provided here. You have to unzip the files in preprocessed_data/DATASET/TextGrid/. Alternately, you can run the aligner by yourself.

    After that, run the preprocessing script by

    python3 preprocess.py --dataset DATASET
    

Training

You can train three types of model: 'naive', 'aux', and 'shallow'.

  • Training Naive Version ('naive'):

    Train the naive version with

    python3 train.py --model naive --dataset DATASET
    
  • Training Basic Acoustic Model for Shallow Version ('aux'):

    To train the shallow version, we need a pre-trained FastSpeech2. The below command will let you train the FastSpeech2 modules, including Auxiliary (Mel) Decoder.

    python3 train.py --model aux --dataset DATASET
    
  • Training Shallow Version ('shallow'):

    To leverage pre-trained FastSpeech2, including Auxiliary (Mel) Decoder, you must pass --restore_step with the final step of auxiliary FastSpeech2 training as the following command.

    python3 train.py --model shallow --restore_step RESTORE_STEP --dataset DATASET
    

    For example, if the last checkpoint is saved at 200000 steps during the auxiliary training, you have to set --restore_step with 200000. Then it will load and freeze the aux model and then continue the training under the active shallow diffusion mechanism.

TensorBoard

Use

tensorboard --logdir output/log/DATASET

to serve TensorBoard on your localhost. The loss curves, synthesized mel-spectrograms, and audios are shown.

Naive Diffusion

Shallow Diffusion

Notes

  • In addition to the Diffusion Decoder, the Variance Adaptor is also conditioned on speaker information.
  • Unconditional and Conditional output of the JCU discriminator is averaged during each of loss calculation as VocGAN did.
  • Some differences on the Data and Preprocessing compared to the original paper:
    • Using VCTK (109 speakers) instead of Mandarin Chinese of 228 speakers.
    • Following DiffSpeech's audio config, e.g., sample rate is 22050Hz rather than 24,000Hz.
    • Also, following DiffSpeech's variance extraction and modeling.
  • lambda_fm is fixed to a scala value since the dynamically scaled scalar computed as L_recon/L_fm makes the model explode.
  • Two options for embedding for the multi-speaker TTS setting: training speaker embedder from scratch or using a pre-trained philipperemy's DeepSpeaker model (as STYLER did). You can toggle it by setting the config (between 'none' and 'DeepSpeaker').
  • DeepSpeaker on VCTK dataset shows clear identification among speakers. The following figure shows the T-SNE plot of extracted speaker embedding.

Citation

Please cite this repository by the "Cite this repository" of About section (top right of the main page).

References

diffgan-tts's People

Contributors

keonlee9420 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

diffgan-tts's Issues

ERROR

File "train.py", line 320, in 3.24s/it]
main(args, configs)
File "train.py", line 196, in main
figs, wav_reconstruction, wav_prediction, tag = synth_one_sample(
File "/data/workspace/liukaiyang/TTS/DiffGAN-TTS-main/utils/tools.py", line 227, in synth_one_sample
mels = [mel_pred[0, :mel_len].float().detach().transpose(0, 1) for mel_pred in diffusion.sampling()]
File "/root/anaconda3/envs/LKYBase/lib/python3.8/site-packages/torch/autograd/grad_mode.py", line 28, in decorate_context
return func(*args, **kwargs)
File "/data/workspace/liukaiyang/TTS/DiffGAN-TTS-main/model/diffusion.py", line 157, in sampling
b, *_, device = *self.cond.shape, self.cond.device
File "/root/anaconda3/envs/LKYBase/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1177, in getattr
raise AttributeError("'{}' object has no attribute '{}'".format(
AttributeError: 'GaussianDiffusion' object has no attribute 'cond'

Thank for your work!I seem to get a Error……

Implementation performance

Hi,
thank you very much for your great work!

I was wondering if you conduct any evaluations on the model performance and voice quality for multi-speaker results, e.g. MOS or sMOS ?

After listening the demos you provided, I found the generated voices for speaker p257 and p250 are quite similar. (I suppose p250-265.wav and p257-243.wav come from different speakers.)
p250
demo/VCTK)/shallow_diffusion_400k/demo_VCTK_shallow_diffusion_400k_p250-265.wav

p257
demo/VCTK)/shallow_diffusion_400k/demo_VCTK_shallow_diffusion_400k_p257-243.wav

Could you please give me a hint?

Thanks!

Can I ask you some questions about mel-spectrogram?

HI@keonlee9420, I have some questions to ask you about the mel-spectrogram. In the picture, image
The above mel-spectrogram alignment has been generated, but the horizontal details have not been released yet. What problem do you think caused it

'GaussianDiffusion' object has no attribute 'cond' when training with multi-GPU

File "train.py", line 320, in 3.24s/it]
main(args, configs)
File "train.py", line 196, in main
figs, wav_reconstruction, wav_prediction, tag = synth_one_sample(
File "/data/workspace/liukaiyang/TTS/DiffGAN-TTS-main/utils/tools.py", line 227, in synth_one_sample
mels = [mel_pred[0, :mel_len].float().detach().transpose(0, 1) for mel_pred in diffusion.sampling()]
File "/root/anaconda3/envs/LKYBase/lib/python3.8/site-packages/torch/autograd/grad_mode.py", line 28, in decorate_context
return func(*args, **kwargs)
File "/data/workspace/liukaiyang/TTS/DiffGAN-TTS-main/model/diffusion.py", line 157, in sampling
b, *_, device = *self.cond.shape, self.cond.device
File "/root/anaconda3/envs/LKYBase/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1177, in getattr
raise AttributeError("'{}' object has no attribute '{}'".format(
AttributeError: 'GaussianDiffusion' object has no attribute 'cond'

Hi,
thank you very much for your great work!

the model runs well on single GPU but encounters the above problem when training with multi-GPU.

This problem only arises in the val phase, since the sampleling() requires the cond parameter, but self.cond is not defined in the val phase.

The issue was reported in the previous issue, but without any solutions.

Could you please give some hints?

About question of code and synthesis

HI@keonlee9420, Thank you for your suggestions these days, I successfully integrated model PortaSpeech on the basis of this model. These are some questions to ask you! Thank you!

  1. In the DiffGAN-TTS, the return of get_mask from length is mask. And the return of get_mask from length in PortaSpeech is ~mask. I want to know the difference between them,
  2. In DiffGAN-TTS, about def diffuse_trace(self, x_start, mask). I want to know how do the ~ aims to do in def diffuse_trace. In my integrated model, I set the return of get_mask from length is ~mask.
    If I delete the ~ in diffuse_trace, the synthesis mel is error and the voice likes to the voice of water. While If I preserve the ~ in diffuse_trace, the mel is also error and the voice likes to electric voice.
    Thank you very much!
  • Deng Yan
  • 2022.5.9
  • GuangXi University

What does the mlp and Mish function in modules.py do

self.mlp = nn.Sequential(
LinearNorm(residual_channels, residual_channels * 4),
Mish(), # return x * torch.tanh(F.softplus(x))
LinearNorm(residual_channels * 4, residual_channels)
)

class Mish(nn.Module):
def forward(self, x):
return x * torch.tanh(F.softplus(x))

TypeError: 'NoneType' object is not subscriptable

raceback (most recent call last): | 0/5468 [00:00<?, ?it/s]
File "train.py", line 307, in
main(args, configs)
File "train.py", line 99, in main
output = model(*(batch[2:]))
TypeError: 'NoneType' object is not subscriptable

How can I solve this problem? Thank You!

Checkpoints for Mandarin

Dear Keon Lee,

I am a research assistant at the City University of Hong Kong, I currently conduct research related to neurolinguistics and appreciate your work about text to speech generation. I am wondering that could you share the pretrained checkpoints for mandarin datasets, so I could continue my work based on your results. Many thanks
Honghao

On Input Output Convolutional Mismatch during Training

I will encounter problems when training to validation, which is 1000 steps

Traceback (most recent call last):███████████████████████████████████████████████████████████████████████████| 99/99 [12:35<00:00, 2.46s/it]
File "train.py", line 321, in | 0/4 [00:00<?, ?it/s]
main(args, configs)
File "train.py", line 196, in main
figs, wav_reconstruction, wav_prediction, tag = synth_one_sample(
File "/home/wxk/diff/DiffGAN-TTS-main/utils/tools.py", line 227, in synth_one_sample
mels = [mel_pred[0, :mel_len].float().detach().transpose(0, 1) for mel_pred in diffusion.sampling(cond=cond)]
File "/home/wxk/anaconda3/envs/diffgan/lib/python3.8/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context
return func(*args, **kwargs)
File "/home/wxk/diff/DiffGAN-TTS-main/model/diffusion.py", line 162, in sampling
x = self.p_sample(xs[-1], torch.full((b,), i, device=device, dtype=torch.long), cond, spk_emb)
File "/home/wxk/anaconda3/envs/diffgan/lib/python3.8/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context
return func(*args, **kwargs)
File "/home/wxk/diff/DiffGAN-TTS-main/model/diffusion.py", line 124, in p_sample
x_0_pred = self.denoise_fn(x_t, t, cond, spk_emb)
File "/home/wxk/anaconda3/envs/diffgan/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
return forward_call(*input, **kwargs)
File "/home/wxk/diff/DiffGAN-TTS-main/model/modules.py", line 618, in forward
x, skip_connection = layer(x, conditioner, diffusion_step, speaker_emb)
File "/home/wxk/anaconda3/envs/diffgan/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
return forward_call(*input, **kwargs)
File "/home/wxk/diff/DiffGAN-TTS-main/model/blocks.py", line 670, in forward
conditioner = self.conditioner_projection(conditioner)
File "/home/wxk/anaconda3/envs/diffgan/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
return forward_call(*input, **kwargs)
File "/home/wxk/diff/DiffGAN-TTS-main/model/blocks.py", line 191, in forward
conv_signal = self.conv(signal)
File "/home/wxk/anaconda3/envs/diffgan/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
return forward_call(*input, **kwargs)
File "/home/wxk/anaconda3/envs/diffgan/lib/python3.8/site-packages/torch/nn/modules/conv.py", line 307, in forward
return self._conv_forward(input, self.weight, self.bias)
File "/home/wxk/anaconda3/envs/diffgan/lib/python3.8/site-packages/torch/nn/modules/conv.py", line 303, in _conv_forward
return F.conv1d(input, weight, bias, self.stride,
RuntimeError: Given groups=1, weight of size [256, 256, 1], expected input[32, 839, 256] to have 256 channels, but got 839 channels instead

How should we solve this

License Issue

Hi @keonlee9420, this software depends on praat-parselmouth which is GPL-licensed, which means all software that depends on it must also be GPL-licensed. Might it be possible to switch to DeepPhonemizer, licensed under the MIT license? Thanks in advance!

EDIT: This package also uses unidecode, which is also GPL-licensed. Might it be possible to switch to text-unidecode? Thanks!

Issues with Audio Quality for Longer Text Inputs Using VCTK Pretrained Model

Hello @keonlee9420,

I've been working with the VCTK pretrained model provided in the GitHub repository and encountered some issues regarding the audio quality for longer text inputs. While the initial few seconds of the generated audio (approximately the first 3 seconds) are of high quality, the audio quality noticeably drops or becomes unnatural after this point. This occurs regardless of whether I use the naive, aux, or shallow methods.

In the paper "DiffGAN-TTS: High-Fidelity and Efficient Text-to-Speech with Denoising Diffusion GANs", a dataset of 228 Mandarin Chinese speakers with over 200 hours of speech data was used, but the GitHub implementation utilizes the VCTK dataset, which has around 44 hours of speech data. I'm curious if the difference in the dataset, both in terms of the length of individual speech samples and the total volume of data, might be influencing the quality of the generated audio, especially for longer text inputs.

I have a couple of specific questions:

  1. Are there differences in the average length of speech samples between the Mandarin dataset used in the paper and the VCTK dataset used in the implementation?
  2. Are there any inherent limitations in the model regarding the length of text input for maintaining high-quality audio synthesis?

I suspect that the discrepancies between the datasets might be impacting the model's performance, particularly for longer inputs. Any insights or suggestions would be greatly appreciated.

Thank you for your work on this project and for any assistance you can provide.

process data

hi, when i use VCTK dataset, process has a problem called "UnboundLocalError: local variable 'f0' referenced before assignment"
but using LJSpeech is ok. By the way ,when i train the naive model by LJSpeech, it comes to a problem called " GaussiDiffusion has no attribute cond in torch", is this caused by torch version?

Why minmize l1(\hat{x_0}, x_0)+l1(\hat{x_1}, x_0) when optimizing aux model?

Hi, keonlee.
Thanks for sharing code!
I found that when training aux model, we get \hat{x_0} from G, then diffuse it to \hat{x_1}, finally get a prediciton list [ \hat{x_0}, \hat{x_1}]. When calculating mel loss, add l1 loss of them with target. It confuse me. I understand l1(x_0, \hat{x_0}). But why not l1(x_1, \hat{x_1}).

about the preprocessed data VCTK

I encountered some problems again VCTK dataset, I followed the process but UnboundLocalError: local variable 'f0' referenced before assignment, I wonder if it is possible to package the VCTK dataset in preprocessed_data and send it to the [email protected]

VCTK generation fails

Hello, thank you very much for your brilliant open-source project. I have been able to do single and batch generations using the LJSpeech dataset. However, when I try to replicate the results for the VCTK dataset, it fails.

I run the following command,
!python3 synthesize.py --text "Hello World" --model naive --restore_step 300000 --mode single --dataset VCTK

I obtain the following output:

[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger.zip.
[nltk_data] Downloading package cmudict to /root/nltk_data...
[nltk_data]   Unzipping corpora/cmudict.zip.

==================================== Inference Configuration ====================================
 ---> Type of Modeling: naive
 ---> Total Batch Size: 32
 ---> Path of ckpt: ./output/ckpt/VCTK_naive
 ---> Path of log: ./output/log/VCTK_naive
 ---> Path of result: ./output/result/VCTK_naive
================================================================================================
Removing weight norm...
Traceback (most recent call last):
  File "synthesize.py", line 264, in <module>
    )) if load_spker_embed else None
  File "/usr/local/lib/python3.7/dist-packages/numpy/lib/npyio.py", line 416, in load
    fid = stack.enter_context(open(os_fspath(file), "rb"))
FileNotFoundError: [Errno 2] No such file or directory: './preprocessed_data/VCTK/spker_embed/p225-spker_embed.npy' 

I tried to investigate further and discovered that the specific speaker embedding folder and file did not exist in my directory. Any pointer to how I can solve the issue will be appreciated.

stft

Hello, thank you very much for the open source project.
I ran into a problem: the model successfully converged during training, but after generating the mel spectrum (which looked very good), I put the mel spectrum into my own hifigan vocoder, and the resulting wav was murmur, I could be sure that the parameters of the hifigan's sample radio, hoplength and winlength were consistent with the diffgan model, and I guessed that the problem was in the process of processing the audio of the data into a mel spectrum. I noticed that you used pytorch-stft to implement it, which is very different from the processing result of librosa.stft?

About preprocess

HI, I wanna run "python3 preprocess.py --dataset VCTK" after "python3 prepare_align.py --dataset VCTK",
but in
./preprocessor/preprocessor.py
line :115 tg_path = os.path.join(self.out_dir, "TextGrid", speaker, "{}.TextGrid".format(basename)
I cannot find file named "*TextGrid", I want to know when it created?

After step "python3 prepare_align.py --dataset VCTK" I only get files name ".lab" and ".wav", no files named ".TextGrid"

Thanks

About DiffSVC

Hello, sorry for bothering you.
Have you contacted with DiffSVC? I saw a code for DiffSVC is similar with yours but it is uncompleted.

Python and dependency versions

Hello! Could you update the docs with info about using python==3.8 making things easier (praat-parselmouth only distributes binaries for python 3.8 and I couldn't get it to compile), numba==0.49 (numba.decorators was pulled in 0.50) , resampy==0.3.1 (to match numba==0.49), and add python_speech_features, pandas, and tensorflow to requirements.txt?

Some of the problems that occur in training

Hi@keonlee9420, I encountered some problems during the training stage. I often have loss functions that occasionally fluctuate a lot during training, even from around 3 to tens or hundreds. After I set the training set shuffle, sometimes I have this problem, sometimes but not this problem. This problem was encountered in the naive, aux and shallow stages. Thank you for my friend!Best wish to you!

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.