GithubHelp home page GithubHelp logo

lucidrains / naturalspeech2-pytorch Goto Github PK

View Code? Open in Web Editor NEW
1.2K 56.0 97.0 524 KB

Implementation of Natural Speech 2, Zero-shot Speech and Singing Synthesizer, in Pytorch

License: MIT License

Python 100.00%
artificial-intelligence deep-learning singing-synthesis speech-synthesis latent-diffusion residual-vector-quantization zero-shot

naturalspeech2-pytorch's Introduction

Natural Speech 2 - Pytorch (wip)

Implementation of Natural Speech 2, Zero-shot Speech and Singing Synthesizer, in Pytorch

NaturalSpeech 2 is a TTS system that leverages a neural audio codec with continuous latent vectors and a latent diffusion model with non-autoregressive generation to enable natural and zero-shot text-to-speech synthesis

This repository will use denoising diffusion rather than score-based SDE, and may potentially offer elucidated version as well. It will also offer improvements for the attention / transformer components wherever applicable.

Appreciation

  • Stability and 馃 Huggingface for their generous sponsorships to work on and open source cutting edge artificial intelligence research

  • 馃 Huggingface for the amazing accelerate library

  • Manmay for submitting the initial code for phoneme, pitch, duration, and speech prompt encoders as well as the multilingual phonemizer and phoneme aligner!

  • Manmay for wiring up the complete end-to-end conditioning of the diffusion network!

  • You? If you are an aspiring ML / AI engineer or work in the TTS field and would like to contribute to open sourcing state-of-the-art, jump right in!

Install

$ pip install naturalspeech2-pytorch

Usage

import torch
from naturalspeech2_pytorch import (
    EncodecWrapper,
    Model,
    NaturalSpeech2
)

# use encodec as an example

codec = EncodecWrapper()

model = Model(
    dim = 128,
    depth = 6
)

# natural speech diffusion model

diffusion = NaturalSpeech2(
    model = model,
    codec = codec,
    timesteps = 1000
).cuda()

# mock raw audio data

raw_audio = torch.randn(4, 327680).cuda()

loss = diffusion(raw_audio)
loss.backward()

# do the above in a loop for a lot of raw audio data...
# then you can sample from your generative model as so

generated_audio = diffusion.sample(length = 1024) # (1, 327680)

With conditioning

ex.

import torch
from naturalspeech2_pytorch import (
    EncodecWrapper,
    Model,
    NaturalSpeech2,
    SpeechPromptEncoder
)

# use encodec as an example

codec = EncodecWrapper()

model = Model(
    dim = 128,
    depth = 6,
    dim_prompt = 512,
    cond_drop_prob = 0.25,                  # dropout prompt conditioning with this probability, for classifier free guidance
    condition_on_prompt = True
)

# natural speech diffusion model

diffusion = NaturalSpeech2(
    model = model,
    codec = codec,
    timesteps = 1000
)

# mock raw audio data

raw_audio = torch.randn(4, 327680)
prompt = torch.randn(4, 32768)               # they randomly excised a range on the audio for the prompt during training, eventually will take care of this auto-magically

text = torch.randint(0, 100, (4, 100))
text_lens = torch.tensor([100, 50 , 80, 100])

# forwards and backwards

loss = diffusion(
    audio = raw_audio,
    text = text,
    text_lens = text_lens,
    prompt = prompt
)

loss.backward()

# after much training

generated_audio = diffusion.sample(
    length = 1024,
    text = text,
    prompt = prompt
) # (1, 327680)

Or if you want a Trainer class to take care of the training and sampling loop, just simply do

from naturalspeech2_pytorch import Trainer

trainer = Trainer(
    diffusion_model = diffusion,     # diffusion model + codec from above
    folder = '/path/to/speech',
    train_batch_size = 16,
    gradient_accumulate_every = 2,
)

trainer.train()

Todo

  • complete perceiver then cross attention conditioning on ddpm side

  • add classifier free guidance, even if not in paper

  • complete duration / pitch prediction during training - thanks to Manmay

  • make sure pyworld way of computing pitch can also work

  • consult phd student in TTS field about pyworld usage

  • also offer direct summation conditioning using spear-tts text-to-semantic module, if available

  • add self-conditioning on ddpm side

  • take care of automatic slicing of audio for prompt, being aware of minimal audio segment as allowed by the codec model

  • make sure curtail_from_left works for encodec, figure out what they are doing

Citations

@inproceedings{Shen2023NaturalSpeech2L,
    title   = {NaturalSpeech 2: Latent Diffusion Models are Natural and Zero-Shot Speech and Singing Synthesizers},
    author  = {Kai Shen and Zeqian Ju and Xu Tan and Yanqing Liu and Yichong Leng and Lei He and Tao Qin and Sheng Zhao and Jiang Bian},
    year    = {2023}
}
@misc{shazeer2020glu,
    title   = {GLU Variants Improve Transformer},
    author  = {Noam Shazeer},
    year    = {2020},
    url     = {https://arxiv.org/abs/2002.05202}
}
@inproceedings{dao2022flashattention,
    title   = {Flash{A}ttention: Fast and Memory-Efficient Exact Attention with {IO}-Awareness},
    author  = {Dao, Tri and Fu, Daniel Y. and Ermon, Stefano and Rudra, Atri and R{\'e}, Christopher},
    booktitle = {Advances in Neural Information Processing Systems},
    year    = {2022}
}
@article{Salimans2022ProgressiveDF,
    title   = {Progressive Distillation for Fast Sampling of Diffusion Models},
    author  = {Tim Salimans and Jonathan Ho},
    journal = {ArXiv},
    year    = {2022},
    volume  = {abs/2202.00512}
}
@inproceedings{Hang2023EfficientDT,
    title   = {Efficient Diffusion Training via Min-SNR Weighting Strategy},
    author  = {Tiankai Hang and Shuyang Gu and Chen Li and Jianmin Bao and Dong Chen and Han Hu and Xin Geng and Baining Guo},
    year    = {2023}
}
@article{Alayrac2022FlamingoAV,
    title   = {Flamingo: a Visual Language Model for Few-Shot Learning},
    author  = {Jean-Baptiste Alayrac and Jeff Donahue and Pauline Luc and Antoine Miech and Iain Barr and Yana Hasson and Karel Lenc and Arthur Mensch and Katie Millican and Malcolm Reynolds and Roman Ring and Eliza Rutherford and Serkan Cabi and Tengda Han and Zhitao Gong and Sina Samangooei and Marianne Monteiro and Jacob Menick and Sebastian Borgeaud and Andy Brock and Aida Nematzadeh and Sahand Sharifzadeh and Mikolaj Binkowski and Ricardo Barreira and Oriol Vinyals and Andrew Zisserman and Karen Simonyan},
    journal  = {ArXiv},
    year     = {2022},
    volume   = {abs/2204.14198}
}
@article{Badlani2021OneTA,
    title   = {One TTS Alignment to Rule Them All},
    author  = {Rohan Badlani and Adrian Lancucki and Kevin J. Shih and Rafael Valle and Wei Ping and Bryan Catanzaro},
    journal = {ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)},
    year    = {2021},
    pages   = {6092-6096},
    url     = {https://api.semanticscholar.org/CorpusID:237277973}
}

naturalspeech2-pytorch's People

Contributors

amiasato avatar lucidrains avatar manmay-nakhashi avatar p0p4k avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

naturalspeech2-pytorch's Issues

About result

Good work! Did you train it by yourself? How about the result?

Question about generate speech

heya my lad,
I have run you Usage code for generate speech, but nothing output.
Could you please give a full example to accept a prompt and text, then generating the speech?
Or give some simple steps to attain this goal.

WaveNet

@lucidrains
I saw the thesis. There is a part in 4.2.-> Specifically, we use a FiLM layer [38] at every 3 WaveNet layers to fuse the condition information processed by the second Q-K-V attention in the prompting mechanism in the diffusion model.

but i saw your model use the Film layer in any layers

class WavenetResBlock(nn.Module):
def init(
self,
dim,
*,
dilation,
kernel_size = 3,
skip_conv = False,
dim_cond_mult = None
):
super().init()

    self.cond = exists(dim_cond_mult)
    self.to_time_cond = None

    if self.cond:
        self.to_time_cond = nn.Linear(dim * dim_cond_mult, dim * 2)

    self.conv = CausalConv1d(dim, dim, kernel_size, dilation = dilation)
    self.res_conv = CausalConv1d(dim, dim, 1)
    self.skip_conv = CausalConv1d(dim, dim, 1) if skip_conv else None

def forward(self, x, t = None):

    if self.cond:
        assert exists(t)
        t = self.to_time_cond(t)
        t = rearrange(t, 'b c -> b c 1')
        t_gamma, t_beta = t.chunk(2, dim = -2)

    res = self.res_conv(x)

    x = self.conv(x)

    if self.cond: # need to applt the layers%3 
        x = x * t_gamma + t_beta

    x = x.tanh() * x.sigmoid()

    x = x + res

    skip = None
    if exists(self.skip_conv):
        skip = self.skip_conv(x)

    return x, skip

Is this a wrong edit or is a specific trick to run the wavenet ?

loss.backward()??

In Usage:
loss = diffusion(raw_audio)
loss.backward()
Thank you for your work, very nice! And I'm sorry, as a newbie, I have to ask two stupid questions:

  1. Where does this backward() go? I didn't find a follow-up to it, so I had a second question
  2. naturalspeech2, I don't know if it's a model or a method, so I don't know how to train it, or if I need to train it

How to generate song with my voice?

Great work.
I have some song audios(wav), and I have record my audio.
I want to know how to generate audios of these songs with my voice?
Thanks!

unconditional version seems not to work correctly

Hi, thanks for the great job!

I've follow the training process below and use pure audio without conditional input.

I've tried dataset with 110000+ audios and also tried dataset with only 500 audios, but after training for 100000 iterations, output had no meaning using the sampling pipeline with the trained network.

could you pls tell me if there is anything wrong. Or have you trained something meaningful?

Thanks.

from naturalspeech2_pytorch import Trainer

trainer = Trainer(
    diffusion_model = diffusion,     # diffusion model + codec from above
    folder = '/path/to/speech',
    train_batch_size = 16,
    gradient_accumulate_every = 2,
)

trainer.train()

Will this work continue?

This is an impressive replication work. But I haven't seen any commits in the past two months. Will it continue?

Trainer support for audio file, prompt pairs

Most of my data is split in file.wav and file.txt or in json files with "path/to/file.wav": "the transcription of the audio" mappings. It looks like Trainer only supports audio files. Is there a way to get prompt support?

Some issues about implementing DurationPitchPredictor

In the paper on NaturalSpeed2, it was mentioned that the Duration/Pitch Predictor has been preliminarily implemented in the code. Here are three small issues:

  1. In the paper, it is stated that there is one attention for every three convolutional layers. I see that every convolution in the code has an attention. Is it being written or intentionally set?

  2. The paper uses Layer Normalization, and the code uses RMSNorm. Is the effect better?

  3. The paper did not mention the residual of the convolutional part, and the code did not use residual connections. May I ask if the 30 layers are too deep, resulting in no gradient?

Thank you very much for your contribution!

how to run this repo?

hello world
very stupid question, but, how is it possible to run this app locally?
even very genera description
thanks

Discuss details regarding loss function

Hi @lucidrains,

image
Needed a little clarification on 2nd loss term , as per author denoiser model is predicting z0 rather than score so we needed to calculate score for 2nd loss term, as per paper formula is following:

pred score = $位^{-1} (\hat{z_0} - z_t)$

$\hat{z_0}$ is output of the denoiser model and $z_t$ is noisy input but we needed to calculate $位$ which is variance of the $p(z_t/z_0)$ distribution.
So as per your code:
$位$ = sigma from
https://github.com/lucidrains/naturalspeech2-pytorch/blob/900581e52534cb3451b4f2715bf8ffa6466c84be/naturalspeech2_pytorch/naturalspeech2_pytorch.py#L1140
and second loss term would be:

score = (pred - noised_audio)/ sigma
score_loss = F.mse_loss(score , noise, reduction = 'none')
score_loss = reduce(score_loss , 'b ... -> b', 'mean')

So am I calculating $位$ and score loss correctly or I am missing something?

Thanks

example model

Thanks for your work.
Are there any pre-trained models (.pt) for this project? If not, do you plan to provide them?
It would be great if there were models available to check how this project works.

Model study questions

Please let me know the format of the data to train the model and a detailed guide to train the model. Thank you.

Torch, CUDA version

Hi,

Thanks for your work.

I am trying to run the training as mentioned in the Readme file. I am getting this error:
RuntimeError: GET was unable to find an engine to execute this computation

I am using a Tesla V100 GPU.

Could you clarify what is the Torch. CUDA version you have used for your training.

ValueError: common is not allowed

Reproducer:

import torch
from naturalspeech2_pytorch import (
    EncodecWrapper,
    Model,
    NaturalSpeech2,
    SpeechPromptEncoder
)
Traceback (most recent call last):
  File "/p/i/tts/ns2.py", line 2, in <module>
    from naturalspeech2_pytorch import (
  File "/home/i/.local/lib/python3.11/site-packages/naturalspeech2_pytorch/__init__.py", line 8, in <module>
    from naturalspeech2_pytorch.naturalspeech2_pytorch import (
  File "/home/i/.local/lib/python3.11/site-packages/naturalspeech2_pytorch/naturalspeech2_pytorch.py", line 22, in <module>
    from audiolm_pytorch import SoundStream, EncodecWrapper
  File "/home/i/.local/lib/python3.11/site-packages/audiolm_pytorch/__init__.py", line 8, in <module>
    from audiolm_pytorch.audiolm_pytorch import AudioLM
  File "/home/i/.local/lib/python3.11/site-packages/audiolm_pytorch/audiolm_pytorch.py", line 16, in <module>
    from audiolm_pytorch.vq_wav2vec import FairseqVQWav2Vec
  File "/home/i/.local/lib/python3.11/site-packages/audiolm_pytorch/vq_wav2vec.py", line 7, in <module>
    import fairseq
  File "/home/i/.local/lib/python3.11/site-packages/fairseq/__init__.py", line 20, in <module>
    from fairseq.distributed import utils as distributed_utils
  File "/home/i/.local/lib/python3.11/site-packages/fairseq/distributed/__init__.py", line 7, in <module>
    from .fully_sharded_data_parallel import (
  File "/home/i/.local/lib/python3.11/site-packages/fairseq/distributed/fully_sharded_data_parallel.py", line 10, in <module>
    from fairseq.dataclass.configs import DistributedTrainingConfig
  File "/home/i/.local/lib/python3.11/site-packages/fairseq/dataclass/__init__.py", line 6, in <module>
    from .configs import FairseqDataclass
  File "/home/i/.local/lib/python3.11/site-packages/fairseq/dataclass/configs.py", line 1104, in <module>
    @dataclass
     ^^^^^^^^^
  File "/usr/lib64/python3.11/dataclasses.py", line 1230, in dataclass
    return wrap(cls)
           ^^^^^^^^^
  File "/usr/lib64/python3.11/dataclasses.py", line 1220, in wrap
    return _process_class(cls, init, repr, eq, order, unsafe_hash,
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib64/python3.11/dataclasses.py", line 958, in _process_class
    cls_fields.append(_get_field(cls, name, type, kw_only))
                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib64/python3.11/dataclasses.py", line 815, in _get_field
    raise ValueError(f'mutable default {type(f.default)} for field '
ValueError: mutable default <class 'fairseq.dataclass.configs.CommonEvalConfig'> for field common is not allowed: use default_factory

Process finished with exit code 1

multiple GPU?

image
Hi, @lucidrains
I found that when step=1000, the system could save the .pt file automatically and output two audio files at the same time, but at this time, a single GPU always reports an error above. I plan to use multiple GPU, but you don't seem to provide how to use multiple GPU. can you show it ?

Generate speech from Text

Hello I am currently looking into this code and came up with a question
Is there a specific part that I can put in desired text to synthesize speech upon the text??

What I only see is a process of random sampling

Thank you

@wonwooo

Can you provide the training code for that model?


import torch

from naturalspeech2_pytorch import Trainer, EncodecWrapper, Model, NaturalSpeech2, SpeechPromptEncoder

codec = EncodecWrapper()

def main():
model = Model(
dim = 128,
depth = 6,
dim_prompt = 512,
cond_drop_prob = 0.25,
condition_on_prompt = True
)

diffusion = NaturalSpeech2(
    model = model,
    codec = codec,
    timesteps = 50
)

raw_audio = torch.randn(4, 327680)
prompt = torch.randn(4, 32768)

text = torch.randint(0, 100, (4, 100))
text_lens = torch.tensor([100, 50 , 80, 100])

# forwards and backwards

loss = diffusion(
    audio = raw_audio,
    text = text,
    text_lens = text_lens,
    prompt = prompt,
    )

loss.backward()

# after much training

generated_audio = diffusion.sample(
    length = 1024,
    text = text,
    prompt = prompt,
    )

trainer = Trainer(
    diffusion_model = diffusion,
    folder = 'C:\\naturalspeech2-pytorch\\ansunghun',
    train_batch_size = 16,
    gradient_accumulate_every = 2,
    train_num_steps = 5,
    save_and_sample_every = 100,
)

trainer.train()
trainer.save_checkpoint('C:\\naturalspeech2-pytorch\\ansunghun\\checkpoint.pt')

if name == 'main':
from multiprocessing import freeze_support
freeze_support()
main()


An error occurs in that code.


Traceback (most recent call last):
File "test.py", line 62, in
main()
File "test.py", line 56, in main
trainer.train()
File "C:\naturalspeech2-pytorch\naturalspeech2_pytorch\naturalspeech2_pytorch.py", line 1875, in train
loss = self.model(data)
File "C:\Users\user.conda\envs\svc\lib\site-packages\torch\nn\modules\module.py", line 1511, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "C:\Users\user.conda\envs\svc\lib\site-packages\torch\nn\modules\module.py", line 1520, in _call_impl
return forward_call(*args, **kwargs)
File "C:\naturalspeech2-pytorch\naturalspeech2_pytorch\naturalspeech2_pytorch.py", line 1522, in forward
text_max_length = text.shape[-1]
AttributeError: 'NoneType' object has no attribute 'shape'

two pitch?

the first pitch in the sample() as follow:

duration, pitch = self.duration_pitch(phoneme_enc, prompt_enc)
pitch = rearrange(pitch, 'b n -> b 1 n')

the second pitch in the forward() of Naturalspeech2 as follow:

if not exists(pitch):
assert exists(audio) and audio.ndim == 2
assert exists(self.target_sample_hz)
if self.calc_pitch_with_pyworld:
pitch = compute_pitch_pyworld(
audio,
sample_rate = self.target_sample_hz,
hop_length = self.mel_hop_length
)
else:
pitch = compute_pitch_pytorch(audio, self.target_sample_hz)
pitch = rearrange(pitch, 'b n -> b 1 n')

  1. Personally, I think the first pitch is from the prompt, and the second pitch is from the training data, right?
  2. Personally, I think the prompt is a small part of the training data, such as the training data is10s, from which prompt takes 2s, right?
  3. Because the input format of the prompt and the training data is the same, why are the calculation methods of pitch different?

Error when following the readme

Hello, there seems to be an error in the readme. I tried on both my machine and Colab, and got this error:

File <@beartype(naturalspeech2_pytorch.naturalspeech2_pytorch.NaturalSpeech2.__init__) at 0x7f1ba95b9550>:24, in __init__(__beartype_func, __beartype_conf, __beartype_get_violation, __beartype_object_125973616, __beartype_object_139756782786176, *args, **kwargs)

BeartypeCallHintParamViolation: @beartyped naturalspeech2_pytorch.naturalspeech2_pytorch.NaturalSpeech2.__init__() parameter model="Transformer(
  (layers): ModuleList(
    (0-11): 12 x ModuleList(
      (0): RMSNorm()
     ... violates type hint <class 'naturalspeech2_pytorch.naturalspeech2_pytorch.Model'>, as <class "naturalspeech2_pytorch.naturalspeech2_pytorch.Transformer"> "Transformer(
  (layers): ModuleList(
    (0-11): 12 x ModuleList(
      (0): RMSNorm()
     ... not instance of <class "naturalspeech2_pytorch.naturalspeech2_pytorch.Model">.

audio2audio?

I'm curious how difficult it would be to get this model to support audio2audio training.

For example, the input is noisy speech and the output is denoised speech.

This basically assumes that we would have a finetuning step with (input, output) pairs.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    馃枛 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 馃搳馃搱馃帀

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google 鉂わ笍 Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.