bshall / universalvocoding Goto Github PK

View Code? Open in Web Editor NEW

235.0 14.0 41.0 5.75 MB

A PyTorch implementation of "Robust Universal Neural Vocoding"

Home Page: https://bshall.github.io/UniversalVocoding/

License: MIT License

Python 100.00%

wavernn speech-synthesis pytorch neural-vocoder

universalvocoding's Introduction

Towards Achieving Robust Universal Neural Vocoding

A PyTorch implementation of Towards Achieving Robust Universal Neural Vocoding. Audio samples can be found here. Colab demo can be found here. Accompanying Tacotron implementation can be found here

^{Fig 1:Architecture of the vocoder.}

Quick Start

Ensure you have Python 3.6 and PyTorch 1.7 or greater installed. Then install the package with:

pip install univoc

Example Usage

import torch
import soundfile as sf
from univoc import Vocoder

# download pretrained weights (and optionally move to GPU)
vocoder = Vocoder.from_pretrained(
    "https://github.com/bshall/UniversalVocoding/releases/download/v0.2/univoc-ljspeech-7mtpaq.pt"
).cuda()

# load log-Mel spectrogram from file or from tts (see https://github.com/bshall/Tacotron for example)
mel = ...

# generate waveform
with torch.no_grad():
    wav, sr = vocoder.generate(mel)

# save output
sf.write("path/to/save.wav", wav, sr)

Train from Scratch

Clone the repo:

git clone https://github.com/bshall/UniversalVocoding
cd ./UniversalVocoding

Install requirements:

pip install -r requirements.txt

Download and extract the LJ-Speech dataset:

wget https://data.keithito.com/data/speech/LJSpeech-1.1.tar.bz2
tar -xvjf LJSpeech-1.1.tar.bz2

Download the train split here and extract it in the root directory of the repo.
Extract Mel spectrograms and preprocess audio:

python preprocess.py in_dir=path/to/LJSpeech-1.1 out_dir=datasets/LJSpeech-1.1

Train the model:

python train.py checkpoint_dir=ljspeech dataset_dir=datasets/LJSpeech-1.1

Pretrained Models

Pretrained weights for the 10-bit LJ-Speech model are available here.

Notable Differences from the Paper

Trained on 16kHz audio from a single speaker. For an older version trained on 102 different speakers form the ZeroSpeech 2019: TTS without T English dataset click here.
Uses an embedding layer instead of one-hot encoding.

Acknowlegements

https://github.com/fatchord/WaveRNN

universalvocoding's People

Contributors

Stargazers

Watchers

universalvocoding's Issues

Question about preprocess.py

Hello.

In preprocess.py line 17,

wav /= np.abs(wav).max() * 0.999

I'm wondering why you choose to use * 0.999. It leads wav to have value which gets over 1.0. Is it bug or intended code?

Thanks.

Changing parameters

If i want to change the parameters of the model, what are the things should I change?
I want to change sampling_rate, num_fft, num_mels, hop length, win_length.
It seems pretty much every parameter.

Are there other things to change other than config.json?

Usage of audio_slice_frames, sample_frames, pad

Hello,

I saw that you used pad, audio_slice_frames, sample_frames but I can't understand the usage of those params. Can you explain the meanings of them?

Also, WaveRNN model was using padded mel input in the first GRU layer. However you just sliced out paddings after the first layer. Is it important to use padded mel in first GRU?

Thanks.

Summary

I will share my result of the Universal Vocoder in other datasets.

Thanks for your great library and impressive result/demo.
It seems that you are interested in other datasets (#2), I will share my result. (if not interested, please feel free to ignore!)

I forked this repository and used this for other dataset, JSUT (Japanese single female utterances, total 10 hours).
Though the model trained on single female speaker, it works very well even toward out-of-domain speaker's test data (other females, male, and even English speaker).
Below is the result/demo.
https://tarepan.github.io/UniversalVocoding

In my impression, RNN_MS (Universal Vocoder) seems to learn utterances from human mouth/vocalTract, which is independent from language. So interesting.

I am grad if my result is good for your further experiments.
Again, thanks for your great library.

Generate audio from mag spectrogram

Hey, thanks for your work in this project, it is really good.

I'm trying to use this vocoder to generate wavs from magnitude spectrograms I generated using another neural network. Using griffin-lim gets me a nice audio, but kind of robotic, so I think your vocoder will improve it a lot.

The biggest difference between the parameters of the two networks are in n_ftt, my spectrograms use 1024 and your network use 2048. So, if I try to use your pre-trained model, changing only n_ftt the resulting audio is sped up a bit and the voice gets really high.

I tryed retraining the network changing only n_ftt, but the results where not good, it got a lot of noise.

Any leads on what I might try next?

About Speaker Voice

I was playing with the preprocessing parameters and I was able to change a bit the sound of the synthesized voice.
I was wondering if there was a clever way to to do it in terms of pitch, energy, style, timbre etc..
Thanks!

path to waveform directory

Hi @bshall, I wondered how to set the path to the downloaded waveform directory when preprocessing, as it not as the parameter.

Inference speed comparison

Hi! Could you share some details about the inference speed compared to Griffin-Lim/WaveNet/WaveRNN?

Generating samples from generated Mel-spectrograms

@bshall - First of all, thank you for this implementation. In this issue, you pointed out that you've generated a sample audio from generated Mel-spectrogram from VQVAE. It sounds pretty good.

My question is: how would one go about generating audio from Mel-spectrograms? Do we need to preprocess the Mel-spectrogram, if that's the only thing we're given?

Why the embedding layer instead of the one-hot audio vector?

Hello,

In the original implementation of this model, the authors employed a one-hot audio vector of dimension 1024. Unfortunately, the authors did not detail much about this one-hot vector in the paper and did not explain its purpose in the model. Given that its dimension is 1024 = (2^10), and that authors use 10-bit audio samples, I assume this vector is related to the prediction of each bit in each audio sample. But that's just a guess.

So, I have two (actually three) questions:

What is the purpose of the one-hot audio vector in the original implementation?
Why did you replace the one-hot vector with an embedding layer? What changed in the model behavior with this replacement?

Thank you very much

audio_slice_frames in v0.2

Summary

audio_slice_frames seems to be deprecated in v0.2.
Is 10-bit model trained with this version?

Context

Conditioning network (rrn1) and auto-regressive network (rrn2) used different sample frames (#12).
It was controlled in VocoderDataset by sample_frames and audio_slice_frames.

Question

In v0.2, there seems to be no audio_slice_frames.
Is it deprecated?

And, is 10-bit model (LJ-speech model) trained without this different frame usage?

Result remains little noise, but loss does not decrease

Hi, I use your method training on my own dataset, for 1000k iterations, it sounds stable, have only a little background noise. But the loss maintains around 2.6, and the noise didn't disappear after another 1000k steps. I have tried to reduce the batchsize to 2 and learning rate 5e-5, but it doesn't work. How can I deal with it?
samples.zip

generate_audio questions

hi，recently, I've been trying different mel preprocessing methods. mel is [-4, 4]，audio stays the same（mulaw_encode and mulaw_decode stay the same）. But, the generated audio contains a lot of noise, mel is normal.

original audio：

generated audio：

How can I deal with it? Looking forward for your response, thank you. @bshall

How to improve performance

Hello,
It takes 25 seconds to generate three seconds (sample_rate 22050, about 15 words) audio. Do you have a good idea for performance optimization？We can discuss it. Thank you.

About generated samples

“A PyTorch implementation of Robust Universal Neural Vocoding. Audio samples can be found here.“

The link you gave here is the sample you generated is the actual spectrum feed or the acoustic model predicted?

24kHz and 10 bit mu-law model

Following the original paper train a model on 24kHz audio with 10 bit mu-law encoding.

Help needed. Trying to get vocoder working with output from a ML Tracotron

Hello,

I'm trying to figure out what I need to do so to my numpy array can be vocoded by the UniversalVocoder.

Attached is a sample npy file.

The output is from a modified https://github.com/Tomiinek/Multilingual_Text_to_Speech

import os

import numpy


def main():
    import torch
    import soundfile as sf
    from univoc import Vocoder

    cwd: str = os.getcwd()

    # download pretrained weights (and optionally move to GPU)
    vocoder: Vocoder = Vocoder.from_pretrained(
            "https://github.com/bshall/UniversalVocoding/releases/download/v0.2/univoc-ljspeech-7mtpaq.pt").cuda()

    # load log-Mel spectrogram from file or from tts (see https://github.com/bshall/Tacotron for example)
    mel = numpy.load(os.path.join(cwd, "tmp.npy"))

    # generate waveform
    with torch.no_grad():
        wav, sr = vocoder.generate(mel)

    # save output
    sf.write(os.path.join(cwd, "tmp.wav"), wav, sr)


if __name__ == "__main__":
    main()

Traceback (most recent call last):
  File "/home/muksihs/git/Cherokee-TTS/tts-wrapper/uv.py", line 29, in <module>
    main()
  File "/home/muksihs/git/Cherokee-TTS/tts-wrapper/uv.py", line 22, in main
    wav, sr = vocoder.generate(mel)
  File "/home/muksihs/miniconda3/envs/UniversalVocoding/lib/python3.9/site-packages/univoc/model.py", line 102, in generate
    mel, _ = self.rnn1(mel)
  File "/home/muksihs/miniconda3/envs/UniversalVocoding/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1051, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/muksihs/miniconda3/envs/UniversalVocoding/lib/python3.9/site-packages/torch/nn/modules/rnn.py", line 821, in forward
    max_batch_size = input.size(0) if self.batch_first else input.size(1)
TypeError: 'int' object is not callable

tmp.npy.zip
wavernn-vocoded.zip

How long does it takes to train from the scratch?

Thank you for sharing your great work.
As i have changed many parameters(n_mel, fft, hop, window etc), I am training this model from scratch with VCTK dataset.
Could you tell me the environment you had and how long it took?
I have geforce rtx 2080 ti, and it seems to take whole month :(

audio_slice_frames deprecation in v0.2

Summary

audio_slice_frames seems to be deprecated in v0.2.
Is 10-bit model trained with this version?

Context

Conditioning network (rrn1) and auto-regressive network (rrn2) used different sample frames (#12).
It was controlled in VocoderDataset by sample_frames and audio_slice_frames.

Question

In v0.2, there seems to be no audio_slice_frames.
Is it deprecated?

And, is 10-bit model (LJ-speech model) trained without this different frame usage?

mulaw encdoing

Hi bshall,

I have doubt about the mu-law encoding function. I wondered why the function mulaw_encode() returns np.floor((fx + 1) / 2 * mu + 0.5) instead of fx directly.

preprocessing_mel question

hi，I have doubt about the preprocessing_mel function. I use the following preprocessing method. The generated audio file is muted.

def melspectrogram(wav, hparams):
D = _stft(preemphasis(wav, hparams.preemphasis, hparams.preemphasize), hparams)
S = _amp_to_db(_linear_to_mel(np.abs(D), hparams), hparams) - hparams.ref_level_db

np.dot(mel_basis, S)

if hparams.signal_normalization:
	return _normalize(S, hparams)
return S

def _stft(y, hparams):
if hparams.use_lws: False
return _lws_processor(hparams).stft(y).T
else:
return librosa.stft(y=y, n_fft=hparams.n_fft, hop_length=get_hop_size(hparams), win_length=hparams.win_size)
librosa.stft(y, n_fft=num_fft, hop_length=hop_length, win_length=win_length)
def _linear_to_mel(spectogram, hparams):
global _mel_basis
if _mel_basis is None:
_mel_basis = _build_mel_basis(hparams)
return np.dot(_mel_basis, spectogram)

def _amp_to_db(x, hparams):
min_level = np.exp(hparams.min_level_db / 20 * np.log(10))
return 20 * np.log10(np.maximum(min_level, x))

np.exp(-100 / 20 * np.log(10))

min_level = 10**(-100 / 20)
return 20 * np.log10(np.maximum(min_level, x))

def _normalize(S, hparams):
if hparams.allow_clipping_in_normalization: （True）
if hparams.symmetric_mels: （True）
return np.clip((2 * hparams.max_abs_value) * ((S - hparams.min_level_db) / (-hparams.min_level_db)) - hparams.max_abs_value,
-hparams.max_abs_value, hparams.max_abs_value)
else:
return np.clip(hparams.max_abs_value * ((S - hparams.min_level_db) / (-hparams.min_level_db)), 0, hparams.max_abs_value)

The main difference is “S = _amp_to_db(_linear_to_mel(np.abs(D), hparams), hparams) - hparams.ref_level_db” and _normalize，
hparams.ref_level_db =20, hparams.max_abs_value = 4;
data is [-4, 4], your preprocessing data is[0, 1]; the data range has a great influence on the model? I don't understand，I am asking for your help. thank you.

What's the capacity of this network?

what's the maximum speakers number during training? the paper use 17 speakers. what will happen if speaker number is larger than 17?

num_steps of training for those demo sample?

Hi,

This repo is really great. May I ask the number of training steps (with batch_size 32) required for your demo samples? Given the amount of training data used here (around 26 hours recordings), I guess the 100k num_steps as provided in the config.json is not enough, right?

Many thanks!

bshall / universalvocoding Goto Github PK

universalvocoding's Introduction

Towards Achieving Robust Universal Neural Vocoding

Quick Start

Example Usage

Train from Scratch

Pretrained Models

Notable Differences from the Paper

Acknowlegements

universalvocoding's People

Contributors

Stargazers

Watchers

Forkers

universalvocoding's Issues

Summary

Summary

Context

Question

Summary

Context

Question

Recommend Projects

Recommend Topics

Recommend Org

Jobs