GithubHelp home page GithubHelp logo

How to perform text to speech about lpcnet HOT 73 CLOSED

xiph avatar xiph commented on June 30, 2024 3
How to perform text to speech

from lpcnet.

Comments (73)

candlewill avatar candlewill commented on June 30, 2024 11

@bearlu007 Here is some of my code you could use it as a reference:

  1. 55d to 20d:
def reduce_dim(features):
	""" reduce dimension from 55d to 20d
	keep features[0:18] and features[36:38] only
	:param features: 55d
	:return: 20d
	"""
	N, D = features.shape
	assert D == 55, "Dimension error. %sx%s" % (N, D)
	features = np.concatenate((features[:, 0:18], features[:, 36:38]), axis=1)
	assert features.shape[1] == 20, "Dimension error. %s" % str(features.shape)
	return features
  1. convert 20d to 55d when test
	features = np.zeros((N, 55))
	features[:, 0:18] = input[:, 0:18]
	features[:, 36:38] = input[:, 18:20]

from lpcnet.

gosha20777 avatar gosha20777 commented on June 30, 2024 3

SO IT WORKS. Here are my samples.
https://yadi.sk/d/mBUJVSCzVVd2fQ
I achieved the result in the following steps:

  1. Load the pre-trained model.
  2. Take a WAV sample of trained tacotron-2 without vocoder (01.wav).
  3. Convert it to 16bit 16kHz mono raw PCM
    (sox taco2-out.wav -b 16 -s -c 1 -r 16k -t raw - > input.s16).
  4. Compile the data processing program (./compile.sh) and run ti (/dump_data input.s16 exc.s8 features.f32 pred.s16 pcm.s16) to get features.f32 file.
  5. Synthesis speech with LPCNet (./test_lpcnet.py features.f32 > pcm.txt).
    it works quite slow...
  6. Convert pcm.txt to PCNet-out.wav (ffmpeg -f s16le -ar 16k -ac 1 -i pcm.txt PCNet-out.wav)

So, I'am right? But why it works so slow?
P.s. And with RNN vocoder i've got better results...

So if i'm right i'll try connect Tacotron-2 and LPCNet. Or... Or it will better choice to use something else in stand of Tacotron-2?

from lpcnet.

gosha20777 avatar gosha20777 commented on June 30, 2024 2

Thanks for your respone. Yes It works. Of course I've synthesized the sound from takotron2 to demonstrate the result (as to say show progress). I tested LCPNet for Korean and Russian. The results are impressive. I will develop an implimentation of Tacotron2 for a closer connection with LCPNet to make end 2 end stt system. If Tacotron 2 will be work on server (without WaveNet vocoder) and LCPNet will be work on the clients it solves many problems, and reduce server load up to 10 times.

from lpcnet.

attitudechunfeng avatar attitudechunfeng commented on June 30, 2024 1

@gosha20777 What acoustic features are you used when you train the TTS model? I've trained with both 55 dimension features and 21 dimension features, however, the results are not good.

from lpcnet.

jmvalin avatar jmvalin commented on June 30, 2024 1

Are you training end-to-end or are you just learning the LPCNet features from text? Also, make sure that the LPC features are not predicted, but rather computed directly from the predicted cepstral features.

from lpcnet.

jmvalin avatar jmvalin commented on June 30, 2024

LPCNet is basically one half of a TTS system. It takes an acoustic feature vector every 10 ms and outputs speech samples. For TTS, you also need a network that takes in characters and outputs these acoustic feature vectors.

from lpcnet.

changeforan avatar changeforan commented on June 30, 2024

@jmvalin Hi, I have trained a taco2 model to predict the 18-band Bark-scale and 2 pitch parameters.
Can you tell me how to compute the LPC from Bark-scale cepstum, or which part in denoise.c do this work?
Thank you.

from lpcnet.

jmvalin avatar jmvalin commented on June 30, 2024

@changeforan To compute the LPC coefficients, look for the _celt_lpc() function in denoise.c. The process starts from Ex, computed by compute_band_energy(), so you'd need to invert a few more steps, but that shouldn't be too hard.

from lpcnet.

changeforan avatar changeforan commented on June 30, 2024

@jmvalin Thanks for your quick response, but I am still confused.
It seems like you compute the LPC at line 399 and assign them to features[39:55] at line 448, but if features[0:18] are Bark-scale coefficientis, they were computed after line 399.
I think features[39:55] should be computed from features[0:18] after read your paper.
18-band Bark-frequency cepstrum ----> PSD ----> auto-correlation ----> LPC
Am I right?

from lpcnet.

jmvalin avatar jmvalin commented on June 30, 2024

The LPC are the same as if they'd been computed on features[0:18]. The spectrum on which they're computed in the C code is the same that's used to compute the cepstrum and the operation is reversible.

from lpcnet.

jmvalin avatar jmvalin commented on June 30, 2024

Well, the way it's normally supposed to work is that you train Tacotron (or whatever network) to directly output features that LPCNet can use. No need to run the synthesis twice (though in this case I guess it was easier for testing purposes).

from lpcnet.

gosha20777 avatar gosha20777 commented on June 30, 2024

I ve got features from english multi speaker dataset. About 8 hours

from lpcnet.

attitudechunfeng avatar attitudechunfeng commented on June 30, 2024

With the original 55 dimension features or other features ?

from lpcnet.

gosha20777 avatar gosha20777 commented on June 30, 2024

Hmm. I'm not sure... But in my apinion it was 20 dim features.

Try to learn LONG TIME. I v trained it about 5 days in 2x Nvidia 1080 ti. I ve used horovod library to parallel it.

from lpcnet.

gosha20777 avatar gosha20777 commented on June 30, 2024

I can give u a pretrained model if u want.

from lpcnet.

attitudechunfeng avatar attitudechunfeng commented on June 30, 2024

I can't understand what are the 120 dim features and how you extract the features. I'll appreciate it if there's some explanations. In my opinion, in the paper, they claimed using 20 dim features, and in the code it seems like using actually 55 dim features.

from lpcnet.

gosha20777 avatar gosha20777 commented on June 30, 2024

Oh no! No 120 dim but 20 dim! Im so sorry :)

from lpcnet.

attitudechunfeng avatar attitudechunfeng commented on June 30, 2024

In the code, it seems like 21 dim features rather than 20 dim. I've tried to predicted the 21 dim features, however, the results sounds not stable. My backbone model is not taco series, but a traditional rnn model.

from lpcnet.

changeforan avatar changeforan commented on June 30, 2024

@attitudechunfeng I have reviewed the code and found that, features[18:36] is assigned to zero, features[36] and features[37] are about pitch. features[38] is not used at all. features[39:55] are about lpc.

from lpcnet.

attitudechunfeng avatar attitudechunfeng commented on June 30, 2024

So it means that i only need to predict the [0:18] and [36:37] both 20 dim features? Do you have good results using these features? @changeforan

from lpcnet.

changeforan avatar changeforan commented on June 30, 2024

So it means that i only need to predict the [0:18] and [36:37] both 20 dim features? Do you have good results using these features? @changeforan

with Taco2 model, Yes.

from lpcnet.

jmvalin avatar jmvalin commented on June 30, 2024

FYI, I don't think features[38] is useful for anything. OTOH, features[18:36] could potentially be useful for TTS.

from lpcnet.

hdmjdp avatar hdmjdp commented on June 30, 2024

@attitudechunfeng the 21dim not to predict

from lpcnet.

attitudechunfeng avatar attitudechunfeng commented on June 30, 2024

@hdmjdp what do you mean, can you explain it more detailedly?

from lpcnet.

hdmjdp avatar hdmjdp commented on June 30, 2024

@attitudechunfeng it means you need not predict the period, so the net output 20dim

from lpcnet.

candlewill avatar candlewill commented on June 30, 2024

I tried to predict lpcnet parameters directly using a tacotron model. The generated voice is not very good, and the attention seemed very strange. Here are some attention and samples (In Chinese). Is there someone also have this situation, and knows how to explain this?

image

More:
tacotron_lpcnet.zip

from lpcnet.

azraelkuan avatar azraelkuan commented on June 30, 2024

@candlewill may be u used the wrong feature as jmvalin said, my alignement is very good. and compared to the mel spectrogram, it is much easy to get the alignment.
image

from lpcnet.

candlewill avatar candlewill commented on June 30, 2024

Thanks @jmvalin and @azraelkuan, I predict all of the 55d features when do end-to-end training. I will try to change the features to predict.

from lpcnet.

ohleo avatar ohleo commented on June 30, 2024

@azraelkuan
Looks great!
Could you share your synthesized speech from Tacotron + LPCNet?


LPCNet acoustic feature
features[:18] : 18-dim Bark scale cepstrum
features[18:36] : Not used
features[36:37] : pitch period(what is this value?)
features[37:38] : pitch correlation(what is this value?)
features[39:55] : LPC(calculated by cepstrum)
window_size (=n_fft) = 320 (is it right?)
frame_shift(=hop_size) = 160 (is it right?)

And, did you train Tacotron to predict 20-dim feature(concat. the 18dim cepstrum and 2 pitch param.) instead of 80-dim mel-spectrogram?
(In that case, Decoder LSTM input will be 20-dim concatenated feature.)

Or, only 18-dim cepstrum is the input of Decoder LSTM, and 2 pitch param output is predicted by dense projection likewise stop-token?

Could you explain more detailed structure or tips for training?
(e.g. window_size, hop_size(=frame shift), and normalization of feature)

I would appreciate your reply.

from lpcnet.

azraelkuan avatar azraelkuan commented on June 30, 2024

feature: 20-dim concatenated feature, i do not split them, i can not share the samples, sorry

from lpcnet.

hdmjdp avatar hdmjdp commented on June 30, 2024

@azraelkuan what is your repo of tacotron?

from lpcnet.

azraelkuan avatar azraelkuan commented on June 30, 2024

@hdmjdp https://github.com/keithito/tacotron

from lpcnet.

candlewill avatar candlewill commented on June 30, 2024

I changed the features to predict, then the attention could be learnt well. Here is some samples with 16k pcm format generated from an end2end+lpcnet model.
e2e_lpcnet_samples.zip

from lpcnet.

hdmjdp avatar hdmjdp commented on June 30, 2024

@azraelkuan why not use tacotron2?

from lpcnet.

hdmjdp avatar hdmjdp commented on June 30, 2024

@candlewill how to convert chinese char to vector?

from lpcnet.

bearlu007 avatar bearlu007 commented on June 30, 2024

I changed the features to predict, then the attention could be learnt well. Here is some samples with 16k pcm format generated from an end2end+lpcnet model.
e2e_lpcnet_samples.zip

May I know how you change your features for modeling and prediction ?

from lpcnet.

bearlu007 avatar bearlu007 commented on June 30, 2024

I changed the features to predict, then the attention could be learnt well. Here is some samples with 16k pcm format generated from an end2end+lpcnet model.
e2e_lpcnet_samples.zip

May I know how you change your features for modeling and prediction ?

@candlewill Thanks

from lpcnet.

bearlu007 avatar bearlu007 commented on June 30, 2024

@bearlu007 Here is some of my code you could use it as a reference:

  1. 55d to 20d:
def reduce_dim(features):
	""" reduce dimension from 55d to 20d
	keep features[0:18] and features[36:38] only
	:param features: 55d
	:return: 20d
	"""
	N, D = features.shape
	assert D == 55, "Dimension error. %sx%s" % (N, D)
	features = np.concatenate((features[:, 0:18], features[:, 36:38]), axis=1)
	assert features.shape[1] == 20, "Dimension error. %s" % str(features.shape)
	return features
  1. convert 20d to 55d when test
	features = np.zeros((N, 55))
	features[:, 0:18] = input[:, 0:18]
	features[:, 36:38] = input[:, 18:20]

Clear enough. Thanks a lot .

from lpcnet.

attitudechunfeng avatar attitudechunfeng commented on June 30, 2024

@azraelkuan I have a question about the predicted features. When training with tacotron, do you only use LPCNET features? Or LPCNET features and Linear spectrogram?

from lpcnet.

azraelkuan avatar azraelkuan commented on June 30, 2024

@attitudechunfeng only lpcnet features, 20 dimension

from lpcnet.

attitudechunfeng avatar attitudechunfeng commented on June 30, 2024

thanks for your quick reply. And after how many steps the alignment becomes well?

from lpcnet.

azraelkuan avatar azraelkuan commented on June 30, 2024

@attitudechunfeng about 5k step, i use the real lpc feature in the training decode step.

from lpcnet.

hdmjdp avatar hdmjdp commented on June 30, 2024

@azraelkuan this repo can not give the time of when to stop?

from lpcnet.

azraelkuan avatar azraelkuan commented on June 30, 2024

@hdmjdp u can add a stop token to predict it

from lpcnet.

hdmjdp avatar hdmjdp commented on June 30, 2024

@azraelkuan how to add in decoder cell?

from lpcnet.

hyzhan avatar hyzhan commented on June 30, 2024

@jmvalin If I want to normalize the cepstral coefficients, how should I choose the normalization range? The magnitude of cepstral coefficients seems to vary a lot.

from lpcnet.

jmvalin avatar jmvalin commented on June 30, 2024

Why do you want to normalize the cepstral coefficients?

from lpcnet.

hyzhan avatar hyzhan commented on June 30, 2024

I tried to combine tacotron with LPCNet, which succeeded in a big data set, but failed in a small data set. (The dataset extraction feature only takes one round.) The tacotron output may have a period greater than 3.1, which I think will cause problems in training the LPCNet network (although training does not report an error). So I plan to normalize the cepstrum and pitch parameters.

from lpcnet.

hdmjdp avatar hdmjdp commented on June 30, 2024

@jmvalin Hi, in your makefile, you give the A53's compile option. Does this mean that this repo can run in realtime on A53 chip? but we find it runs slow than realtime much. why

from lpcnet.

jmvalin avatar jmvalin commented on June 30, 2024

LPCNet is not yet real-time on the A53. That's a pretty slow chip. We've managed real-time performance on an iPhone6 though. So it should run in real-time on most modern smartphones. Just not on RaspberryPi yet. That may eventually be achievable, but that's not what we're working on atm.

from lpcnet.

hdmjdp avatar hdmjdp commented on June 30, 2024

@jmvalin thanks, we test lpcnet on the phone with A73 chip, it cannot run in realtime yet. I will try train it with 32*1 sparse block, so it can using 17 registers. What do you think?

from lpcnet.

Coastchb avatar Coastchb commented on June 30, 2024

@bearlu007 Here is some of my code you could use it as a reference:

  1. 55d to 20d:
def reduce_dim(features):
	""" reduce dimension from 55d to 20d
	keep features[0:18] and features[36:38] only
	:param features: 55d
	:return: 20d
	"""
	N, D = features.shape
	assert D == 55, "Dimension error. %sx%s" % (N, D)
	features = np.concatenate((features[:, 0:18], features[:, 36:38]), axis=1)
	assert features.shape[1] == 20, "Dimension error. %s" % str(features.shape)
	return features
  1. convert 20d to 55d when test
	features = np.zeros((N, 55))
	features[:, 0:18] = input[:, 0:18]
	features[:, 36:38] = input[:, 18:20]

Hi, @candlewill .
while testing, do you predict the 20-dim features using Tacotron and then convert it back to 55-dim by padding zeros to the other 35 dimensionality, and directly synthesize with LPCnet ?

from lpcnet.

pgmbayes avatar pgmbayes commented on June 30, 2024

Hi Team - (@candlewill or @azraelkuan if you can help out that would be amazing)
I'm getting started with Speech Synthesis and TTS. This might be a naive question, so please bear with my ignorance.

Given a predicted 80-dimensional mel-spectrogram from say DeepVoice or Tacotron, what are the steps to post-process it so that it can be fed directly as an input (18-Bark Scale Cepstral Coefficients and 2 pitch params) for LPCNet?

Goal: numpy array (.npy file) from tts -> features.32 - without generating a wavform and converting that to a raw audio header file to be fed into LPCNet.

Assuming that my base TTS model is not trained e2e for LPCNet features, and let's say I use the below function:

def reduce_dim(features):
	""" reduce dimension from 55d to 20d
	keep features[0:18] and features[36:38] only
	:param features: 55d
	:return: 20d
	"""
	N, D = features.shape
	assert D == 55, "Dimension error. %sx%s" % (N, D)
	features = np.concatenate((features[:, 0:18], features[:, 36:38]), axis=1)
	assert features.shape[1] == 20, "Dimension error. %s" % str(features.shape)
	return features

to reduce my predicted 80 dimensional melspectrogram down to 20d. Where in this repo should I start to generate a test_features.f32 from a numpy array? I've been looking into dump_data.c but am a bit lost. Any pointers (e.g. correcting my naive assumptions, e.g. which file and line in the repo should be used, strategy for converting npy array into features.32 directly, etc.) would be super appreciated! Thanks y'all

from lpcnet.

MlWoo avatar MlWoo commented on June 30, 2024

@pgmbayes you can refer my repo. You can turn on tacotron2 macro to incooperate with deep voice or tacotron repo.

from lpcnet.

azraelkuan avatar azraelkuan commented on June 30, 2024

@pgmbayes why not just predict 18-Bark Scale Cepstral Coefficients and 2 pitch params using tacotron or deep voice?

from lpcnet.

HallidayReadyOne avatar HallidayReadyOne commented on June 30, 2024

Hi @candlewill, how many epochs would it take to get samples like e2e_lpcnet_samples.zip? Thank you.

from lpcnet.

candlewill avatar candlewill commented on June 30, 2024

@HallidayReadyOne I trained it with 120 epochs which is the default parameter. What's more important is that before use lpcnet, you should be assure your end2end model can predict the lpcnet-used features well.

from lpcnet.

HallidayReadyOne avatar HallidayReadyOne commented on June 30, 2024

@candlewill Thank you for the kindly reply. Yep, the text2feature model is important. I have trained a tacotron model to predict the lpcnet-used features. The attention alignment is quite good now. However, the sample of lpcnet (about 18 epochs) is instable.

from lpcnet.

ZhaoZeqing avatar ZhaoZeqing commented on June 30, 2024

Hi, @candlewill , how many step and how much batch size when you trained the end2end model to predict the lpcnet-used features? Thanks!

from lpcnet.

alokprasad avatar alokprasad commented on June 30, 2024

I can give u a pretrained model if u want.

Can you share the pretrained model

from lpcnet.

alokprasad avatar alokprasad commented on June 30, 2024

@jmvalin Thanks for your quick response, but I am still confused.
It seems like you compute the LPC at line 399 and assign them to features[39:55] at line 448, but if features[0:18] are Bark-scale coefficientis, they were computed after line 399.
I think features[39:55] should be computed from features[0:18] after read your paper.
18-band Bark-frequency cepstrum ----> PSD ----> auto-correlation ----> LPC
Am I right?

@changeforan
How can i get the 20 dim features from text using tacatron or tacatron2 that can be feeded to lcpnet, is there any repo/steps that i can follow.

from lpcnet.

cahuja1992 avatar cahuja1992 commented on June 30, 2024

@pgmbayes you can refer my repo. You can turn on tacotron2 macro to incooperate with deep voice or tacotron repo.

How can we pass tacatron features so that it converts to features.32 ?
Suppose we are writing features from tacatron2 as *.npy or *.pkl or even the raw binary file.

from lpcnet.

MlWoo avatar MlWoo commented on June 30, 2024

@cahuja1992
the script is very simple.

import numpy as np
npy_data = np.load("mel_220k_0.npy")
npy_data = npy_data.reshape((-1,))
npy_data.tofile("mel_220k_0.s32")

from lpcnet.

cahuja1992 avatar cahuja1992 commented on June 30, 2024

@cahuja1992
the script is very simple.

import numpy as np
npy_data = np.load("mel_220k_0.npy")
npy_data = npy_data.reshape((-1,))
npy_data.tofile("mel_220k_0.s32")

From tacatron we should be using the model.mel_outputs which comes out to be of (1 , 1000, 80) dimensions for a audio. In order to match the dimensions for LPCNet, what should be the parameters of tacatron 2.

The default parameters are as follow:
num_mels=80,
num_freq=1025,
sample_rate=20000,
frame_length_ms=50,
frame_shift_ms=12.5,
preemphasis=0.97,
min_level_db=-100,
ref_level_db=20,

from lpcnet.

estherxue avatar estherxue commented on June 30, 2024

Hi, @candlewill , I have listened to your samples. They are better than those that I generated. I have used tacotron 2 to predict the 20 dim features and trained LPCNet with my own data. It seems that the samples with predicted features have problems concerning the pitch compared to samples generated with ground-truth features. I would like to ask if you could share some information with me about the training of tacotron 2, for example, the loss function.

from lpcnet.

alokprasad avatar alokprasad commented on June 30, 2024

@jmvalin
I am trying to integrate tacotron and lpcnet , for that i am doing end-to-end training,
followed below steps , but i am not getting even a fair amount of result and voice synthesized
contains noise.
Please let me know if anything i am missing and doing wrong here.

Trying Below steps to generate features from tacatron and using that to generate speech
from Lpcnet

Training Tacatron for LPCNET

  1. Change hparams.py with following parameter
    num_mels=20,
    sample_rate=22050 ( as LJSpeech dataset has 22050hz sampling)

  2. Download LJspeech
    https://data.keithito.com/data/speech/LJSpeech-1.1.tar.bz2

  3. Start Training
    python3 preprocess.py --base_dir /ws/sandbox/tacatron --dataset ljspeech
    python3 train.py --input /ws/sandbox/tacatron/training/train.txt

4.check points are created every 1000 iteration
at ~/tacotron/logs-tacotron/model.ckpt-1000
use checkpoint with less error or after some

5.Now using metadata.csv of LJSpeech generate sentences array for eval.py
( this contains text for all the sample LJSpeech wav files)
Also convert all wav to pcm and merge in same order as metadata.cvs

6.Modify synthesizer.py
to dump 55 dimension numpy array for all the text.

wav = self.session.run(self.wav_output, feed_dict=feed_dict)
mel_features, wav = self.session.run([self.model.mel_outputs, self.wav_output], feed_dict=feed_dict)
features = mel_features[0][:,:55]
f = open("mel_op.npy","ab")
features.tofile(f)
f.close()
=>mel_op.npy contains 55 dim features for text.

7 . run eval.py
python3 eval.py --checkpoint /tacotron/logs-tacotron/model.ckpt-123000
This will generate mel_op.npy for all text present in sentences array.

Training LPCNET using features generated from Tacatron

Here we have to use the concatenated pcm file and mel_op.npy
1.convert mel_op.npy to mel_op,f32
import numpy as np
import pickle
npy_data = np.fromfile("mel_op.npy")
npy_data = npy_data.reshape((-1,))
npy_data.tofile("mel_op.f32")

2
Merge all the wav files of any one folder of LJSpeech dataset and generate single PCM File.
Now used following to generate features.f32 and data.u8

make dump_data taco=1
./dump_data -train merge-LJ028.pcm features.f32 data.u8 ( features.f32 and data.u8 autogenerated)
use only data.u8 and mel_op.f32 from step.1

  1. Training
    ./src/train_lpcnet.py mel_op.f32 data.u8
    this will generate generate lpcnet*.h5 file.

Usage
generate test_features.f32 from tacatron ( npy - > f32)
( edit test_lpcnet.py the path of .h5 file)
./src/test_lpcnet.py test_features.f32 test.s16
play test.s16
(Note the .h5 is hard coded in test_lpcnet.py, modify for your .h5 file.)

from lpcnet.

attitudechunfeng avatar attitudechunfeng commented on June 30, 2024

@jmvalin thanks, we test lpcnet on the phone with A73 chip, it cannot run in realtime yet. I will try train it with 32*1 sparse block, so it can using 17 registers. What do you think?

@hdmjdp Any progress with 32*1 sparse block? I've tried with A73 chip, when increasing the sparsity, it can reach about 1.0+ realtime speed, however, still a little slow.

from lpcnet.

ZhaoZeqing avatar ZhaoZeqing commented on June 30, 2024

@candlewill may be u used the wrong feature as jmvalin said, my alignement is very good. and compared to the mel spectrogram, it is much easy to get the alignment.
image

Hi @azraelkuan , for the tacotron model, what did you use as the input? phone or pinyin or English words? Thanks!

from lpcnet.

Jwei-Lee avatar Jwei-Lee commented on June 30, 2024

@

@bearlu007 Here is some of my code you could use it as a reference:

  1. 55d to 20d:
def reduce_dim(features):
	""" reduce dimension from 55d to 20d
	keep features[0:18] and features[36:38] only
	:param features: 55d
	:return: 20d
	"""
	N, D = features.shape
	assert D == 55, "Dimension error. %sx%s" % (N, D)
	features = np.concatenate((features[:, 0:18], features[:, 36:38]), axis=1)
	assert features.shape[1] == 20, "Dimension error. %s" % str(features.shape)
	return features
  1. convert 20d to 55d when test
	features = np.zeros((N, 55))
	features[:, 0:18] = input[:, 0:18]
	features[:, 36:38] = input[:, 18:20]

@candlewill are you train lpcnet with 55 dim feature? and the 55dim feature just generate with lpcnet dump_data without any other process?

from lpcnet.

gongchenghhu avatar gongchenghhu commented on June 30, 2024

@jmvalin
I am trying to integrate tacotron and lpcnet , for that i am doing end-to-end training,
followed below steps , but i am not getting even a fair amount of result and voice synthesized
contains noise.
Please let me know if anything i am missing and doing wrong here.

Trying Below steps to generate features from tacatron and using that to generate speech
from Lpcnet

Training Tacatron for LPCNET

  1. Change hparams.py with following parameter
    num_mels=20,
    sample_rate=22050 ( as LJSpeech dataset has 22050hz sampling)
  2. Download LJspeech
    https://data.keithito.com/data/speech/LJSpeech-1.1.tar.bz2
  3. Start Training
    python3 preprocess.py --base_dir /ws/sandbox/tacatron --dataset ljspeech
    python3 train.py --input /ws/sandbox/tacatron/training/train.txt

4.check points are created every 1000 iteration
at ~/tacotron/logs-tacotron/model.ckpt-1000
use checkpoint with less error or after some

5.Now using metadata.csv of LJSpeech generate sentences array for eval.py
( this contains text for all the sample LJSpeech wav files)
Also convert all wav to pcm and merge in same order as metadata.cvs

6.Modify synthesizer.py
to dump 55 dimension numpy array for all the text.

wav = self.session.run(self.wav_output, feed_dict=feed_dict)
mel_features, wav = self.session.run([self.model.mel_outputs, self.wav_output], feed_dict=feed_dict)
features = mel_features[0][:,:55]
f = open("mel_op.npy","ab")
features.tofile(f)
f.close()
=>mel_op.npy contains 55 dim features for text.

7 . run eval.py
python3 eval.py --checkpoint /tacotron/logs-tacotron/model.ckpt-123000
This will generate mel_op.npy for all text present in sentences array.

Training LPCNET using features generated from Tacatron

Here we have to use the concatenated pcm file and mel_op.npy
1.convert mel_op.npy to mel_op,f32
import numpy as np
import pickle
npy_data = np.fromfile("mel_op.npy")
npy_data = npy_data.reshape((-1,))
npy_data.tofile("mel_op.f32")

2
Merge all the wav files of any one folder of LJSpeech dataset and generate single PCM File.
Now used following to generate features.f32 and data.u8

make dump_data taco=1
./dump_data -train merge-LJ028.pcm features.f32 data.u8 ( features.f32 and data.u8 autogenerated)
use only data.u8 and mel_op.f32 from step.1

  1. Training
    ./src/train_lpcnet.py mel_op.f32 data.u8
    this will generate generate lpcnet*.h5 file.

Usage
generate test_features.f32 from tacatron ( npy - > f32)
( edit test_lpcnet.py the path of .h5 file)
./src/test_lpcnet.py test_features.f32 test.s16
play test.s16
(Note the .h5 is hard coded in test_lpcnet.py, modify for your .h5 file.)
@alokprasad I have try your above idea, i merge the all mel_op.f32 ectracted by tacotron2 to a single final mel_op.f32 . But i found that threre is a missmatch between mel_op.f32 and data.u8. That is, the frame numbers of mel_op.f32 is different the frame numbers of data.u8. I want to konw how to you slove it.

from lpcnet.

kkokdari avatar kkokdari commented on June 30, 2024

Thanks @jmvalin and @azraelkuan, I predict all of the 55d features when do end-to-end training. I will try to change the features to predict.

Hi! Have you resolved this problem?

from lpcnet.

Ben654987 avatar Ben654987 commented on June 30, 2024

You should try the Text Speaker" app. This is the best text to speech app. It has so many natural sounding voices to chose from. It is useful to listen to study files and much more. It can even extract text from scanned pages and websites and read them out loud. I use it most often to create mp3 files of my study files so I can listen to them on the go. Great product.https://www.deskshare.com/text-to-speech-software.aspx

from lpcnet.

alokprasad avatar alokprasad commented on June 30, 2024

@Ben654987 Please stop pasting your add here.

from lpcnet.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.