ljy-m / ppg_tacotron Goto Github PK

View Code? Open in Web Editor NEW

18.0 18.0 3.0 40 KB

An implementation of deep-voice-conversion using pytorch

Python 100.00%

ppg_tacotron's People

Contributors

Stargazers

Watchers

Forkers

zeng-yifei wildstrom whub401

ppg_tacotron's Issues

Starting from scratch

You have had success with this where many others have failed, indicated by the issues in the original repo. It might be a long shot, but would you consider putting together a very basic video showing you getting this repo working form scratch? Your instructions are clearer than other related repos but still require significant prior knowledge of these systems to follow.

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xf9 in position 1036: invalid start byte

apply it in Chinese

I want to apply it in Chinese. I'm working on a data set in the same format as TIMIT.
What do the first two numbers represent in the PHN file? Is it the timestamp of this phoneme?

Customized Voice Conversion

After training enough epochs, the result of imitating cmu arctic finally becomes more and more satisfying!

But when it comes to create a customized voice converter, I have no idea about what wav feature I should preserve when making my own "cmu arctic" dataset. Do I need to keep the every feature absolute same with cmu's dataset, like 2 second of wav duration?

Plus, I saw the issue of applying this project on mandarin. If using pyPinYin, do I need to make bigger dataset than TIMIT due to the great variance of pinyin? And what lib should I use to get every word's duration to make a phn file like TIMIT? ( which is the most confusing question for me. )

Could you please release a Pretrained model?

spec2wav is tow slow

spec2wav is tow slow.
I try to run this project on "GPU 1080ti, CPU Intel(R) Xeon(R) CPU E5-2686 v4 @ 2.30GHz" and this part take a long time.
My input is a 3s audio. It take 15 seconds to process.

Is there any way to optimize it？

num_samples should be a positive integer value, but got num_samples=0

please help me

A Quantitative Question About Training.

I have successfully passed through the whole process from training net1 to net2 and convert.
But after training net1 for 15000 Iterations and net2 for 15000 Iterations, the convert result is still inaudible.
Can you share an experimental conclusion about how many Iterations net1&net2 training should roughly take before obtaining an acceptable result?

net1 training
Loss : [0.629394], Accuracy : [0.792945]
net 2 training
Loss : [0.009412], Loss_spec : [0.006773], Loss_mel : [0.002639]

Open .wav file in audio_operation.py: line 162 casts out error

when using 'open' function to read .wav file like this

open('/content/data/dataset/arctic/bdl/arctic_a0001.wav',encoding='utf-8').read().splitlines()

Error occurs and indicates a decoding failure:

'utf-8' codec can't decode byte 0x86 in position 4: invalid start byte

Is that due to different python version? I'm wondering why this error happens.

apply for Mandarin Chinese

do you apply this work for Mandarin Chinese?

ljy-m / ppg_tacotron Goto Github PK

ppg_tacotron's People

Contributors

Stargazers

Watchers

Forkers

ppg_tacotron's Issues

Recommend Projects

Recommend Topics

Recommend Org

Jobs