auspicious3000 / speechsplit Goto Github PK

View Code? Open in Web Editor NEW

615.0 23.0 92.0 241 KB

Unsupervised Speech Decomposition Via Triple Information Bottleneck

Home Page: http://arxiv.org/abs/2004.11284

License: MIT License

Jupyter Notebook 7.02% Python 92.98%

voice-conversion unsupervised-learning disentangled-representations

speechsplit's Introduction

Unsupervised Speech Decomposition Via Triple Information Bottleneck

This repository provides a PyTorch implementation of SpeechSplit, which enables more detailed speaking style conversion by disentangling speech into content, timbre, rhythm and pitch.

This is a short video that explains the main concepts of our work. If you find this work useful and use it in your research, please consider citing our paper.

@article{qian2020unsupervised,
  title={Unsupervised speech decomposition via triple information bottleneck},
  author={Qian, Kaizhi and Zhang, Yang and Chang, Shiyu and Cox, David and Hasegawa-Johnson, Mark},
  journal={arXiv preprint arXiv:2004.11284},
  year={2020}
}

Audio Demo

The audio demo for SpeechSplit can be found here

Dependencies

Python 3.6
Numpy
Scipy
PyTorch >= v1.2.0
librosa
pysptk
soundfile
matplotlib
wavenet_vocoder pip install wavenet_vocoder==0.1.1 for more information, please refer to https://github.com/r9y9/wavenet_vocoder

To Run Demo

Download pre-trained models to assets

Download the same WaveNet vocoder model as in AutoVC to assets

The fast and high-quality hifi-gan v1 (https://github.com/jik876/hifi-gan) pre-trained model is now available here.

Run demo.ipynb

Please refer to AutoVC if you have any problems with the vocoder part, because they share the same vocoder scripts.

To Train

Download training data to assets. The provided training data is very small for code verification purpose only. Please use the scripts to prepare your own data for training.

Extract spectrogram and f0: python make_spect_f0.py
Generate training metadata: python make_metadata.py
Run the training scripts: python main.py

Please refer to Appendix B.4 for training guidance.

Final Words

This project is part of an ongoing research. We hope this repo is useful for your research. If you need any help or have any suggestions on improving the framework, please raise an issue and we will do our best to get back to you as soon as possible.

speechsplit's People

Contributors

Stargazers

Watchers

Forkers

appleholic luweishuang dendisuhubdy chenchy many-hats sberryman palmerkuo infynite cvml luckeryi mbdash kinamoe xuexidi ishine hrnoh steven850 ml-applications tebin aflorithmic n1r c1a1o1 entn-at erezvolk kevinhua pc2752 aisaturdayslagos spxnn holttechnologycorporation batnik1 lucidbard tejuafonja olegjakushkin emailandxu dmzubr keonlee9420 teinhonglo anchit1999 innovator1311 whitefu deepdubbed bridgettesong jaedukseo atlisig gorkemgoknar shaun95 terbed idgmatrix verses-github vancause 3139725181 sx-tts sciai-ai atravler yenebeb dave-knight triper1022 inconnu11 unparalleled-ysj dipjyoti92 glitteringau maxmax2016 pyrito kedengfeng achyun russell-izadi-bose tranphong1991 techthiyanes asrlytics affanmehmood cheulyop colorbuffer zhazhafon zhenxili96 jedrzejbaranski esoff ifgcguitarclub qingen bw-git adambear fastflair road2018 betciso magicse chester-w-xie p1an-lin-jung sanena mrtreev render-ai pawanhv

speechsplit's Issues

How to fix the vibrato result?

Hi everyone,

I was trying to make my own Generator model; however, I found the result always carries Vibrato.

datasets: VCTK + LibriSpeech clean-100 + LibriSpeech clean-360 (with no data augmentation)
Instead of using one-hot speaker id, I was using speaker embedding.
The validation loss is 47.18.

Here is my result.
The intonation and naturalness sound okay, but the voice sounds like a man/woman speaking in front of a fan,
and the microphone is three steps away from the speaker.

Could anyone give me some advice or suggestion that may fix this kind of issue?
Should I change the datasets or maybe all I need is data augmentation?
Thanks in advance.

Learning rate for P (F0_Converter)

I have integrated training for P in solver.py but am unsure of what learning rate to use.

The default for G is 0.001, but I doubt this is also correct for P.

What initial LR should I use for P?

Many thanks.

How can we construct a demo.pkl-like file?

demo.pkl is a list of 6 entries.

What does each entry of the list represent? I figured out the last entry is for identification. But I still have no idea what the other entries mean.
How can we manually construct a demo.pkl-like file? Are there any APIs to make one?

Thank you.

Does it will work on unseen data also? will it be able to convert voice of unseen speaker with different content than that of data in training, will we obtain the disentanglement?

I write the follow code to get the wrong pkl! Who can help me?

Extract spectrogram and f0: python make_spect_f0.py

Generate training metadata: python make_metadata.py

My code is based on the above step!
Who can help me?

import os
import sys
import pickle
import numpy as np
import soundfile as sf
from scipy import signal
from librosa.filters import mel
from numpy.random import RandomState
from pysptk import sptk
from utils import butter_highpass
from utils import speaker_normalization
from utils import pySTFT
import torch
from autovc.model_bl import D_VECTOR
from collections import OrderedDict
mel_basis = mel(16000, 1024, fmin=90, fmax=7600, n_mels=80).T
min_level = np.exp(-100 / 20 * np.log(10))
b, a = butter_highpass(30, 16000, order=5)

C = D_VECTOR(dim_input=80, dim_cell=768, dim_emb=256).eval().cuda()
c_checkpoint = torch.load('assets/3000000-BL.ckpt')
new_state_dict = OrderedDict()
for key, val in c_checkpoint['model_b'].items():
new_key = key[7:]
new_state_dict[new_key] = val
C.load_state_dict(new_state_dict)
num_uttrs = 1
len_crop = 128

spk2gen = pickle.load(open('assets/spk2gen.pkl', "rb"))

Modify as needed

rootDir = 'assets/wavs'
targetDir_f0 = 'assets/raptf0'
targetDir = 'assets/spmel'

dirName, subdirList, _ = next(os.walk(rootDir))
print('Found directory: %s' % dirName)
speakers = []
for subdir in sorted(subdirList):
print(subdir)

if not os.path.exists(os.path.join(targetDir, subdir)):
    os.makedirs(os.path.join(targetDir, subdir))
if not os.path.exists(os.path.join(targetDir_f0, subdir)):
    os.makedirs(os.path.join(targetDir_f0, subdir))    
_,_, fileList = next(os.walk(os.path.join(dirName,subdir)))

if spk2gen[subdir] == 'M':
    lo, hi = 50, 250
elif spk2gen[subdir] == 'F':
    lo, hi = 100, 600
else:
    raise ValueError
utterances = []
utterances.append(subdir)
_, _, fileList = next(os.walk(os.path.join(dirName, subdir)))
# make speaker embedding  [Speaker_Name , One-hot , [Mel, normed-F0, length, utterance_name] ]
assert len(fileList) >= num_uttrs
idx_uttrs = np.random.choice(len(fileList), size=num_uttrs, replace=False)
utterances.append(idx_uttrs)

prng = RandomState(int(subdir[1:]))

for i in range(num_uttrs):
    dirName2=dirName.replace("wavs", "spmel")
    npyfile=fileList[idx_uttrs[i]].replace("wav", "npy")
    tmp = np.load(os.path.join(dirName2, subdir, npyfile))
    # choose another utterance if the current one is too short
    embs = []
    left = np.random.randint(0, tmp.shape[0]-len_crop)
    melsp = torch.from_numpy(tmp[np.newaxis, left:left+len_crop, :]).cuda()
    emb = C(melsp)
    embs.append(emb.detach().squeeze().cpu().numpy())
    #embs1=emb.detach().squeeze().cpu().numpy()
    # read audio file
    x, fs = sf.read(os.path.join(dirName, subdir, fileList[idx_uttrs[i]]))
    assert fs == 16000
    if x.shape[0] % 256 == 0:
        x = np.concatenate((x, np.array([1e-06])), axis=0)
    y = signal.filtfilt(b, a, x)
    wav = y * 0.96 + (prng.rand(y.shape[0]) - 0.5) * 1e-06

    # compute spectrogram
    D = pySTFT(wav).T
    D_mel = np.dot(D, mel_basis)
    D_db = 20 * np.log10(np.maximum(min_level, D_mel)) - 16
    S = (D_db + 100) / 100

    # extract f0  [Speaker_Name , One-hot , [Mel, normed-F0, length, utterance_name] ]
    f0_rapt = sptk.rapt(wav.astype(np.float32) * 32768, fs, 256, min=lo, max=hi, otype=2)
    index_nonzero = (f0_rapt != -1e10)
    mean_f0, std_f0 = np.mean(f0_rapt[index_nonzero]), np.std(f0_rapt[index_nonzero])
    f0_norm = speaker_normalization(f0_rapt, index_nonzero, mean_f0, std_f0)
    embs.append(f0_norm)
    #embs2=f0_norm
    embs.append(tmp.shape[0])
    #embs3= tmp.shape[0]
    embs.append(subdir)
    #embs4= subdir

    embss = tuple(embs)
utterances.append(embss)
speakers.append(utterances)

with open(os.path.join(rootDir, 'train.pkl'), 'wb') as handle:
pickle.dump(speakers, handle)

AttributeError: 'HParams' object has no attribute 'builder'

When running demo.ipynb, I am presented with this error:

AttributeError Traceback (most recent call last)
in ()
10 os.makedirs('results')
11
---> 12 model = build_model().to(device)
13 checkpoint = torch.load("/content/SpeechSplit/checkpoint_step001000000_ema.pth")
14 model.load_state_dict(checkpoint["state_dict"])

/root/wavenet_vocoder/autovc/synthesis.py in build_model()

AttributeError: 'HParams' object has no attribute 'builder'

This has been mentioned before in #1, but there have been no solutions posted. This repo has very few instructions. The ones that exist are vague and lack detail. It would be helpful to have a more comprehensive installation tutorial.

Training bottleneck hparams

Can I get a confirmation on tuning the bottlenecks inline with Appendix B.4?

"The first operation is to increase the channel dimension of the encoder output"
Does this refer to dim_enc, dim_enc_2 and dim_enc_3?
"The second operation is to increase the sampling rate of the down-sampled code"
Does this refer to freq, freq_2 and freq_3?

why apply filter on the wav and why choose different min max for f0 for different gender

thanks for your great work!

I have 2 questions:

in make_spect_f0.py,
1). why apply a high pass filter
2). why pad a number if x.shape[0] % 256 ==0
3). why add some randomness on wav
why use different min and max for male and female when you extract f0.

Thanks

Training Pitch F0 converter model P

Hi,
Thanks for the complete code!
I wanted to check how can I train F0 convertor P.
train.py only trains speech split model G.

Kindly help.

data preprocessing and final loss value

Hi. I wanted to ask if you performed data normalization of an audio after trimming all the silences!
And if you did, what method did you use? (maybe link to a paper or lecture or some package, please?)

What was final validation loss of G and P after training is almost done? My result is something like this.. and I'm not sure if it's an okay number.

How to get a generated speech from the output of the trained Generator?

I have trained the Generator model with my own data. However, I found that there may not exist a code for generating the speech from the trained Generator. And I check the code named "demo.ipynb" for founding out the way. It indicates that a trained F0_Converter is needed.
So I would like to ask the author that dose it nessusary to train a F0_Converter first for generating the speech from the trained Generator?(Because I found no code for training F0_Converter)? Or we just need to use the pretrained F0_Converter?

Great work! I the can't get the right demo.pkl by make_metadata.py！

Great work! The question is to get Mel, normed-F0 value!

Who can help me?
[Speaker_Name , One-hot , [Mel, normed-F0, length, utterance_name] ]

How to train your own model and apply it? I have come so far but having problem at solver.py

Ok I have downloaded visual studio code to debug and understand

I see that make_spect_f0.py is used to generate raptf0 and spmel folders with values

So this make_spect_f0 reads a folder and decides whether it is male voice or female voice from spk2gen.pkl file

So as a beginning I have deleted all folders raptf0 and spmel and wavs

then composed a wavs folder and composed another folder inside wavs as p285 which is a male assigned folder

Then inside p285 I have put my more than 2 hours long wav file myfile.wav

Question 1 : Does it have to be 16k hz and mono? or We can use maximum quality?

After I run make_spect_0.py, it has composed myfile.npy and myfile.npy in raptf0 and spmel folders

Then I did run make_metadata.py and it has composed train.pkl inside spmel

Then when I run main.py I get this below error at solver.py

I want to train a model. I don't want test.

Then I want to use this model to convert style of a speech to the trained model

So I need help thank you

@auspicious3000

For one-to-one conversion, is it better to: train with dataset of the 2 speakers only, OR add additional speakers to training data?

'pad_seq_to_2' from utils is not found

Hi there,

I am getting the following error:

ImportError: cannot import name 'pad_seq_to_2'

No documentation online seems to support this function in the package 'utils'. I would be really grateful to have some clarity on this issue!

Thanks :)

spk2gen

Is the spk2gen.pkl file available?

Downsampling for VCTK corpus

The sampling rate of the VCTK corpus is 48K Hz while the model requires the sampling rate to be 16K Hz. To match the sampling rate, I used librosa's resample function and my code looks like:

import librosa

y, sr = librosa.load(wav_file, sr=48000)
y_16k = librosa.resample(y, sr, 16000)

Is this the same code you used for downsampling the audios? I want to clarify this because I want to make sure the data distribution is the same.

No module named 'synthesis'

I want to run the demo having been interested by the paper, but I am facing problems as follows:

To reproduce

Clone the repo
Install dependencies
pip install wavenet-vocoder==0.1.1
Add trained models to assets
Run 'demo.ipynb' cell 1, then cell 2

Issue

The following error is thrown:

---------------------------------------------------------------------------
ModuleNotFoundError                       Traceback (most recent call last)
<ipython-input-2-e12167bbae28> in <module>
      4 import pickle
      5 import os
----> 6 from synthesis import build_model
      7 from synthesis import wavegen
      8 

ModuleNotFoundError: No module named 'synthesis'

I feel this is because the install of wavenet-vocoder through pip does not install the necessary module 'synthesis'. Instructions are not completely clear on the wavenet-vocoder repo, or here how this synthesis module is accessed.

Any help would be appreciated!

Tuning bottlenecks according to Appendix B.4

Although the tuning process mentioned is very intuitive, it seems like there's no theoretical guarantee that the same bottleneck sizes will work for all speakers. I think it's a research problem in itself to be able to decide the bottlenecks directly from the speech(without going through the manual tuning process).

But practically speaking, it might be possible that a set of bottleneck sizes might work well in general for most of the cases. Is that the case with the sizes used in the repo? Did anyone try using the same sizes on a different dataset? Since training takes a long time, for each iteration of the tuning process for every new speaker or dataset, I'm afraid the approach might become very impractical to use.

@auspicious3000 any insights or help is very much appreciated

Unable to run demo.ipynb

To Run Demo
(done) Download pre-trained models to assets
(done) Download the same WaveNet vocoder model as in AutoVC to assets
(done) wavenet_vocoder git checkout 44e0e36 for more information, please refer to
(done) Run demo.ipynb

ModuleNotFoundError: No module named 'synthesis'

F0 Converter for P - loss function values

I am trying to replicate your work. I am currently making F0 converter model for P checkpoint generation. I am stuck at loss calculation.

I see when I use F0_Converter model to generate P, I get a 257 dimension one-hot encoded feature P.

Demo.ipynb

f0_pred = P(uttr_org_pad, f0_trg_onehot)[0]
f0_pred.shape
> torch.Size([192, 257])

I wanted to ask you when training the F0 converter model, what is the value that you are using to calculate the loss?

I tried using the following value but I am not sure if that is the right way.
This is what I am doing to generate f0_pred and to calculate the loss:

f0_pred = self.P(x_real_org,f0_org_intrp)[0]
p_loss_id = F.mse_loss(f0_pred,f0_org_intrp,reduction='mean')

I just want to know if I am on the right track.
Can you help me out here @auspicious3000

Training G and P at different sample rates

I am attempting to retrain at 22050Hz. At this SR validation loss for G and P do not decrease (P actually steadily increases). I am using test samples from every speaker in train set. Both loss_id's decrease as expected.

I train G according to this code in solver.py:

self.G = self.G.train()
# G Identity mapping loss
x_f0 = torch.cat((x_real_org, f0_org), dim=-1)
x_f0_intrp = self.Interp(x_f0, len_org) 

f0_org_intrp = quantize_f0_torch(x_f0_intrp[:,:,-1])[0]
x_f0_intrp_org = torch.cat((x_f0_intrp[:,:,:-1], f0_org_intrp), dim=-1)

# G forward
x_pred = self.G(x_f0_intrp_org, x_real_org, emb_org)
g_loss_id = F.mse_loss(x_pred, x_real_org, reduction='mean') 

# Backward and optimize.
self.g_optimizer.zero_grad()
g_loss_id.backward()
self.g_optimizer.step()

loss['G/loss_id'] = g_loss_id.item()

and train P according to this code:

self.P = self.P.train()
# Preprocess f0_trg for P 
x_f0_trg = torch.cat((x_real_org, f0_org), dim=-1)
x_f0_intrp_trg = self.Interp(x_f0_trg, len_org) 
# Target for P
f0_trg_intrp = quantize_f0_torch(x_f0_intrp_trg[:,:,-1])[0]
f0_trg_intrp_indx = f0_trg_intrp.argmax(2)

# P forward
f0_pred = self.P(x_real_org,f0_trg_intrp)
p_loss_id = F.cross_entropy(f0_pred.transpose(1,2),f0_trg_intrp_indx, reduction='mean')


self.p_optimizer.zero_grad()
p_loss_id.backward()
self.p_optimizer.step()
loss['P/loss_id'] = p_loss_id.item()

Do the authors or any others have any experience tuning bottlenecks or other hparams at this sample rate? Appendix B.4 does not specifically cover sample rate but is otherwise helpful.
Can anyone validate my training code for P?

I feel this may be due to the LSTMs in the encoders and decoders, since at a different SR the vocal features appear over a different scale, however any other suggestions would be appreciated.

Can it transfer RFU between different utterances?

Hi. Thank you for the fantastic project.
Does your model is capable to transfer content, rhythm, and pitch between different sentences?
I've prepared a demo.pkl file in the way that metadata[0] is
metadata[0].wav.zip
And the metadata[1] is was left the same.

Here is the result:
p226_p231_003002_RFU.wav.zip

Did I do something wrong, or your model is not intended to do such conversions?

make_metadata.py logic puzzle

SpeechSplit/make_metadata.py

Lines 17 to 25 in 10ed8b9

 # use hardcoded onehot embeddings in order to be cosistent with the test speakers 

 # modify as needed 

 # may use generalized speaker embedding for zero-shot conversion 

 spkid = np.zeros((82,), dtype=np.float32) 

 if speaker == 'p226': 

 spkid[1] = 1.0 

 else: 

 spkid[7] = 1.0 

 utterances.append(spkid)

Hi,
I don't understand the code logic of this paragraph.
Shouldn't everyone's id be different? Why are there only two?
If I want to use VCTK 20 speakers to train, does this paragraph have to be modified?
Could you explain a bit of it to me?

How to compute GPE FFE

Thanks for your awesome contributions to this paper. I want to eval my synthesized audio. Can you share with me the code to compute GPE VDE and FFE? Thank you!

How do I solve this error when executing the last cell?

AttributeError Traceback (most recent call last)
in ()
10 os.makedirs('results')
11
---> 12 model = build_model().to(device)
13 checkpoint = torch.load("/content/SpeechSplit/checkpoint_step001000000_ema.pth")
14 model.load_state_dict(checkpoint["state_dict"])

/root/wavenet_vocoder/autovc/synthesis.py in build_model()

AttributeError: 'HParams' object has no attribute 'builder'

There is exception when i run the main.py! How can i fix it?

The dateset is your provided training data . How can i fix it?
Thx :) !

F0 mean and std

Hey.

Do you know the mean and std calculated here for the pre-trained weights??

Thanks

How to align multiple sequences while they are from different source?

If the length of content code, rhythm code and pitch code is different from each other, how do they align since there is no attention mechanism in decoder?

The validation loss is rising and fluctuating, is that a regular situation?

Greetings, thanks for such a good project.
In my experiment, i used the same dataset VCTK as yours, and i had only trained for 68000-steps. The log of my experiment like this:

I noticed that the validation loss is rising and the training loss also has some fluctuating peaks. Is it a normal phenomenon？

thank you in advance :)

Question: Using a pretrained encoder for getting the speaker embedding.

Hi,

Did you guys experiment using a pretrained encoder for getting the speaker embedding similar to your previous work (AutoVC).

PS: Amazing work by the way!

Thanks,

"The provided training data is very small for code verification purpose only"

Is this mean the pre-trained model given is kind of overfitted model trained on small dataset?

Question: how many speakers was trained in pre-trained model?

Hi, thanks for the great work.
Sorry for the rudimentary question.
I have a question about the pre-trained model in demo.ipybn.
In the paper, it says that it was trained by 20 speakers, but the speaker ID vector used in demo.ipynb has a size of 82, and it looks like it has information for 82 speakers.
Please tell me how many speakers were used in the pre-trained model and why the speaker ID in demo.ipynb has a size of 82.

The validation loss is rising

The loss of my training set looks normal, but the loss of the validation set has been rising. The loss of my training set looks normal, but the loss of the validation set has been rising. The structure of the validation set is:
[speaker, speaker_onehot, (spmel, raptf0, len, chapter)],
spmel and raptf0 were extracted by make_spect_f0.py directly.
Is there any problem with this?

I tried several times and the loss of validation set is rising.

How to retrain the G model?

Hello, I want to retrain the G model. I used the VCTK corpus to delete some flac files that were too long and too short. I tried several times and the results were not good. I only modified the number of speakers in the model parameters. Can you send me a copy of your training data or tell me what preprocessing you have done to the VCTK corpus and which speakers have been selected? My email: [email protected]. Thanks! ! ! (●—●)

How to split "Accent" information of the speaker?

Thanks for the codebase. Good work!

In the paper, Speech is split into -- timbre (using speaker embedding), pitch, rhythm, content. If I am not wrong, the accent information of the speaker is not captured by the speaker embedding. (I know this because when I experimented with AutoVC codebase, the speaker embedding did not capture the accent info. It accent info of the source speech was always seen in the voice conversion output.)

Any ideas on how to split the accent information from speech?

Thanks,
Pravin

Content

Is this project an evolution of autovc?
can I transfer only rhythm or tiber or pitch from audio1 to audio2 where they have different contents?

Help with the missing synthesis module in the demo

Hi,
Thanks for this amazing work!
I was quickly trying to run the demo. I am currently stuck with the synthesis module being imported to generate the audio from mels.
Kindly let me know where can I find it!

Thanks,
Shubham

What `demo.pkl` consists of ?

Is anyone know how to make a file like demo.pkl?
I've tried to print a data out, but I still have no idea. Below is my code and what I got:

demoData = pickle.load(open(os.path.join('assets', 'demo.pkl'), "rb"))
    print(demoData[0][0])
    print(demoData[0][1].shape)       
    print(demoData[0][2][0].shape)
    print(demoData[0][2][1].shape)
    print(demoData[0][2][2])
    print(demoData[0][2][3])

result:
>> p226
>> (1, 82)
>> (135, 80)
>> (135,)
>> 135
>> 003002

I suppose it contains 3 things:

speaker name
speaker id
??? → Could anyone give me a hint about these things? and how can I make one by myself?

Thanks in advance

Using ParallelWaveGan instead of Wavenet

Is it possible to use PWG vocoder(https://github.com/kan-bayashi/ParallelWaveGAN) instead of Wavenet on the output of the decoder? Specifically, do I need to change the frame length and frame hop to make the mel spectrograms compatible with PWG.

Wavenet inference is very slow so it would help we are able to use any other neural vocoders directly. That way we could just finetune the given pretrained speechsplit models instead of training again from scratch.

Can you solve the inference problem of audio not in the data set?

how to use with my own voices?

how to use with my own voices?
with what script do I extract the components of my voices?

mel spectrogram normalization range

Hi, I observed that the range of spectrogram saved in npy file is -0.2 ~ 0.8. I am wondering why you normalize spectrogram into this range? For what reason?

How to check speaker disentanglement during training?

What I have done: I purposely set a 0-like speaker embedding vector during testing for both image representation and loss measure (MSE, I assume higher is better).

For the result, I can clearly observe a significant MSE (around 33) after few days of training. However, after doing the real voice conversion (from one speaker to another), the model only achieves reconstruction without voice conversion.

If possible, it would be really appreciated knowing if there exist other ways to test voice conversion during training.

Great Thanks.

How the utterance id in demo.pkl is calculated?

In the file demo.pkl, the last dimension is utterance id and it is designated as 003002 in demo. How to find the corresponding wav in VCTK corpus whose id is 003002 in your demo? And I am wondering how the utterance id is obtained. It seems that you doesn't mention how to get the utterance id according to the VCTK corpus.
Thank you!

RuntimeError: Error(s) in loading state_dict for WaveNet

Hello,

Thanks for uploading the code! I wanted to let you know I'm having some issues with running the code from the demo, getting this error:

RuntimeError: Error(s) in loading state_dict for WaveNet:
	Missing key(s) in state_dict: "upsample_net.conv_in.weight", "upsample_net.upsample.up_layers.1.weight_g", "upsample_net.upsample.up_layers.1.weight_v", "upsample_net.upsample.up_layers.3.weight_g", "upsample_net.upsample.up_layers.3.weight_v", "upsample_net.upsample.up_layers.5.weight_g", "upsample_net.upsample.up_layers.5.weight_v", "upsample_net.upsample.up_layers.7.weight_g", "upsample_net.upsample.up_layers.7.weight_v". 
	Unexpected key(s) in state_dict: "upsample_conv.0.bias", "upsample_conv.0.weight_g", "upsample_conv.0.weight_v", "upsample_conv.2.bias", "upsample_conv.2.weight_g", "upsample_conv.2.weight_v", "upsample_conv.4.bias", "upsample_conv.4.weight_g", "upsample_conv.4.weight_v", "upsample_conv.6.bias", "upsample_conv.6.weight_g", "upsample_conv.6.weight_v", "conv_layers.0.conv1x1c.bias", "conv_layers.1.conv1x1c.bias", "conv_layers.2.conv1x1c.bias", "conv_layers.3.conv1x1c.bias", "conv_layers.4.conv1x1c.bias", "conv_layers.5.conv1x1c.bias", "conv_layers.6.conv1x1c.bias", "conv_layers.7.conv1x1c.bias", "conv_layers.8.conv1x1c.bias", "conv_layers.9.conv1x1c.bias", "conv_layers.10.conv1x1c.bias", "conv_layers.11.conv1x1c.bias", "conv_layers.12.conv1x1c.bias", "conv_layers.13.conv1x1c.bias", "conv_layers.14.conv1x1c.bias", "conv_layers.15.conv1x1c.bias", "conv_layers.16.conv1x1c.bias", "conv_layers.17.conv1x1c.bias", "conv_layers.18.conv1x1c.bias", "conv_layers.19.conv1x1c.bias", "conv_layers.20.conv1x1c.bias", "conv_layers.21.conv1x1c.bias", "conv_layers.22.conv1x1c.bias", "conv_layers.23.conv1x1c.bias".

I used to have size mismatches as well, but then I edited these rows from inside the wavenet_vocoder repo:

residual_channels=512,
gate_channels=512,  # split into 2 gropus internally for gated activation
skip_out_channels=256,

Maybe it's something obvious for you, thank you for publishing your code and of course your time, much obliged.

Question: how much data was trained in pre-trained model?

Hi, thanks for your great work.

I know that the pre-trained model, which you ask us to download before running demo.ipynb, has trained through 660000 steps.
Do you mind me to ask which dataset and how much data you trained in that one you gave us?

bug issue

File "SpeechSplit/data_loader.py", line 108, in call
pdb.set_trace()
NameError: name 'pdb' is not defined

where to define and find the pdb in data_loader file

How to get the right x_org, f0_org, len_org, uid_org value?

x_org, f0_org, len_org, uid_org = sbmt_i[2]

Spectrogram db scaling

Hi, I am trying to change the wavenet vocoder to the melgan, however, I noticed the scaling of dB is different from what is generated in demo.ipynb, compared to what melgan generated.

On the left is c.T from demo.ipynb plotted, while on the right is melgan generated spectrogram from wav files. Note the scale is different, one is positive, while the other is negative.

demo.ipynb spectrogram generates fine using wavenet, but generates garbage when fed into melgan. Scaling db values linearly to approximate melgan works ok when fed to melgan, but is there a proper method to convert between the db scalings?

	# use hardcoded onehot embeddings in order to be cosistent with the test speakers
	# modify as needed
	# may use generalized speaker embedding for zero-shot conversion
	spkid = np.zeros((82,), dtype=np.float32)
	if speaker == 'p226':
	spkid[1] = 1.0
	else:
	spkid[7] = 1.0
	utterances.append(spkid)

auspicious3000 / speechsplit Goto Github PK

speechsplit's Introduction

Unsupervised Speech Decomposition Via Triple Information Bottleneck

Audio Demo

Dependencies

To Run Demo

To Train

Final Words

speechsplit's People

Contributors

Stargazers

Watchers

Forkers

speechsplit's Issues

Modify as needed

To reproduce

Issue

Recommend Projects

Recommend Topics

Recommend Org

Jobs