moonintheriver / neuralsvb Goto Github PK

View Code? Open in Web Editor NEW

417.0 13.0 51.0 1.99 MB

Learning the Beauty in Songs: Neural Singing Voice Beautifier; ACL 2022 (Main conference); Official code

License: GNU General Public License v3.0

Python 100.00%

singing-voice acl2022 singing-synthesis gan singing-voice-synthesis singing-voice-conversion

neuralsvb's Introduction

Learning the Beauty in Songs: Neural Singing Voice Beautifier

Demo Page

This repository is the official PyTorch implementation of our ACL-2022 paper.

0. Dataset (PopBuTFy) Acquirement

Audio samples

You can download the dataset from here. Please send us an email for registration (See in apply_form).
Dataset preview.

Text labels

NeuralSVB does not need text as input, but the ASR model to extract PPG needs text. Thus we also provide the text labels of PopBuTFy.

1. Preparation

Environment Preparation

Most of the required packages are in https://github.com/NATSpeech/NATSpeech/blob/main/requirements.txt

Or you can prepare environments with the Requirements.txt file in the repository directory.

pip install Requirements.txt

Data Preparation

Extract embeddings of vocal timbre:

CUDA_VISIBLE_DEVICES=0 python data_gen/tts/bin/binarize.py --config egs/datasets/audio/PopBuTFy/save_emb.yaml

Pack the dataset:

CUDA_VISIBLE_DEVICES=0 python data_gen/tts/bin/binarize.py --config egs/datasets/audio/PopBuTFy/para_bin.yaml

Vocoder Preparation

We provide the pre-trained model of HifiGAN-Singing which is specially designed for SVS with NSF mechanism.

Please unzip pre-trained vocoder into checkpoints before training your acoustic model.

This singing vocoder is trained on 100+ hours singing data (including Chinese and English songs).

PPG Extractor Preparation

We provide the pre-trained model of PPG Extractor.

Please unzip pre-trained PPG extractor into checkpoints before training your acoustic model.

After the instructions above, the directory structure should be as follows:

.
|--data
    |--processed
        |--PopBuTFy (unzip PopBuTFy.zip)
            |--data
                |--directories containing wavs
    |--binary
        |--PopBuTFyENSpkEM
|--checkpoints
    |--1009_pretrain_asr_english
        |--
        |--config.yaml
    |--1012_hifigan_all_songs_nsf
        |--
        |--config.yaml

2. Training Example

CUDA_VISIBLE_DEVICES=0,1 python tasks/run.py --config egs/datasets/audio/PopBuTFy/vae_global_mle_eng.yaml --exp_name exp_name --reset

3. Inference

Inference from packed test set

CUDA_VISIBLE_DEVICES=0,1 python tasks/run.py --config egs/datasets/audio/PopBuTFy/vae_global_mle_eng.yaml --exp_name exp_name --reset --infer

Inference results will be saved in ./checkpoints/EXP_NAME/generated_ by default.

We provided:

the pre-trained model of NSVB (en version);

Remember to put the pre-trained models in checkpoints directory.

Inference from raw inputs

WIP.

Limitations

See Appendix D "Limitations and Solutions" in our paper.

Citation

If this repository helps your research, please cite:

@inproceedings{liu-etal-2022-learning-beauty,
title = "Learning the Beauty in Songs: Neural Singing Voice Beautifier",
author = "Liu, Jinglin  and
  Li, Chengxi  and
  Ren, Yi  and
  Zhu, Zhiying  and
  Zhao, Zhou",
booktitle = "Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)",
month = may,
year = "2022",
address = "Dublin, Ireland",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2022.acl-long.549",
pages = "7970--7983",}

Issues

Before raising a issue, please check our Readme and other issues for possible solutions.
We will try to handle your problem in time but we could not guarantee a satisfying solution.
Please be friendly.

Acknowledgements

r9y9's wavenet_vocoder
Po-Hsun-Su's ssim
descriptinc's melgan
Official espnet
Official PyTorch Lightning

The framework of this repository is based on DiffSinger, and is a predecessor of NATSpeech.

neuralsvb's People

Contributors

Stargazers

Watchers

neuralsvb's Issues

How can NSVB generalize to unseen singers?

NSVB is trained on PopBuTFy with 34 speakers. Even with the 30-hour internal singing data as described in the paper in the training of Stage1 , I doubt that this level of data would enable it to generalize to unseen singers. I believe it can only generalize to a similar singer in the training set. I trained Stage 1 with the 50-hour OpenSinger data and 3 other singers, the resulting model can only generalize to a similar singer in the training set, but it can't do the same for a very different singer. Has anyone been able to do a better generalization here?

I got error when inference model. How to fix it ?

CUDA_VISIBLE_DEVICES=0 python tasks/run.py --config egs/datasets/audio/PopBuTFy/vae_global_mle_eng.yaml --exp_name 1012_hifigan_all_songs_nsf --reset --infer
RuntimeError: Error(s) in loading state_dict for HifiGanGenerator:
Unexpected key(s) in state_dict: "m_source.l_linear.weight", "m_source.l_linear.bias", "noise_convs.0.weight", "noise_convs.0.bias", "noise_convs.1.weight", "noise_convs.1.bias", "noise_convs.2.weight", "noise_convs.2.bias", "noise_convs.3.weight", "noise_convs.3.bias".

How to get the PopBuTFy dataset

Hi, how can i get the PopBuTFy dataset to train the model? I have sent the application form to the official mailbox, but there is no response.

NSVB checkpoint or data request

Hi, is it possible to make the trained svb model available? Or could you maybe describe how the input data should look like so that I could train my own model?

hi

hi, request for datasets and source code.

This work is very outstanding and we are insterested in it. Are there any plans to make the dataset and associated pretrained models public in the near future? Thank you

Problem with proper data loading

Hi, I'd like to run your model by myself, however I cannot find proper way to load the dataset with .mp3 files you provided. Is there a chance to share the dataloader you've used or give some hints how to process the .mp3 files to valid dataset which could be used in your usage examples? I'll be very grateful!

Has anyone gotten good results?

Has anyone gotten good results in English?

The pre-trained model of NSVB

Hi! I really liked your DiffSinger project! NeuralSVB also looks very promising. You write you provide the pre-trained model of NSVB. I would like to try it on my data. How can I get a model?

救命啊！ jiùmìng a! 救命啊！ jiùmìng a! 救命啊！ jiùmìng a! jiùmìng a!

I got error when inference model:
Exception has occurred: RuntimeError
Error(s) in loading state_dict for HifiGanGenerator:
Unexpected key(s) in state_dict: "m_source.l_linear.weight", "m_source.l_linear.bias", "noise_convs.0.weight", "noise_convs.0.bias", "noise_convs.1.weight", "noise_convs.1.bias", "noise_convs.2.weight", "noise_convs.2.bias", "noise_convs.3.weight", "noise_convs.3.bias".
CUDA_VISIBLE_DEVICES=0 python tasks/run.py --config egs/datasets/audio/PopBuTFy/vae_global_mle_eng.yaml --exp_name 1012_hifigan_all_songs_nsf --reset --infer
| Hparams chains: ['egs/egs_bases/config_base.yaml', 'egs/egs_bases/tts/base.yaml', 'egs/egs_bases/tts/fs2.yaml', 'egs/egs_bases/tts/fs2_adv.yaml', 'egs/egs_bases/vc/vc_ppg.yaml', 'egs/egs_bases/tts/base_zh.yaml', 'egs/egs_bases/singing/base.yaml', 'egs/datasets/audio/PopBuTFy/base_text2mel.yaml', 'egs/datasets/audio/PopBuTFy/vae_global_mle_eng.yaml']
| Hparams:
accumulate_grad_batches: 1, amp: False, asr_content_encoder: True, asr_dec_layers: 2, asr_enc_layers: 2,
asr_enc_type: conformer, asr_last_norm: False, asr_upsample_norm: bn, audio_num_mel_bins: 80, audio_sample_rate: 22050,
base_config: ['egs/egs_bases/vc/vc_ppg.yaml', './base_text2mel.yaml'], binarization_args: {'shuffle': False, 'with_txt': True, 'with_wav': False, 'with_align': False, 'with_spk_embed': False, 'with_spk_id': True, 'with_f0': True, 'with_f0cwt': False, 'with_linear': False, 'with_word': True, 'trim_eos_bos': False, 'reset_phone_dict': True, 'reset_word_dict': True}, binarizer_cls: data_gen.tts.singing.binarize.SingingBinarizer, binary_data_dir: data/binary/PopBuTFyENSpkEM_new, check_val_every_n_epoch: 10,
clip_grad_norm: 1, clip_grad_value: 0, concurrent_ways: , conv_use_pos: False, cross_way_no_disc_loss: False,
cross_way_no_recon_loss: False, cwt_add_f0_loss: False, cwt_hidden_size: 128, cwt_layers: 2, cwt_loss: l1,
cwt_std_scale: 0.8, datasets: [], debug: False, dec_dilations: [1, 1, 1, 1], dec_ffn_kernel_size: 9,
dec_inp_add_noise: False, dec_kernel_size: 5, dec_layers: 4, dec_num_heads: 2, decoder_rnn_dim: 0,
decoder_type: conv, dict_dir: , disable_map: False, disc_hidden_size: 128, disc_interval: 1,
disc_lr: 0.0001, disc_norm: in, disc_reduction: stack, disc_start_steps: 0, disc_win_num: 3,
discriminator_grad_norm: 1, discriminator_optimizer_params: {'eps': 1e-06, 'weight_decay': 0.0}, discriminator_scheduler_params: {'step_size': 60000, 'gamma': 0.5}, dropout: 0.05, ds_workers: 2,
dur_enc_hidden_stride_kernel: ['0,2,3', '0,2,3', '0,1,3'], dur_loss: mse, dur_predictor_kernel: 3, dur_predictor_layers: 2, enc_dec_norm: ln,
enc_dilations: [1, 1, 1, 1], enc_ffn_kernel_size: 9, enc_kernel_size: 5, enc_layers: 4, encoder_K: 8,
encoder_type: rel_fft, endless_ds: True, exp_name: 1012_hifigan_all_songs_nsf, ffn_act: gelu, ffn_hidden_size: 1024,
ffn_padding: SAME, fft_size: 512, fmax: 11025, fmin: 50, frames_multiple: 4,
fvae_dec_n_layers: 4, fvae_enc_dec_hidden: 192, fvae_enc_n_layers: 8, fvae_kernel_size: 5, gen_dir_name: ,
generator_grad_norm: 5.0, griffin_lim_iters: 60, hidden_size: 256, hop_size: 128, infer: True,
lambda_commit: 0.25, lambda_energy: 0.0, lambda_f0: 0.0, lambda_kl: 0.001, lambda_mel_adv: 0.1,
lambda_mle: 1.0, lambda_ph_dur: 0.0, lambda_sent_dur: 0.0, lambda_uv: 0.0, lambda_word_dur: 0.0,
latent_size: 128, layers_in_block: 2, load_ckpt: , loud_norm: False, lr: 1.0,
map_lr: 0.001, map_scheduler_params: {'gamma': 0.5, 'step_size': 60000}, max_epochs: 100, max_frames: 5000, max_input_tokens: 1550,
max_sentences: 80, max_tokens: 40000, max_updates: 200000, max_valid_sentences: 1, max_valid_tokens: 60000,
mel_disc_hidden_size: 128, mel_disc_type: multi_window, mel_gan: True, mel_hidden_size: 256, mel_loss: ssim:0.5|l1:0.5,
mel_strides: [2, 1, 1], mel_vmax: 1.5, mel_vmin: -6, mfa_version: 2, min_frames: 0,
min_level_db: -100, normalize_pitch: False, num_ckpt_keep: 2, num_heads: 2, num_sanity_val_steps: 10,
num_spk: 100, num_techs: 3, num_test_samples: 0, num_valid_plots: 10, optimizer_adam_beta1: 0.5,
optimizer_adam_beta2: 0.999, out_wav_norm: False, phase_1_concurrent_ways: p2p, phase_1_steps: -1, phase_2_concurrent_ways: a2a,p2p,
phase_2_steps: 100000, phase_3_concurrent_ways: a2p, pitch_ar: False, pitch_embed_type: 0, pitch_enc_hidden_stride_kernel: ['0,2,5', '0,2,5', '0,2,5'],
pitch_extractor: parselmouth, pitch_loss: l1, pitch_norm: standard, pitch_ssim_win: 11, pitch_type: frame,
pre_align_args: {'nsample_per_mfa_group': 1000, 'txt_processor': 'zh', 'use_tone': False, 'sox_resample': True, 'sox_to_wav': False, 'allow_no_txt': False, 'trim_sil': False, 'denoise': False}, pre_align_cls: data_gen.tts.singing.pre_align.SingingPreAlign, predictor_dropout: 0.5, predictor_grad: 0.0, predictor_hidden: -1,
predictor_kernel: 5, predictor_layers: 2, pretrain_asr_ckpt: checkpoints/1009_pretrain_asr_english, pretrain_fs_ckpt: , print_nan_grads: False,
processed_data_dir: data/processed/popbutfy_0.75, profile_infer: False, raw_data_dir: data/raw/popbutfy_short_male_0.75, ref_attn: False, ref_enc_out: 256,
ref_hidden_stride_kernel: ['0,3,5', '0,3,5', '0,2,5', '0,2,5', '0,2,5'], ref_level_db: 20, ref_norm_layer: bn, rename_tmux: True, rerun_gen: False,
resume_from_checkpoint: 0, save_best: False, save_codes: [], save_f0: True, save_gt: True,
scheduler: rsqrt, seed: 1234, sort_by_len: True, task_cls: tasks.singing.svb_vae_task.SVBVAEMleTask, tb_log_interval: 100,
test_ids: [], test_input_dir: , test_num: 0, test_prefixes: [], test_set_name: test,
train_set_name: train, train_sets: , use_cond_disc: False, use_energy: False, use_energy_embed: False,
use_gt_dur: True, use_gt_f0: True, use_pitch_embed: True, use_pos_embed: True, use_ref_enc: False,
use_spk_embed: False, use_spk_id: False, use_split_spk_id: False, use_tech: True, use_uv: True,
use_var_enc: False, use_word_input: False, val_check_interval: 2000, valid_infer_interval: 10000, valid_mel_timbre_id: 100,
valid_monitor_key: val_loss, valid_monitor_mode: min, valid_set_name: valid, validate: False, var_enc_vq_codes: 64,
vocoder: hifigan, vocoder_ckpt: checkpoints/1012_hifigan_all_songs_nsf, vocoder_denoise_c: 0.0, warmup_updates: 2000, weight_decay: 0,
win_size: 512, word_size: 1000, work_dir: checkpoints/1012_hifigan_all_songs_nsf,
12/14 10:18:24 AM GPU available: True, GPU used: [0]
| Mel losses: {'ssim': 0.5, 'l1': 0.5}
12/14 10:18:24 AM load module from checkpoint: checkpoints/1009_pretrain_asr_english/model_ckpt_steps_136000.ckpt
| load 'model' from 'checkpoints/1009_pretrain_asr_english/model_ckpt_steps_136000.ckpt'.
| Generator Arch: MleSVBVAE(
(pitch_embed): Embedding(300, 256, padding_idx=0)
(pitch_encoder): ConvStacks(
(conv): ModuleList(
(0-2): 3 x ConvBlock(
(conv): ConvNorm(
(conv): Conv1d(256, 256, kernel_size=(5,), stride=(1,), padding=(2,))
)
(norm): GroupNorm(16, 256, eps=1e-05, affine=True)
(dropout): Dropout(p=0, inplace=False)
(relu): ReLU()
)
)
(in_proj): Linear(in_features=256, out_features=256, bias=True)
(out_proj): Linear(in_features=256, out_features=256, bias=True)
)
(vc_asr): VCASR(
(mel_prenet): Prenet(
(layers): ModuleList(
(0): Sequential(
(0): Conv1d(80, 256, kernel_size=(5,), stride=(2,), padding=(2,))
(1): ReLU()
(2): BatchNorm1d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
)
(1-2): 2 x Sequential(
(0): Conv1d(256, 256, kernel_size=(5,), stride=(1,), padding=(2,))
(1): ReLU()
(2): BatchNorm1d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
)
)
(out_proj): Linear(in_features=256, out_features=256, bias=True)
)
(content_encoder): ConformerLayers(
(layers): ModuleList()
(pos_embed): RelPositionalEncoding(
(dropout): Dropout(p=0.05, inplace=False)
)
(encoder_layers): ModuleList(
(0-1): 2 x EncoderLayer(
(self_attn): RelPositionMultiHeadedAttention(
(linear_q): Linear(in_features=256, out_features=256, bias=True)
(linear_k): Linear(in_features=256, out_features=256, bias=True)
(linear_v): Linear(in_features=256, out_features=256, bias=True)
(linear_out): Linear(in_features=256, out_features=256, bias=True)
(dropout): Dropout(p=0.0, inplace=False)
(linear_pos): Linear(in_features=256, out_features=256, bias=False)
)
(feed_forward): MultiLayeredConv1d(
(w_1): Conv1d(256, 1024, kernel_size=(1,), stride=(1,))
(w_2): Conv1d(1024, 256, kernel_size=(1,), stride=(1,))
(dropout): Dropout(p=0.05, inplace=False)
)
(feed_forward_macaron): MultiLayeredConv1d(
(w_1): Conv1d(256, 1024, kernel_size=(1,), stride=(1,))
(w_2): Conv1d(1024, 256, kernel_size=(1,), stride=(1,))
(dropout): Dropout(p=0.05, inplace=False)
)
(conv_module): ConvolutionModule(
(pointwise_conv1): Conv1d(256, 512, kernel_size=(1,), stride=(1,))
(depthwise_conv): Conv1d(256, 256, kernel_size=(31,), stride=(1,), padding=(15,), groups=256)
(norm): BatchNorm1d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(pointwise_conv2): Conv1d(256, 256, kernel_size=(1,), stride=(1,))
(activation): Swish()
)
(norm_ff): LayerNorm((256,), eps=1e-05, elementwise_affine=True)
(norm_mha): LayerNorm((256,), eps=1e-05, elementwise_affine=True)
(norm_ff_macaron): LayerNorm((256,), eps=1e-05, elementwise_affine=True)
(norm_conv): LayerNorm((256,), eps=1e-05, elementwise_affine=True)
(norm_final): LayerNorm((256,), eps=1e-05, elementwise_affine=True)
(dropout): Dropout(p=0.05, inplace=False)
)
)
(layer_norm): Linear(in_features=256, out_features=256, bias=True)
)
(token_embed): Embedding(88, 256, padding_idx=0)
(asr_decoder): TransformerASRDecoder(
(embed_positions): SinusoidalPositionalEmbedding()
(layers): ModuleList(
(0-1): 2 x TransformerDecoderLayer(
(op): DecSALayer(
(layer_norm1): LayerNorm((256,), eps=1e-05, elementwise_affine=True)
(self_attn): MultiheadAttention(
(out_proj): Linear(in_features=256, out_features=256, bias=False)
)
(layer_norm2): LayerNorm((256,), eps=1e-05, elementwise_affine=True)
(encoder_attn): MultiheadAttention(
(out_proj): Linear(in_features=256, out_features=256, bias=False)
)
(layer_norm3): LayerNorm((256,), eps=1e-05, elementwise_affine=True)
(ffn): TransformerFFNLayer(
(ffn_1): Sequential(
(0): ConstantPad1d(padding=(8, 0), value=0.0)
(1): Conv1d(256, 1024, kernel_size=(9,), stride=(1,))
)
(ffn_2): Linear(in_features=1024, out_features=256, bias=True)
)
)
)
)
(layer_norm): LayerNorm((256,), eps=1e-12, elementwise_affine=True)
(project_out_dim): Linear(in_features=256, out_features=88, bias=False)
)
)
(upsample_layer): Sequential(
(0): Sequential(
(0): Upsample(scale_factor=2.0, mode='nearest')
(1): Conv1d(256, 256, kernel_size=(5,), stride=(1,), padding=(2,))
(2): ReLU()
(3): BatchNorm1d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
)
(1): Conv1d(256, 256, kernel_size=(5,), stride=(1,), padding=(2,))
)
(spk_embed_proj): Linear(in_features=256, out_features=256, bias=True)
(encoded_embed_proj): Linear(in_features=768, out_features=256, bias=True)
(vae_model): GlobalFVAE(
(g_pre_net): Sequential(
(0): Conv1d(256, 256, kernel_size=(8,), stride=(4,), padding=(2,))
)
(encoder): GlobalFVAEEncoder(
(pre_net): Sequential(
(0): Conv1d(80, 192, kernel_size=(8,), stride=(4,), padding=(2,))
)
(wn): WN(
(in_layers): ModuleList(
(0-7): 8 x Conv1d(192, 384, kernel_size=(5,), stride=(1,), padding=(2,))
)
(res_skip_layers): ModuleList(
(0-6): 7 x Conv1d(192, 384, kernel_size=(1,), stride=(1,))
(7): Conv1d(192, 192, kernel_size=(1,), stride=(1,))
)
(drop): Dropout(p=0, inplace=False)
(cond_layer): Conv1d(256, 3072, kernel_size=(1,), stride=(1,))
)
(out_proj): Conv1d(192, 256, kernel_size=(1,), stride=(1,))
(poolings): Sequential(
(0): Conv1d(256, 256, kernel_size=(3,), stride=(2,))
(1): ReLU()
(2): BatchNorm1d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(3): Conv1d(256, 256, kernel_size=(3,), stride=(2,))
(4): ReLU()
(5): BatchNorm1d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(6): Conv1d(256, 256, kernel_size=(3,), stride=(2,))
)
)
(decoder): GlobalFVAEDecoder(
(pre_net): Sequential(
(0): ConvTranspose1d(128, 192, kernel_size=(4,), stride=(4,))
)
(wn): WN(
(in_layers): ModuleList(
(0-3): 4 x Conv1d(192, 384, kernel_size=(5,), stride=(1,), padding=(2,))
)
(res_skip_layers): ModuleList(
(0-2): 3 x Conv1d(192, 384, kernel_size=(1,), stride=(1,))
(3): Conv1d(192, 192, kernel_size=(1,), stride=(1,))
)
(drop): Dropout(p=0, inplace=False)
(cond_layer): Conv1d(256, 1536, kernel_size=(1,), stride=(1,))
)
(out_proj): Conv1d(192, 80, kernel_size=(1,), stride=(1,))
)
)
(z_mapping_function): GlobalLatentMap(
(convs): Sequential(
(0): Conv1d(128, 128, kernel_size=(1,), stride=(1,))
(1): BatchNorm1d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(2): ReLU(inplace=True)
(3): Conv1d(128, 128, kernel_size=(1,), stride=(1,))
(4): BatchNorm1d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(5): ReLU(inplace=True)
(6): Conv1d(128, 128, kernel_size=(1,), stride=(1,))
)
(spk_proj): Sequential(
(0): Conv1d(256, 128, kernel_size=(1,), stride=(1,))
(1): ReLU(inplace=True)
(2): Conv1d(128, 128, kernel_size=(1,), stride=(1,))
)
)
)
| Generator Trainable Parameters: 10.056M
12/14 10:18:25 AM load module from checkpoint: checkpoints/1012_hifigan_all_songs_nsf/model_ckpt_steps_1170000.ckpt
Traceback (most recent call last):
File "tasks/run.py", line 17, in
run_task()
File "tasks/run.py", line 12, in run_task
task_cls.start()
File "/home/datnt114/Videos/doanpc/NeuralSVB/tasks/base_task.py", line 352, in start
trainer.test(cls)
File "/home/datnt114/Videos/doanpc/NeuralSVB/utils/trainer.py", line 92, in test
self.fit(task_cls)
File "/home/datnt114/Videos/doanpc/NeuralSVB/utils/trainer.py", line 100, in fit
self.run_single_process(self.task)
File "/home/datnt114/Videos/doanpc/NeuralSVB/utils/trainer.py", line 120, in run_single_process
self.restore_weights(checkpoint)
File "/home/datnt114/Videos/doanpc/NeuralSVB/utils/trainer.py", line 355, in restore_weights
getattr(task_ref, k).load_state_dict(v)
File "/home/datnt114/anaconda3/envs/diffsinger/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1695, in getattr
raise AttributeError(f"'{type(self).name}' object has no attribute '{name}'")
AttributeError: 'SVBVAEMleTask' object has no attribute 'model_gen'

ask for dataset and checkpoints

This work is great. May I ask whether the model save point or data set will be released,?I want to test the effect.

Issue by loading the pre-trained model

Maybe there is an issue by loading the pre-trained model (line 61 ckpt_utils.py)

size mismatch for token_embed.weight: copying a param with shape torch.Size([88, 256]) from checkpoint, the shape in current model is torch.Size([87, 256]).
size mismatch for asr_decoder.project_out_dim.weight: copying a param with shape torch.Size([88, 256]) from checkpoint, the shape in current model is torch.Size([87, 256])

关于NSVB

听了demo后有些疑问，
1 如果实际使用来美化唱歌，那么Inference的时候是需要原唱的pitch curve对吧？
2 虽然测试样例不在训练样本中，但是GT Professional和GT Amateur是同一个人录制的。Inference中GT Professional不可能是自己，这样泛化性有测试过吗？

Does this model work in English?

My inference results in a bunch of noise, I don't know if it's not because the phone_set.json file is reporting errors, I'm using this file：https://huggingface.co/spaces/NATSpeech/PortaSpeech/blob/f26c987c6b32d8e97148a2a077b7abd8e7111d85/data/binary/ljspeech/phone_set.json

Please, can someone send me a download link to the PopBuTFy dataset and 1009_pretrain_asr_english checkpoint ?. I use it for research. Thank you very much if you send it to me

Recommend Projects

React

A declarative, efficient, and flexible JavaScript library for building user interfaces.
Vue.js

🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
Typescript

TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
TensorFlow

An Open Source Machine Learning Framework for Everyone
Django

The Web framework for perfectionists with deadlines.
Laravel

A PHP framework for web artisans
D3

Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

javascript

JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
web

Some thing interesting about web. New door for the world.
server

A server is a program made to process requests and deliver data to clients.
Machine learning

Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Visualization

Some thing interesting about visualization, use data art
Game

Some thing interesting about game, make everyone happy.

Recommend Org

Facebook

We are working to build community through open source technology. NB: members must have two-factor auth.
Microsoft

Open source projects and samples from Microsoft.
Google

Google ❤️ Open Source for everyone.
Alibaba

Alibaba Open Source for everyone
D3

Data-Driven Documents codes.
Tencent

China tencent open source team.