Light

welkinyang / learn2sing2.0 Goto Github PK

View Code? Open in Web Editor NEW

170.0 6.0 28.0 37.57 MB

Diffusion and Mutual Information-Based Target Speaker SVS by Learning from Singing Teacher

Home Page: https://welkinyang.github.io/Learn2Sing2.0/

HTML 6.33% JavaScript 68.02% Python 25.55% Shell 0.10%

singing-voice singing-voice-synthesis tts

learn2sing2.0's Introduction

Learn2Sing 2.0: Diffusion and Mutual Information-Based Target Speaker SVS by Learning from Singing Teacher

Official implementation of Learn2Sing 2.0. For all details check out our paper which is accepted by Interspeech 2022 via this link.

Authors: Heyang Xue, Xinsheng Wang, Yongmao Zhang, Lei Xie, Pengcheng Zhu, Mengxiao Bi.

Abstract

Demo page : link.

Building a high-quality singing corpus for a person who is not good at singing is non-trivial, thus making it challenging to create a singing voice synthesizer for this person. Learn2Sing is dedicated to synthesizing the singing voice of a speaker without his or her singing data by learning from data recorded by others, i.e., the singing teacher. Inspired by the fact that pitch is the key style factor to distinguish singing from speaking voice, the proposed Learn2Sing 2.0 first generates the preliminary acoustic feature with averaged pitch value in the phone level, which allows the training of this process for different styles, i.e., speaking or singing, share same conditions except for the speaker information. Then, conditioned on the specific style, a diffusion decoder, which is accelerated by a fast sampling algorithm during the inference stage, is adopted to gradually restore the final acoustic feature. During the training, to avoid the information confusion of the speaker embedding and the style embedding, mutual information is employed to restrain the learning of speaker embedding and style embedding. Experiments show that the proposed approach is capable of synthesizing high-quality singing voice for the target speaker without singing data with 10 decoding steps.

Training and inference:

Before you can use this implementation, you need to modify the following：

Replace the phoneset and pitchset in text/symbols.py with your own set
Provide the path to the data in config.json, the testdata folder contains example files to demonstrate the format

Training
```
  bash run.sh
```

Inference

  bash syn.sh outputs target_speaker_id 0 decoding_steps cuda True

Acknowledgements:

The diffusion decoder is adapted from GradTTS;
Estimation of mutual information is modified from VQMIVC;
Vadim Popov performed a code review of the fast sampling algorithm part.

learn2sing2.0's People

Contributors

Stargazers

Watchers

learn2sing2.0's Issues

请问训练大概需要多长时间？

你好首先感谢你的分享！

我处理了自己手头的数据，已经跑起来训练了，我的配置是：
"memory_efficient_training": false,
batch_size=4
sample_rate=24000
与hifigan对接mels，不使用f0.
其他均为默认配置

目前合可以合成声音了，只是效果还不行。
请问这种情况的话，对应learn2sing模型，大概需要训练多久，到M_X.pth能达到一个不错的效果？

demo 效果

我看demo的效果，student合成效果，有些词不太准而且有走音现象

regarding english dateset

hi great work done here !!! I wanted to know if this repo is going to work on English speaking dataset ?? and whether Are there English examples for reference to know the quality ?? and if yes it is going to work on English dataset what exactly should i do ,? like in the "Replace the phoneset and pitchset in text/symbols.py with your own set" what would be the case here if using English ? Also "Provide the path to the data in config.json" is clear but what what would be the format ??

Thanks in advance!

请问会开放匹配的声码器吗？

请问支持opencpop数据集吗？

Sample dataset?

Hi -- this is a pretty amazing project. I understand if you can't share all of your data set but I wondered if you could provide a small sample with the precise format and structure that could help others wanting to create their own, and guidance on what tools you used to annotate the source recordings -- taking 5 hours of songs and marking them up at phone level is a huge task indeed.

About club mutual information

Hi, thanks for your good work. I have two questions:

Is the tanh activation function of logvar required? Can you remove it or just replace it with any other activations？
During training, I encountered a problem: the logvar prediction network, whose last layer is 'tanh', always output the '-1', no matter what the input is. And the overall CLUB MI prediction network seems to lose effect (fail to be updated), giving negative MI estimation.

Recommend Projects

React

A declarative, efficient, and flexible JavaScript library for building user interfaces.
Vue.js

🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
Typescript

TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
TensorFlow

An Open Source Machine Learning Framework for Everyone
Django

The Web framework for perfectionists with deadlines.
Laravel

A PHP framework for web artisans
D3

Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

javascript

JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
web

Some thing interesting about web. New door for the world.
server

A server is a program made to process requests and deliver data to clients.
Machine learning

Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Visualization

Some thing interesting about visualization, use data art
Game

Some thing interesting about game, make everyone happy.

Recommend Org

Facebook

We are working to build community through open source technology. NB: members must have two-factor auth.
Microsoft

Open source projects and samples from Microsoft.
Google

Google ❤️ Open Source for everyone.
Alibaba

Alibaba Open Source for everyone
D3

Data-Driven Documents codes.
Tencent

China tencent open source team.

Jobs