babysor / mockingbird Goto Github PK

View Code? Open in Web Editor NEW

33.9K 307.0 5.1K 127.72 MB

🚀AI拟声: 5秒内克隆您的声音并生成任意语音内容 Clone a voice in 5 seconds to generate arbitrary speech in real-time

License: Other

Python 99.54% Dockerfile 0.07% Shell 0.24% Cython 0.15%

ai speech pytorch deep-learning text-to-speech tts

mockingbird's Introduction

English | 中文| 中文Linux

Features

🌍 Chinese supported mandarin and tested with multiple datasets: aidatatang_200zh, magicdata, aishell3, data_aishell, and etc.

🤩 PyTorch worked for pytorch, tested in version of 1.9.0(latest in August 2021), with GPU Tesla T4 and GTX 2060

🌍 Windows + Linux run in both Windows OS and linux OS (even in M1 MACOS)

🤩 Easy & Awesome effect with only newly-trained synthesizer, by reusing the pretrained encoder/vocoder

🌍 Webserver Ready to serve your result with remote calling

Quick Start

1. Install Requirements

1.1 General Setup

Follow the original repo to test if you got all environment ready. **Python 3.7 or higher ** is needed to run the toolbox.

Install PyTorch.

If you get an ERROR: Could not find a version that satisfies the requirement torch==1.9.0+cu102 (from versions: 0.1.2, 0.1.2.post1, 0.1.2.post2 ) This error is probably due to a low version of python, try using 3.9 and it will install successfully

Install ffmpeg.
Run pip install -r requirements.txt to install the remaining necessary packages.
Install webrtcvad pip install webrtcvad-wheels(If you need)

install dependencies with conda or mamba

conda env create -n env_name -f env.yml

mamba env create -n env_name -f env.yml

will create a virtual environment where necessary dependencies are installed. Switch to the new environment by conda activate env_name and enjoy it.

env.yml only includes the necessary dependencies to run the project，temporarily without monotonic-align. You can check the official website to install the GPU version of pytorch.

1.2 Setup with a M1 Mac

The following steps are a workaround to directly use the original demo_toolbox.pywithout the changing of codes.

Since the major issue comes with the PyQt5 packages used in demo_toolbox.py not compatible with M1 chips, were one to attempt on training models with the M1 chip, either that person can forgo demo_toolbox.py, or one can try the web.py in the project.

1.2.1 Install `PyQt5`, with ref here.

Create and open a Rosetta Terminal, with ref here.

Use system Python to create a virtual environment for the project

/usr/bin/python3 -m venv /PathToMockingBird/venv
source /PathToMockingBird/venv/bin/activate

Upgrade pip and install PyQt5

pip install --upgrade pip
pip install pyqt5

1.2.2 Install `pyworld` and `ctc-segmentation`

Both packages seem to be unique to this project and are not seen in the original Real-Time Voice Cloning project. When installing with pip install, both packages lack wheels so the program tries to directly compile from c code and could not find Python.h.

Install pyworld
- brew install python Python.h can come with Python installed by brew
- export CPLUS_INCLUDE_PATH=/opt/homebrew/Frameworks/Python.framework/Headers The filepath of brew-installed Python.h is unique to M1 MacOS and listed above. One needs to manually add the path to the environment variables.
- pip install pyworld that should do.
Installctc-segmentation

Same method does not apply to ctc-segmentation, and one needs to compile it from the source code on github.
- git clone https://github.com/lumaku/ctc-segmentation.git
- cd ctc-segmentation
- source /PathToMockingBird/venv/bin/activate If the virtual environment hasn't been deployed, activate it.
- cythonize -3 ctc_segmentation/ctc_segmentation_dyn.pyx
- /usr/bin/arch -x86_64 python setup.py build Build with x86 architecture.
- /usr/bin/arch -x86_64 python setup.py install --optimize=1 --skip-buildInstall with x86 architecture.

1.2.3 Other dependencies

/usr/bin/arch -x86_64 pip install torch torchvision torchaudio Pip installing PyTorch as an example, articulate that it's installed with x86 architecture
pip install ffmpeg Install ffmpeg
pip install -r requirements.txt Install other requirements.

1.2.4 Run the Inference Time (with Toolbox)

To run the project on x86 architecture. ref.

vim /PathToMockingBird/venv/bin/pythonM1 Create an executable file pythonM1 to condition python interpreter at /PathToMockingBird/venv/bin.

Write in the following content:

#!/usr/bin/env zsh
mydir=${0:a:h}
/usr/bin/arch -x86_64 $mydir/python "$@"

chmod +x pythonM1 Set the file as executable.
If using PyCharm IDE, configure project interpreter to pythonM1(steps here), if using command line python, run /PathToMockingBird/venv/bin/pythonM1 demo_toolbox.py

2. Prepare your models

Note that we are using the pretrained encoder/vocoder but not synthesizer, since the original model is incompatible with the Chinese symbols. It means the demo_cli is not working at this moment, so additional synthesizer models are required.

You can either train your models or use existing ones:

2.1 Train encoder with your dataset (Optional)

Preprocess with the audios and the mel spectrograms: python encoder_preprocess.py <datasets_root> Allowing parameter --dataset {dataset} to support the datasets you want to preprocess. Only the train set of these datasets will be used. Possible names: librispeech_other, voxceleb1, voxceleb2. Use comma to sperate multiple datasets.
Train the encoder: python encoder_train.py my_run <datasets_root>/SV2TTS/encoder

For training, the encoder uses visdom. You can disable it with --no_visdom, but it's nice to have. Run "visdom" in a separate CLI/process to start your visdom server.

2.2 Train synthesizer with your dataset

Download dataset and unzip: make sure you can access all .wav in folder
Preprocess with the audios and the mel spectrograms: python pre.py <datasets_root> Allowing parameter --dataset {dataset} to support aidatatang_200zh, magicdata, aishell3, data_aishell, etc.If this parameter is not passed, the default dataset will be aidatatang_200zh.
Train the synthesizer: python train.py --type=synth mandarin <datasets_root>/SV2TTS/synthesizer
Go to next step when you see attention line show and loss meet your need in training folder synthesizer/saved_models/.

2.3 Use pretrained model of synthesizer

Thanks to the community, some models will be shared:

author	Download link	Preview Video	Info
@author	https://pan.baidu.com/s/1iONvRxmkI-t1nHqxKytY3g Baidu 4j5d		75k steps trained by multiple datasets
@author	https://pan.baidu.com/s/1fMh9IlgKJlL2PIiRTYDUvw Baidu code：om7f		25k steps trained by multiple datasets, only works under version 0.0.1
@FawenYo	https://yisiou-my.sharepoint.com/:u:/g/personal/lawrence_cheng_fawenyo_onmicrosoft_com/EWFWDHzee-NNg9TWdKckCc4BC7bK2j9cCbOWn0-_tK0nOg?e=n0gGgC	input output	200k steps with local accent of Taiwan, only works under version 0.0.1
@miven	https://pan.baidu.com/s/1PI-hM3sn5wbeChRryX-RCQ code: 2021 https://www.aliyundrive.com/s/AwPsbo8mcSP code: z2m0	https://www.bilibili.com/video/BV1uh411B7AD/	only works under version 0.0.1

2.4 Train vocoder (Optional)

note: vocoder has little difference in effect, so you may not need to train a new one.

Preprocess the data: python vocoder_preprocess.py <datasets_root> -m <synthesizer_model_path>

<datasets_root> replace with your dataset root，<synthesizer_model_path>replace with directory of your best trained models of sythensizer, e.g. sythensizer\saved_mode\xxx

Train the wavernn vocoder: python vocoder_train.py mandarin <datasets_root>
Train the hifigan vocoder python vocoder_train.py mandarin <datasets_root> hifigan

3. Launch

3.1 Using the web server

You can then try to run:python web.py and open it in browser, default as http://localhost:8080

3.2 Using the Toolbox

You can then try the toolbox: python demo_toolbox.py -d <datasets_root>

3.3 Using the command line

You can then try the command: python gen_voice.py <text_file.txt> your_wav_file.wav you may need to install cn2an by "pip install cn2an" for better digital number result.

Reference

This repository is forked from Real-Time-Voice-Cloning which only support English.

URL	Designation	Title	Implementation source
1803.09017	GlobalStyleToken (synthesizer)	Style Tokens: Unsupervised Style Modeling, Control and Transfer in End-to-End Speech Synthesis	This repo
2010.05646	HiFi-GAN (vocoder)	Generative Adversarial Networks for Efficient and High Fidelity Speech Synthesis	This repo
2106.02297	Fre-GAN (vocoder)	Fre-GAN: Adversarial Frequency-consistent Audio Synthesis	This repo
1806.04558	SV2TTS	Transfer Learning from Speaker Verification to Multispeaker Text-To-Speech Synthesis	This repo
1802.08435	WaveRNN (vocoder)	Efficient Neural Audio Synthesis	fatchord/WaveRNN
1703.10135	Tacotron (synthesizer)	Tacotron: Towards End-to-End Speech Synthesis	fatchord/WaveRNN
1710.10467	GE2E (encoder)	Generalized End-To-End Loss for Speaker Verification	This repo

F Q&A

1.Where can I download the dataset?

Dataset	Original Source	Alternative Sources
aidatatang_200zh	OpenSLR	Google Drive
magicdata	OpenSLR	Google Drive (Dev set)
aishell3	OpenSLR	Google Drive
data_aishell	OpenSLR

After unzip aidatatang_200zh, you need to unzip all the files under aidatatang_200zh\corpus\train

2.What is`<datasets_root>`?

If the dataset path is D:\data\aidatatang_200zh,then <datasets_root> isD:\data

3.Not enough VRAM

Train the synthesizer：adjust the batch_size in synthesizer/hparams.py

//Before
tts_schedule = [(2,  1e-3,  20_000,  12),   # Progressive training schedule
                (2,  5e-4,  40_000,  12),   # (r, lr, step, batch_size)
                (2,  2e-4,  80_000,  12),   #
                (2,  1e-4, 160_000,  12),   # r = reduction factor (# of mel frames
                (2,  3e-5, 320_000,  12),   #     synthesized for each decoder iteration)
                (2,  1e-5, 640_000,  12)],  # lr = learning rate
//After
tts_schedule = [(2,  1e-3,  20_000,  8),   # Progressive training schedule
                (2,  5e-4,  40_000,  8),   # (r, lr, step, batch_size)
                (2,  2e-4,  80_000,  8),   #
                (2,  1e-4, 160_000,  8),   # r = reduction factor (# of mel frames
                (2,  3e-5, 320_000,  8),   #     synthesized for each decoder iteration)
                (2,  1e-5, 640_000,  8)],  # lr = learning rate

Train Vocoder-Preprocess the data：adjust the batch_size in synthesizer/hparams.py

//Before
### Data Preprocessing
        max_mel_frames = 900,
        rescale = True,
        rescaling_max = 0.9,
        synthesis_batch_size = 16,                  # For vocoder preprocessing and inference.
//After
### Data Preprocessing
        max_mel_frames = 900,
        rescale = True,
        rescaling_max = 0.9,
        synthesis_batch_size = 8,                  # For vocoder preprocessing and inference.

Train Vocoder-Train the vocoder：adjust the batch_size in vocoder/wavernn/hparams.py

//Before
# Training
voc_batch_size = 100
voc_lr = 1e-4
voc_gen_at_checkpoint = 5
voc_pad = 2

//After
# Training
voc_batch_size = 6
voc_lr = 1e-4
voc_gen_at_checkpoint = 5
voc_pad =2

4.If it happens `RuntimeError: Error(s) in loading state_dict for Tacotron: size mismatch for encoder.embedding.weight: copying a param with shape torch.Size([70, 512]) from checkpoint, the shape in current model is torch.Size([75, 512]).`

Please refer to issue #37

5. How to improve CPU and GPU occupancy rate?

Adjust the batch_size as appropriate to improve

6. What if it happens `the page file is too small to complete the operation`

Please refer to this video and change the virtual memory to 100G (102400), for example : When the file is placed in the D disk, the virtual memory of the D disk is changed.

7. When should I stop during training?

FYI, my attention came after 18k steps and loss became lower than 0.4 after 50k steps.

mockingbird's People

Contributors

Stargazers

Watchers

Forkers

mockingbird's Issues

在 Preprocess the embeddings 時自動關機

有人有跟我一樣的問題嗎，剛執行 python synthesizer_preprocess_embeds.py <datasets_root>/SV2TTS/synthesizer 不久就自動關機，有遇過同樣問題的人用什麼辦法解決呢?

Building Wave-RNN
Trainable Parameters: 4.481M
Loading model weights at vocoder/saved_models/pretrained/pretrained.pt
python: src/hostapi/alsa/pa_linux_alsa.c:3641: PaAlsaStreamComponent_BeginPolling: Assertion `ret == self->nfds' failed.
Aborted (core dumped)

LibriSpeech alignments?

(base) F:\Realtime-Voice-Clone-Chinese-main>python synthesizer_preprocess_audio.py "F:\Realtime-Voice-Clone-Chinese-main/data1"
Arguments:
datasets_root: F:\Realtime-Voice-Clone-Chinese-main\data1
out_dir: F:\Realtime-Voice-Clone-Chinese-main\data1\SV2TTS\synthesizer
n_processes: None
skip_existing: False
hparams:
no_alignments: False
dataset: aidatatang_200zh

Using data from:
F:\Realtime-Voice-Clone-Chinese-main\data1\aidatatang_200zh\corpus\train
aidatatang_200zh: 100%|████████████████████████████████████████████████████████| 420/420 [02:47<00:00, 2.51speakers/s]
The dataset consists of 0 utterances, 0 mel frames, 0 audio timesteps (0.00 hours).
Traceback (most recent call last):
File "synthesizer_preprocess_audio.py", line 64, in
preprocess_dataset(**vars(args))
File "F:\Realtime-Voice-Clone-Chinese-main\synthesizer\preprocess.py", line 76, in preprocess_dataset
print("Max input length (text chars): %d" % max(len(m[5]) for m in metadata))
ValueError: max() arg is an empty sequence

在此处发现同样问题:https://github.com/CorentinJ/Real-Time-Voice-Cloning/issues/486

Where can I download aidatatang_200zh dataset?

在运行 python synthesizer_preprocess_audio.py 时报错

Pre Trained Model

Hi, I am from outside China

is it possible to have the pre-trained model download from google drive?

sounddevice报错问题

在win10默认情况下系统编码格式为gbk，在运行demo_toolbox.py时会报错：

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xc2 in position 6: invalid continuation byte

打开D:\Env\anaconda3\Lib\site-packages\sounddevice.py移动到573行，有相关报错的issue，更改为mbcs后错误变成：

UnicodeDecodeError: 'mbcs' codec can't decode bytes in position 0--1:xxxxxxxxxxxxxxxxxxx

运行python -m sounddevice会报相同的错误

按照上述步骤更改系统编码格式后重启，再次运行python -m sounddevice就没报错了

C:\Users\LM>python -m sounddevice
   0 Microsoft Sound Mapper - Input, MME (2 in, 0 out)
>  1 mic (USBAudio2.0), MME (2 in, 0 out)
   2 麦克风阵列 (Realtek High Definition , MME (2 in, 0 out)
   3 立体声混音 (Realtek High Definition , MME (2 in, 0 out)
   4 Microsoft Sound Mapper - Output, MME (0 in, 2 out)
<  5 ear (15- Meizu HiFi DAC Headpho, MME (0 in, 2 out)
   6 Speaker (Realtek High Definitio, MME (0 in, 2 out)
   7 DELL U2414H (NVIDIA High Defini, MME (0 in, 2 out)
   8 主声音捕获驱动程序, Windows DirectSound (2 in, 0 out)
   9 mic (USBAudio2.0), Windows DirectSound (2 in, 0 out)
  10 麦克风阵列 (Realtek High Definition Audio), Windows DirectSound (2 in, 0 out)
  11 立体声混音 (Realtek High Definition Audio), Windows DirectSound (2 in, 0 out)
  12 主声音驱动程序, Windows DirectSound (0 in, 2 out)
  13 Speaker (Realtek High Definition Audio), Windows DirectSound (0 in, 2 out)
  14 DELL U2414H (NVIDIA High Definition Audio), Windows DirectSound (0 in, 2 out)
  15 DSD 转码器 (DoP/Native), ASIO (0 in, 2 out)
  16 ear (15- Meizu HiFi DAC Headphone Amplifier), Windows WASAPI (0 in, 2 out)
  17 Speaker (Realtek High Definition Audio), Windows WASAPI (0 in, 2 out)
  18 DELL U2414H (NVIDIA High Definition Audio), Windows WASAPI (0 in, 2 out)
  19 麦克风阵列 (Realtek High Definition Audio), Windows WASAPI (2 in, 0 out)
  20 立体声混音 (Realtek High Definition Audio), Windows WASAPI (2 in, 0 out)
  21 mic (USBAudio2.0), Windows WASAPI (2 in, 0 out)
  22 Output (), Windows WDM-KS (0 in, 2 out)
  23 耳机 (), Windows WDM-KS (0 in, 2 out)
  24 Headphones (Meizu HiFi DAC Headphone Amplifier), Windows WDM-KS (0 in, 2 out)
  25 Speakers (Realtek HD Audio output), Windows WDM-KS (0 in, 2 out)
  26 立体声混音 (Realtek HD Audio Stereo input), Windows WDM-KS (2 in, 0 out)
  27 麦克风阵列 (Realtek HD Audio Mic input), Windows WDM-KS (2 in, 0 out)
  28 耳机 (@System32\drivers\bthhfenum.sys,#2;%1 Hands-Free AG Audio%0
;(LM’s AirPods Pro)), Windows WDM-KS (0 in, 1 out)
  29 耳机 (@System32\drivers\bthhfenum.sys,#2;%1 Hands-Free AG Audio%0
;(LM’s AirPods Pro)), Windows WDM-KS (1 in, 0 out)
  30 麦克风 (USBAudio2.0), Windows WDM-KS (2 in, 0 out)

再次运行demo_toolbox.py就能正常打开

deploy as webservice

is there anyway to deploy it as http service ,we can call it remote
I have two computer~

如何解决运行python synthesizer_preprocess_audio.py时报错 DLL load failed:页面文件太小，无法完成操作

我在运行 python synthesizer_preprocess_audio.py时遇到如上错误，在CSDN上找到解决方法：1.如果python 运行环境不在C盘查看高级系统设置->高级->性能设置->高级->虚拟内存->更改 ->取消自动管理所有驱动器的分页文件大小-> 自定义大小 ->初始大小和最大值设为10240 2. 更改DateLoade 中的参数num_worker 改为0 但我现在不清楚具体怎样把参数设为0

这次训练一半会出现这个EOFError: Ran out of input，怎么回事 PermissionError: [WinError 5] 拒绝访问。

来看看，解决一下

RuntimeError: CUDA error: CUBLAS_STATUS_INTERNAL_ERROR when calling `cublasSgemm( handle, opa, opb, m, n, k, &alpha, a, lda, b, ldb, &beta, c, ldc)`

训练一半后出现这个，谁能解决help

如何使用训练好的数据集呢

如题~
我将百度云下载好的训练结果放在E:\Voice\trainmodel，执行python demo_toolbox.py -d E:\Voice\trainmodelc好像并不能成功运行

使用百度云上的模型，训练播放后都是杂音

环境

Windows 10
Python 3.7

描述

百度云的pt模型放入synthesizer/saved_models/后，python .\demo_toolbox.py可执行，但产生结果都是杂音，中文和拼音都不太行

问题截图

本人纯小白，希望大佬有空给予指点。

使用百度网盘最新预训练模型，spectrogram不正常，只有两秒杂音

使用预训练模型获得了奇怪的mel spectrogram和杂音

voicepart1.mp3 是一段时长为10秒钟、含7个句子的录音片段

voicepart2.wav 是一段时长为5秒钟的类似片段

合成结果均为约2秒的背景杂音，无论输入内容长度。

200zh数据集解压后，第一步预处理报错

(RVCC) D:\Realtime-Voice-Clone-Chinese-main\Realtime-Voice-Clone-Chinese-main>python synthesizer_preprocess_audio.py D:\data
Arguments:
datasets_root: D:\data
out_dir: D:\data\SV2TTS\synthesizer
n_processes: None
skip_existing: False
hparams:
no_alignments: False
dataset: aidatatang_200zh

Using data from:
D:\data\aidatatang_200zh\corpus\train
aidatatang_200zh: 100%|████████████████████████████████████████████████████████| 420/420 [01:02<00:00, 6.71speakers/s]
The dataset consists of 0 utterances, 0 mel frames, 0 audio timesteps (0.00 hours).
Traceback (most recent call last):
File "synthesizer_preprocess_audio.py", line 64, in
preprocess_dataset(**vars(args))
File "D:\Realtime-Voice-Clone-Chinese-main\Realtime-Voice-Clone-Chinese-main\synthesizer\preprocess.py", line 76, in preprocess_dataset
print("Max input length (text chars): %d" % max(len(m[5]) for m in metadata))
ValueError: max() arg is an empty sequence

誤觸，新開到issue，不好意思。

我什么我使用首页提供的模型生成出来的音频都是杂音呢

输入一个mp3报错了，请问是啥原因？

torch.Size的问题

有个问题，他显示Exception:Error(s) in loading state_dict for Tacotron :
Size mismatch for encoder.embedding.weight: copying a param with shape torch.Size([70,512]) from checkpoint, the shape in current model is torch.Size([75,512])

关于该项目的一些想法。

目前来看，该项目在实际使用的时候远达不到“可用”的程度。包括以下几种问题：
1、合成的音频会出现不包含正常人声，而是噪声和残缺的声音。
2、合成的音色跟目标音色不一致，差别很大。

目前分析出现问题一的原因应该是因为
1、asr数据中有些数据存在明显过强底噪，音频和文本或者音素数据无法对齐。（加入一些数据清洗的手段）
2、目前的d-vector和vocoder部分都是使用的英文数据集上训练的universal的版本，在中文数据集上使用肯定会出现mismatch的问题。（我理解d-vector和vocoder应该需要在中文数据集上重新训练以获得更好的结果）
3、数据集中音色过少，导致很难找到跟目标音色较为一致的”参考音色“用于生成。（混合多种asr和tts数据集，构建一个大型的数据集，以提高对目标音色的适配程度）。

这块我应该也会着手做一些工作以尝试改进模型，希望有机会和作者合作。

請問有aadatatang_200zh數據集的下載網址嗎?

請問有aadatatang_200zh數據集的下載網址嗎

能出一个视频教程嘛

本人是一个小白，真的尝试去做了，好在一些安装下载配置别人有出教程，但不同人出的并不连贯，让我产生一种莫名其妙的感觉，很多东西在于细节，也许他所讲授的方法适用于这个特定的问题，但并不适用于项目，拜托了

kiwisolver是个什么东西。。。。

Traceback (most recent call last):
File "D:\code\Realtime-Voice-Clone-Chinese\demo_toolbox.py", line 2, in
from toolbox import Toolbox
File "D:\code\Realtime-Voice-Clone-Chinese\toolbox_init_.py", line 1, in
from toolbox.ui import UI
File "D:\code\Realtime-Voice-Clone-Chinese\toolbox\ui.py", line 1, in
import matplotlib.pyplot as plt
File "D:\software\install place\python3\lib\site-packages\matplotlib_init_.py", line 157, in
check_versions()
File "D:\software\install place\python3\lib\site-packages\matplotlib_init.py", line 151, in check_versions
module = importlib.import_module(modname)
File "D:\software\install place\python3\lib\importlib_init.py", line 127, in import_module
return _bootstrap._gcd_import(name[level:], package, level)
ModuleNotFoundError: No module named 'kiwisolver'

支持中文版toolbox，直接输入中文

训练模型时不调用GPU

3990x+两张3090
0.12-0.086step/s
CPU占用为70%
GPU无占用

想要支持更多数据集？在这里提建议

已支持的有 aidatatang（已验证200zh）, Magic Data(已验证open SLR68)
需要更多请在这里提建议，并+1投票，将为大家补充支持

關於 Train synthesizer 的問題，求指導 !

你好
我已經下載了aidatatang_200zh這個數據集，並且把 aidatatang_200zh\corpus\train 底下的檔案都解壓縮完畢
但是當我要開始執行 python synthesizer_preprocess_audio.py D:\google download(我把檔案放在 D:\google download 這個路徑下 )
卻發生以下狀況:
D:\python_demo\Realtime-Voice-Clone-Chinese>python synthesizer_preprocess_audio.py D:\google download\ D:\python_demo\Realtime-Voice-Clone-Chinese\encoder\audio.py:13: UserWarning: Unable to import 'webrtcvad'. This package enables noise removal and is recommended. warn("Unable to import 'webrtcvad'. This package enables noise removal and is recommended.") usage: synthesizer_preprocess_audio.py [-h] [-o OUT_DIR] [-n N_PROCESSES] [-s] [--hparams HPARAMS] [--no_trim] [--no_alignments] [--dataset DATASET] datasets_root synthesizer_preprocess_audio.py: error: unrecognized arguments: download\

請問我可以怎麼解決問題呢? 我有查看之前 issues 的討論並沒有發現有類似問題，以下是我想到可能有問題的地方，還請作者為我解答，謝謝！

1.我只有解壓縮 aidatatang_200zh\corpus\train 底下的檔案，是否其他資料夾下的檔案也要解壓縮?
2.是不是只需要將所有 wav 檔單獨拉出來放在 aidatatang_200zh\corpus\train 底下然後再執行python synthesizer_preprocess_audio.py D:\google download ?
3. 輸入的指令不對
4. wav 檔與 txt 檔是不是要預先處理，而我沒有進行處理?

在运行demo_cli.py时出错

我同时下载了原模型和你的模型，但是在运行demo_cli.py时出现以下错误：

RuntimeError: Error(s) in loading state_dict for Tacotron:
size mismatch for encoder.embedding.weight: copying a param with shape torch.Size([66, 512]) from checkpoint, the shape in current model is torch.Size([70, 512]).

在文字框写了文字，可是出來的是其它聲音，是不是文字框bug了

谁能解决

Python

Backend Qt5Agg is interactive backend. Turning interactive mode on.

直接运行没有问题，但是debug demo_toolbox.py时报错：
Traceback (most recent call last):
File "D:\work\python\ide\pycharm\PyCharm 2020.1.2\plugins\python\helpers\pydev\pydevd.py", line 1438, in exec
pydev_imports.execfile(file, globals, locals) # execute the script
File "D:\work\python\ide\pycharm\PyCharm 2020.1.2\plugins\python\helpers\pydev_pydev_imps_pydev_execfile.py", line 18, in execfile
exec(compile(contents+"\n", file, 'exec'), glob, loc)
File "E:/instance/tts/Realtime-Voice-Clone-Chinese-main/demo_toolbox.py", line 43, in
Toolbox(**vars(args))
File "E:\instance\tts\Realtime-Voice-Clone-Chinese-main\toolbox_init.py", line 75, in init
self.ui = UI()
File "E:\instance\tts\Realtime-Voice-Clone-Chinese-main\toolbox\ui.py", line 450, in init
self.projections_layout.addWidget(FigureCanvas(fig))
TypeError: addWidget(self, QWidget, stretch: int = 0, alignment: Union[Qt.Alignment, Qt.AlignmentFlag] = Qt.Alignment()): argument 1 has unexpected type 'FigureCanvasQTAgg'
Backend Qt5Agg is interactive backend. Turning interactive mode on.

声音样本

大佬想问下若声音样本是歌曲的话，能不能克隆出其声音主人的声音出来？

训练模型时显存爆了

Variable._execution_engine.run_backward(RuntimeError: CUDA out of memory. Tried to allocate 88.00 MiB (GPU 0; 4.00 GiB totalcapacity; 2.68 GiB already allocated; 0 bytes free; 2.85 GiB reserved in total by PyTorch)

能不能提供一个调batch_size的参数? 我目前用的显卡显存只有4G(GTX1050Ti)，默认参数正常训练时经常爆掉显存....

bilibili 演示视频已消失

Bilibili

保姆级别教程（持续更新各类社区/非官方教程----

（作者借楼编辑ing
社区视频教程：
奶糖 https://www.bilibili.com/video/BV1dq4y137pH

请问如何训练呢

关于aidatatang_200zh的问题

我尝试从aidatatang_200zh的官网上下载，是要把aidatatang_200zh\aidatatang_200zh\aidatatang_200zh\corpus\train下的文件全部解压吗？

请问可以提供预训练的编码器/声码器吗？

python synthesizer_preprocess_embeds.py <path-to-datasets_root>/SV2TTS/synthesizer

Output:

Arguments:
    synthesizer_root:      <path-to-datasets_root>/SV2TTS/synthesizer
    encoder_model_fpath:   encoder/saved_models/pretrained.pt
    n_processes:           4

Embedding:   0% 0/25308 [00:00<?, ?utterances/s]multiprocessing.pool.RemoteTraceback: 
"""
Traceback (most recent call last):
  File "/usr/lib/python3.7/multiprocessing/pool.py", line 121, in worker
    result = (True, func(*args, **kwds))
  File "<path-to-Realtime-Voice-Clone-Chinese>/synthesizer/preprocess.py", line 242, in embed_utterance
    encoder.load_model(encoder_model_fpath)
  File "<path-to-Realtime-Voice-Clone-Chinese>/encoder/inference.py", line 33, in load_model
    checkpoint = torch.load(weights_fpath, _device)
  File "/usr/local/lib/python3.7/dist-packages/torch/serialization.py", line 594, in load
    with _open_file_like(f, 'rb') as opened_file:
  File "/usr/local/lib/python3.7/dist-packages/torch/serialization.py", line 230, in _open_file_like
    return _open_file(name_or_buffer, mode)
  File "/usr/local/lib/python3.7/dist-packages/torch/serialization.py", line 211, in __init__
    super(_open_file, self).__init__(open(name, mode))
FileNotFoundError: [Errno 2] No such file or directory: 'encoder/saved_models/pretrained.pt'
"""

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "synthesizer_preprocess_embeds.py", line 25, in <module>
    create_embeddings(**vars(args))    
  File "<path-to-Realtime-Voice-Clone-Chinese>/synthesizer/preprocess.py", line 268, in create_embeddings
    list(tqdm(job, "Embedding", len(fpaths), unit="utterances"))
  File "/usr/local/lib/python3.7/dist-packages/tqdm/std.py", line 1104, in __iter__
    for obj in iterable:
  File "/usr/lib/python3.7/multiprocessing/pool.py", line 748, in next
    raise value
FileNotFoundError: [Errno 2] No such file or directory: 'encoder/saved_models/pretrained.pt'
Embedding:   0% 0/25308 [00:01<?, ?utterances/s]

安全问题，万一有人居心叵测。。。

speaker encoder的输出向量是什么样的？

SVT2TTS的评论区过来的，自己训练的speaker encoder，因为用的aishell3数据集，214个说话人，而输出的speaker embedding是256维的，这就导致每个说话人的向量很稀疏，大部分维度是0，几乎是one-hot形式的。所以用来训练synthesizer的话根本训练不了，loss是Nan。
你这个模型训练synthesizer时有注意到speaker embedding向量大概是什么样的吗？

用这里的模型跑出现这个RuntimeError: Error(s) in loading state_dict for Tacotron: size mismatch for encoder.embedding.weight: copying a param with shape torch.Size([70, 512]) from checkpoint, the shape in current model is torch.Size([75, 512]).