GithubHelp home page GithubHelp logo

babysor / mockingbird Goto Github PK

View Code? Open in Web Editor NEW
34.0K 306.0 5.1K 127.72 MB

🚀AI拟声: 5秒内克隆您的声音并生成任意语音内容 Clone a voice in 5 seconds to generate arbitrary speech in real-time

License: Other

Python 99.54% Dockerfile 0.07% Shell 0.24% Cython 0.15%
ai speech pytorch deep-learning text-to-speech tts

mockingbird's Issues

torch.Size的问题

有个问题,他显示Exception:Error(s) in loading state_dict for Tacotron :
Size mismatch for encoder.embedding.weight: copying a param with shape torch.Size([70,512]) from checkpoint, the shape in current model is torch.Size([75,512])

如何使用训练好的数据集呢

如题~
我将百度云下载好的训练结果放在E:\Voice\trainmodel,执行python demo_toolbox.py -d E:\Voice\trainmodelc好像并不能成功运行

Pre Trained Model

Hi, I am from outside China

is it possible to have the pre-trained model download from google drive?

deploy as webservice

is there anyway to deploy it as http service ,we can call it remote
I have two computer~

在生成录音时闪退

Building Wave-RNN
Trainable Parameters: 4.481M
Loading model weights at vocoder/saved_models/pretrained/pretrained.pt
python: src/hostapi/alsa/pa_linux_alsa.c:3641: PaAlsaStreamComponent_BeginPolling: Assertion `ret == self->nfds' failed.
Aborted (core dumped)

使用百度云上的模型,训练播放后都是杂音

环境

Windows 10
Python 3.7

描述

百度云的pt模型放入synthesizer/saved_models/后,python .\demo_toolbox.py可执行,但产生结果都是杂音,中文和拼音都不太行

问题截图

image
image

本人纯小白,希望大佬有空给予指点。

关于aidatatang_200zh的问题

我尝试从aidatatang_200zh的官网上下载,是要把aidatatang_200zh\aidatatang_200zh\aidatatang_200zh\corpus\train下的文件全部解压吗?

声音样本

大佬想问下若声音样本是歌曲的话,能不能克隆出其声音主人的声音出来?

Backend Qt5Agg is interactive backend. Turning interactive mode on.

直接运行没有问题,但是debug demo_toolbox.py时 报错:
Traceback (most recent call last):
File "D:\work\python\ide\pycharm\PyCharm 2020.1.2\plugins\python\helpers\pydev\pydevd.py", line 1438, in exec
pydev_imports.execfile(file, globals, locals) # execute the script
File "D:\work\python\ide\pycharm\PyCharm 2020.1.2\plugins\python\helpers\pydev_pydev_imps_pydev_execfile.py", line 18, in execfile
exec(compile(contents+"\n", file, 'exec'), glob, loc)
File "E:/instance/tts/Realtime-Voice-Clone-Chinese-main/demo_toolbox.py", line 43, in
Toolbox(**vars(args))
File "E:\instance\tts\Realtime-Voice-Clone-Chinese-main\toolbox_init
.py", line 75, in init
self.ui = UI()
File "E:\instance\tts\Realtime-Voice-Clone-Chinese-main\toolbox\ui.py", line 450, in init
self.projections_layout.addWidget(FigureCanvas(fig))
TypeError: addWidget(self, QWidget, stretch: int = 0, alignment: Union[Qt.Alignment, Qt.AlignmentFlag] = Qt.Alignment()): argument 1 has unexpected type 'FigureCanvasQTAgg'
Backend Qt5Agg is interactive backend. Turning interactive mode on.

如何解决运行python synthesizer_preprocess_audio.py时报错 DLL load failed:页面文件太小,无法完成操作

我在运行 python synthesizer_preprocess_audio.py时遇到如上错误 ,在CSDN上找到解决方法:1.如果python 运行环境不在C盘 查看高级系统设置->高级->性能 设置->高级->虚拟内存->更改 ->取消自动管理所有驱动器的分页文件大小-> 自定义大小 ->初始大小和最大值设为10240 2. 更改DateLoade 中的参数num_worker 改为0 但我现在不清楚具体怎样把参数设为0

speaker encoder的输出向量是什么样的?

SVT2TTS的评论区过来的,自己训练的speaker encoder,因为用的aishell3数据集,214个说话人,而输出的speaker embedding是256维的,这就导致每个说话人的向量很稀疏,大部分维度是0,几乎是one-hot形式的。所以用来训练synthesizer的话根本训练不了,loss是Nan。
你这个模型训练synthesizer时有注意到speaker embedding向量大概是什么样的吗?

请问可以提供预训练的编码器/声码器吗?

python synthesizer_preprocess_embeds.py <path-to-datasets_root>/SV2TTS/synthesizer

Output:

Arguments:
    synthesizer_root:      <path-to-datasets_root>/SV2TTS/synthesizer
    encoder_model_fpath:   encoder/saved_models/pretrained.pt
    n_processes:           4

Embedding:   0% 0/25308 [00:00<?, ?utterances/s]multiprocessing.pool.RemoteTraceback: 
"""
Traceback (most recent call last):
  File "/usr/lib/python3.7/multiprocessing/pool.py", line 121, in worker
    result = (True, func(*args, **kwds))
  File "<path-to-Realtime-Voice-Clone-Chinese>/synthesizer/preprocess.py", line 242, in embed_utterance
    encoder.load_model(encoder_model_fpath)
  File "<path-to-Realtime-Voice-Clone-Chinese>/encoder/inference.py", line 33, in load_model
    checkpoint = torch.load(weights_fpath, _device)
  File "/usr/local/lib/python3.7/dist-packages/torch/serialization.py", line 594, in load
    with _open_file_like(f, 'rb') as opened_file:
  File "/usr/local/lib/python3.7/dist-packages/torch/serialization.py", line 230, in _open_file_like
    return _open_file(name_or_buffer, mode)
  File "/usr/local/lib/python3.7/dist-packages/torch/serialization.py", line 211, in __init__
    super(_open_file, self).__init__(open(name, mode))
FileNotFoundError: [Errno 2] No such file or directory: 'encoder/saved_models/pretrained.pt'
"""

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "synthesizer_preprocess_embeds.py", line 25, in <module>
    create_embeddings(**vars(args))    
  File "<path-to-Realtime-Voice-Clone-Chinese>/synthesizer/preprocess.py", line 268, in create_embeddings
    list(tqdm(job, "Embedding", len(fpaths), unit="utterances"))
  File "/usr/local/lib/python3.7/dist-packages/tqdm/std.py", line 1104, in __iter__
    for obj in iterable:
  File "/usr/lib/python3.7/multiprocessing/pool.py", line 748, in next
    raise value
FileNotFoundError: [Errno 2] No such file or directory: 'encoder/saved_models/pretrained.pt'
Embedding:   0% 0/25308 [00:01<?, ?utterances/s]

能出一个视频教程嘛

本人是一个小白,真的尝试去做了,好在一些安装下载配置别人有出教程,但不同人出的并不连贯,让我产生一种莫名其妙的感觉,很多东西在于细节,也许他所讲授的方法适用于这个特定的问题,但并不适用于项目,拜托了

關於 Train synthesizer 的問題,求指導 !

你好
我已經下載了aidatatang_200zh這個數據集,並且把 aidatatang_200zh\corpus\train 底下的檔案都解壓縮完畢
但是當我要開始執行 python synthesizer_preprocess_audio.py D:\google download(我把檔案放在 D:\google download 這個路徑下 )
卻發生以下狀況:
D:\python_demo\Realtime-Voice-Clone-Chinese>python synthesizer_preprocess_audio.py D:\google download\ D:\python_demo\Realtime-Voice-Clone-Chinese\encoder\audio.py:13: UserWarning: Unable to import 'webrtcvad'. This package enables noise removal and is recommended. warn("Unable to import 'webrtcvad'. This package enables noise removal and is recommended.") usage: synthesizer_preprocess_audio.py [-h] [-o OUT_DIR] [-n N_PROCESSES] [-s] [--hparams HPARAMS] [--no_trim] [--no_alignments] [--dataset DATASET] datasets_root synthesizer_preprocess_audio.py: error: unrecognized arguments: download\

請問我可以怎麼解決問題呢? 我有查看之前 issues 的討論並沒有發現有類似問題,以下是我想到可能有問題的地方,還請作者為我解答,謝謝!

1.我只有解壓縮 aidatatang_200zh\corpus\train 底下的檔案,是否其他資料夾下的檔案也要解壓縮?
2.是不是只需要將所有 wav 檔單獨拉出來放在 aidatatang_200zh\corpus\train 底下然後再執行python synthesizer_preprocess_audio.py D:\google download ?
3. 輸入的指令不對
4. wav 檔 與 txt 檔是不是要預先處理,而我沒有進行處理?

在 Preprocess the embeddings 時自動關機

有人有跟我一樣的問題嗎,剛執行 python synthesizer_preprocess_embeds.py <datasets_root>/SV2TTS/synthesizer 不久就自動關機,有遇過同樣問題的人用什麼辦法解決呢?

LibriSpeech alignments?

(base) F:\Realtime-Voice-Clone-Chinese-main>python synthesizer_preprocess_audio.py "F:\Realtime-Voice-Clone-Chinese-main/data1"
Arguments:
datasets_root: F:\Realtime-Voice-Clone-Chinese-main\data1
out_dir: F:\Realtime-Voice-Clone-Chinese-main\data1\SV2TTS\synthesizer
n_processes: None
skip_existing: False
hparams:
no_alignments: False
dataset: aidatatang_200zh

Using data from:
F:\Realtime-Voice-Clone-Chinese-main\data1\aidatatang_200zh\corpus\train
aidatatang_200zh: 100%|████████████████████████████████████████████████████████| 420/420 [02:47<00:00, 2.51speakers/s]
The dataset consists of 0 utterances, 0 mel frames, 0 audio timesteps (0.00 hours).
Traceback (most recent call last):
File "synthesizer_preprocess_audio.py", line 64, in
preprocess_dataset(**vars(args))
File "F:\Realtime-Voice-Clone-Chinese-main\synthesizer\preprocess.py", line 76, in preprocess_dataset
print("Max input length (text chars): %d" % max(len(m[5]) for m in metadata))
ValueError: max() arg is an empty sequence
image

在此处发现同样问题:https://github.com/CorentinJ/Real-Time-Voice-Cloning/issues/486

关于该项目的一些想法。

目前来看,该项目在实际使用的时候远达不到“可用”的程度。包括以下几种问题:
1、合成的音频会出现不包含正常人声,而是噪声和残缺的声音。
2、合成的音色跟目标音色不一致,差别很大。

目前分析出现问题一的原因应该是因为
1、asr数据中有些数据存在明显过强底噪,音频和文本或者音素数据无法对齐。(加入一些数据清洗的手段)
2、目前的d-vector和vocoder部分都是使用的英文数据集上训练的universal的版本,在中文数据集上使用肯定会出现mismatch的问题。(我理解d-vector和vocoder应该需要在中文数据集上重新训练以获得更好的结果)
3、数据集中音色过少,导致很难找到跟目标音色较为一致的”参考音色“用于生成。(混合多种asr和tts数据集,构建一个大型的数据集,以提高对目标音色的适配程度)。

这块我应该也会着手做一些工作以尝试改进模型,希望有机会和作者合作。

关于训练和推理的疑问

据我了解,datatang和slr68数据集都是针对ASR的数据,所以没有标注phoneme,那训练的时候是直接使用文字token还是先将文字转换成phoneme在进行训练。另外在您的演示视频中,我貌似看到是使用phoneme作为输入,如果是使用文字训练,inference的时候用phoneme,这之间又有什么样的处理。

sounddevice报错问题

在win10默认情况下系统编码格式为gbk,在运行demo_toolbox.py时会报错:

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xc2 in position 6: invalid continuation byte

打开D:\Env\anaconda3\Lib\site-packages\sounddevice.py移动到573行,有相关报错的issue,更改为mbcs后错误变成:

UnicodeDecodeError: 'mbcs' codec can't decode bytes in position 0--1:xxxxxxxxxxxxxxxxxxx

运行python -m sounddevice会报相同的错误
image
按照上述步骤更改系统编码格式后重启,再次运行python -m sounddevice就没报错了

C:\Users\LM>python -m sounddevice
   0 Microsoft Sound Mapper - Input, MME (2 in, 0 out)
>  1 mic (USBAudio2.0), MME (2 in, 0 out)
   2 麦克风阵列 (Realtek High Definition , MME (2 in, 0 out)
   3 立体声混音 (Realtek High Definition , MME (2 in, 0 out)
   4 Microsoft Sound Mapper - Output, MME (0 in, 2 out)
<  5 ear (15- Meizu HiFi DAC Headpho, MME (0 in, 2 out)
   6 Speaker (Realtek High Definitio, MME (0 in, 2 out)
   7 DELL U2414H (NVIDIA High Defini, MME (0 in, 2 out)
   8 主声音捕获驱动程序, Windows DirectSound (2 in, 0 out)
   9 mic (USBAudio2.0), Windows DirectSound (2 in, 0 out)
  10 麦克风阵列 (Realtek High Definition Audio), Windows DirectSound (2 in, 0 out)
  11 立体声混音 (Realtek High Definition Audio), Windows DirectSound (2 in, 0 out)
  12 主声音驱动程序, Windows DirectSound (0 in, 2 out)
  13 Speaker (Realtek High Definition Audio), Windows DirectSound (0 in, 2 out)
  14 DELL U2414H (NVIDIA High Definition Audio), Windows DirectSound (0 in, 2 out)
  15 DSD 转码器 (DoP/Native), ASIO (0 in, 2 out)
  16 ear (15- Meizu HiFi DAC Headphone Amplifier), Windows WASAPI (0 in, 2 out)
  17 Speaker (Realtek High Definition Audio), Windows WASAPI (0 in, 2 out)
  18 DELL U2414H (NVIDIA High Definition Audio), Windows WASAPI (0 in, 2 out)
  19 麦克风阵列 (Realtek High Definition Audio), Windows WASAPI (2 in, 0 out)
  20 立体声混音 (Realtek High Definition Audio), Windows WASAPI (2 in, 0 out)
  21 mic (USBAudio2.0), Windows WASAPI (2 in, 0 out)
  22 Output (), Windows WDM-KS (0 in, 2 out)
  23 耳机 (), Windows WDM-KS (0 in, 2 out)
  24 Headphones (Meizu HiFi DAC Headphone Amplifier), Windows WDM-KS (0 in, 2 out)
  25 Speakers (Realtek HD Audio output), Windows WDM-KS (0 in, 2 out)
  26 立体声混音 (Realtek HD Audio Stereo input), Windows WDM-KS (2 in, 0 out)
  27 麦克风阵列 (Realtek HD Audio Mic input), Windows WDM-KS (2 in, 0 out)
  28 耳机 (@System32\drivers\bthhfenum.sys,#2;%1 Hands-Free AG Audio%0
;(LM’s AirPods Pro)), Windows WDM-KS (0 in, 1 out)
  29 耳机 (@System32\drivers\bthhfenum.sys,#2;%1 Hands-Free AG Audio%0
;(LM’s AirPods Pro)), Windows WDM-KS (1 in, 0 out)
  30 麦克风 (USBAudio2.0), Windows WDM-KS (2 in, 0 out)

再次运行demo_toolbox.py就能正常打开

kiwisolver是个什么东西。。。。

Traceback (most recent call last):
File "D:\code\Realtime-Voice-Clone-Chinese\demo_toolbox.py", line 2, in
from toolbox import Toolbox
File "D:\code\Realtime-Voice-Clone-Chinese\toolbox_init_.py", line 1, in
from toolbox.ui import UI
File "D:\code\Realtime-Voice-Clone-Chinese\toolbox\ui.py", line 1, in
import matplotlib.pyplot as plt
File "D:\software\install place\python3\lib\site-packages\matplotlib_init_.py", line 157, in
check_versions()
File "D:\software\install place\python3\lib\site-packages\matplotlib_init
.py", line 151, in check_versions
module = importlib.import_module(modname)
File "D:\software\install place\python3\lib\importlib_init
.py", line 127, in import_module
return _bootstrap._gcd_import(name[level:], package, level)
ModuleNotFoundError: No module named 'kiwisolver'

这个项目需要自己训练吗?

Pretrained-models下了放根目录不行 拷贝了到相对应的文件目录才能启动工具箱
只能load数据集的语音
无法使用解析功能
说明不详细不会用啊
出一个详细的步骤文档吧

在运行demo_cli.py时出错

我同时下载了原模型和你的模型,但是在运行demo_cli.py时出现以下错误:

RuntimeError: Error(s) in loading state_dict for Tacotron:
size mismatch for encoder.embedding.weight: copying a param with shape torch.Size([66, 512]) from checkpoint, the shape in current model is torch.Size([70, 512]).

训练模型时显存爆了

Variable._execution_engine.run_backward(RuntimeError: CUDA out of memory. Tried to allocate 88.00 MiB (GPU 0; 4.00 GiB totalcapacity; 2.68 GiB already allocated; 0 bytes free; 2.85 GiB reserved in total by PyTorch)

能不能提供一个调batch_size的参数? 我目前用的显卡显存只有4G(GTX1050Ti),默认参数正常训练时经常爆掉显存....

200zh数据集解压后,第一步预处理报错

(RVCC) D:\Realtime-Voice-Clone-Chinese-main\Realtime-Voice-Clone-Chinese-main>python synthesizer_preprocess_audio.py D:\data
Arguments:
datasets_root: D:\data
out_dir: D:\data\SV2TTS\synthesizer
n_processes: None
skip_existing: False
hparams:
no_alignments: False
dataset: aidatatang_200zh

Using data from:
D:\data\aidatatang_200zh\corpus\train
aidatatang_200zh: 100%|████████████████████████████████████████████████████████| 420/420 [01:02<00:00, 6.71speakers/s]
The dataset consists of 0 utterances, 0 mel frames, 0 audio timesteps (0.00 hours).
Traceback (most recent call last):
File "synthesizer_preprocess_audio.py", line 64, in
preprocess_dataset(**vars(args))
File "D:\Realtime-Voice-Clone-Chinese-main\Realtime-Voice-Clone-Chinese-main\synthesizer\preprocess.py", line 76, in preprocess_dataset
print("Max input length (text chars): %d" % max(len(m[5]) for m in metadata))
ValueError: max() arg is an empty sequence

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.