babysor / mockingbird Goto Github PK

🚀AI拟声: 5秒内克隆您的声音并生成任意语音内容 Clone a voice in 5 seconds to generate arbitrary speech in real-time

License: Other

Python 99.54% Dockerfile 0.07% Shell 0.24% Cython 0.15%

ai speech pytorch deep-learning text-to-speech tts

mockingbird's Issues

torch.Size的问题

有个问题，他显示Exception:Error(s) in loading state_dict for Tacotron :
Size mismatch for encoder.embedding.weight: copying a param with shape torch.Size([70,512]) from checkpoint, the shape in current model is torch.Size([75,512])

如何使用训练好的数据集呢

如题~
我将百度云下载好的训练结果放在E:\Voice\trainmodel，执行python demo_toolbox.py -d E:\Voice\trainmodelc好像并不能成功运行

Python

Pre Trained Model

Hi, I am from outside China

is it possible to have the pre-trained model download from google drive?

請問有aadatatang_200zh數據集的下載網址嗎?

請問有aadatatang_200zh數據集的下載網址嗎

deploy as webservice

is there anyway to deploy it as http service ,we can call it remote
I have two computer~

在运行 python synthesizer_preprocess_audio.py 时报错

谁有模型，仓库中的百度网盘下载出现网络错误

在生成录音时闪退

Building Wave-RNN
Trainable Parameters: 4.481M
Loading model weights at vocoder/saved_models/pretrained/pretrained.pt
python: src/hostapi/alsa/pa_linux_alsa.c:3641: PaAlsaStreamComponent_BeginPolling: Assertion `ret == self->nfds' failed.
Aborted (core dumped)

RuntimeError: CUDA error: CUBLAS_STATUS_INTERNAL_ERROR when calling `cublasSgemm( handle, opa, opb, m, n, k, &alpha, a, lda, b, ldb, &beta, c, ldc)`

训练一半后出现这个，谁能解决help

使用百度云上的模型，训练播放后都是杂音

环境

Windows 10
Python 3.7

描述

百度云的pt模型放入synthesizer/saved_models/后，python .\demo_toolbox.py可执行，但产生结果都是杂音，中文和拼音都不太行

问题截图

本人纯小白，希望大佬有空给予指点。

使用百度网盘最新预训练模型，spectrogram不正常，只有两秒杂音

安全问题，万一有人居心叵测。。。

训练模型时不调用GPU

3990x+两张3090
0.12-0.086step/s
CPU占用为70%
GPU无占用

使用预训练模型获得了奇怪的mel spectrogram和杂音

voicepart1.mp3 是一段时长为10秒钟、含7个句子的录音片段

voicepart2.wav 是一段时长为5秒钟的类似片段

合成结果均为约2秒的背景杂音，无论输入内容长度。

关于aidatatang_200zh的问题

我尝试从aidatatang_200zh的官网上下载，是要把aidatatang_200zh\aidatatang_200zh\aidatatang_200zh\corpus\train下的文件全部解压吗？

声音样本

大佬想问下若声音样本是歌曲的话，能不能克隆出其声音主人的声音出来？

保姆级别教程（持续更新各类社区/非官方教程----

（作者借楼编辑ing
社区视频教程：
奶糖 https://www.bilibili.com/video/BV1dq4y137pH

Backend Qt5Agg is interactive backend. Turning interactive mode on.

直接运行没有问题，但是debug demo_toolbox.py时报错：
Traceback (most recent call last):
File "D:\work\python\ide\pycharm\PyCharm 2020.1.2\plugins\python\helpers\pydev\pydevd.py", line 1438, in exec
pydev_imports.execfile(file, globals, locals) # execute the script
File "D:\work\python\ide\pycharm\PyCharm 2020.1.2\plugins\python\helpers\pydev_pydev_imps_pydev_execfile.py", line 18, in execfile
exec(compile(contents+"\n", file, 'exec'), glob, loc)
File "E:/instance/tts/Realtime-Voice-Clone-Chinese-main/demo_toolbox.py", line 43, in
Toolbox(**vars(args))
File "E:\instance\tts\Realtime-Voice-Clone-Chinese-main\toolbox_init.py", line 75, in init
self.ui = UI()
File "E:\instance\tts\Realtime-Voice-Clone-Chinese-main\toolbox\ui.py", line 450, in init
self.projections_layout.addWidget(FigureCanvas(fig))
TypeError: addWidget(self, QWidget, stretch: int = 0, alignment: Union[Qt.Alignment, Qt.AlignmentFlag] = Qt.Alignment()): argument 1 has unexpected type 'FigureCanvasQTAgg'
Backend Qt5Agg is interactive backend. Turning interactive mode on.

Where can I download aidatatang_200zh dataset?

如何解决运行python synthesizer_preprocess_audio.py时报错 DLL load failed:页面文件太小，无法完成操作

我在运行 python synthesizer_preprocess_audio.py时遇到如上错误，在CSDN上找到解决方法：1.如果python 运行环境不在C盘查看高级系统设置->高级->性能设置->高级->虚拟内存->更改 ->取消自动管理所有驱动器的分页文件大小-> 自定义大小 ->初始大小和最大值设为10240 2. 更改DateLoade 中的参数num_worker 改为0 但我现在不清楚具体怎样把参数设为0

我什么我使用首页提供的模型生成出来的音频都是杂音呢

speaker encoder的输出向量是什么样的？

SVT2TTS的评论区过来的，自己训练的speaker encoder，因为用的aishell3数据集，214个说话人，而输出的speaker embedding是256维的，这就导致每个说话人的向量很稀疏，大部分维度是0，几乎是one-hot形式的。所以用来训练synthesizer的话根本训练不了，loss是Nan。
你这个模型训练synthesizer时有注意到speaker embedding向量大概是什么样的吗？

请问可以提供预训练的编码器/声码器吗？

python synthesizer_preprocess_embeds.py <path-to-datasets_root>/SV2TTS/synthesizer

Output:

Arguments:
    synthesizer_root:      <path-to-datasets_root>/SV2TTS/synthesizer
    encoder_model_fpath:   encoder/saved_models/pretrained.pt
    n_processes:           4

Embedding:   0% 0/25308 [00:00<?, ?utterances/s]multiprocessing.pool.RemoteTraceback: 
"""
Traceback (most recent call last):
  File "/usr/lib/python3.7/multiprocessing/pool.py", line 121, in worker
    result = (True, func(*args, **kwds))
  File "<path-to-Realtime-Voice-Clone-Chinese>/synthesizer/preprocess.py", line 242, in embed_utterance
    encoder.load_model(encoder_model_fpath)
  File "<path-to-Realtime-Voice-Clone-Chinese>/encoder/inference.py", line 33, in load_model
    checkpoint = torch.load(weights_fpath, _device)
  File "/usr/local/lib/python3.7/dist-packages/torch/serialization.py", line 594, in load
    with _open_file_like(f, 'rb') as opened_file:
  File "/usr/local/lib/python3.7/dist-packages/torch/serialization.py", line 230, in _open_file_like
    return _open_file(name_or_buffer, mode)
  File "/usr/local/lib/python3.7/dist-packages/torch/serialization.py", line 211, in __init__
    super(_open_file, self).__init__(open(name, mode))
FileNotFoundError: [Errno 2] No such file or directory: 'encoder/saved_models/pretrained.pt'
"""

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "synthesizer_preprocess_embeds.py", line 25, in <module>
    create_embeddings(**vars(args))    
  File "<path-to-Realtime-Voice-Clone-Chinese>/synthesizer/preprocess.py", line 268, in create_embeddings
    list(tqdm(job, "Embedding", len(fpaths), unit="utterances"))
  File "/usr/local/lib/python3.7/dist-packages/tqdm/std.py", line 1104, in __iter__
    for obj in iterable:
  File "/usr/lib/python3.7/multiprocessing/pool.py", line 748, in next
    raise value
FileNotFoundError: [Errno 2] No such file or directory: 'encoder/saved_models/pretrained.pt'
Embedding:   0% 0/25308 [00:01<?, ?utterances/s]

支持中文版toolbox，直接输入中文

能出一个视频教程嘛

本人是一个小白，真的尝试去做了，好在一些安装下载配置别人有出教程，但不同人出的并不连贯，让我产生一种莫名其妙的感觉，很多东西在于细节，也许他所讲授的方法适用于这个特定的问题，但并不适用于项目，拜托了

Suggestion! Maybe you can list the basic hardware requirements of this project.

Just as the title.

想要支持更多数据集？在这里提建议

已支持的有 aidatatang（已验证200zh）, Magic Data(已验证open SLR68)
需要更多请在这里提建议，并+1投票，将为大家补充支持

请问如何训练呢

關於 Train synthesizer 的問題，求指導 !

你好
我已經下載了aidatatang_200zh這個數據集，並且把 aidatatang_200zh\corpus\train 底下的檔案都解壓縮完畢
但是當我要開始執行 python synthesizer_preprocess_audio.py D:\google download(我把檔案放在 D:\google download 這個路徑下 )
卻發生以下狀況:
D:\python_demo\Realtime-Voice-Clone-Chinese>python synthesizer_preprocess_audio.py D:\google download\ D:\python_demo\Realtime-Voice-Clone-Chinese\encoder\audio.py:13: UserWarning: Unable to import 'webrtcvad'. This package enables noise removal and is recommended. warn("Unable to import 'webrtcvad'. This package enables noise removal and is recommended.") usage: synthesizer_preprocess_audio.py [-h] [-o OUT_DIR] [-n N_PROCESSES] [-s] [--hparams HPARAMS] [--no_trim] [--no_alignments] [--dataset DATASET] datasets_root synthesizer_preprocess_audio.py: error: unrecognized arguments: download\

請問我可以怎麼解決問題呢? 我有查看之前 issues 的討論並沒有發現有類似問題，以下是我想到可能有問題的地方，還請作者為我解答，謝謝！

1.我只有解壓縮 aidatatang_200zh\corpus\train 底下的檔案，是否其他資料夾下的檔案也要解壓縮?
2.是不是只需要將所有 wav 檔單獨拉出來放在 aidatatang_200zh\corpus\train 底下然後再執行python synthesizer_preprocess_audio.py D:\google download ?
3. 輸入的指令不對
4. wav 檔與 txt 檔是不是要預先處理，而我沒有進行處理?

在 Preprocess the embeddings 時自動關機

有人有跟我一樣的問題嗎，剛執行 python synthesizer_preprocess_embeds.py <datasets_root>/SV2TTS/synthesizer 不久就自動關機，有遇過同樣問題的人用什麼辦法解決呢?

LibriSpeech alignments?

(base) F:\Realtime-Voice-Clone-Chinese-main>python synthesizer_preprocess_audio.py "F:\Realtime-Voice-Clone-Chinese-main/data1"
Arguments:
datasets_root: F:\Realtime-Voice-Clone-Chinese-main\data1
out_dir: F:\Realtime-Voice-Clone-Chinese-main\data1\SV2TTS\synthesizer
n_processes: None
skip_existing: False
hparams:
no_alignments: False
dataset: aidatatang_200zh

Using data from:
F:\Realtime-Voice-Clone-Chinese-main\data1\aidatatang_200zh\corpus\train
aidatatang_200zh: 100%|████████████████████████████████████████████████████████| 420/420 [02:47<00:00, 2.51speakers/s]
The dataset consists of 0 utterances, 0 mel frames, 0 audio timesteps (0.00 hours).
Traceback (most recent call last):
File "synthesizer_preprocess_audio.py", line 64, in
preprocess_dataset(**vars(args))
File "F:\Realtime-Voice-Clone-Chinese-main\synthesizer\preprocess.py", line 76, in preprocess_dataset
print("Max input length (text chars): %d" % max(len(m[5]) for m in metadata))
ValueError: max() arg is an empty sequence

在此处发现同样问题:https://github.com/CorentinJ/Real-Time-Voice-Cloning/issues/486

关于该项目的一些想法。

目前来看，该项目在实际使用的时候远达不到“可用”的程度。包括以下几种问题：
1、合成的音频会出现不包含正常人声，而是噪声和残缺的声音。
2、合成的音色跟目标音色不一致，差别很大。

目前分析出现问题一的原因应该是因为
1、asr数据中有些数据存在明显过强底噪，音频和文本或者音素数据无法对齐。（加入一些数据清洗的手段）
2、目前的d-vector和vocoder部分都是使用的英文数据集上训练的universal的版本，在中文数据集上使用肯定会出现mismatch的问题。（我理解d-vector和vocoder应该需要在中文数据集上重新训练以获得更好的结果）
3、数据集中音色过少，导致很难找到跟目标音色较为一致的”参考音色“用于生成。（混合多种asr和tts数据集，构建一个大型的数据集，以提高对目标音色的适配程度）。

这块我应该也会着手做一些工作以尝试改进模型，希望有机会和作者合作。

用这里的模型跑出现这个RuntimeError: Error(s) in loading state_dict for Tacotron: size mismatch for encoder.embedding.weight: copying a param with shape torch.Size([70, 512]) from checkpoint, the shape in current model is torch.Size([75, 512]).

誤觸，新開到issue，不好意思。

关于训练和推理的疑问

据我了解，datatang和slr68数据集都是针对ASR的数据，所以没有标注phoneme，那训练的时候是直接使用文字token还是先将文字转换成phoneme在进行训练。另外在您的演示视频中，我貌似看到是使用phoneme作为输入，如果是使用文字训练，inference的时候用phoneme，这之间又有什么样的处理。

这次训练一半会出现这个EOFError: Ran out of input，怎么回事 PermissionError: [WinError 5] 拒绝访问。

来看看，解决一下

输入一个mp3报错了，请问是啥原因？

sounddevice报错问题

在win10默认情况下系统编码格式为gbk，在运行demo_toolbox.py时会报错：

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xc2 in position 6: invalid continuation byte

打开D:\Env\anaconda3\Lib\site-packages\sounddevice.py移动到573行，有相关报错的issue，更改为mbcs后错误变成：

UnicodeDecodeError: 'mbcs' codec can't decode bytes in position 0--1:xxxxxxxxxxxxxxxxxxx

运行python -m sounddevice会报相同的错误

按照上述步骤更改系统编码格式后重启，再次运行python -m sounddevice就没报错了

C:\Users\LM>python -m sounddevice
   0 Microsoft Sound Mapper - Input, MME (2 in, 0 out)
>  1 mic (USBAudio2.0), MME (2 in, 0 out)
   2 麦克风阵列 (Realtek High Definition , MME (2 in, 0 out)
   3 立体声混音 (Realtek High Definition , MME (2 in, 0 out)
   4 Microsoft Sound Mapper - Output, MME (0 in, 2 out)
<  5 ear (15- Meizu HiFi DAC Headpho, MME (0 in, 2 out)
   6 Speaker (Realtek High Definitio, MME (0 in, 2 out)
   7 DELL U2414H (NVIDIA High Defini, MME (0 in, 2 out)
   8 主声音捕获驱动程序, Windows DirectSound (2 in, 0 out)
   9 mic (USBAudio2.0), Windows DirectSound (2 in, 0 out)
  10 麦克风阵列 (Realtek High Definition Audio), Windows DirectSound (2 in, 0 out)
  11 立体声混音 (Realtek High Definition Audio), Windows DirectSound (2 in, 0 out)
  12 主声音驱动程序, Windows DirectSound (0 in, 2 out)
  13 Speaker (Realtek High Definition Audio), Windows DirectSound (0 in, 2 out)
  14 DELL U2414H (NVIDIA High Definition Audio), Windows DirectSound (0 in, 2 out)
  15 DSD 转码器 (DoP/Native), ASIO (0 in, 2 out)
  16 ear (15- Meizu HiFi DAC Headphone Amplifier), Windows WASAPI (0 in, 2 out)
  17 Speaker (Realtek High Definition Audio), Windows WASAPI (0 in, 2 out)
  18 DELL U2414H (NVIDIA High Definition Audio), Windows WASAPI (0 in, 2 out)
  19 麦克风阵列 (Realtek High Definition Audio), Windows WASAPI (2 in, 0 out)
  20 立体声混音 (Realtek High Definition Audio), Windows WASAPI (2 in, 0 out)
  21 mic (USBAudio2.0), Windows WASAPI (2 in, 0 out)
  22 Output (), Windows WDM-KS (0 in, 2 out)
  23 耳机 (), Windows WDM-KS (0 in, 2 out)
  24 Headphones (Meizu HiFi DAC Headphone Amplifier), Windows WDM-KS (0 in, 2 out)
  25 Speakers (Realtek HD Audio output), Windows WDM-KS (0 in, 2 out)
  26 立体声混音 (Realtek HD Audio Stereo input), Windows WDM-KS (2 in, 0 out)
  27 麦克风阵列 (Realtek HD Audio Mic input), Windows WDM-KS (2 in, 0 out)
  28 耳机 (@System32\drivers\bthhfenum.sys,#2;%1 Hands-Free AG Audio%0
;(LM’s AirPods Pro)), Windows WDM-KS (0 in, 1 out)
  29 耳机 (@System32\drivers\bthhfenum.sys,#2;%1 Hands-Free AG Audio%0
;(LM’s AirPods Pro)), Windows WDM-KS (1 in, 0 out)
  30 麦克风 (USBAudio2.0), Windows WDM-KS (2 in, 0 out)

再次运行demo_toolbox.py就能正常打开

kiwisolver是个什么东西。。。。

Traceback (most recent call last):
File "D:\code\Realtime-Voice-Clone-Chinese\demo_toolbox.py", line 2, in
from toolbox import Toolbox
File "D:\code\Realtime-Voice-Clone-Chinese\toolbox_init_.py", line 1, in
from toolbox.ui import UI
File "D:\code\Realtime-Voice-Clone-Chinese\toolbox\ui.py", line 1, in
import matplotlib.pyplot as plt
File "D:\software\install place\python3\lib\site-packages\matplotlib_init_.py", line 157, in
check_versions()
File "D:\software\install place\python3\lib\site-packages\matplotlib_init.py", line 151, in check_versions
module = importlib.import_module(modname)
File "D:\software\install place\python3\lib\importlib_init.py", line 127, in import_module
return _bootstrap._gcd_import(name[level:], package, level)
ModuleNotFoundError: No module named 'kiwisolver'

这个项目需要自己训练吗？

Pretrained-models下了放根目录不行拷贝了到相对应的文件目录才能启动工具箱
只能load数据集的语音
无法使用解析功能
说明不详细不会用啊
出一个详细的步骤文档吧

在文字框写了文字，可是出來的是其它聲音，是不是文字框bug了

谁能解决

调用作者提供的预训练模型出错。

encoder.embedding.weight不匹配的问题。

请问如何恰当调整CPU和GPU的占用率呢

请教一下，GPU和CPU利用率只有13%左右，该怎么调整训练参数？

在运行demo_cli.py时出错

我同时下载了原模型和你的模型，但是在运行demo_cli.py时出现以下错误：

RuntimeError: Error(s) in loading state_dict for Tacotron:
size mismatch for encoder.embedding.weight: copying a param with shape torch.Size([66, 512]) from checkpoint, the shape in current model is torch.Size([70, 512]).

训练模型时显存爆了

Variable._execution_engine.run_backward(RuntimeError: CUDA out of memory. Tried to allocate 88.00 MiB (GPU 0; 4.00 GiB totalcapacity; 2.68 GiB already allocated; 0 bytes free; 2.85 GiB reserved in total by PyTorch)

能不能提供一个调batch_size的参数? 我目前用的显卡显存只有4G(GTX1050Ti)，默认参数正常训练时经常爆掉显存....

200zh数据集解压后，第一步预处理报错

(RVCC) D:\Realtime-Voice-Clone-Chinese-main\Realtime-Voice-Clone-Chinese-main>python synthesizer_preprocess_audio.py D:\data
Arguments:
datasets_root: D:\data
out_dir: D:\data\SV2TTS\synthesizer
n_processes: None
skip_existing: False
hparams:
no_alignments: False
dataset: aidatatang_200zh

Using data from:
D:\data\aidatatang_200zh\corpus\train
aidatatang_200zh: 100%|████████████████████████████████████████████████████████| 420/420 [01:02<00:00, 6.71speakers/s]
The dataset consists of 0 utterances, 0 mel frames, 0 audio timesteps (0.00 hours).
Traceback (most recent call last):
File "synthesizer_preprocess_audio.py", line 64, in
preprocess_dataset(**vars(args))
File "D:\Realtime-Voice-Clone-Chinese-main\Realtime-Voice-Clone-Chinese-main\synthesizer\preprocess.py", line 76, in preprocess_dataset
print("Max input length (text chars): %d" % max(len(m[5]) for m in metadata))
ValueError: max() arg is an empty sequence

babysor / mockingbird Goto Github PK

mockingbird's Issues

环境

描述

问题截图

Recommend Projects

Recommend Topics

Recommend Org

Jobs