playvoice / so-vits-svc-5.0 Goto Github PK

View Code? Open in Web Editor NEW

2.5K 29.0 901.0 42.3 MB

Core Engine of Singing Voice Conversion & Singing Voice Clone

Home Page: https://huggingface.co/spaces/maxmax20160403/sovits5.0

License: MIT License

Python 97.92% Jupyter Notebook 2.08%

sovits svc vits change voice singing-voice-conversion diff-svc diffusion diffusion-svc vits2

so-vits-svc-5.0's Introduction

Variational Inference with adversarial learning for end-to-end Singing Voice Conversion based on VITS

中文文档

The tree bigvgan-mix-v2 has good audio quality

The tree RoFormer-HiFTNet has fast infer speed

No More Upgrade

This project targets deep learning beginners, basic knowledge of Python and PyTorch are the prerequisites for this project;
This project aims to help deep learning beginners get rid of boring pure theoretical learning, and master the basic knowledge of deep learning by combining it with practices;
This project does not support real-time voice converting; (need to replace whisper if real-time voice converting is what you are looking for)
This project will not develop one-click packages for other purposes;

A minimum VRAM requirement of 6GB for training
Support for multiple speakers
Create unique speakers through speaker mixing
It can even convert voices with light accompaniment
You can edit F0 using Excel

AI_Elysia_LoveStory.mp4

Model properties

Feature	From	Status	Function
whisper	OpenAI	✅	strong noise immunity
bigvgan	NVIDA	✅	alias and snake
natural speech	Microsoft	✅	reduce mispronunciation
neural source-filter	NII	✅	solve the problem of audio F0 discontinuity
speaker encoder	Google	✅	Timbre Encoding and Clustering
GRL for speaker	Ubisoft	✅	Preventing Encoder Leakage Timbre
SNAC	Samsung	✅	One Shot Clone of VITS
SCLN	Microsoft	✅	Improve Clone
Diffusion	HuaWei	✅	Improve sound quality
PPG perturbation	this project	✅	Improved noise immunity and de-timbre
HuBERT perturbation	this project	✅	Improved noise immunity and de-timbre
VAE perturbation	this project	✅	Improve sound quality
MIX encoder	this project	✅	Improve conversion stability
USP infer	this project	✅	Improve conversion stability
HiFTNet	Columbia University	✅	NSF-iSTFTNet for speed up
RoFormer	Zhuiyi Technology	✅	Rotary Positional Embeddings

due to the use of data perturbation, it takes longer to train than other projects.

USP : Unvoice and Silence with Pitch when infer

Why mix

Plug-In-Diffusion

Setup Environment

Install PyTorch.
Install project dependencies
```
pip install -i https://pypi.tuna.tsinghua.edu.cn/simple -r requirements.txt
```
Note: whisper is already built-in, do not install it again otherwise it will cuase conflict and error
Download the Timbre Encoder: Speaker-Encoder by @mueller91, put best_model.pth.tar into speaker_pretrain/.
Download whisper model whisper-large-v2. Make sure to download large-v2.pt，put it into whisper_pretrain/.
Download hubert_soft model，put hubert-soft-0d54a1f4.pt into hubert_pretrain/.
Download pitch extractor crepe full，put full.pth into crepe/assets.

Note: crepe full.pth is 84.9 MB, not 6kb

Download pretrain model sovits5.0.pretrain.pth, and put it into vits_pretrain/.

python svc_inference.py --config configs/base.yaml --model ./vits_pretrain/sovits5.0.pretrain.pth --spk ./configs/singers/singer0001.npy --wave test.wav

Dataset preparation

Necessary pre-processing:

Separate voice and accompaniment with UVR (skip if no accompaniment)
Cut audio input to shorter length with slicer, whisper takes input less than 30 seconds.
Manually check generated audio input, remove inputs shorter than 2 seconds or with obivous noise.
Adjust loudness if necessary, recommend Adobe Audiiton.
Put the dataset into the dataset_raw directory following the structure below.

dataset_raw
├───speaker0
│   ├───000001.wav
│   ├───...
│   └───000xxx.wav
└───speaker1
    ├───000001.wav
    ├───...
    └───000xxx.wav

Data preprocessing

python svc_preprocessing.py -t 2

-t: threading, max number should not exceed CPU core count, usually 2 is enough. After preprocessing you will get an output with following structure.

data_svc/
└── waves-16k
│    └── speaker0
│    │      ├── 000001.wav
│    │      └── 000xxx.wav
│    └── speaker1
│           ├── 000001.wav
│           └── 000xxx.wav
└── waves-32k
│    └── speaker0
│    │      ├── 000001.wav
│    │      └── 000xxx.wav
│    └── speaker1
│           ├── 000001.wav
│           └── 000xxx.wav
└── pitch
│    └── speaker0
│    │      ├── 000001.pit.npy
│    │      └── 000xxx.pit.npy
│    └── speaker1
│           ├── 000001.pit.npy
│           └── 000xxx.pit.npy
└── hubert
│    └── speaker0
│    │      ├── 000001.vec.npy
│    │      └── 000xxx.vec.npy
│    └── speaker1
│           ├── 000001.vec.npy
│           └── 000xxx.vec.npy
└── whisper
│    └── speaker0
│    │      ├── 000001.ppg.npy
│    │      └── 000xxx.ppg.npy
│    └── speaker1
│           ├── 000001.ppg.npy
│           └── 000xxx.ppg.npy
└── speaker
│    └── speaker0
│    │      ├── 000001.spk.npy
│    │      └── 000xxx.spk.npy
│    └── speaker1
│           ├── 000001.spk.npy
│           └── 000xxx.spk.npy
└── singer
│   ├── speaker0.spk.npy
│   └── speaker1.spk.npy
|
└── indexes
    ├── speaker0
    │   ├── some_prefix_hubert.index
    │   └── some_prefix_whisper.index
    └── speaker1
        ├── hubert.index
        └── whisper.index

Re-sampling

Generate audio with a sampling rate of 16000Hz in ./data_svc/waves-16k

python prepare/preprocess_a.py -w ./dataset_raw -o ./data_svc/waves-16k -s 16000

Generate audio with a sampling rate of 32000Hz in ./data_svc/waves-32k

python prepare/preprocess_a.py -w ./dataset_raw -o ./data_svc/waves-32k -s 32000

Use 16K audio to extract pitch

python prepare/preprocess_crepe.py -w data_svc/waves-16k/ -p data_svc/pitch

Use 16K audio to extract ppg

python prepare/preprocess_ppg.py -w data_svc/waves-16k/ -p data_svc/whisper

Use 16K audio to extract hubert

python prepare/preprocess_hubert.py -w data_svc/waves-16k/ -v data_svc/hubert

Use 16k audio to extract timbre code

python prepare/preprocess_speaker.py data_svc/waves-16k/ data_svc/speaker

Extract the average value of the timbre code for inference; it can also replace a single audio timbre in generating the training index, and use it as the unified timbre of the speaker for training
```
python prepare/preprocess_speaker_ave.py data_svc/speaker/ data_svc/singer
```

Use 32k audio to extract the linear spectrum

python prepare/preprocess_spec.py -w data_svc/waves-32k/ -s data_svc/specs

Use 32k audio to generate training index
```
python prepare/preprocess_train.py
```
Training file debugging
```
python prepare/preprocess_zzz.py
```

Train

If fine-tuning is based on the pre-trained model, you need to download the pre-trained model: sovits5.0.pretrain.pth. Put pretrained model under project root, change this line
```
pretrain: "./vits_pretrain/sovits5.0.pretrain.pth"
```
in configs/base.yaml，and adjust the learning rate appropriately, eg 5e-5.

batch_size: for GPU with 6G VRAM, 6 is the recommended value, 8 will work but step speed will be much slower.

Start training

python svc_trainer.py -c configs/base.yaml -n sovits5.0

Resume training

python svc_trainer.py -c configs/base.yaml -n sovits5.0 -p chkpt/sovits5.0/sovits5.0_***.pt

Log visualization
```
tensorboard --logdir logs/
```

Inference

Export inference model: text encoder, Flow network, Decoder network

python svc_export.py --config configs/base.yaml --checkpoint_path chkpt/sovits5.0/***.pt

Inference

if there is no need to adjust f0, just run the following command.

python svc_inference.py --config configs/base.yaml --model sovits5.0.pth --spk ./data_svc/singer/your_singer.spk.npy --wave test.wav --shift 0

if f0 will be adjusted manually, follow the steps:
1. use whisper to extract content encoding, generate test.vec.npy.
```
python whisper/inference.py -w test.wav -p test.ppg.npy
```
1. use hubert to extract content vector, without using one-click reasoning, in order to reduce GPU memory usage
```
python hubert/inference.py -w test.wav -v test.vec.npy
```
1. extract the F0 parameter to the csv text format, open the csv file in Excel, and manually modify the wrong F0 according to Audition or SonicVisualiser
```
python pitch/inference.py -w test.wav -p test.csv
```
1. final inference
```
python svc_inference.py --config configs/base.yaml --model sovits5.0.pth --spk ./data_svc/singer/your_singer.spk.npy --wave test.wav --ppg test.ppg.npy --vec test.vec.npy --pit test.csv --shift 0
```

Notes
- when --ppg is specified, when the same audio is reasoned multiple times, it can avoid repeated extraction of audio content codes; if it is not specified, it will be automatically extracted;
- when --vec is specified, when the same audio is reasoned multiple times, it can avoid repeated extraction of audio content codes; if it is not specified, it will be automatically extracted;
- when --pit is specified, the manually tuned F0 parameter can be loaded; if not specified, it will be automatically extracted;
- generate files in the current directory:svc_out.wav
Arguments ref

args --config --model --spk --wave --ppg --vec --pit --shift

name config path model path speaker wave input wave ppg wave hubert wave pitch pitch shift
post by vad

args	--config	--model	--spk	--wave	--ppg	--vec	--pit	--shift
name	config path	model path	speaker	wave input	wave ppg	wave hubert	wave pitch	pitch shift

python svc_inference_post.py --ref test.wav --svc svc_out.wav --out svc_out_post.wav

Train Feature Retrieval Index (Optional)

To increase the stability of the generated timbre, you can use the method described in the Retrieval-based-Voice-Conversion repository. This method consists of 2 steps:

Training the retrieval index on hubert and whisper features Run training with default settings:

python svc_train_retrieval.py

If the number of vectors is more than 200_000 they will be compressed to 10_000 using the MiniBatchKMeans algorithm. You can change these settings using command line options:

usage: crate faiss indexes for feature retrieval [-h] [--debug] [--prefix PREFIX] [--speakers SPEAKERS [SPEAKERS ...]] [--compress-features-after COMPRESS_FEATURES_AFTER]
                                                 [--n-clusters N_CLUSTERS] [--n-parallel N_PARALLEL]

options:
  -h, --help            show this help message and exit
  --debug
  --prefix PREFIX       add prefix to index filename
  --speakers SPEAKERS [SPEAKERS ...]
                        speaker names to create an index. By default all speakers are from data_svc
  --compress-features-after COMPRESS_FEATURES_AFTER
                        If the number of features is greater than the value compress feature vectors using MiniBatchKMeans.
  --n-clusters N_CLUSTERS
                        Number of centroids to which features will be compressed
  --n-parallel N_PARALLEL
                        Nuber of parallel job of MinibatchKmeans. Default is cpus-1

Compression of training vectors can speed up index inference, but reduces the quality of the retrieve. Use vector count compression if you really have a lot of them.

The resulting indexes will be stored in the "indexes" folder as:

data_svc
...
└── indexes
    ├── speaker0
    │   ├── some_prefix_hubert.index
    │   └── some_prefix_whisper.index
    └── speaker1
        ├── hubert.index
        └── whisper.index

At the inference stage adding the n closest features in a certain proportion of the vits model Enable Feature Retrieval with settings:
```
python svc_inference.py --config configs/base.yaml --model sovits5.0.pth --spk ./data_svc/singer/your_singer.spk.npy --wave test.wav --shift 0 \
--enable-retrieval \
--retrieval-ratio 0.5 \
--n-retrieval-vectors 3
```
For a better retrieval effect, you can try to cycle through different parameters: --retrieval-ratio and --n-retrieval-vectors

If you have multiple sets of indexes, you can specify a specific set via the parameter: --retrieval-index-prefix

You can explicitly specify the paths to the hubert and whisper indexes using the parameters: --hubert-index-path and --whisper-index-path

Create singer

named by pure coincidence：average -> ave -> eva，eve(eva) represents conception and reproduction

python svc_eva.py

eva_conf = {
    './configs/singers/singer0022.npy': 0,
    './configs/singers/singer0030.npy': 0,
    './configs/singers/singer0047.npy': 0.5,
    './configs/singers/singer0051.npy': 0.5,
}

the generated singer file will be eva.spk.npy.

Data set

Name	URL
KiSing	http://shijt.site/index.php/2021/05/16/kising-the-first-open-source-mandarin-singing-voice-synthesis-corpus/
PopCS	https://github.com/MoonInTheRiver/DiffSinger/blob/master/resources/apply_form.md
opencpop	https://wenet.org.cn/opencpop/download/
Multi-Singer	https://github.com/Multi-Singer/Multi-Singer.github.io
M4Singer	https://github.com/M4Singer/M4Singer/blob/master/apply_form.md
CSD	https://zenodo.org/record/4785016#.YxqrTbaOMU4
KSS	https://www.kaggle.com/datasets/bryanpark/korean-single-speaker-speech-dataset
JVS MuSic	https://sites.google.com/site/shinnosuketakamichi/research-topics/jvs_music
PJS	https://sites.google.com/site/shinnosuketakamichi/research-topics/pjs_corpus
JUST Song	https://sites.google.com/site/shinnosuketakamichi/publication/jsut-song
MUSDB18	https://sigsep.github.io/datasets/musdb.html#musdb18-compressed-stems
DSD100	https://sigsep.github.io/datasets/dsd100.html
Aishell-3	http://www.aishelltech.com/aishell_3
VCTK	https://datashare.ed.ac.uk/handle/10283/2651
Korean Songs	http://urisori.co.kr/urisori-en/doku.php/

Code sources and references

https://github.com/facebookresearch/speech-resynthesis paper

https://github.com/jaywalnut310/vits paper

https://github.com/openai/whisper/ paper

https://github.com/NVIDIA/BigVGAN paper

https://github.com/mindslab-ai/univnet paper

https://github.com/nii-yamagishilab/project-NN-Pytorch-scripts/tree/master/project/01-nsf

https://github.com/huawei-noah/Speech-Backbones/tree/main/Grad-TTS

https://github.com/brentspell/hifi-gan-bwe

https://github.com/mozilla/TTS

https://github.com/bshall/soft-vc

https://github.com/maxrmorrison/torchcrepe

https://github.com/MoonInTheRiver/DiffSinger

https://github.com/OlaWod/FreeVC paper

https://github.com/yl4579/HiFTNet paper

One-shot Voice Conversion by Separating Speaker and Content Representations with Instance Normalization

SNAC : Speaker-normalized Affine Coupling Layer in Flow-based Architecture for Zero-Shot Multi-Speaker Text-to-Speech

Adapter-Based Extension of Multi-Speaker Text-to-Speech Model for New Speakers

AdaSpeech: Adaptive Text to Speech for Custom Voice

AdaVITS: Tiny VITS for Low Computing Resource Speaker Adaptation

Cross-Speaker Prosody Transfer on Any Text for Expressive Speech Synthesis

Learn to Sing by Listening: Building Controllable Virtual Singer by Unsupervised Learning from Voice Recordings

Adversarial Speaker Disentanglement Using Unannotated External Data for Self-supervised Representation Based Voice Conversion

Multilingual Speech Synthesis and Cross-Language Voice Cloning: GRL

RoFormer: Enhanced Transformer with rotary position embedding

Method of Preventing Timbre Leakage Based on Data Perturbation

https://github.com/auspicious3000/contentvec/blob/main/contentvec/data/audio/audio_utils_1.py

https://github.com/revsic/torch-nansy/blob/main/utils/augment/praat.py

https://github.com/revsic/torch-nansy/blob/main/utils/augment/peq.py

https://github.com/biggytruck/SpeechSplit2/blob/main/utils.py

https://github.com/OlaWod/FreeVC/blob/main/preprocess_sr.py

Contributors

Thanks to

https://github.com/Francis-Komizu/Sovits

Relevant Projects

LoRA-SVC: decoder only svc
Grad-SVC: diffusion based svc

Original evidence

2022.04.12 https://mp.weixin.qq.com/s/autNBYCsG4_SvWt2-Ll_zA

2022.04.22 https://github.com/PlayVoice/VI-SVS

2022.07.26 https://mp.weixin.qq.com/s/qC4TJy-4EVdbpvK2cQb1TA

2022.09.08 https://github.com/PlayVoice/VI-SVC

Be copied by svc-develop-team/so-vits-svc

so-vits-svc-5.0's People

Contributors

Stargazers

Watchers

Forkers

splinter21 entn-at pediastrum blackleave bronyaeve sovits-mod tiger28 ylyayzbl mumutoy sunxiaochuan96 wind4000 scrpr blueboson xzy-git yuaowang mia2832791324 typologi dlseed zhufanpo 26402512858 secretava wq010615 onlycost15yens maophp sisanime maxela-2001 ofshellohicy kanadegardner jasonhy sp-repository dosugamea rednoon f1am3 1933927861 nuooos shaun95 eririsawamurazzz rei1mu sdlibowen stonerossa mzl233 akito-uzukip darwintree zzssll1212012 med1844 yanyao2333 ecauchy dachuniii shisheng233 huochaiyx archivoice fifasiu ykyinky shenyun3 kennethskyhigh pymastera ailunshawu section995 kwuyouk emuccc bobo-paopao papermooon 3229729324 wslwy xinerfeixiang 1271381588 yyd859 zoacys gandad effusiveperiscope hmaron gusha861 bopo excurs1ons maggot-code nikaidouasahi windwingwood kerlyyy asukalca baoren1996 ljgit428 lonelytangzz larygwil xinai666 manywaysai listeningpost1379 ame192 654622316 vampirefuye hylimr hellocym lwd-temp hackermaster1969 fishball2-3-3 zwzheng45 bellallabella xavieryang007 yuquan-zuo loimino dullwolfs

so-vits-svc-5.0's Issues

依赖清单里的ffmpeg、torch、torchaudio应删除

openai/whisper#487

ffmpeg是与ffmpeg-python冲突的，如上

pip直接安装torch默认是torch-cpu，应建议去pytorch官网获取命令

wtf man

from whisper.model import Whisper, ModelDimensions
ModuleNotFoundError: No module named 'whisper'

How to make use of GRL for speaker

Hi,

How can I use the SpeakerClassifier in vits/modules_grl.py?

It was added with this commit but I cannot see this to be used anywhere during training.

Thanks!

issue with zip archive

how can i solve this at step4 of Data preprocessing ?
RuntimeError: PytorchStreamReader failed reading zip archive: failed finding central directory

请问一下推理时用到的spk参数

configs/singers下的应该是debug模型的预设吧，用自己训练的模型的时候这个参数应该选哪个npy呢

incorrect audio shape

執行 python prepare/preprocess_ppg.py -w data_svc/waves-16k/ -p data_svc/whisper後出現此錯誤

請問要怎麼解決

Warning

what is this ?
how to solve it?
prepare/preprocess_a.py:11: RuntimeWarning: invalid value encountered in true_divide
wav = wav / np.abs(wav).max() * 0.6

预览模型无变声效果

你好，我在本地使用预览模型进行测试时发现没有变声效果，但在hugging face上用同样的音频测试，效果却不一样，请问下我是哪个过程错了呢？
我是按照下面的步骤进行的：

下载预览模型https://github.com/PlayVoice/so-vits-svc-5.0/releases/download/v5.2/sovits5.0.pretrain.pth
使用svc_export.py将sovits5.0.pretrain.pth转成sovits5.0.pth
按照readme的说明进行测试

python svc_inference.py --config configs/base.yaml --model sovits5.0.pth --spk ./configs/singers/singer0051.npy --wave test.wav --ppg test.ppg.npy --pit test.csv

could you explain about whisper_ppg?

it seems you are using whisper's encoder output directly as content information vectors, how is it better than contentvec used in previous so-vits-svc?

处理train和Valid两个文本文件时，执行后，文本内无数据

如标题
LargeV2分支

Nvidia CUDA 10.1 user here! Can I run this program with pytorch 1.8.1? What is the minimum version requirement?

I am a music instructor and I would love to introduce this lovely AI software to our students to try out.

Here in my school we have several Windows 7 Pro 64-bit computers in our classrooms, running on Nvidia GeForce GTX 660M GPU. According to Nvidia, the highest version of graphic driver we can install is 425.31, and the highest CUDA Toolkit we can install would be 10.1.

According to pytorch dot org, with CUDA version 10.1, the highest torch we can install would be:
“torch-1.8.1+cu101-cp39-cp39-win_amd64.whl”.

Here, “cu101” in the file name, is referring to CUDA 10.1.

Any torch version higher than 1.8.1, will have a higher “cu” number attached in the whl file name, such as:
“torch-1.10.0+cu102-cp36-cp36m-win_amd64.whl”, or
“torch-1.13.0+cu116-cp310-cp310-win_amd64.whl”, etc.

In other words, our school can not install any torch higher than version 1.8.1.

In the non-fork so-vits-svc-4.0 program folder, there is a file called “requirements.txt”. We opened that file, and can see it says “torch==1.13.1”. Can we assume torch version 1.13.1 is the lowest minimum requirement for so-vits-svc program to run?

Does it mean we can not install your amazing software on our school’s computers, because our Nvidia GPU are too old, and can’t reach your required pytorch 1.13.1 version? Or maybe it doesn’t matter, a lower pytorch 1.8.1 version can still run?

Too bad! My colleagues have already trained several G_43200.pth models on their home computers, and they can just simply copy these models to our school’s computers and start the voice inference right away. We don’t need to train on the classroom’s computers, we just need to infer on existing models, to demonstrate to our students. Inferring takes an awful lot less of GPU powers to do.

Has anyone tested this program on CUDA 10.1?

Please let me know. So, should I give up? Is it a death penalty for our students to see this?

ModuleNotFoundError: No module named 'whisper'

调用python whisper/inference.py -w test.wav -p test.ppg.npy 报错无whisper模块
pip install whisper之后报错、

Traceback (most recent call last):
File "H:\svc\sovits\so-vits-svc-5.0\whisper\inference.py", line 6, in
from whisper.model import Whisper, ModelDimensions
File "H:\svc\sovits\so-vits-svc-5.0\venv\lib\site-packages\whisper.py", line 65, in
libc = ctypes.CDLL(libc_name)
File "C:\Python310\lib\ctypes_init_.py", line 364, in init
if '/' in name or '\' in name:
TypeError: argument of type 'NoneType' is not iterable

Another sample rate support?

Hello! Is this repository support another train sample rate 44100?

F0

What's the f0 parameter?

vits_pretrained

在vits_pretrained文件夹中要求加入pt文件，把用vits学习的模型放进去就可以吗？
另外，应该在哪里使用vits型号呢？

Error when training

Why this happened when i run this command : python svc_trainer.py -c configs/base.yaml -n sovits5.0

File "svc_trainer.py", line 11, in
from vits_extend.train import train
File "/home/parisa/so-vits-svc-5.0__/vits_extend/train.py", line 16, in
from vits_extend.writer import MyWriter
File "/home/parisa/so-vits-svc-5.0__/vits_extend/writer.py", line 1, in
from torch.utils.tensorboard import SummaryWriter
File "/home/parisa/anaconda3/envs/voice/lib/python3.7/site-packages/torch/utils/tensorboard/init.py", line 12, in
from .writer import FileWriter, SummaryWriter # noqa: F401
File "/home/parisa/anaconda3/envs/voice/lib/python3.7/site-packages/torch/utils/tensorboard/writer.py", line 9, in
from tensorboard.compat.proto.event_pb2 import SessionLog
File "/home/parisa/anaconda3/envs/voice/lib/python3.7/site-packages/tensorboard/compat/proto/event_pb2.py", line 17, in
from tensorboard.compat.proto import summary_pb2 as tensorboard_dot_compat_dot_proto_dot_summary__pb2
File "/home/parisa/anaconda3/envs/voice/lib/python3.7/site-packages/tensorboard/compat/proto/summary_pb2.py", line 17, in
from tensorboard.compat.proto import histogram_pb2 as tensorboard_dot_compat_dot_proto_dot_histogram__pb2
File "/home/parisa/anaconda3/envs/voice/lib/python3.7/site-packages/tensorboard/compat/proto/histogram_pb2.py", line 42, in
serialized_options=None, file=DESCRIPTOR),
File "/home/parisa/anaconda3/envs/voice/lib/python3.7/site-packages/google/protobuf/descriptor.py", line 561, in new
_message.Message._CheckCalledFromGeneratedFile()
TypeError: Descriptors cannot not be created directly.
If this call came from a _pb2.py file, your generated code is out of date and must be regenerated with protoc >= 3.19.0.
If you cannot immediately regenerate your protos, some other possible workarounds are:

Downgrade the protobuf package to 3.20.x or lower.
Set PROTOCOL_BUFFERS_PYTHON_IMPLEMENTATION=python (but this will use pure-Python parsing and will be much slower).

More information: https://developers.google.com/protocol-buffers/docs/news/2022-05-06#python-updates

Inference

i don't know which file to use for --spk parameter when using my own wav file at inference step ?

No module named 'whisper‘.but it seems whisper is built-in

(venv) PS D:\so-vits-svc-5.0> python prepare/preprocess_ppg.py -w data_svc/waves-16k/ -p data_svc/whisper
Traceback (most recent call last):
File "D:\so-vits-svc-5.0\prepare\preprocess_ppg.py", line 6, in
from whisper.model import Whisper, ModelDimensions
ModuleNotFoundError: No module named 'whisper

i just clone this project,and install dependence,then i run those commands below

python prepare/preprocess_a.py -w ./dataset_raw -o ./data_svc/waves-16k -s 16000
python prepare/preprocess_f0.py -w data_svc/waves-16k/ -p data_svc/pitch
when i run "python prepare/preprocess_ppg.py -w data_svc/waves-16k/ -p"

occur hapend,

windows 10
Python 3.10.7

help :(

issue in step 4 of Data preprocessing

i get this error when running this : python prepare/preprocess_ppg.py -w data_svc/waves-16k/ -p data_svc/whisper

Traceback (most recent call last):
File "prepare/preprocess_ppg.py", line 56, in
whisper = load_model(os.path.join("whisper_pretrain", "medium.pt"))
File "prepare/preprocess_ppg.py", line 25, in load_model
dims = ModelDimensions(**checkpoint["dims"])
TypeError: 'ModuleSpec' object is not callable

Error in step 5 of data preprocess

How to fix this?

Setting up Audio Processor...
| > sample_rate:16000
| > resample:False
| > num_mels:80
| > log_func:np.log10
| > min_level_db:-100
| > frame_shift_ms:None
| > frame_length_ms:None
| > ref_level_db:20
| > fft_size:1024
| > power:1.5
| > preemphasis:0.98
| > griffin_lim_iters:60
| > signal_norm:True
| > symmetric_norm:True
| > mel_fmin:0
| > mel_fmax:8000.0
| > spec_gain:20.0
| > stft_pad_mode:reflect
| > max_norm:4.0
| > clip_norm:True
| > do_trim_silence:True
| > trim_db:60
| > do_sound_norm:False
| > do_amp_to_db_linear:True
| > do_amp_to_db_mel:True
| > stats_path:None
| > base:10
| > hop_length:256
| > win_length:1024
49%|████████████████████████████████████████████████▌ | 200/408 [00:11<00:12, 17.05it/s]/home/parisa/so-vits-svc-5.0__/speaker/utils/audio.py:732: RuntimeWarning: invalid value encountered in true_divide
return x / abs(x).max() * 0.95
49%|████████████████████████████████████████████████▌ | 200/408 [00:11<00:11, 17.49it/s]
Traceback (most recent call last):
File "prepare/preprocess_speaker.py", line 79, in
spec = speaker_encoder_ap.melspectrogram(waveform)
File "/home/parisa/so-vits-svc-5.0__/speaker/utils/audio.py", line 564, in melspectrogram
D = self.stft(self.apply_preemphasis(y))
File "/home/parisa/so-vits-svc-5.0_/speaker/utils/audio.py", line 624, in _stft
center=True,
File "/home/parisa/anaconda3/envs/voice/lib/python3.7/site-packages/librosa/util/decorators.py", line 88, in inner_f
return f(*args, **kwargs)
File "/home/parisa/anaconda3/envs/voice/lib/python3.7/site-packages/librosa/core/spectrum.py", line 202, in stft
util.valid_audio(y, mono=False)
File "/home/parisa/anaconda3/envs/voice/lib/python3.7/site-packages/librosa/util/decorators.py", line 88, in inner_f
return f(*args, **kwargs)
File "/home/parisa/anaconda3/envs/voice/lib/python3.7/site-packages/librosa/util/utils.py", line 294, in valid_audio
raise ParameterError("Audio buffer is not finite everywhere")
librosa.util.exceptions.ParameterError: Audio buffer is not finite everywhere

error at step 7

preprocess_zzz.py calls vits but preprocess_zzz.py itself is in a map so can't locate it.

if I copy the vits map into prepare it still throws an error at step 7:

line 252, in iter
ids_bucket = ids_bucket + ids_bucket * (rem // len_bucket) + ids_bucket[:(rem % len_bucket)]
ZeroDivisionError: integer division or modulo by zero in \vits\data_utils.py

is it because vits_pretrain.pt is missing? it does not seem to exist on the internet?

请问语音克隆效果如何啊？

请问你这个模型对于数据中没有出现过的人的克隆语音相似度如何呢？

Super Slow Loss Calc On Google TPU

最近我正在尝试将本项目移植到谷歌TPU上，但这段代码在谷歌TPU上运行速度极慢，耗费十分钟才计算完毕，推测是自定义损失函数导致其运行在CPU上，请问有没有替代方案解决该问题
vits_extends下的train.py中的
loss_kl_f = kl_loss(z_f, logs_q, m_p, logs_p, logdet_f, z_mask) * hp.train.c_kl
loss_kl_r = kl_loss(z_r, logs_p, m_q, logs_q, logdet_r, z_mask) * hp.train.c_kl
loss_g = score_loss + mel_loss + stft_loss + loss_kl_f
loss_g.backward()

hyper parameters

what are the best values for learnign-rate , epochs, batch-size ?

推理无需去伴奏的原理是？

很强啊，第一次听说，为什么能无视BGM与和声的干扰？

import module

how should i solve this error?
ModuleNotFoundError: No module named 'whisper.model'; 'whisper' is not a package

错误 [WinError 183] 当文件已存在时，无法创建该文件。

运行代码重采样

将音频剪裁为小于30秒的音频段，whisper的要求

生成采样率16000Hz音频, 存储路径为：./data_svc/waves-16k

python prepare/preprocess_a.py -w ./data_raw -o ./data_svc/waves-16k -s 16000
后，出现 [WinError 3] 系统找不到指定的路径。
但翻看文件夹时，对应文件已存在
再次运行代码 [WinError 183] 当文件已存在时，无法创建该文件。
但是文件夹中并没有采样率16000HZ的音频段，一片空白

checkpoints

How can I create the checkpoints for each dataset of speakers separately in different folders?

Any reason why whisper is used instead of HuBERT/contentvec

Just curious

数据集命名

README曰：

然后以下面文件结构将数据集放入dataset_raw目录

dataset_raw
├───speaker0
│   ├───xxx1-xxx1.wav
│   ├───...
│   └───Lxx-0xx8.wav
└───speaker1
    ├───xx2-0xxx2.wav
    ├───...
    └───xxx7-xxx007.wav

想问一下这里必须要使用speaker0、speaker1的命名方式吗？以及下面诸如Lxx-0xx8等名称中是否要严格包含这些数字（以及大写L）？

非歌声、无音高语音的转换

无论是huggingface上的demo，亦或是自己本地微调跑的结果，对于歌声以外的语音转换结果都不尽如人意，一半以上都是嘶哑的片段，目测是f0提取的问题，因为svc_out_pit.wav里没提成的部分就变成了白噪音。
尽管我个人的目的是转换游戏语音而非歌声，但对于其它想要转换歌声的人来说，能够自动处理说唱、诗喃、Intro常有的独白朗诵、live互动等无音高语音总是更方便的，隔壁的so-vits-svc-fork可以做到这一点，希望这个repo也可以有。

断音明显，或许是切片问题？

svc_out(20).zip
如题，可能需要一个交叉淡化？

https://github.com/svc-develop-team/so-vits-svc/blob/4.0/inference_main.py

一些问题

whisper要求为小于30秒

请问有最短的时长要求吗？
有最短总时常要求吗？即音频集最少需要多长总时常可以达到开始训练的程度。

设置工作目录

推荐的python版本是？
set PYTHONPATH=%cd%中的%cd%指的是什么？如果是使用conda创建的虚拟环境，还需要指定PYTHONPATH吗？

应该将speaker改为timbre，才准确

指的是将python prepare/preprocess_speaker.py data_svc/waves-16k/ data_svc/speaker更改为python prepare/preprocess_speaker.py data_svc/waves-16k/ data_svc/timbre吗？

指定configs/base.yaml参数pretrain: "./5.0.epoch1200.full.pth"，并适当调小学习率

适当指的大概是多少？

查看日志，release页面有完整的训练日志

应该在可视化的图表中参考哪些因素来判断模型是否训练完成/过拟合？

提取csv文本格式F0参数，Excel打开csv文件，对照Audition或者SonicVisualiser手动修改错误的F0

可以给出稍微详细一些的说明吗？Readme中的图片没看懂。

如何导出spk？

如题，训练好的模型如何导出音色文件

似乎训练到一定的量就会自动停下来，如何让他继续。

找了一下config 没看见哪个量控制这个。可能是我看漏了qwq

how to make pretrain model?

vits_pretrain.pt

is pretrain model only have one speaker or have mutil speaker? will us train our dataset on the pretrained first speaker?
why pretrain model is smaller than our trained model, is there a way to convert our model to pretain model?

sovits5.0-48k-debug.pth 在版本 5d0c4b4 推理没效果

之前有个版本推理成功过，但现在这个版本推出来只有1kB的sys_out.wav
sys_out_pit.wav有声音但都是笛笛声
使用的是 configs/singers_sample/47-wave-girl/025.wav

!python whisper/inference.py -w /content/so-vits-svc-5.0/configs/singers_sample/47-wave-girl/025.wav -p test.ppg.npy
!python svc_inference.py --config configs/base.yaml --model sovits5.0-48k-debug.pth --spk ./configs/singers/singer0023.npy --wave /content/so-vits-svc-5.0/configs/singers_sample/47-wave-girl/025.wav --ppg test.ppg.npy

singer language difference

I found there are 52 singer.
All of them is in chinese ?
Or they have different language

Train Time

how much time does it take to train ?

could you explain about training process?

thanks for awesome work! since i can not understand chinese, i translated readme to english i understood traning process as below

it seems there's two stage training process, training is quite complicated, especially for stage 2 training

For first stage, train VITS(SynthesizerTrn) with whisper ppg, NSF-hifigan, external speaker encoder(d-vector)

Second stage(SynthsizerTrnEx), apply GRL, SNAC for preventing speaker information leakage in text encoder, also apply natural speech loss(bidirectional loss between prior and posterior)

is it right? also, i can not find SynthesizerTrnEx's usage in this code base(maybe currently). could you explain bit more about training process?

How to clear training checkpoint automatically in bigvgan?

I know how to change in main.

感觉推理过程中需要一个类似原版的slicer

https://colab.research.google.com/drive/1PY1E4bDAeHbAD4r99D_oYXB46fG8nIA5?usp=sharing

如题，在colab中推理一首5min的歌曲也是会爆显存的（15g）

what is the reason for this?

I installed everything that was in the file requirements.txt , but it gives me this error at stage 4
I also exported PYTHONPATH (I got to stage 7, except stage 4, with no errors)

Traceback (most recent call last):
File "prepare/preprocess_ppg.py", line 54, in
pred_ppg(whisper, f"{wavPath}/{spks}/{file}.wav", f"{ppgPath}/{spks}/{file}.ppg")
File "prepare/preprocess_ppg.py", line 20, in pred_ppg
audio = load_audio(wavPath)
File "so-vits-svc-5.0-main\whisper\audio.py", line 42, in load_audio
ffmpeg.input(file, threads=0)
File "Python\Python38\lib\site-packages\ffmpeg_run.py", line 313, in run
process = run_async(
File "Python\Python38\lib\site-packages\ffmpeg_run.py", line 284, in run_async
return subprocess.Popen(
File "Python\Python38\lib\subprocess.py", line 854, in init
self._execute_child(args, executable, preexec_fn, close_fds,
File "Python\Python38\lib\subprocess.py", line 1307, in _execute_child
hp, ht, pid, tid = _winapi.CreateProcess(executable, args,

FileNotFoundError: [WinError 2] The specified file cannot be found

添加中文文档

浅扩散推理时扩散模型无法识别，不带扩散模型可正常推理（linux）

./raw/test.wav
test.ppg.npy
./raw/test.wav
test.csv
don't use pitch shift
/root/autodl-tmp/so-vits-svc5/diff_tool
No diffusion model or config found. Shallow diffusion mode will False
Traceback (most recent call last):
File "inference_main.py", line 121, in
main()
File "inference_main.py", line 83, in main
svc_model = Svc(args.model_path, args.config_path, args.device, args.cluster_model_path,enhance,diffusion_model_path,diffusion_config_path,shallow_diffusion,only_diffusion)
File "/root/autodl-tmp/so-vits-svc5/diff_tool/inference/infer_tool.py", line 159, in init
self.load_model()
File "/root/autodl-tmp/so-vits-svc5/diff_tool/inference/infer_tool.py", line 176, in load_model
self.hps_ms.data.filter_length // 2 + 1,
AttributeError: 'Svc' object has no attribute 'hps_ms'
/root/autodl-tmp/so-vits-svc5
mv: cannot stat './diff_tool/results/*': No such file or directory
推理结束

有QQ群没

MacBook CUDA error

M1 MacBook 报错见下。是不是意味着只有 NVIDIA 显卡才行

raise AssertionError("Torch not compiled with CUDA enabled")

AssertionError: Torch not compiled with CUDA enabled

Windows设置工作目录

设置工作目录可以加一个Windows的方法，readme里面的好像只适用Linux（）
set PYTHONPATH=%cd%

playvoice / so-vits-svc-5.0 Goto Github PK

so-vits-svc-5.0's Introduction

Variational Inference with adversarial learning for end-to-end Singing Voice Conversion based on VITS

Model properties

Why mix

Plug-In-Diffusion

Setup Environment

Dataset preparation

Data preprocessing

Train

Inference

Train Feature Retrieval Index (Optional)

Create singer

Data set

Code sources and references

Method of Preventing Timbre Leakage Based on Data Perturbation

Contributors

Thanks to

Relevant Projects

Original evidence

Be copied by svc-develop-team/so-vits-svc

so-vits-svc-5.0's People

Contributors

Stargazers

Watchers

Forkers

so-vits-svc-5.0's Issues

In other words, our school can not install any torch higher than version 1.8.1.

Does it mean we can not install your amazing software on our school’s computers, because our Nvidia GPU are too old, and can’t reach your required pytorch 1.13.1 version? Or maybe it doesn’t matter, a lower pytorch 1.8.1 version can still run?

Recommend Projects

Recommend Topics

Recommend Org

Jobs