danruta / xva-trainer Goto Github PK

UI app for training TTS/VC machine learning models for xVASynth, with several audio pre-processing tools, and dataset creation/management.

Python 70.99% HTML 1.32% JavaScript 10.06% Batchfile 0.02% CSS 0.59% Cuda 6.54% C++ 10.43% Shell 0.06%

xva-trainer's Introduction

xVATrainer

xVATrainer is the companion app to xVASynth, the AI text-to-speech app using video game voices. xVATrainer is used for creating the voice models for xVASynth, and for curating and pre-processing the datasets used for training these models. With this tool, you can provide new voices for mod authors to use in their projects.

v1.0 Showcase/overview:

Links: Steam Nexus Discord Patreon

Check the descriptions on the nexus page for the most up-to-date information.

There are three main components to xVATrainer:

Dataset annotation - where you can adjust the text transcripts of existing/finished datasets, or record new data for it over your microphone
Data preparation/pre-processing tools - Used for creating datasets of the correct format, from whatever audio data you may have
Model training - The bit where the models actually train on the datasets

Dataset annotation

The main screen of xVATrainer contains a dataset explorer, which gives you an easy way to view, analyse, and adjust the data samples in your dataset. It further provides recording capabilities, if you need to record a dataset of your own voice, straight through the app, into the correct format.

Tools

There are several data pre-processing tools included in xVATrainer, to help you with almost any data preparation work you may need to do, to prepare your datasets for training. There is no step-by-step order that they need to be operated in, so long as your datasets end up as 22050Hz mono wav files of clean speech audio, up to about 10 seconds in length, with an associated transcript file with each audio file's transcript. Depending on what sources your data is from, you can pick which tools you need to use, to prepare your dataset to match that format. The included tools are:

Audio formatting - a tool to convert from most audio formats into the required 22050Hz mono .wav format
AI speaker diarization - an AI model that automatically extracts short slices of speech audio from otherwise longer audio samples (including feature length movie sized audio clips). The audio slices are additionally separated automatically into different individual speakers
AI source separation - an AI model that can remove background noise, music, and echo from an audio clip of speech
Audio Normalization - a tool which normalizes (EBU R128) audio to standard loudness
WEM to OGG - a tool to convert from a common audio format found in game files, to a playable .ogg format. Use the "Audio formatting" tool to convert this to the required .wav format
Cluster speakers - a tool which uses an AI model to encode audio files, and then clusters them into a known or unknown number of clusters, either separating multiple speakers, or single-speaker audio styles
Speaker similarity search - a tool which encoders some query files, a larger corpus of audio files, and then re-orders the larger corpus according to each file's similarity to all the query files
Speaker cluster similarity search - the same as the "Speaker similarity search" tool, but using clusters calculated via the "Cluster speakers" tool as data points in the corpus to sort
Transcribe - an AI model which automatically generates a text transcript for audio files
WER transcript evaluation - a tool which examines your dataset's transcript against one auto-generated via the "Transcribe" tool to check for quality. Useful when supplying your own transcript, and checking if there are any transcription errors.
Remove background noise - a more traditional noise removal tool, which uses a clip of just noise as reference to remove from a larger corpus of audio which consistently has matching background noise
Silence Split - A simple tool which splits long audio clips based on configurable silence detection

Trainer

xVATrainer contains AI model training, for the FastPitch1.1 (with a custom modified training set-up), and HiFi-GAN models (the xVASynth "v2" models). The training follows a multi-stage approach especially optimized for maximum transfer learning (fine-tuning) quality. The generated models are exported into the correct format required by xVASynth, ready to use for generating audio with.

Batch training is also supported, allowing you to queue up any number of datasets to train, with cross-session persistence. The training panel shows a cmd-like textual log of the training progress, a tensorboard-like visual graph for the most relevant metrics, and a task manager-like set of system resources graphs.

You don't need any programming or machine learning experience. The only required input is to start/pause/stop the training sessions, and everything within is automated.

Setting up the development environment

Note: these installation instructions are for Windows. Use the requirements_linux_py3_10_6.txt file for Linux installation.

Create the environment using virtualenv and python 3.10 (pre-requisite: Python 3.10) virtualenv envXVATrainerP310 --python=C:\Users\Dan\AppData\Local\Programs\Python\Python310\python.exe

Activate your environment. Do this every time you launch a new terminal to work with xVATrainer envXVATrainerP310\Scripts\activate

Install PyTorch v2.0.x with CUDA. Get the v2.0 link from the pytorch website (pre-requisite: CUDA drivers from nvidia) eg (might be outdated): pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118

Install the dependencies through pip: pip install -r reqs_gpu.txt

Copy the folders from ./lib/_dev into the environment (into envXVATrainerP310/Lib/site-packages). These are some library files which needed custom modifications/bug fixes to integrate with everything else. Overwrite as necessary

Make sure that you're using librosa==0.8.1 (check with pip list. uninstall and re-install with this version if not)

Contribuiting

If you'd like to help improve xVATrainer, get in touch (eg an Issue, though best on Discord), and let me know. The main areas of interest for community contributions are (though let me know your ideas!):

Training optimizations (speed)
Model quality improvements
New tools
Bug fixes
Quality of life improvements

A current issue/bug is that I can't get the HiFi-GAN to train with num_workers>0, as the training always cuts out after a deterministic amount of time - maybe something to do with the training loop being inside a secondary thread (FastPitch works fine though). Help with this would be especially welcome.

In a similar vein (and maybe caused by the same issue), I can only set tools' maximum number of multi-processing workers to just under half the total CPU thread count, else it can only run once, after which the app needs re-starting to use again.

xva-trainer's People

Contributors

Stargazers

Watchers

Forkers

bunglepaws yuripourre-forks diagnostikon pendrokar grumpyben hololeo sariohara smallbutfine cybersys colorfingers saradark iamleon121 swordlegend randomais jak12-3

xva-trainer's Issues

Instructions for running from source

Title. I'm pretty sure this is an electron app but I have no idea which of these files to do anything with.

Error While Training

Settings:

Output:

18:41:18 | New Session 
18:41:18 | No graphs.json file found. Starting anew. 
18:41:18 | Dataset: C:/Program Files (x86)/Steam/steamapps/common/xVATrainer/resources/app/datasets//rdfvd_paimon 
18:41:18 | Language: English 
18:41:18 | Checkpoint: ./resources/app/python/xvapitch/pretrained_models/xVAPitch_5820651.pt 
18:41:18 | CUDA device IDs: 0 
18:41:18 | FP16: Disabled 
18:41:18 | Batch size: 6 (Base: 6, GPUs mult: 1) | GAM: 67 -> (402) | Target: 400 
18:41:18 | Outputting model backups every 3 checkpoints 
18:41:19 | Loading model and optimizer state from ./resources/app/python/xvapitch/pretrained_models/xVAPitch_5820651.pt 
18:41:20 | New voice 
18:41:20 | Workers: 3 
18:41:38 | Fine-tune dataset files: 7 
18:45:00 | Priors datasets files: 179007 | Number of datasets: 28

Error:

Traceback (most recent call last):
  File "server.py", line 227, in handleTrainingLoop
  File "python\xvapitch\xva_train.py", line 137, in handleTrainer
  File "python\xvapitch\xva_train.py", line 557, in start
  File "python\xvapitch\xva_train.py", line 604, in iteration
  File "python\xvapitch\xva_train.py", line 391, in init
  File "C:\Program Files (x86)\Steam\steamapps\common\xVATrainer\.\resources\app\python\xvapitch\get_dataset_emb.py", line 18, in get_emb
    kmeans = KMeans(n_clusters=n_clusters, random_state=0).fit(embs)
  File "sklearn\cluster\_kmeans.py", line 1376, in fit
    self._check_params(X)
  File "sklearn\cluster\_kmeans.py", line 1307, in _check_params
    super()._check_params(X)
  File "sklearn\cluster\_kmeans.py", line 828, in _check_params
    raise ValueError(
ValueError: n_samples=7 should be >= n_clusters=10.

Allow manually saving checkpoints

Given that it can take quite some time to train 2500 steps, it would be convenient to create manual checkpoints so training can resume from that.

Speaker diarization in 1.2.0+ does not work

UI error message:

ERROR:Traceback (most recent call last):
  File "server.py", line 200, in websocket_handler
  File "python\make_srt\model.py", line 91, in make_srt
KeyError: 'diarization'

server.log of version 1.2.0

Traceback (most recent call last):
  File "python\models_manager.py", line 37, in init_model
  File "python\speaker_diarization\model.py", line 26, in __init__
  File "python\speaker_diarization\model.py", line 129, in load_model
  File "PyInstaller\loader\pyimod03_importers.py", line 495, in exec_module
  File "pyannote\audio\features\__init__.py", line 33, in <module>
  File "PyInstaller\loader\pyimod03_importers.py", line 495, in exec_module
  File "pyannote\audio\features\base.py", line 38, in <module>
  File "PyInstaller\loader\pyimod03_importers.py", line 495, in exec_module
  File "pyannote\database\__init__.py", line 37, in <module>
  File "PyInstaller\loader\pyimod03_importers.py", line 495, in exec_module
  File "pyannote\database\database.py", line 31, in <module>
  File "PyInstaller\loader\pyimod03_importers.py", line 495, in exec_module
  File "pyannote\database\util.py", line 32, in <module>
  File "PyInstaller\loader\pyimod03_importers.py", line 495, in exec_module
  File "pandas\__init__.py", line 22, in <module>
    from pandas.compat import (
  File "PyInstaller\loader\pyimod03_importers.py", line 495, in exec_module
  File "pandas\compat\__init__.py", line 15, in <module>
    from pandas.compat.numpy import (
  File "PyInstaller\loader\pyimod03_importers.py", line 495, in exec_module
  File "pandas\compat\numpy\__init__.py", line 7, in <module>
    from pandas.util.version import Version
  File "PyInstaller\loader\pyimod03_importers.py", line 495, in exec_module
  File "pandas\util\__init__.py", line 1, in <module>
    from pandas.util._decorators import (  # noqa
  File "PyInstaller\loader\pyimod03_importers.py", line 495, in exec_module
  File "pandas\util\_decorators.py", line 14, in <module>
    from pandas._libs.properties import cache_readonly  # noqa
  File "PyInstaller\loader\pyimod03_importers.py", line 495, in exec_module
  File "pandas\_libs\__init__.py", line 13, in <module>
    from pandas._libs.interval import Interval
  File "pandas\_libs\interval.pyx", line 1, in init pandas._libs.interval
  File "pandas\_libs\hashtable.pyx", line 1, in init pandas._libs.hashtable
  File "pandas\_libs\missing.pyx", line 1, in init pandas._libs.missing
  File "PyInstaller\loader\pyimod03_importers.py", line 495, in exec_module
  File "pandas\_libs\tslibs\__init__.py", line 31, in <module>
    from pandas._libs.tslibs.conversion import (
  File "pandas\_libs\tslibs\conversion.pyx", line 1, in init pandas._libs.tslibs.conversion
ModuleNotFoundError: No module named 'pandas._libs.tslibs.base'

server.log of version 1.2.1

Traceback (most recent call last):
  File "python\models_manager.py", line 37, in init_model
  File "python\speaker_diarization\model.py", line 26, in __init__
  File "python\speaker_diarization\model.py", line 129, in load_model
ImportError: cannot import name 'Pretrained' from 'pyannote.audio.features' (C:\Program Files (x86)\Steam\steamapps\common\xVATrainer\resources\app\cpython_gpu\pyannote\audio\features\__init__.pyc)

Steam release 1.2.0 seems to be broken

Multiple issues. Checking data files through Steam didn't show any error. Cleaning up the dataset didn't help.

I've got this stacktrace after starting a new training from scratch:

Traceback (most recent call last):
  File "server.py", line 227, in handleTrainingLoop
  File "python\xvapitch\xva_train.py", line 137, in handleTrainer
  File "python\xvapitch\xva_train.py", line 554, in start
  File "python\xvapitch\xva_train.py", line 601, in iteration
  File "python\xvapitch\xva_train.py", line 377, in init
  File "python\xvapitch\xva_train.py", line 1206, in setup_dataloaders
  File "C:\Program Files (x86)\Steam\steamapps\common\xVATrainer\.\resources\app\python\xvapitch\util.py", line 410, in get_language_weighted_sampler
    return WeightedRandomSampler(dataset_samples_weight, len(dataset_samples_weight))
  File "torch\utils\data\sampler.py", line 186, in __init__
    raise ValueError("num_samples should be a positive integer "
ValueError: num_samples should be a positive integer value, but got num_samples=0

The line numbers there don't match up with xva_train.py, changes to that file to debug this are completely ignored, whereas changes to e.g. dataset.py are working fine. Throwing an exception in read_datasets shows that at least one point it's returning the correct dataset.

The UI is still broken when adding new trainings, it seems the list of trainings must be cleared in order to be able to add a new training.

Improve training quality / synthesis

Hi, I see that the PRIORS datasets are synthetic. I plan to replace them with more natural sounding datasets (either synthetic and/or human). This will help for a better and more natural pronunciation and intonation in voice synthesis?

I also plan to make an improvement on Arpabet, since it has a problem with the double R ("RR"), in Spanish, and does not read accents.
Is it possible to improve the dictionary?

I have a mod I made some time ago for Tacotron for the Spanish language, but I don't know if it can be implemented here. I leave the link to the files below. Thanks!

https://drive.google.com/file/d/19AGqgfWiMc8MYHuH_705phCbm8-GGIDa/view?usp=drive_link