Torch implementation of StyleDDPM for voice conversion
- StarGANv2-VC: A Diverse, Unsupervised, Non-parallel Framework for Natural-Sounding Voice Conversion, Li et al., 2021. [arXiv:2107.10394]
- VDM: Variational diffusion models, Kingma et al., 2021. [arXiv:2107.00630]
- UNIT-DDPM: UNpaired Image Translation with Denoising Diffusion Probabilistic Models, Sasaki et al., 2021. [arXiv:2104.05358]
- MAE: Masked Autoencoders Are Scalable Vision Learners, He et al., 2021. [arXiv:2111.06377]
- StyleGAN2: Analyzing and Improving the Image Quality of StyleGAN, Karras et al., 2019. [arXiv:1912.04958]
Tested in python 3.9.12 conda environment, ref requirements.
Initialize the submodule and patch.
git submodule init --update
cd hifigan; patch -p0 < ../hifigan-diff
Download LibriTTS dataset from openslr
Dump the preprocessed LibriTTS dataset.
python -m utils.vcdataset \
--data-dir /datasets/LibriTTS/train-clean-360 \
--out-dir /datasets/LibriTTS/train-clean-360-dump \
--num-proc 8 \
--chunksize 16 \
--device cuda
To train model, run train.py
python train.py \
--data-dir /datasets/LibriTTS/train-clean-360-dump \
--from-dump
To start to train from previous checkpoint, --load-epoch is available.
python train.py \
--data-dir /datasets/LibriTTS/train-clean-360-dump \
--from-dump \
--load-epoch 20 \
--config ./ckpt/t1.json
Checkpoint will be written on TrainConfig.ckpt, tensorboard summary on TrainConfig.log.
tensorboard --logdir ./log
[TODO] To inference model, run inference.py
[TODO] Pretrained checkpoints are relased on releases.
To use pretrained model, download files and unzip it. Followings are sample script.
from config import Config
from styleddpm import StyleDDPMVC
with open('t1.json') as f:
config = Config.load(json.load(f))
ckpt = torch.load('t1_200.ckpt', map_location='cpu')
vc = StyleDDPMVC(config.model)
vc.load(ckpt)
vc.eval()