Face-Driven Zero-Shot Voice Conversion with Memory-based Face-Voice Alignment

This paper presents a novel task, zero-shot voice conversion based on face images (zero-shot FaceVC). We leverage a memory-based face-voice alignment module for the capture of voice characteristics from face images. A mixed supervision strategy is also introduced to mitigate the long-standing issue of the inconsistency between training and inference phases for voice conversion tasks. To obtain speaker-independent content-related representations, we transfer the knowledge from a pretrained zero-shot voice conversion model VQMIVC to our zero-shot FaceVC model.

Paper Demo

Training

Step1. Data preparation & preprocessing

Put LRS3 corpus under directory "Dataset/LRS3"
Extract wav from LRS3 video

python Tools/preprocess/extract_wav_from_video.py

Extract mel and lf0 from wav

python Tools/preprocess/extract_wav_feature.py

Extract face feature

python Tools/Preprocess/extract_face_feature.py

Extract speech feature

python Tools/Preprocess/extract_spk_emb.py

Step2. Model training

ParallelWaveGAN is used as the vocoder, so firstly please install ParallelWaveGAN
Download the pretrained VQMIVC and place it in folder pretrained
Training model

./run_shell/train.sh

Step3. Inference

Preprocess the samples for inference following Step 1. The IDs of the preprocessed samples can be found in the files "test_src_speakers.txt" and "test_tar_speakers.txt."
Pretrained FVMVC can be found in here
Runing inference

./run_shell/inference.sh

Citation

If the code is used in your research, please Star our repo and cite our paper:

@inproceedings{10.1145/3581783.3613825,
author = {Sheng, Zheng-Yan and Ai, Yang and Chen, Yan-Nian and Ling, Zhen-Hua},
title = {Face-Driven Zero-Shot Voice Conversion with Memory-Based Face-Voice Alignment},
year = {2023},
isbn = {9798400701085},
url = {https://doi.org/10.1145/3581783.3613825},
doi = {10.1145/3581783.3613825},
booktitle = {Proceedings of the 31st ACM International Conference on Multimedia},
pages = {8443–8452},
location = {Ottawa ON, Canada},
}

Acknowledgements:

The voice conversion backbone is borrowed from VQMIVC
The vocoder is borrowed from ParallelWaveGAN

levent9 / zero-shot-facevc Goto Github PK

zero-shot-facevc's Introduction

Face-Driven Zero-Shot Voice Conversion with Memory-based Face-Voice Alignment

Training

Citation

Acknowledgements:

zero-shot-facevc's People

Contributors

Stargazers

Watchers

Forkers

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent

Jobs