GithubHelp home page GithubHelp logo

zero-shot-facevc's Introduction

Face-Driven Zero-Shot Voice Conversion with Memory-based Face-Voice Alignment

arXiv githubio

This paper presents a novel task, zero-shot voice conversion based on face images (zero-shot FaceVC). We leverage a memory-based face-voice alignment module for the capture of voice characteristics from face images. A mixed supervision strategy is also introduced to mitigate the long-standing issue of the inconsistency between training and inference phases for voice conversion tasks. To obtain speaker-independent content-related representations, we transfer the knowledge from a pretrained zero-shot voice conversion model VQMIVC to our zero-shot FaceVC model.

Paper Demo

Training

  • Step1. Data preparation & preprocessing
  1. Put LRS3 corpus under directory "Dataset/LRS3"
  2. Extract wav from LRS3 video
python Tools/preprocess/extract_wav_from_video.py 
  1. Extract mel and lf0 from wav
python Tools/preprocess/extract_wav_feature.py
  1. Extract face feature
python Tools/Preprocess/extract_face_feature.py
  1. Extract speech feature
python Tools/Preprocess/extract_spk_emb.py
  • Step2. Model training
  1. ParallelWaveGAN is used as the vocoder, so firstly please install ParallelWaveGAN

  2. Download the pretrained VQMIVC and place it in folder pretrained

  3. Training model

./run_shell/train.sh
  • Step3. Inference
  1. Preprocess the samples for inference following Step 1. The IDs of the preprocessed samples can be found in the files "test_src_speakers.txt" and "test_tar_speakers.txt."

  2. Pretrained FVMVC can be found in here

  3. Runing inference

./run_shell/inference.sh

Citation

If the code is used in your research, please Star our repo and cite our paper:

@inproceedings{10.1145/3581783.3613825,
author = {Sheng, Zheng-Yan and Ai, Yang and Chen, Yan-Nian and Ling, Zhen-Hua},
title = {Face-Driven Zero-Shot Voice Conversion with Memory-Based Face-Voice Alignment},
year = {2023},
isbn = {9798400701085},
url = {https://doi.org/10.1145/3581783.3613825},
doi = {10.1145/3581783.3613825},
booktitle = {Proceedings of the 31st ACM International Conference on Multimedia},
pages = {8443โ€“8452},
location = {Ottawa ON, Canada},
}

Acknowledgements:

zero-shot-facevc's People

Contributors

levent9 avatar

Stargazers

 avatar  avatar MichaelChen avatar  avatar  avatar redmist avatar Weizhi Zhong avatar  avatar Rory Lu avatar  avatar Bobo avatar

Watchers

 avatar

Forkers

vivamichu

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.