GithubHelp home page GithubHelp logo

affithecreator / gesture-generation-from-trimodal-context Goto Github PK

View Code? Open in Web Editor NEW

This project forked from ai4r/gesture-generation-from-trimodal-context

0.0 0.0 0.0 1.56 MB

License: Other

Python 100.00%

gesture-generation-from-trimodal-context's Introduction

Gesture Generation from Trimodal Context

This is an official pytorch implementation of Speech Gesture Generation from the Trimodal Context of Text, Audio, and Speaker Identity (SIGGRAPH Asia 2020). In this paper, we present an automatic gesture generation model that uses the multimodal context of speech text, audio, and speaker identity to reliably generate gestures. By incorporating a multimodal context and an adversarial training scheme, the proposed model outputs gestures that are human-like and that match with speech content and rhythm. We also introduce a new quantitative evaluation metric, called FGD, for gesture generation models.

OVERVIEW

Environment

This repository is developed and tested on Ubuntu 18.04, Python 3.6+, and PyTorch 1.3+. On Windows, we only tested the synthesis step and worked fine. On PyTorch 1.5+, some warning appears due to read-only entries in LMDB (related issue).

Quick Start

Installation

  1. Clone this repository:

    git clone https://github.com/ai4r/Gesture-Generation-from-Trimodal-Context.git
    
  2. Install required python packages:

    pip install -r requirements.txt
    
  3. Install Gentle for audio-transcript alignment. Download the source code from Gentle github and install the library via install.sh. And then, you can import gentle library by specifying the path to the library at script/synthesize.py line 27.

Preparation

  1. Download the trained model.

  2. Download the preprocessed TED dataset (16GB) and extract the ZIP file into data/ted_dataset.

  3. Setup Google Cloud TTS. You need to set the environment variable GOOGLE_APPLICATION_CREDENTIALS. Please see the manual here. You can skip this step if you're not going to synthesize gesture from custom text.

Synthesize from TED speech

Generate gestures from a clip in the TED testset:

python scripts/synthesize.py from_db_clip [trained model path] [number of samples to generate]

You would run like this:

python scripts/synthesize.py from_db_clip output/train_multimodal_context/multimodal_context_checkpoint_best.bin 10

The first run takes several minutes to cache the datset. After that, it runs quickly.
You can find synthesized results in output/generation_results. There are MP4, WAV, and PKL files for visualized output, audio, and pickled raw results, respectively. Speaker IDs are randomly selected for each generation. The following shows a sample MP4 file.

Sample MP4

Synthesize from custom text

Generate gestures from speech text. Speech audio is synthesized by Google Cloud TTS.

python scripts/synthesize.py from_text [trained model path] {en-male, en-female}

You could select a sample text or input a new text. Input text can be a plain text or SSML markup text. The third argument in the above command is for selecting TTS voice. You might further tweak TTS in utils/tts_help.py.

Training

Train the proposed model:

python scripts/train.py --config=config/multimodal_context.yml

And the baseline models as well:

python scripts/train.py --config=config/seq2seq.yml
python scripts/train.py --config=config/speech2gesture.yml
python scripts/train.py --config=config/joint_embed.yml 

Caching TED training set (lmdb_train) takes tens of minutes at your first run. Model checkpoints and sample results will be saved in subdirectories of output folder. Training the proposed model took about 8 h with a RTX 2080 Ti.

Note on reproducibility:
unfortunately, we didn't fix a random seed, so you are not able to reproduce the same FGD in the paper. But, several runs with different random seeds mostly fell in a similar FGD range.

Fréchet Gesture Distance (FGD)

To be updated.

Blender Animation (from a generated PKL file)

To be updated.

License

Please see LICENSE.md

Citation

If you find our work useful in your research, please consider citing:

@article{Yoon2020Trimodal,
  title={Speech Gesture Generation from the Trimodal Context of Text, Audio, and Speaker Identity},
  author={Youngwoo Yoon and Bok Cha and Joo-Haeng Lee and Minsu Jang and Jaeyeon Lee and Jaehong Kim and Geehyuk Lee},
  journal={ACM Transactions on Graphics},
  year={2020},
  volume={39},
  number={6},
}

Please feel free to contact us ([email protected]) with any question or concerns.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.