Gesture Generation from Trimodal Context

This is an official pytorch implementation of Speech Gesture Generation from the Trimodal Context of Text, Audio, and Speaker Identity (SIGGRAPH Asia 2020). In this paper, we present an automatic gesture generation model that uses the multimodal context of speech text, audio, and speaker identity to reliably generate gestures. By incorporating a multimodal context and an adversarial training scheme, the proposed model outputs gestures that are human-like and that match with speech content and rhythm. We also introduce a new quantitative evaluation metric, called FGD, for gesture generation models.

PAPER | VIDEO

Environment

This repository is developed and tested on Ubuntu 18.04, Python 3.6+, and PyTorch 1.3+. On Windows, we only tested the synthesis step and worked fine. On PyTorch 1.5+, some warning appears due to read-only entries in LMDB (related issue).

Quick Start

Installation

Clone this repository:

git clone https://github.com/ai4r/Gesture-Generation-from-Trimodal-Context.git

Install required python packages:
```
pip install -r requirements.txt
```
Install Gentle for audio-transcript alignment. Download the source code from Gentle github and install the library via install.sh. And then, you can import gentle library by specifying the path to the library at script/synthesize.py line 27.

Preparation

Download the trained model.
Download the preprocessed TED dataset (16GB) and extract the ZIP file into data/ted_dataset.
Setup Google Cloud TTS. You need to set the environment variable GOOGLE_APPLICATION_CREDENTIALS. Please see the manual here. You can skip this step if you're not going to synthesize gesture from custom text.

Synthesize from TED speech

Generate gestures from a clip in the TED testset:

python scripts/synthesize.py from_db_clip [trained model path] [number of samples to generate]

You would run like this:

python scripts/synthesize.py from_db_clip output/train_multimodal_context/multimodal_context_checkpoint_best.bin 10

The first run takes several minutes to cache the datset. After that, it runs quickly.
You can find synthesized results in output/generation_results. There are MP4, WAV, and PKL files for visualized output, audio, and pickled raw results, respectively. Speaker IDs are randomly selected for each generation. The following shows a sample MP4 file.

Synthesize from custom text

Generate gestures from speech text. Speech audio is synthesized by Google Cloud TTS.

python scripts/synthesize.py from_text [trained model path] {en-male, en-female}

You could select a sample text or input a new text. Input text can be a plain text or SSML markup text. The third argument in the above command is for selecting TTS voice. You might further tweak TTS in utils/tts_help.py.

Training

Train the proposed model:

python scripts/train.py --config=config/multimodal_context.yml

And the baseline models as well:

python scripts/train.py --config=config/seq2seq.yml
python scripts/train.py --config=config/speech2gesture.yml
python scripts/train.py --config=config/joint_embed.yml

Caching TED training set (lmdb_train) takes tens of minutes at your first run. Model checkpoints and sample results will be saved in subdirectories of output folder. Training the proposed model took about 8 h with a RTX 2080 Ti.

Note on reproducibility:
unfortunately, we didn't fix a random seed, so you are not able to reproduce the same FGD in the paper. But, several runs with different random seeds mostly fell in a similar FGD range.

Fréchet Gesture Distance (FGD)

To be updated.

Blender Animation (from a generated PKL file)

To be updated.

License

Please see LICENSE.md

Citation

If you find our work useful in your research, please consider citing:

@article{Yoon2020Trimodal,
  title={Speech Gesture Generation from the Trimodal Context of Text, Audio, and Speaker Identity},
  author={Youngwoo Yoon and Bok Cha and Joo-Haeng Lee and Minsu Jang and Jaeyeon Lee and Jaehong Kim and Geehyuk Lee},
  journal={ACM Transactions on Graphics},
  year={2020},
  volume={39},
  number={6},
}

Please feel free to contact us ([email protected]) with any question or concerns.

affithecreator / gesture-generation-from-trimodal-context Goto Github PK