Code for Talking Face Generation by Adversarially Disentangled Audio-Visual Representation (AAAI 2019)

License: MIT License

Python 85.65% MATLAB 14.35%

talking-face-generation-davs's Introduction

Talking Face Generation by Adversarially Disentangled Audio-Visual Representation (AAAI 2019)

We propose Disentangled Audio-Visual System (DAVS) to address arbitrary-subject talking face generation in this work, which aims to synthesize a sequence of face images that correspond to given speech semantics, conditioning on either an unconstrained speech audio or video.

[Project] [Paper] [Demo]

Recommondation of our CVPR21 repo

This repo is barely maintaining since the version of this code is out of date. If you are interested in the topic of Talking Face Generation, feel free to try the CODE of our CVPR2021 PAPER!

Requirements

python 2.7
PyTorch（We use version 0.2.0)
opencv2

Generating test results

Download the pre-trained model checkpoint

Create the default folder "checkpoints" and put the checkpoint in it or get the CHECKPOINT_PATH

Samples for testing can be found in this folder named 0572_0019_0003. This is a pre-processed sample from the Voxceleb Dataset.
Run the testing script to generate videos from video:

python test_all.py  --test_root ./0572_0019_0003/video --test_type video --test_audio_video_length 99 --test_resume_path CHECKPOINT_PATH

Run the testing script to generate videos from audio:

python test_all.py  --test_root ./0572_0019_0003/audio --test_type audio --test_audio_video_length 99 --test_resume_path CHECKPOINT_PATH

Sample Results

Talking Effect on Human Characters

Talking Effect on Non-human Characters (Trained on Human Faces Only)

Create more samples

The face detection tool used in the demo videos can be found at RSA. It will return a Matfile with 5 key point locations in a row for each image. Other face alignment methods are also appliable such as dlib. The key points for face alignement we used are the two for the center of the eyes and the average point of the corners of the mouth. With each image's PATH and the face POINTS, you can find our way of face alignment at preprocess/face_align.py.
Our preprocessing of the audio files is the same and borrowed from the matlab code of SyncNet. Then we save the mfcc features into bin files.

Preparing Training Data

We used the LRW dataset for training.
The directories are arranged like this:

data
├── train, val, test
|	├── 0, 1, 2 ... 499 (one folder for each class)
|	│   ├── 0, 1, 2 ... #videos per class
|	│   │   ├── align_face256
|	│   │   |   ├── 0, 1, ... 28.jpg
|	│   |   ├── mfcc20
|	│   │   |   ├── 2, 3 ... 26.bin

where each video is extracted to frames and aligned using our protocol, and each audio is processed and saved using Matlab.

Training

python train.py

This is still a beta version of the training code which only disentangles wid information from pid space. Running the train.py only might not be able to fully reproduce the paper. However, it can be served as a reference for how we implement the whole training process.
During our own implementation, the classification part (without generation and disentanglement) is pretrained first. The pretraining training code is temporarily not provided.

Postprocessing Details (Optional)

The directly generated results may suffer from a "zoom-in-and-out" condition which we assume is caused by our alignment of the training set. We solve the unstable problem using Subspace Video Stabilization in the demos.

License and Citation

The use of this software is RESTRICTED to non-commercial research and educational purposes.

@inproceedings{zhou2019talking,
  title     = {Talking Face Generation by Adversarially Disentangled Audio-Visual Representation},
  author    = {Zhou, Hang and Liu, Yu and Liu, Ziwei and Luo, Ping and Wang, Xiaogang},
  booktitle = {AAAI Conference on Artificial Intelligence (AAAI)},
  year      = {2019},
}

Acknowledgement

The structure of this codebase is borrowed from pix2pix.

talking-face-generation-davs's People

Contributors

Stargazers

Watchers

Forkers

trantorrepository cash2one hzhang57 xiaoyun4 shiyongde xiaoyeye1117 baldr-y tsok-xyz trendingtechnology soobik boruimo wangyangneu sfzhoujob ml-lab ngccc zhiwenshao tangyoubao lelechen63 susanqq zhengziqiang llsourcell smitkadvani juanlp yaduvendra htadg voltek62 esmaeilinia mlroughwork hzitoun stc-cqupt alecrespo manisaiprasad ebuty kingulight amirunpri2018 samangel93 breakcount shahuzi leonskennedy lockejiang gaobingaobingaobin chaoso amirgoren ruizewang edmig freesia-sherry handong890 mbyase sunville hotpoor 163jh gegetang chaoyue729 jiashaoyong mybian yuechengli w1994wl202 qoboty rabiaaqel cheungbs itssujeeth vsycx ecohnoch masoudj nisheethjaiswal jettangtzj moonjeongkang yueyuanman xuhui6666 team-know-name mekchone zaidalyafeai rkuo2000 human2b keithimyers jihaonew linzai1992 dsmouse mdongbenben shubhamrock428 danbadds38 wonnor-pro wynmew soccergame peterzs kxw123456 natravedrova weinanguan tamwaiban samsgates qyou avinashkumarsingh15 edward-deltax aliushn spacetral mrasimzahid sergeytimoshin ssitb peterzhousz fragotesac

talking-face-generation-davs's Issues

Error in loading Checkpoint file : Missing keys error

I faced some problems when I tried loading checkpoint file. The error I got is:

loading checkpoint '/home/gouriparvathymenon/Downloads/101_DAVS_checkpoint.pth'
missing keys in state_dict: set(['module.block33.conv33_a_bn.num_batches_tracked', 'module.block12.conv12_b_bn.num_batches_tracked', 'module.block13.conv13_a_bn.num_batches_tracked', 'module.block32.conv32_a_bn.num_batches_tracked', 'module.block32.conv32_b_bn.num_batches_tracked', 'module.block34.conv34_a_bn.num_batches_tracked', 'module.block31.conv31_a_bn.num_batches_tracked', 'module.block12.conv12_a_bn.num_batches_tracked', 'module.block1.conv01_a_bn.num_batches_tracked', 'module.block33.conv33_b_bn.num_batches_tracked', 'module.block11.conv11_a_bn.num_batches_tracked', 'module.block42.conv42_b_bn.num_batches_tracked', 'module.block13.conv13_b_bn.num_batches_tracked', 'module.block24.conv24_b_bn.num_batches_tracked', 'module.block23.conv23_a_bn.num_batches_tracked', 'module.block25.conv25_b_bn.num_batches_tracked', 'module.block22.conv22_a_bn.num_batches_tracked', 'module.block34.conv34_b_bn.num_batches_tracked', 'module.block11.conv11_b_bn.num_batches_tracked', 'module.block21.conv21_a_bn.num_batches_tracked', 'module.block23.conv23_b_bn.num_batches_tracked', 'module.block1.conv01_b_bn.num_batches_tracked', 'module.block31.conv31_b_bn.num_batches_tracked', 'module.block42.conv42_a_bn.num_batches_tracked', 'module.block14.conv14_b_bn.num_batches_tracked', 'module.block26.conv26_a_bn.num_batches_tracked', 'module.block25.conv25_a_bn.num_batches_tracked', 'module.block22.conv22_b_bn.num_batches_tracked', 'module.block14.conv14_a_bn.num_batches_tracked', 'module.block26.conv26_b_bn.num_batches_tracked', 'module.block41.conv41_a_bn.num_batches_tracked', 'module.block21.conv21_b_bn.num_batches_tracked', 'module.block24.conv24_a_bn.num_batches_tracked', 'module.block41.conv41_b_bn.num_batches_tracked'])
missing keys in state_dict: set(['module.convblock3.conv3_1_bn.num_batches_tracked', 'module.convblock5.conv5_0_bn.num_batches_tracked', 'module.convblock4.conv4_2_bn.num_batches_tracked', 'module.convblock2.conv2_0_bn.num_batches_tracked', 'module.convblock2.conv2_2_bn.num_batches_tracked', 'module.convblock3.conv3_3_bn.num_batches_tracked', 'module.convblock1.conv1_0_bn.num_batches_tracked', 'module.convblock1.conv1_1_bn.num_batches_tracked', 'module.convblock5.conv5_2_bn.num_batches_tracked', 'module.convblock3.conv3_0_bn.num_batches_tracked', 'module.convblock6.conv6_1_bn.num_batches_tracked', 'module.convblock3.conv3_2_bn.num_batches_tracked', 'module.deconv1_1_bn.num_batches_tracked', 'module.convblock4.conv4_0_bn.num_batches_tracked', 'module.convblock4.conv4_1_bn.num_batches_tracked', 'module.convblock2.conv2_1_bn.num_batches_tracked', 'module.conv7_1_bn.num_batches_tracked', 'module.convblock5.conv5_1_bn.num_batches_tracked', 'module.convblock4.conv4_3_bn.num_batches_tracked', 'module.convblock6.conv6_0_bn.num_batches_tracked'])
missing keys in state_dict: set(['module.model1.bn3.num_batches_tracked', 'module.model1.bn1.num_batches_tracked', 'module.model1.bn2.num_batches_tracked', 'module.model1.bn5.num_batches_tracked', 'module.model2.bn1.num_batches_tracked', 'module.model2.bn2.num_batches_tracked'])
missing keys in state_dict: set(['module.model.conv4.bn2.num_batches_tracked', 'module.model.top_m_0.bn2.num_batches_tracked', 'module.model.m0.b1_2.bn2.num_batches_tracked', 'module.model.m0.b1_4.bn2.num_batches_tracked', 'module.model.m0.b1_2.bn1.num_batches_tracked', 'module.model.conv3.bn2.num_batches_tracked', 'module.model.m0.b1_4.bn1.num_batches_tracked', 'module.model.m0.b1_3.bn1.num_batches_tracked', 'module.model.conv2.bn2.num_batches_tracked', 'module.model.m0.b3_2.bn2.num_batches_tracked', 'module.model.m0.b1_4.bn3.num_batches_tracked', 'module.model.conv2.bn1.num_batches_tracked', 'module.model.m0.b2_plus_1.bn2.num_batches_tracked', 'module.model.m0.b3_3.bn1.num_batches_tracked', 'module.model.m0.b2_3.bn2.num_batches_tracked', 'module.model.m0.b2_2.bn2.num_batches_tracked', 'module.model.m0.b1_1.bn3.num_batches_tracked', 'module.model.m0.b2_1.bn2.num_batches_tracked', 'module.model.conv2.bn3.num_batches_tracked', 'module.model.conv4.bn1.num_batches_tracked', 'module.model.m0.b2_4.bn2.num_batches_tracked', 'module.model.m0.b3_3.bn3.num_batches_tracked', 'module.model.conv4.bn3.num_batches_tracked', 'module.model.m0.b3_1.bn3.num_batches_tracked', 'module.model.m0.b2_2.bn3.num_batches_tracked', 'module.model.conv3.bn1.num_batches_tracked', 'module.model.m0.b3_4.bn2.num_batches_tracked', 'module.model.m0.b2_4.bn1.num_batches_tracked', 'module.model.m0.b3_2.bn1.num_batches_tracked', 'module.model.m0.b1_1.bn1.num_batches_tracked', 'module.model.m0.b1_1.bn2.num_batches_tracked', 'module.model.m0.b2_1.bn3.num_batches_tracked', 'module.model.top_m_0.bn1.num_batches_tracked', 'module.model.m0.b3_2.bn3.num_batches_tracked', 'module.model.m0.b1_2.bn3.num_batches_tracked', 'module.bn1.num_batches_tracked', 'module.model.conv2.downsample.0.num_batches_tracked', 'module.model.m0.b2_1.bn1.num_batches_tracked', 'module.model.m0.b3_1.bn1.num_batches_tracked', 'module.model.m0.b2_3.bn3.num_batches_tracked', 'module.model.bn1.num_batches_tracked', 'module.model.m0.b1_3.bn3.num_batches_tracked', 'module.model.bn_end0.num_batches_tracked', 'module.model.m0.b2_2.bn1.num_batches_tracked', 'module.model.m0.b3_1.bn2.num_batches_tracked', 'module.model.conv4.downsample.0.num_batches_tracked', 'module.model.m0.b3_4.bn1.num_batches_tracked', 'module.model.m0.b2_4.bn3.num_batches_tracked', 'module.model.m0.b2_plus_1.bn1.num_batches_tracked', 'module.model.m0.b2_3.bn1.num_batches_tracked', 'module.model.top_m_0.bn3.num_batches_tracked', 'module.model.m0.b3_3.bn2.num_batches_tracked', 'module.model.m0.b2_plus_1.bn3.num_batches_tracked', 'module.model.conv3.bn3.num_batches_tracked', 'module.model.m0.b3_4.bn3.num_batches_tracked', 'module.model.m0.b1_3.bn2.num_batches_tracked'])
=> loaded checkpoint '/home/gouriparvathymenon/Downloads/101_DAVS_checkpoint.pth' (step 21145000)
Traceback (most recent call last):
File "test_all.py", line 41, in
for i2, data in enumerate(test_dataloader):
File "/home/gouriparvathymenon/.local/share/virtualenvs/gouriparvathymenon-vLhDFwnm/lib/avatar/local/lib/python2.7/site-packages/torch/utils/data/dataloader.py", line 637, in next
return self._process_next_batch(batch)
File "/home/gouriparvathymenon/.local/share/virtualenvs/gouriparvathymenon-vLhDFwnm/lib/avatar/local/lib/python2.7/site-packages/torch/utils/data/dataloader.py", line 658, in _process_next_batch
raise batch.exc_type(batch.exc_msg)
KeyError: 'Traceback (most recent call last):\n File "/home/gouriparvathymenon/.local/share/virtualenvs/gouriparvathymenon-vLhDFwnm/lib/avatar/local/lib/python2.7/site-packages/torch/utils/data/dataloader.py", line 138, in _worker_loop\n samples = collate_fn([dataset[i] for i in batch_indices])\n File "/home/gouriparvathymenon/PycharmProjects/avatar/Talking-Face-Generation-DAVS/Dataloader/Test_load_audio.py", line 103, in getitem\n loader['A'] = self.vid['A']\nKeyError: 'A'\n'

The command I used was this:

python test_all.py --test_root 001.wav --test_type audio --test_audio_video_length 99 --test_resume_path 101_DAVS_checkpoint.pth.tar

Can someone help me find where i went wrong? Thanks in advance.

checkpoint

I can not download the model checkpoint. How can I deal with the problem.

Can I use some other audio for testing except for the example 0572_0019_0003.wav？

I have just try some other audios for test(I use the matlab taking the *.wav file into mfcc bin files),find that most of the wav files did not make the modle generate the images, only a few can. Can this mode support this test?

Code running without CUDA and pytorch version

Hi,
I am tying to run the test with python 2.7, opencv 2.4.11 and pytorch 0.4.1 without CUDA (I tried with pytorch 0.2.0 but I had conflicts and I couldn't install it).

With this setup I receive " ValueError: expected 2D or 3D input (got 4D input)"

I would like to know if anybody managed to have the test code running WITHOUT using CUDA, and, in case, which pytorch version was used, please.
Can pytorch 0.2.0 run without CUDA?

Thanks!

how to change to support none cuda pytorch?

run test_all.py, use audio, get the image

Train Code

Will training code be made available ?

Strange filterbank parameter value

Talking-Face-Generation-DAVS/preprocess/savemfcc.m

Line 7 in c0233ac

opt.M = 13;

In the Readme it is suggested that you use a similar audio pre-processing as Zimmerman et al. However, they use 40 filterbank channels across their code (e.g. in the yousaidthat repository https://github.com/joonson/yousaidthat/blob/98b51812894497cb6c2b65a7ae147067609fc6ca/run_demo.m#L22)
I was wondering if there was a reason for choosing 13, or if it had just been mixed up with the number of cepstral coefficients.

Thanks,

code

Is this code complete?

Can you share the code that generates the gif files?

ValueError: expected 2D or 3D input (got 4D input)

Hi.

I run
python test_all.py --test_root ./0572_0019_0003/audio --test_type
audio --test_audio_video_length 99 --test_resume_path .\checkpoints\101_DAVS_checkpoint.pth.tar

and see error

Traceback (most recent call last):
File "test_all.py", line 47, in
model.test_train()
File "E:\ml\face_animation\Talking-Face-Generation-DAVS\Test_Gen_Models\Test_Audio_Model.py", line 86, in test_train
self.audio_embeddings = self.mfcc_encoder.forward(self.audios)
File "C:\anaconda3\lib\site-packages\torch\nn\parallel\data_parallel.py", line 121, in forward
return self.module(*inputs[0], **kwargs[0])
File "C:\anaconda3\lib\site-packages\torch\nn\modules\module.py", line 477, in call
result = self.forward(*input, **kwargs)
File "E:\ml\face_animation\Talking-Face-Generation-DAVS\network\mfcc_networks.py", line 83, in forward
net = self._forward(x0)
File "E:\ml\face_animation\Talking-Face-Generation-DAVS\network\mfcc_networks.py", line 75, in _forward
net2 = self.model2.forward(x)
File "E:\ml\face_animation\Talking-Face-Generation-DAVS\network\mfcc_networks.py", line 56, in forward
net = self.relu(self.bn1(net))
File "C:\anaconda3\lib\site-packages\torch\nn\modules\module.py", line 477, in call
result = self.forward(*input, **kwargs)
File "C:\anaconda3\lib\site-packages\torch\nn\modules\batchnorm.py", line 52, in forward
self._check_input_dim(input)
File "C:\anaconda3\lib\site-packages\torch\nn\modules\batchnorm.py", line 156, in _check_input_dim
.format(input.dim()))
ValueError: expected 2D or 3D input (got 4D input)

What can be fixed here ?

using 'audio' generates static video

Hi~ @Hangz-nju-cuhk
When I use 'audio' as the test_type, I got the static video all the time. All the generated images are the same. However when I use 'video' as the test_type, it works fine. Do you have any idea for this? Thank you!

SyntaxError: invalid token

E:\Users\Raytine\Anaconda3\python.exe F:/future/Talking-Face-Generation-DAVS-master/ptest_all.py --test_root './shuju/audio' --test_type 'audio' --test_audio_video_length 99 --test_resume_path CHECKPOINT_PATH
Traceback (most recent call last):
File "F:/future/Talking-Face-Generation-DAVS-master/ptest_all.py", line 11, in
import Test_Gen_Models.Test_Video_Model as Gen_Model
File "F:\future\Talking-Face-Generation-DAVS-master\Test_Gen_Models\Test_Video_Model.py", line 12, in
import network.IdentityEncoder as IdentityEncoder
File "F:\future\Talking-Face-Generation-DAVS-master\network\IdentityEncoder.py", line 46
self.add_module('block' + str(01), BasicBlock(3, 32, name="01", conv_std=0.025253814, kernel_size=7, stride=2, padding=3))
^
SyntaxError: invalid token

raise exception on loading checkpoint file

what dataset and data ID do U use to generate demo?

Hi,
I want to follow and matched your demo result, chould you list the dataset and ID you use to generate the demo?
In your paper,R@1, R@10 and Med R ,SVM are used to measure the quality for different supervisions, can you release the evaluate codes?
Meanwhile, have you ever quantitatively evaluated your results and other authors' talking face, not qualitative Results?
Thanks a lot

Unzipping pretrained model issues

Unable to unzip pretrained model(checkpoint.pth.tar) in both Windows and Ubuntu.

How to generate a continuous mouth-to-audio matching video

hi，First thank you for sharing，I have encountered problems when running this project.
I took a piece of audio and a sample image as input and generated a series of mouth-shaped changes images by running the test_all.py file. I found that by simply splicing these images and audio, I couldn't get a video with a mouth-to-sound match. How do you combine these images and audio into a video with great mouth and audio matching or can you only use video and audio as input to get a mouth-and-audio-matched video?

no checkpoint found ?

Hi , i'm trying to run the repo on Colab , this is the comman Line That entered:

! python test_all.py --test_root '/content/gdrive/My Drive/Talking-Face-Generation-DAVS-master/video' --test_type video --test_audio_video_length 848 --test_resume_path '/content/gdrive/My Drive/Talking-Face-Generation-DAVS-master/checkpoints/101_DAVS_checkpoint.pth.tar'

knowing that i have downloaded and loaded the checkpoints inside that file (they are effectively present).

Any help would be greatly appreciated , thank you.

audio preprocess

hello， How should I preprocess my own audio files?

Corrupted checkpoint file

Hi~ @Hangz-nju-cuhk
Thanks for your paper and code, but I encountered some problems when I download checkpoint file. It's seems checkpoint tar file has been Corrupted, will you mind to repair it ?
thanks

Pre-Processing Data

Hey @Hangz-nju-cuhk @liuziwei7 @liuyuisanai!
I am trying to understand and reproduce the results of this repository end-to-end, So I could create a docker file and contribute. I have already read the paper thoroughly and analyzing the code now. But I am having a problem with pre-processing the data. Could you please guide how could I do this? Step by step process.
Looking forward to you.
Thank you.

Docker for this project

Hi! I have a problem during installation/ Have you a docker for this project or if no - can you create it?
Thanks!

No change

I run the following command: "python test_all.py --test_root ./0572_0019_0003/audio --test_type audio --test_audio_video_length 98 --test_resume_path ./101_DAVS_checkpoint.pth.tar"

But in the end, the pictures generated in the "results" folder are exactly the same as those in the demo images. The "bin" file I used is the original version, and I didn't generate the ". Bin" file myself to run

语音特征提取的一些疑惑

请问这里的减去600，取2：end和取2：26是依据什么呢，还有我看readme里数据集，face的图片张数和mfcc的bin个数不一样？

Use the sample wav and pretrained model, get weird result

hi,
i run the sample you offered, and get approriate result;

while when i want to generate the bin file of mfcc feature myself, i got the wrong result even with the same wav you use 0572_0019_0003.wav;

my python code to generate mfcc feature(try to get 25fps result) like follows:

import numpy as np
import sys
import python_speech_features
from scipy import signal
from scipy.io import wavfile
import subprocess
base_dir = sys.argv[2]
#audiotmp = os.path.join(opt.tmp_dir,'audio.wav')
audiotmp = 'tmp.wav'
videofile = sys.argv[1]
command = ("ffmpeg -y -i %s -async 1 -ac 1 -vn -acodec pcm_s16le -ar 16000 %s" % (videofile,audiotmp))
output = subprocess.call(command, shell=True, stdout=None)
sample_rate, audio = wavfile.read(audiotmp)
mfcc = zip(*python_speech_features.mfcc(audio,sample_rate))
mfcc = np.stack([np.array(i) for i in mfcc])
mfcc = np.transpose(mfcc[1:], (1,0))
lenn = mfcc.shape[0]//4

for i in np.arange(lenn-6):
tmp_data = mfcc[i4:i4+20].reshape(240)
tmp_data.tofile('%s/%d.bin'%(base_dir, i))

print('over.....')

the result are like

the python is 3.6, would you help me to find where is the bug?

Run testing script result not as good as demo

Thank you for the great work! I run
python test_all.py --test_root ./0572_0019_0003/video --test_type video --test_audio_video_length 99

but my results are not the same as the demo video and jitter a lot. Do you know what the problem is?

Chinese characters are spoken faster than English words, will this model work on Chinese?

I want to build a dataset of Chinese characters to train this model.
I applied speech recognition on some Chinese news videos (by CCTV).
The recognition part was fine, but I found that Chinese characters are too short in terms of pronounce time because each of them has only one syllable.
The average number of video frames it takes to show the lip movement of a single Chinese character is only 5 (fps=25), and It can be even as low as 2 frames. This is much less than the required 29 frames. Obviously, interpolation won't work well in this case.
So I would like to know if you guys have considered Chinese? Will this model work? Is there any workaround?

Stabilization of output

As you know, images generated by the model are rather shaky and definitely require stabilization. Could you please elaborate on which stabilization techniques have you applied?

Undefined function or variable 'vec2frames'.

Hello, I use the savemfcc.m to generate *.bin file but when I execute the code, an error occurred.

savemfcc('~~/talkingface//20180619_1_M.wav','~~/talkingface/tlkface/wav')
Undefined function or variable 'vec2frames'.

Error in mfcc (line 151)
frames = vec2frames( speech, Nw, Ns, 'cols', window, false );

Error in runmfcc (line 5)
[ CC, FBE, frames ] = mfcc( speech, opt.fs, opt.Tw, opt.Ts, opt.alpha, hamming, opt.R, opt.M, N, opt.L );

Error in savemfcc (line 17)
[ MFCCs, ~, ~ ] = runmfcc( Speech, opt );

Could you please tell me where to find 'vec2frames'?

is the pretrained model broken?

not found mfccs file:./0572_0019_0003/audio/100.bin

test_load_audio.py line 13
pair = range(2, 2 + data_length)
need changed to
pair = range(2, 1 + data_length)

result file

I got two kinds of photos as result: test_sample1_fake_audio_B_0_x.png and test_sample1_real_A_x.png. what do they mean? And how to use them to get the final video?

not good on Chinese words audio

Hi, I try the model on a piece of audio with the words of "广大党员干部正在积极学习。。。"
But the result is not good. The images below are from test_sample3_fake_audio_B_0_0.png to test_sample3_fake_audio_B_0_17.png

I think the sequence do not denote "广大", I do not know where is the problem.

train code

@Hangz-nju-cuhk Hi, This work is amazing and I run the code. But I have two questions:

could you release the train code? and How can we prepare our train data? Could you describe these steps in details?
After test, I find the video generated is jitters. But the demo you release is steady. Video Stabilization is important for this task. I want to know how you process this phenomenon?

Thanks a lot!

Use the pretrained model ,but got the wrong test result

My PC environment is python3.6+pytorch0.40.
After I changed and only changed the str(01) to "01" in IdentityEncoder.py, this code(test_all.py) can run well, with some deprecated UserWarning,But the output is far from the actual result, Is that cause by environment problem?
The following is the test result and real sample.

Table 3: Audio-Visual Speech Recognition and 1:25000 audio-video retrieval results with different supervisions.

Hi, after reading the paper, I am confused about the table 3.
What is the meaning of visual acc, audio acc and combine acc?
How did you calculate the result of 67.5%, 91.8%, 95.2%?

Python3 and Pytorch 0.4

Do you have any plan to release code in python3 and pytorch0.4?

Generation from audio

Hi~
Thanks for your code, but I encountered some problems when I run the testing script to generate videos from audio.

In Test_load_audio.py，there seems config has no require_audio attribute, because when I run python test_all.py --test_root './0572_0019_0003/audio' --test_type 'audio' --test_audio_video_length 99 --test_resume_path, I got the error AttributeError: 'Namespace' object has no attribute 'require_audio
When will you release the complete code?

What's the meaning of the parameter --test_audio_video_length?

In the test command:
python test_all.py --test_root ./0572_0019_0003/video --test_type video --test_audio_video_length 99 --test_resume_path CHECKPOINT_PATH
What's the meaning of the parameter --test_audio_video_length?

issues about train and some train error

你好，我是**科学技术大学的一位本科生，最近我和一些小伙伴想学习一些talking face generation 方面的工作，我们选择复现您的代码，预处理方面还比较顺利地按照您的**完成了，但是train方面的代码错误很多，我们最终在改动多处地方后使得train代码可以正常运行了。忽然发现您在10月5日提交了一个修改，而且这个修改和我们当初的修改相同，难道您最近也在修改train方面的代码吗？请问您是否有新版本的代码？我们很想和您交流一下
如果您愿意，这是我的邮箱[email protected]
如果您愿意，我们将不胜感激

can I add new images into the demo_images folder for testing

Hi,Huang zhou,I just add some new images into the demo_images for test，and find that the result image fake frames 's variation is not like the four demo images , does this repo code support any other else image test? or Should I do some preprocssing work on my own images ?

'GPU' problem？

When I use test_all.py ,it has "AssertionError:
Found no NVIDIA driver on your system. Please check that you
have an NVIDIA GPU and installed a driver from", I think that your test code should increase automatic decision between GPU and CPU.

Questions about pretraining process and small errors in train.py

Hi, firstly I want to thank you for sharing such a great project. However, I noticed that you wrote 'The pretraining training code is temporarily not provided.' in README.md, so I was wondering if my understanding is right about the classification part. Here is my own assumption:

Use the subset of the MS-Celeb-1M dataset to train the ID_encoder part.
Use the optimize_parameters_no_generation() function in Gen_final_v1.py and LRW dataset to train the lip_feature_encoder, mfcc_encoder and model_fusion part.
Moreover, when I read and try to train the model using train.py, I find some small errors. For example , opt.isTrain and opt.eval_freq are not defined in Options.py and pair in lip_reading_loader() should be (2,25), since there are only 24 files in /mfcc20. So I want to know if you will update the project later which will be of a great help to me.

Have you trained the part of eyes? There is no blink in our results. How can you achieve the performance in your demos.

RuntimeError: cuda runtime error (2) : out of memory

Hi
I face some problems when I run this project.

.......... Networks tnittaltzed ..........
==> Loading checkpoint ' . checkpoints/101. DAVS checkpoint . pth. tar '
==>Loaded checkpoint ' . /checkpoints/101 DAVS checkpotnt , pth,tar'. (step 21145000)
THCudaCheck FAIL file=/opt/conda/conda-bld/pytorch_ 1511304568725/work/torch/lib/THC/generic/THCStorage.cu line=66 error=2 : out of memory
Trackback (most recent call last):
File "test_all.py"，line 43, in
module.test_train()
File "/home/downloads/gp/Talkingface/Test_Gen_Models/Test_Video_Model.py",
line 88, in test_train self.optimizer_G_test.step()
File "/home/anaconda2/lib/python2.7/site-packages/torch/optim/adam.py', line 68, in step denom = exp_avg_sq.sqrt().add_(group['eps'])

RuntimeError: cuda runtime error (2) :out of memory at /opt/conda/conda-bld/pytorch_ 1511304568725/work/torch/lib/THC/generic/THCStorage.cu:66

Can you tell me the version of your graphics card and the version of cuda & cudnn?
thank you very much.

Unable to convert pretrained model into tensorflow model.

We are trying to deploy this project on an android application. In order to do so, we need to convert the pretrained pytorch model (checkpoint.pth.tar) into tensorflow but it shows an error of 'state_dict' unavailability as shown in the attached picture.

A_select in training code

Hi, thank you so much for sharing your codes! When I was going through the code, there were several things that confuse me a lot:

I found this line that doesn't really make sense to me:

Talking-Face-Generation-DAVS/Gen_final_v1.py

Line 142 in 2095439

A_select = random.randint(0, 28)

It seems that here the code is picking the "input" frame, but why is the upper limit of the random function set to be 28? It seems that each training sample should only have 25 frames... Is this a typo?
How do you actually process the audio & video inputs? More precisely, does each video frame correspond to 1/25 s of audio MFCC features?
Also, videos LRW dataset contain not only the labeled words, but also some other words in the video. Are there pre-processings that you perform so that the network only focus on that single word, or are you using the entire video clip?

Thanks a lot!!

Hello

Unable to save the model using Train.py

To save the network we are using a new python file net.py which has the code of Train.py up till the code:
model = Gen_Model.GenModel(opt)
after which we are trying to save the model using torch.save() but the class is importing Gen_final_v1 which imports embedding_utils which causes the following error. Kindly help.

Is audio video offset considered in LRW?

@Hangz-nju-cuhk LRW is used to train the model according to your paper, but there are audio video offset in LRW videos. And [11] used SyncNet when pre-processing dataset to correct the offset.
Did you consider this problem when preparing dataset?
Thank you!

[11] Joon Son Chung, Amir Jamaludin, and Andrew Zisserman. You said that? arXiv preprint arXiv:1705.02966, 2017.

hangz-nju-cuhk / talking-face-generation-davs Goto Github PK