talklip's People
Forkers
saber5433 maxmax2016 ishine amorjnyh viml-cvdl sal-dti ttslr clcarwin whitefu jackzhousz tracyleaf chainoneer mowenli phaethonp yfliao mrlzla miningirving ai-ron-man pinglmlcv duccuong197 yinghuozijin deepmakerai aptx128 ecafe8 officialwwfem yyheart tinaa23 yanxf23 pfxjacky quantjia dmitriyvahrushev vital121 enjoyteach tomdiudiutalklip's Issues
How does task state.pt come from?
Is a typo or bug?
In the paper, the implementation detail indicts that
Audio wavforms are preprocessed to mel-spectrogram with hop and window lengths, and mel bins are 12.5 ms, 50 ms, and 80.
But hop and window lengths, and mel bins are 10 ms, 25 ms, and 26 in the function 'def fre_audio' of "info_demo.py" and "class Talklipdata".
# train.py
L231: audio_feats = logfbank(wav_data, samplerate=sample_rate).astype(np.float32) # [T, F]
# info_demo.py
L160: audio_feats = logfbank(wav_data, samplerate=sample_rate).astype(np.float32) # [T, F]
The codes utilize the default values.
about output file tmp.avi and tmp.mp4
Good morning,
It is a good project, I just have a try, it run the following command :
python3 inf_demo.py --video_path ./input.mp4 --wav_path ./voice.wav --ckpt_path ./global_contrastive.pth --avhubert_root ./avhubert
It run without error , but the output file tmp.avi is just 21Kb and could not play properly, and the tmp.mp4 is 0 kb, may I have your suggestion what wrong I did? or I missed something? thanks.
The results of the generation are not aligned. Why do you need to adjust the bbx when post-processing the fused face?
In line 178
of the file utils/data_avhubert.py
, adjusting bbx causes the results to be misaligned.
omegaconf.errors.ConfigKeyError: Key 'input_modality' not in 'AVHubertPretrainingConfig'
followd the instructions in the readme , but got the above error. Which version of fairseq did you use?
Demo code failed to run
lip_loss too large
why my lip_loss is >90 even for gt videos?
should not LabelSmoothedCrossEntropyCriterion in fairseq be near -log(1/n)?
[Question] How to output long sequence video demo?
May I ask if you have tried long-sequence inference, I am referring to a video input with a resolution of 96x96 and a frame rate of 25 that is greater than 30 seconds. I ran your demo code and got abnormal results, which made me very confused. Do I need to modify it? Looking forward to your reply!
the face in output video is blurred
Hi, thanks for your great work! I tested talklip with my own video, but the generated face in output video is blurred and appear clear border with background. The resolution of my test video is 1600x900.
hello,--word_root data how to create
Request for sharing the pre-trained discriminator weights
Hi there,
Congrats on this awesome work!
I have been trying to fine-tune the model on a custom dataset, which needs to access the pre-trained discriminator model. Could you please share its weights?
Also, do I need the lip observer models during training?
Thanks in advance!
Average Confidence value range 1~2?
I used LSE-C of your code. I got 1.88 in my result, 2.00 in wav2lip inference result and 1.89 in audio2head inference result. But as I have found out score in other papers, the range of score is 6~7. What is it wrong for my quantitative result of LSE-C?
After executing an epoch during training, this error will appear. Has anyone encountered it?
Traceback (most recent call last):
File "/data/wwp/TalkLip-main/train.py", line 531, in train
average_sync_loss, valid_log = eval_model(data_loader['test'], avhubert, criterion, global_step, device, model['gen'], model['disc'], args.cont_w, recon_loss)
File "/data/wwp/TalkLip-main/train.py", line 592, in eval_model
lip_loss, sample_size, logs, enc_out = criterion(avhubert, sample)
File "/data/anaconda3/envs/wwp_talklip/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
return forward_call(*input, **kwargs)
File "/data/wwp/TalkLip-main/fairseq/criterions/label_smoothed_cross_entropy.py", line 79, in forward
net_output = model(**sample["net_input"])
File "/data/anaconda3/envs/wwp_talklip/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
return forward_call(*input, **kwargs)
File "/data/wwp/TalkLip-main/avhubert/hubert_asr.py", line 494, in forward
ft = self.freeze_finetune_updates <= self.num_updates
File "/data/anaconda3/envs/wwp_talklip/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1207, in getattr
raise AttributeError("'{}' object has no attribute '{}'".format(
AttributeError: 'AVHubertSeq2Seq' object has no attribute 'num_updates'
Could you upload the ckpt of TalkLip_disc_qual?
Hi, thanks for your wonderful work.
Recently, I wanted to finetune the network in my dataset. Hence, I need model['gen'] and model[disc'] co-trained checkpoints.
Why the WER is 66% when I used your checkpoint to train the model?
Hello! Thank you for your outstanding efforts.
I met some difficulties.
Why the WER is 66% when I used your checkpoints to train the model?Is it because I didn't use your lip observer ckpt?
How should I use the lip observer ckpt you gave us?
Project Page ?
Hello , I'm asking if there is going to be a project page to this repo to see the quality of the visual results of this method in order to compare it to the others wav2lip etc. If not, can you upload me one of the results to see the quality of this method.
Calculating WER using AV_hubert
Hi,
I am trying to calculate your model's WER performance on VoxCeleb2 using your provided AV_Hubert evaluation scripts. However, I do not understand what the '../datalist/test_partial.tsv' file in toavhform.py refers to.
Is 'datalist/' a directory in LRS2?
If so, can you please tell me what the expected output should be from toavhform.py so I can write it for voxceleb?
Also, I'm assuming ground truth will have to be provided from us to calculate WER?
a
IndexError: list index out of range
the output video frames will increase unexpectly
I have tried with inf_demo.py
, but I found that the frame count of the output video was doubled.
The input video file is 10s/25fps/250frames, but I found the duration of the output video file is 20s/25fps/501frames.
I find the length of audio features array is 501.
Maybe the audio/video frames are not aligned in my case. I am not sure if there are some fps/sample rate constraint in your project.
Waiting for your reply, thank you.
You can find my input/output video/audio files in the following linkage.
I run the inf_demo.py with the following command:
python inf_demo.py --video_path ./input.mp4 --wav_path ./input.wav --ckpt_path ./checkpoints/global_contrastive.pth --avhubert_root /root/workspace/av_hubert
There is a bug that is ignored in the wav data processing.
f the input audio is multi-channel, the loaded wav data will be [16k*t, X], where X is the channel’s number.
Then utilizing L160 to extract the spectrogram will increase the T*X times in temporal space.
https://github.com/Sxjdwang/TalkLip/blob/main/inf_demo.py#L160
So, the users need to ensure that the input audio only has one channel,
ffmpeg -i input.wav -ac 1 -ar 16000 output.wav # -ac is set the number of audio channels
or revise L160 to the following function.
from python_speech_features import logfbank
if len(wav_data.shape)>1:
audio_feats = logfbank(wav_data[:,0], samplerate=sample_rate).astype(np.float32) # [T, F]
else:
audio_feats = logfbank(wav_data, samplerate=sample_rate).astype(np.float32) # [T, F]
Some issues met the same bug. Such as #7
Additionally, the train.py ignore this potential situation too.
Discriminator Forward Pass
I'd like to thank the authors for their incredible effort.
I have a question regarding the discriminator forward pass, specifically the get_lower_half() function. Shouldn't this function return the lower half along both width and height dimensions? It seems to be doing it along one axis only, so the returned tensor would be of shape: [N, 3, 48, 96]. I would appreciate your clarification on this!
Severe Blur in the mouth area
Dear Sir or Madam,
Thanks for making this projects open-sourced. Appreciate that.
But I found I cannot get a make-sense result. In most times, there are severe blur in the mouth area. Like the following video shows.
learn-english-00083.mp4
I am assuming that it is because the number of reference identity input is only one. It must be open-mouth or close mouth. So in one single generation period, the network cannot get both open-mouth and close-mouth identity characteristic feature of the face, so it will lead to much blur.
Please correct me if I was wrong.
size mismatch for audio_encoder.w2v_model.encoder.layers: copying a param with shape torch.Size([1024, 1024]) from checkpoint, the shape in current model is torch.Size([768, 768]).
May I ask if lip_reading_expert.pt's decoder_embed_dim is 768 when it was in fientune? The original avhubert finetune's decoder_embed_dim is 1024
Ouput frames are not synced with Audio
Why the Output video is not synced with Audio? and it is 1/3* fps of the original video in my case. I run it with a video in which its fps is 24.0.
checkpoint
sorry i'm slightly confused by the wording of the document just to be clear no pre-trained checkpoint is avaliable correct?
About lip-reading expert
Thank you very much for your work.
I would like to ask you how you train lip reading experts, which I did not find in your code and paper.
Are you using avhubrt to fine-tune on lrs2, and the weight you get is the weight of lip-reading experts?
And what is your lip reading observer weight responsible for?
Finally, thank you again for your work.
Does anyone have a demo video for the demonstration?
Does anyone have a demo video for the demonstration?
LRS2 and LRW permission request
Seem like my university's email fail to send the request to the email abroad,I would be appreciated if some kind people could share me with the dataset or a few samples in it.Thank you very much!
Contact email:[email protected]
[Bug] TypeError: 'NoneType' object is not subscriptable in "utils /data_avhubert.py"
Hi! I encountered the following error when using the demo for inference:
Traceback (most recent call last):
File "inf_demo.py", line 284, in <module>
synt_demo(fa, device, model, args)
File "inf_demo.py", line 248, in synt_demo
for j, im in enumerate(processed_img[0]):
TypeError: 'NoneType' object is not subscriptable
I checked that processd_img
comes from the emb_roi2im
function in utils/data_avhubert.py
, which I suspect it is missing the return value.
def emb_roi2im(pickedimg, imgs, bbxs, pre, device):
trackid = 0
height, width, _ = imgs[0][0].shape
for i in range(len(pickedimg)):
idimg = pickedimg[i]
imgs[i] = imgs[i].float().to(device)
for j in range(len(idimg)):
bbx = bbxs[i][idimg[j]]
if bbx[2] > width: bbx[2] = width
if bbx[3] > height: bbx[3] = height
resize2ori = transforms.Resize([bbx[3] - bbx[1], bbx[2] - bbx[0]])
try:
resized = resize2ori(pre[trackid + j] * 255.).permute(1, 2, 0)
imgs[i][idimg[j]][bbx[1]:bbx[3], bbx[0]:bbx[2], :] = resized
except:
print(bbx, resized.shape)
import sys
sys.exit()
trackid += len(idimg)
Based on my intuition, I added the return imgs
after it and the reasoning code ran normally.
Effect of FaceFormer in the paper
Hi I found the quality results you showed in the paper including FaceFormer. But as I know, this is a 3D-mesh animation algorithm.
May I ask which code base is the one you used in the paper? Did you directly use the official FaceFormer code and adjust video encoder and decoder and retrained it?
Looking forward to your reply.
Best.
Hello, I trained a digital person with a square cover on their face. How to remove this?
50001.mp4
paper not release quantitative results aboub TalkLip (l + g + c)
Inconsistency Between Input and Output Video
Hi, great work!
I have a question. I managed to run the inference script that you provided.
However, I observed that the output dubbed video and the input video are no longer synced.
That is, if I combine these two videos with ffmpeg or any other package, the output dubbed video lags behind the original video.
For reference, I am attaching the input video and the output dubbed video (top:input, bottom:dubbed) concatenated with the input video.
Input Video
https://github.com/Sxjdwang/TalkLip/assets/26086758/588107c0-ac33-4e4a-9bc2-41e06eb3699f
Output Dubbed Video
https://github.com/Sxjdwang/TalkLip/assets/26086758/aa83e095-72be-4f45-9618-62913dddd362
Do you have any idea about what would be the reason for that?
Thx in advance!
我们创建了一个中文讨论组,有需要的加我微信douzijun1999
1705126444.mp4
how to install avhubert?
when i try to run the demo , it show the bug "no module avhubert". But i don't know how to install avhubert?
training script
Hello,
Thank you for the great work! Any news on when the training script will be released?
Task state has no factory for attribute target_dictionary
My avhubert can run without any problem, but when I use the fairseq in it to execute the total train, it will report an error: AttributeError: Task state has no factory for attribute target_dictionary, may I ask which version of fairseq you are using?My fairseq version is 1.0.0a0+afc77bd
[BUG]The bug of the function audio_visual_pad I found
Excellent work, but I found some exceptions when running the demo. The original code is as follows:
def audio_visual_pad(audio_feats, video_feats):
diff = len(audio_feats) - len(video_feats)
repeat = 1
if diff > 0:
repeat = math.ceil(len(audio_feats) / len(video_feats))
video_feats = torch.repeat_interleave(video_feats, repeat, dim=0)
diff = len(audio_feats) - len(video_feats)
video_feats = video_feats[:diff]
return video_feats, repeat, diff
In my opinion, what this code does is to process audio features and video features to make them equal in length. Next, the code determines whether the video feature needs to be repeated by judging the value of diff. If diff is greater than 0, it means that the length of the audio feature is greater than the length of the video feature, and the video feature needs to be repeated to make its length equal to the audio feature. Next, the code calculates the length difference between the audio feature and the video feature again, and performs a slicing operation to truncate the length of the video feature to be equal to the audio feature, because after repeated operations, the length of the video feature may exceed the length of the audio feature. The final slicing operation may need to determine whether the length of the video feature exceeds the length of the audio feature. Otherwise, if the input video features and audio features are exactly equal, meaningless results will be returned.
This is the result after my modification:
def audio_visual_pad(audio_feats, video_feats):
diff = len(audio_feats) - len(video_feats)
repeat = 1
if diff > 0:
repeat = math.ceil(len(audio_feats) / len(video_feats))
video_feats = torch.repeat_interleave(video_feats, repeat, dim=0)
diff = len(audio_feats) - len(video_feats)
if diff < 0:
video_feats = video_feats[:diff]
return video_feats, repeat, diff
Looking forward to your reply!
File list of LRS2
Could you please provide me with the file_list
of the LRS2 dataset? I've found it a bit challenging to create it myself
Runtime error with long videos >30s
With short one everything is ok, but more than ~20s got an error:
Video - 480x840 30fps Windows 11, 1080ti
RuntimeError: CUDA out of memory. Tried to allocate 3.29 GiB (GPU 0; 11.00 GiB total capacity; 7.10 GiB already allocated; 1.09 GiB free; 8.36 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.
Is there a way to reduce the use of GPU memory?
I have resize the batch size when face_detection, but it seems not enough when running av_hubert, is there any method to fix it?
Traceback (most recent call last):
File "inf_demo.py", line 280, in
synt_demo(fa, device, model, args)
File "inf_demo.py", line 237, in synt_demo
processed_img = emb_roi2im([idAudio], imgs, bbxs, prediction, device)
File "/data/home/ss/TalkLip/utils/data_avhubert.py", line 174, in emb_roi2im
imgs[i] = imgs[i].float().to(device)
RuntimeError: CUDA out of memory. Tried to allocate 23.75 GiB (GPU 0; 14.76 GiB total capacity; 960.37 MiB already allocated; 4.07 GiB free; 9.62 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
High resolution video cause cuda out of memory
Dear authors, thank you for your wonderful project!
When trying to run inf_demo.py on a 10s(30fps) video with a resolution of 1920 * 1080, I encountered a cuda out of memory error. I was running the script on a NVIDIA RTX A6000 with 48GB memory, and I thought it was enough to do the inference on a short 1080p video. Could you tell me what am I missing?
Any body came across this error?
@Sxjdwang
python3 inf_demo.py --video_path ./data/jtest.mp4 --wav_path ./data/jtest.wav --ckpt_path ./global_contrastive.pth --avhubert_root ./av_hubert/
Traceback (most recent call last):
File "inf_demo.py", line 280, in
synt_demo(fa, device, model, args)
File "inf_demo.py", line 234, in synt_demo
prediction, _ = model(sample, inps, idAudio, spectrogram.shape[0])
File "/home/paddle/python3.7/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
return forward_call(*input, **kwargs)
File "/opt/ai/lzhh/digital_human/TalkLip/models/talklip.py", line 103, in forward
enc_out = self.audio_encoder(**sample["net_input"])
File "/home/paddle/python3.7/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
return forward_call(*input, **kwargs)
File "./av_hubert/avhubert/hubert_asr.py", line 386, in forward
x, padding_mask = self.w2v_model.extract_finetune(**w2v_args)
File "./av_hubert/avhubert/hubert.py", line 704, in extract_finetune
features_audio = self.forward_features(src_audio, modality='audio') # features: [B, F, T]
File "./av_hubert/avhubert/hubert.py", line 541, in forward_features
features = extractor(source)
File "/home/paddle/python3.7/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
return forward_call(*input, **kwargs)
File "./av_hubert/avhubert/hubert.py", line 327, in forward
x = self.proj(x.transpose(1, 2))
File "/home/paddle/python3.7/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
return forward_call(*input, **kwargs)
File "/home/paddle/python3.7/lib/python3.7/site-packages/torch/nn/modules/linear.py", line 114, in forward
return F.linear(input, self.weight, self.bias)
RuntimeError: mat1 and mat2 shapes cannot be multiplied (250x160 and 104x768)
IndexError: index 0 is out of bounds for dimension 0 with size 0
/TalkLip/utils/data_avhubert.py", line 172, in emb_roi2im
width = imgs[0][0].shape[1]
IndexError: index 0 is out of bounds for dimension 0 with size 0
it shows up when I trying to test my own data. Can anyone help with that? Thanks!
AttributeError: 'AVHubertSeq2Seq' object has no attribute 'num_updates'
Run the training script according to readme.md, the error is as follows:
Traceback (most recent call last):
File "train.py", line 740, in <module>
train(device, {'gen': imGen, 'disc': imDisc}, avhubert, criterion, {'train': train_data_loader, 'test': test_data_loader},
File "train.py", line 531, in train
average_sync_loss, valid_log = eval_model(data_loader['test'], avhubert, criterion, global_step, device, model['gen'], model['disc'], args.cont_w, recon_loss)
File "train.py", line 592, in eval_model
lip_loss, sample_size, logs, enc_out = criterion(avhubert, sample)
File "/data/anaconda3/envs/talklip/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
return forward_call(*input, **kwargs)
File "/data/ljy/TalkLip/av_hubert/fairseq/fairseq/criterions/label_smoothed_cross_entropy.py", line 79, in forward
net_output = model(**sample["net_input"])
File "/data/anaconda3/envs/talklip/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
return forward_call(*input, **kwargs)
File "/data/ljy/TalkLip/av_hubert/avhubert/hubert_asr.py", line 494, in forward
ft = self.freeze_finetune_updates <= self.num_updates
File "/data/anaconda3/envs/talklip/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1207, in __getattr__
raise AttributeError("'{}' object has no attribute '{}'".format(
AttributeError: 'AVHubertSeq2Seq' object has no attribute 'num_updates'
Can someone help to see where the problem is? Thanks!
The code to run the script is as follows:
python train.py --file_dir /data/wwp/dataset/LRS2 --video_root /data/wwp/dataset/LRS2/mvlrs_v1/main --audio_root /data/wwp/dataset/LRS2/valid_audio \ --bbx_root /data/wwp/dataset/LRS2/valid_bbx --word_root /data/wwp/dataset/LRS2/mvlrs_v1/main --avhubert_root ./av_hubert/avhubert --avhubert_path /data/ljy/checkpoints/TalkLip/lip_reading_expert.pt \ --checkpoint_dir ./checkpoints/ --log_name log_talklip_01 --n_epoch 10 --ckpt_interval 50
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.