sxjdwang / talklip Goto Github PK

View Code? Open in Web Editor NEW

386.0 386.0 33.0 1.07 MB

Python 92.42% Shell 1.21% Perl 6.38%

talklip's People

Contributors

Stargazers

Watchers

talklip's Issues

How does task state.pt come from?

about output file tmp.avi and tmp.mp4

Good morning,

It is a good project, I just have a try, it run the following command :

python3 inf_demo.py --video_path ./input.mp4 --wav_path ./voice.wav --ckpt_path ./global_contrastive.pth --avhubert_root ./avhubert

It run without error , but the output file tmp.avi is just 21Kb and could not play properly, and the tmp.mp4 is 0 kb, may I have your suggestion what wrong I did? or I missed something? thanks.

The results of the generation are not aligned. Why do you need to adjust the bbx when post-processing the fused face?

In line 178 of the file utils/data_avhubert.py, adjusting bbx causes the results to be misaligned.

omegaconf.errors.ConfigKeyError: Key 'input_modality' not in 'AVHubertPretrainingConfig'

followd the instructions in the readme , but got the above error. Which version of fairseq did you use?

Demo code failed to run

After loading the pretraining weight in the runtime demo, an error message appears indicating that the pretraining parameters do not match. How can I solve this problem

lip_loss too large

why my lip_loss is >90 even for gt videos?
should not LabelSmoothedCrossEntropyCriterion in fairseq be near -log(1/n)?

[Question] How to output long sequence video demo？

May I ask if you have tried long-sequence inference, I am referring to a video input with a resolution of 96x96 and a frame rate of 25 that is greater than 30 seconds. I ran your demo code and got abnormal results, which made me very confused. Do I need to modify it? Looking forward to your reply！

the face in output video is blurred

Hi, thanks for your great work! I tested talklip with my own video, but the generated face in output video is blurred and appear clear border with background. The resolution of my test video is 1600x900.

hello，--word_root data how to create

Request for sharing the pre-trained discriminator weights

Hi there,

Congrats on this awesome work!

I have been trying to fine-tune the model on a custom dataset, which needs to access the pre-trained discriminator model. Could you please share its weights?

Also, do I need the lip observer models during training?

Thanks in advance!

Average Confidence value range 1~2?

I used LSE-C of your code. I got 1.88 in my result, 2.00 in wav2lip inference result and 1.89 in audio2head inference result. But as I have found out score in other papers, the range of score is 6~7. What is it wrong for my quantitative result of LSE-C?

After executing an epoch during training, this error will appear. Has anyone encountered it?

Traceback (most recent call last):
File "/data/wwp/TalkLip-main/train.py", line 531, in train
average_sync_loss, valid_log = eval_model(data_loader['test'], avhubert, criterion, global_step, device, model['gen'], model['disc'], args.cont_w, recon_loss)
File "/data/wwp/TalkLip-main/train.py", line 592, in eval_model
lip_loss, sample_size, logs, enc_out = criterion(avhubert, sample)
File "/data/anaconda3/envs/wwp_talklip/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
return forward_call(*input, **kwargs)
File "/data/wwp/TalkLip-main/fairseq/criterions/label_smoothed_cross_entropy.py", line 79, in forward
net_output = model(**sample["net_input"])
File "/data/anaconda3/envs/wwp_talklip/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
return forward_call(*input, **kwargs)
File "/data/wwp/TalkLip-main/avhubert/hubert_asr.py", line 494, in forward
ft = self.freeze_finetune_updates <= self.num_updates
File "/data/anaconda3/envs/wwp_talklip/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1207, in getattr
raise AttributeError("'{}' object has no attribute '{}'".format(
AttributeError: 'AVHubertSeq2Seq' object has no attribute 'num_updates'

Could you upload the ckpt of TalkLip_disc_qual?

Hi, thanks for your wonderful work.
Recently, I wanted to finetune the network in my dataset. Hence, I need model['gen'] and model[disc'] co-trained checkpoints.

Why the WER is 66% when I used your checkpoint to train the model?

Hello! Thank you for your outstanding efforts.
I met some difficulties.
Why the WER is 66% when I used your checkpoints to train the model?Is it because I didn't use your lip observer ckpt?
How should I use the lip observer ckpt you gave us?

Project Page ?

Hello , I'm asking if there is going to be a project page to this repo to see the quality of the visual results of this method in order to compare it to the others wav2lip etc. If not, can you upload me one of the results to see the quality of this method.

For example you can upload either one of those videos

Calculating WER using AV_hubert

Hi,
I am trying to calculate your model's WER performance on VoxCeleb2 using your provided AV_Hubert evaluation scripts. However, I do not understand what the '../datalist/test_partial.tsv' file in toavhform.py refers to.

Is 'datalist/' a directory in LRS2?
If so, can you please tell me what the expected output should be from toavhform.py so I can write it for voxceleb?

Also, I'm assuming ground truth will have to be provided from us to calculate WER?

a

IndexError: list index out of range

started training, this error will be reported. Train step sometimes reports errors of over 2000, and sometimes errors of several hundred。

the output video frames will increase unexpectly

I have tried with inf_demo.py, but I found that the frame count of the output video was doubled.

The input video file is 10s/25fps/250frames, but I found the duration of the output video file is 20s/25fps/501frames.

I find the length of audio features array is 501.

Maybe the audio/video frames are not aligned in my case. I am not sure if there are some fps/sample rate constraint in your project.

Waiting for your reply, thank you.

You can find my input/output video/audio files in the following linkage.

talklip-issue.zip

I run the inf_demo.py with the following command:
python inf_demo.py --video_path ./input.mp4 --wav_path ./input.wav --ckpt_path ./checkpoints/global_contrastive.pth --avhubert_root /root/workspace/av_hubert

ffmpeg version is 4.2.3:

some debug logs:

There is a bug that is ignored in the wav data processing.

f the input audio is multi-channel, the loaded wav data will be [16k*t, X], where X is the channel’s number.
Then utilizing L160 to extract the spectrogram will increase the T*X times in temporal space.

https://github.com/Sxjdwang/TalkLip/blob/main/inf_demo.py#L160

So, the users need to ensure that the input audio only has one channel,

ffmpeg  -i input.wav -ac 1  -ar 16000 output.wav  # -ac is set the number of audio channels

or revise L160 to the following function.

from python_speech_features import logfbank
if len(wav_data.shape)>1:
    audio_feats = logfbank(wav_data[:,0], samplerate=sample_rate).astype(np.float32)  # [T, F]
else:
    audio_feats = logfbank(wav_data, samplerate=sample_rate).astype(np.float32)  # [T, F]

Some issues met the same bug. Such as #7

Additionally, the train.py ignore this potential situation too.

Discriminator Forward Pass

I'd like to thank the authors for their incredible effort.

I have a question regarding the discriminator forward pass, specifically the get_lower_half() function. Shouldn't this function return the lower half along both width and height dimensions? It seems to be doing it along one axis only, so the returned tensor would be of shape: [N, 3, 48, 96]. I would appreciate your clarification on this!

Severe Blur in the mouth area

Dear Sir or Madam,

Thanks for making this projects open-sourced. Appreciate that.

But I found I cannot get a make-sense result. In most times, there are severe blur in the mouth area. Like the following video shows.

learn-english-00083.mp4

I am assuming that it is because the number of reference identity input is only one. It must be open-mouth or close mouth. So in one single generation period, the network cannot get both open-mouth and close-mouth identity characteristic feature of the face, so it will lead to much blur.

Please correct me if I was wrong.

size mismatch for audio_encoder.w2v_model.encoder.layers: copying a param with shape torch.Size([1024, 1024]) from checkpoint, the shape in current model is torch.Size([768, 768]).

May I ask if lip_reading_expert.pt's decoder_embed_dim is 768 when it was in fientune? The original avhubert finetune's decoder_embed_dim is 1024

Ouput frames are not synced with Audio

Why the Output video is not synced with Audio? and it is 1/3* fps of the original video in my case. I run it with a video in which its fps is 24.0.

checkpoint

sorry i'm slightly confused by the wording of the document just to be clear no pre-trained checkpoint is avaliable correct?

About lip-reading expert

Thank you very much for your work.
I would like to ask you how you train lip reading experts, which I did not find in your code and paper.
Are you using avhubrt to fine-tune on lrs2, and the weight you get is the weight of lip-reading experts?
And what is your lip reading observer weight responsible for?
Finally, thank you again for your work.

Does anyone have a demo video for the demonstration？

LRS2 and LRW permission request

Seem like my university's email fail to send the request to the email abroad,I would be appreciated if some kind people could share me with the dataset or a few samples in it.Thank you very much!

Contact email:[email protected]

[Bug] TypeError: 'NoneType' object is not subscriptable in "utils /data_avhubert.py"

Hi! I encountered the following error when using the demo for inference:

Traceback (most recent call last):
  File "inf_demo.py", line 284, in <module>
    synt_demo(fa, device, model, args)
  File "inf_demo.py", line 248, in synt_demo
    for j, im in enumerate(processed_img[0]):
TypeError: 'NoneType' object is not subscriptable

I checked that processd_img comes from the emb_roi2im function in utils/data_avhubert.py, which I suspect it is missing the return value.

def emb_roi2im(pickedimg, imgs, bbxs, pre, device):
    trackid = 0
    height, width, _ = imgs[0][0].shape
    for i in range(len(pickedimg)):
        idimg = pickedimg[i]
        imgs[i] = imgs[i].float().to(device)
        for j in range(len(idimg)):
            bbx = bbxs[i][idimg[j]]
            if bbx[2] > width: bbx[2] = width
            if bbx[3] > height: bbx[3] = height
            resize2ori = transforms.Resize([bbx[3] - bbx[1], bbx[2] - bbx[0]])
            try:
                resized = resize2ori(pre[trackid + j] * 255.).permute(1, 2, 0)
                imgs[i][idimg[j]][bbx[1]:bbx[3], bbx[0]:bbx[2], :] = resized
            except:
                print(bbx, resized.shape)
                import sys
                sys.exit()
        trackid += len(idimg)

Based on my intuition, I added the return imgs after it and the reasoning code ran normally.

Effect of FaceFormer in the paper

Hi I found the quality results you showed in the paper including FaceFormer. But as I know, this is a 3D-mesh animation algorithm.

May I ask which code base is the one you used in the paper? Did you directly use the official FaceFormer code and adjust video encoder and decoder and retrained it?

Looking forward to your reply.
Best.

Hello, I trained a digital person with a square cover on their face. How to remove this？

50001.mp4

paper not release quantitative results aboub TalkLip (l + g + c)

Inconsistency Between Input and Output Video

Hi, great work!
I have a question. I managed to run the inference script that you provided.
However, I observed that the output dubbed video and the input video are no longer synced.
That is, if I combine these two videos with ffmpeg or any other package, the output dubbed video lags behind the original video.
For reference, I am attaching the input video and the output dubbed video (top:input, bottom:dubbed) concatenated with the input video.

Input Video
https://github.com/Sxjdwang/TalkLip/assets/26086758/588107c0-ac33-4e4a-9bc2-41e06eb3699f

Output Dubbed Video
https://github.com/Sxjdwang/TalkLip/assets/26086758/aa83e095-72be-4f45-9618-62913dddd362

Do you have any idea about what would be the reason for that?

Thx in advance!

我们创建了一个中文讨论组，有需要的加我微信douzijun1999

1705126444.mp4

how to install avhubert?

when i try to run the demo , it show the bug "no module avhubert". But i don't know how to install avhubert?

training script

Hello,

Thank you for the great work! Any news on when the training script will be released?

Task state has no factory for attribute target_dictionary

My avhubert can run without any problem, but when I use the fairseq in it to execute the total train, it will report an error: AttributeError: Task state has no factory for attribute target_dictionary, may I ask which version of fairseq you are using?My fairseq version is 1.0.0a0+afc77bd

[BUG]The bug of the function audio_visual_pad I found

Excellent work, but I found some exceptions when running the demo. The original code is as follows:

def audio_visual_pad(audio_feats, video_feats):
    diff = len(audio_feats) - len(video_feats)
    repeat = 1
    if diff > 0:
        repeat = math.ceil(len(audio_feats) / len(video_feats))
        video_feats = torch.repeat_interleave(video_feats, repeat, dim=0)
    diff = len(audio_feats) - len(video_feats)
    video_feats = video_feats[:diff]
    return video_feats, repeat, diff

In my opinion, what this code does is to process audio features and video features to make them equal in length. Next, the code determines whether the video feature needs to be repeated by judging the value of diff. If diff is greater than 0, it means that the length of the audio feature is greater than the length of the video feature, and the video feature needs to be repeated to make its length equal to the audio feature. Next, the code calculates the length difference between the audio feature and the video feature again, and performs a slicing operation to truncate the length of the video feature to be equal to the audio feature, because after repeated operations, the length of the video feature may exceed the length of the audio feature. The final slicing operation may need to determine whether the length of the video feature exceeds the length of the audio feature. Otherwise, if the input video features and audio features are exactly equal, meaningless results will be returned.
This is the result after my modification:

def audio_visual_pad(audio_feats, video_feats):
    diff = len(audio_feats) - len(video_feats)
    repeat = 1
    if diff > 0:
        repeat = math.ceil(len(audio_feats) / len(video_feats))
        video_feats = torch.repeat_interleave(video_feats, repeat, dim=0)
    diff = len(audio_feats) - len(video_feats)
    if diff < 0:
        video_feats = video_feats[:diff]
    return video_feats, repeat, diff

Looking forward to your reply!

File list of LRS2

Could you please provide me with the file_list of the LRS2 dataset? I've found it a bit challenging to create it myself

Runtime error with long videos >30s

With short one everything is ok, but more than ~20s got an error:
Video - 480x840 30fps Windows 11, 1080ti
RuntimeError: CUDA out of memory. Tried to allocate 3.29 GiB (GPU 0; 11.00 GiB total capacity; 7.10 GiB already allocated; 1.09 GiB free; 8.36 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.

Is there a way to reduce the use of GPU memory?

I have resize the batch size when face_detection, but it seems not enough when running av_hubert, is there any method to fix it?
Traceback (most recent call last):
File "inf_demo.py", line 280, in
synt_demo(fa, device, model, args)
File "inf_demo.py", line 237, in synt_demo
processed_img = emb_roi2im([idAudio], imgs, bbxs, prediction, device)
File "/data/home/ss/TalkLip/utils/data_avhubert.py", line 174, in emb_roi2im
imgs[i] = imgs[i].float().to(device)
RuntimeError: CUDA out of memory. Tried to allocate 23.75 GiB (GPU 0; 14.76 GiB total capacity; 960.37 MiB already allocated; 4.07 GiB free; 9.62 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

High resolution video cause cuda out of memory

Dear authors, thank you for your wonderful project!
When trying to run inf_demo.py on a 10s(30fps) video with a resolution of 1920 * 1080, I encountered a cuda out of memory error. I was running the script on a NVIDIA RTX A6000 with 48GB memory, and I thought it was enough to do the inference on a short 1080p video. Could you tell me what am I missing?

Any body came across this error?

@Sxjdwang
python3 inf_demo.py --video_path ./data/jtest.mp4 --wav_path ./data/jtest.wav --ckpt_path ./global_contrastive.pth --avhubert_root ./av_hubert/
Traceback (most recent call last):
File "inf_demo.py", line 280, in
synt_demo(fa, device, model, args)
File "inf_demo.py", line 234, in synt_demo
prediction, _ = model(sample, inps, idAudio, spectrogram.shape[0])
File "/home/paddle/python3.7/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
return forward_call(*input, **kwargs)
File "/opt/ai/lzhh/digital_human/TalkLip/models/talklip.py", line 103, in forward
enc_out = self.audio_encoder(**sample["net_input"])
File "/home/paddle/python3.7/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
return forward_call(*input, **kwargs)
File "./av_hubert/avhubert/hubert_asr.py", line 386, in forward
x, padding_mask = self.w2v_model.extract_finetune(**w2v_args)
File "./av_hubert/avhubert/hubert.py", line 704, in extract_finetune
features_audio = self.forward_features(src_audio, modality='audio') # features: [B, F, T]
File "./av_hubert/avhubert/hubert.py", line 541, in forward_features
features = extractor(source)
File "/home/paddle/python3.7/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
return forward_call(*input, **kwargs)
File "./av_hubert/avhubert/hubert.py", line 327, in forward
x = self.proj(x.transpose(1, 2))
File "/home/paddle/python3.7/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
return forward_call(*input, **kwargs)
File "/home/paddle/python3.7/lib/python3.7/site-packages/torch/nn/modules/linear.py", line 114, in forward
return F.linear(input, self.weight, self.bias)
RuntimeError: mat1 and mat2 shapes cannot be multiplied (250x160 and 104x768)

IndexError: index 0 is out of bounds for dimension 0 with size 0

/TalkLip/utils/data_avhubert.py", line 172, in emb_roi2im
    width = imgs[0][0].shape[1]
IndexError: index 0 is out of bounds for dimension 0 with size 0

it shows up when I trying to test my own data. Can anyone help with that? Thanks!

AttributeError: 'AVHubertSeq2Seq' object has no attribute 'num_updates'

Run the training script according to readme.md, the error is as follows:

Traceback (most recent call last):
  File "train.py", line 740, in <module>
    train(device, {'gen': imGen, 'disc': imDisc}, avhubert, criterion, {'train': train_data_loader, 'test': test_data_loader},
  File "train.py", line 531, in train
    average_sync_loss, valid_log = eval_model(data_loader['test'], avhubert, criterion, global_step, device, model['gen'], model['disc'], args.cont_w, recon_loss)
  File "train.py", line 592, in eval_model
    lip_loss, sample_size, logs, enc_out = criterion(avhubert, sample)
  File "/data/anaconda3/envs/talklip/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "/data/ljy/TalkLip/av_hubert/fairseq/fairseq/criterions/label_smoothed_cross_entropy.py", line 79, in forward
    net_output = model(**sample["net_input"])
  File "/data/anaconda3/envs/talklip/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "/data/ljy/TalkLip/av_hubert/avhubert/hubert_asr.py", line 494, in forward
    ft = self.freeze_finetune_updates <= self.num_updates
  File "/data/anaconda3/envs/talklip/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1207, in __getattr__
    raise AttributeError("'{}' object has no attribute '{}'".format(
AttributeError: 'AVHubertSeq2Seq' object has no attribute 'num_updates'

Can someone help to see where the problem is? Thanks!
The code to run the script is as follows:
python train.py --file_dir /data/wwp/dataset/LRS2 --video_root /data/wwp/dataset/LRS2/mvlrs_v1/main --audio_root /data/wwp/dataset/LRS2/valid_audio \ --bbx_root /data/wwp/dataset/LRS2/valid_bbx --word_root /data/wwp/dataset/LRS2/mvlrs_v1/main --avhubert_root ./av_hubert/avhubert --avhubert_path /data/ljy/checkpoints/TalkLip/lip_reading_expert.pt \ --checkpoint_dir ./checkpoints/ --log_name log_talklip_01 --n_epoch 10 --ckpt_interval 50

sxjdwang / talklip Goto Github PK

talklip's People

Contributors

Stargazers

Watchers

Forkers

talklip's Issues

Recommend Projects

Recommend Topics

Recommend Org

Jobs