primepake / wav2lip_288x288 Goto Github PK

License: MIT License

Python 99.86% Shell 0.14%

deep-learning generation generative talking-head video face-talking audio-driven-talking-face deep-fake deep-fakes image-animation talking-face talking-face-generation

wav2lip_288x288's Introduction

Better wav2lip model version.

Original repo: https://github.com/Rudrabha/Wav2Lip

Each line on filelist should be full path

First, Train syncnet

python3 train_syncnet_sam.py

Second, train wav2lip-Sam

python3 hq_wav2lip_sam_train.py

Some demo from chinese users: #89 (comment)

New Features: DINet full pipeline training

Original repo: https://github.com/MRzzm/DINet

Syncnet training using deepspeech
DINet frame training using deepspeech
DINet clip training using deepspeech

Citing

To cite this repository:

@misc{Wav2Lip,
  author={Rudrabha},
  title={Wav2Lip: Accurately Lip-syncing Videos In The Wild},
  year={2020},
  url={https://github.com/Rudrabha/Wav2Lip}
}

wav2lip_288x288's People

Contributors

Stargazers

Watchers

Forkers

heunseunglim mingzju cheaterscript janfschr deerleo e3u3 knight-interactive quanta-of-solitude nyrize summerwbb syedmustafa54 generalwei johndpope zestfulcitrus davidmartinrius zhangziliang04 saber5433 blazjurisic zvxme peterbondarenko felixchan9527 st4kc beyondchenlin clcarwin adambear aniketp02 maxmax2016 ishine tengtengcai thetargo ylic2022 hupu1dong ybinu yunjixingkong lianjiang-yulj moming133 lidachuan211 zzpfox zhixiongzuo requeam chaofeibu aylitat superdreams pustar dongmaicle xiedongmingming renfengyi purewater2011 monk-after-90s zhangsanfeng86 whn09 jeffstric ishow520 fonchieh quicksandznzn qiaoyafeng yangningbo er1cw00 einsqing ezrealz fxs-space connerhua airclear anredsky assassindesign gaohuan2015 great1001 isold23 shawti samge0 shuiniu86 langzizhixin erikyaoz erickong1985 minaross cjyustc wshlfx meogoo yyheart chialinz likethis85 syunar amorjnyh hakureirm xxsuper silkcutks reonard healthmemmo pgyilun zhuxiu1234 birnfly liyangbing openxiaoshan dyjng yuantao1880 yaospacetim wengjunfeng zhangnn520 liuqqwwe tajwarabraraleef

wav2lip_288x288's Issues

Access troubles on Linux

Hello, what OS have you used while working with this project?
All works fine on Windows, but on linux I have some troubles: hparams.py doesn't see filelists (but they exist in the right directory) and etc.
With original Wav2Lip repo all works fine on both OS.
I think there is some problem with file(s)/directory(ies) from repo: access, rights, owners or attributes. Did you set any special attributes or parameters on files/directories in your project?
Thanks

run color_syncnet_train.py have a error

my command:
python color_syncnet_train.py --data_root /home//dataset/myvideo_dataset/preprocess --checkpoint_dir /home//vir-person/wav2lip_288x288/myvideo_chekpoint

and then error:

Loss: 1.1169637313910894: : 14it [01:43,  2.70s/it]/pytorch/aten/src/THCUNN/BCECriterion.cu:42: Acctype bce_functor<Dtype, Acctype>::operator()(Tuple) [with Tuple = thrust::detail::tuple_of_iterator_references<thrust::device_reference<float>, thrust::device_reference<float>, thrust::null_type, thrust::null_type, thrust::null_type, thrust::null_type, thrust::null_type, thrust::null_type, thrust::null_type, thrust::null_type>, Dtype = float, Acctype = float]: block: [0,0,0], thread: [46,0,0] Assertion `input >= 0. && input <= 1.` failed.
Loss: 1.1169637313910894: : 14it [01:44,  7.47s/it]
Traceback (most recent call last):
  File "color_syncnet_train.py", line 279, in <module>
    nepochs=hparams.nepochs)
  File "color_syncnet_train.py", line 161, in train
    loss = cosine_loss(a, v, y)
  File "color_syncnet_train.py", line 136, in cosine_loss
    loss = logloss(d.unsqueeze(1), y)
  File "/root/anaconda3/envs/wav2lip/lib/python3.6/site-packages/torch/nn/modules/module.py", line 493, in __call__
    result = self.forward(*input, **kwargs)
  File "/root/anaconda3/envs/wav2lip/lib/python3.6/site-packages/torch/nn/modules/loss.py", line 512, in forward
    return F.binary_cross_entropy(input, target, weight=self.weight, reduction=self.reduction)
  File "/root/anaconda3/envs/wav2lip/lib/python3.6/site-packages/torch/nn/functional.py", line 2113, in binary_cross_entropy
    input, target, weight, reduction_enum)
RuntimeError: reduce failed to synchronize: device-side assert triggered

custom dataset

Hello @primepake, pls help me with dataset correction,
on my dataset FPS=30 and need to changed it to 25, up to now I did the face detection and the preprocessed data is frames and audio, what should I do to change FPS?
thank you

Kernel size can't be greater than actual input size

Thanks for your sharing!

I tried your model and find something wrong. Here are the details:

1. when loading wav2lip.pth, "Missing key(s) in state_dict" occured. I made some changes in inference and if solved. Can you please check if it's all right?
def load_model(path):
model = Wav2Lip()
print("Load checkpoint from: {}".format(path))
checkpoint = _load(path)
# s = checkpoint["state_dict"]
# new_s = {}
# for k, v in s.items():
# new_s[k.replace('module.', '')] = v
model.load_state_dict(checkpoint, False)
model = model.to(device)
return model.eval()

2. New error:
Using cuda for inference.
Reading video frames...
Number of frames available for inference: 128
(80, 377)
Length of mel chunks: 115
Recovering from OOM error; New batch size: 8 | 0/1 [00:00<?, ?it/s]
Load checkpoint from: checkpoints/wav2lip_gan.pth | 0/8 [00:00<?, ?it/s]
Model loaded####################################| 15/15 [00:23<00:00, 1.59s/it]

Traceback (most recent call last):
File "inference.py", line 373, in
main()
File "inference.py", line 354, in main
pred = model(mel_batch, img_batch)
File "/root/miniconda3/lib/python3.6/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
result = self.forward(*input, **kwargs)
File "/workspace/wav2lip_288x288-ckpt-mismatch/models/wav2lipv2.py", line 117, in forward
x = f(x)
File "/root/miniconda3/lib/python3.6/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
result = self.forward(*input, **kwargs)
File "/root/miniconda3/lib/python3.6/site-packages/torch/nn/modules/container.py", line 117, in forward
input = module(input)
File "/root/miniconda3/lib/python3.6/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
result = self.forward(*input, **kwargs)
File "/workspace/wav2lip_288x288-ckpt-mismatch/models/conv2.py", line 16, in forward
out = self.conv_block(x)
File "/root/miniconda3/lib/python3.6/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
result = self.forward(*input, **kwargs)
File "/root/miniconda3/lib/python3.6/site-packages/torch/nn/modules/container.py", line 117, in forward
input = module(input)
File "/root/miniconda3/lib/python3.6/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
result = self.forward(*input, **kwargs)
File "/root/miniconda3/lib/python3.6/site-packages/torch/nn/modules/conv.py", line 423, in forward
return self._conv_forward(input, self.weight)
File "/root/miniconda3/lib/python3.6/site-packages/torch/nn/modules/conv.py", line 420, in _conv_forward
self.padding, self.dilation, self.groups)
RuntimeError: Calculated padded input size per channel: (2 x 2). Kernel size: (3 x 3). Kernel size can't be greater than actual input size

Do you have any suggestions?
Thanks in advance.

about the inference error

hi，i used your code to run the trainning without any change of model. The trainning is normal. But when i run the ingerence code, the error occurs as below. have you ever met this error? Thank you

Model loaded
0%| | 0/2 [00:21<?, ?it/s]
Traceback (most recent call last):
File "inference.py", line 277, in
main()
File "inference.py", line 262, in main
pred = model(mel_batch, img_batch)
File "/bin/anaconda3/lib/python3.6/site-packages/torch/nn/modules/module.py", line 493, in call
result = self.forward(*input, **kwargs)
File "wav2lip/HD_wav2lip/wav2lip_288x288/models/wav2lipv2.py", line 117, in forward
x = f(x)
File "/bin/anaconda3/lib/python3.6/site-packages/torch/nn/modules/module.py", line 493, in call
result = self.forward(*input, **kwargs)
File "/bin/anaconda3/lib/python3.6/site-packages/torch/nn/modules/container.py", line 92, in forward
input = module(input)
File "/bin/anaconda3/lib/python3.6/site-packages/torch/nn/modules/module.py", line 493, in call
result = self.forward(*input, **kwargs)
File "wav2lip/HD_wav2lip/wav2lip_288x288/models/conv2.py", line 16, in forward
out = self.conv_block(x)
File "/bin/anaconda3/lib/python3.6/site-packages/torch/nn/modules/module.py", line 493, in call
result = self.forward(*input, **kwargs)
File "/bin/anaconda3/lib/python3.6/site-packages/torch/nn/modules/container.py", line 92, in forward
input = module(input)
File "/bin/anaconda3/lib/python3.6/site-packages/torch/nn/modules/module.py", line 493, in call
result = self.forward(*input, **kwargs)
File "/bin/anaconda3/lib/python3.6/site-packages/torch/nn/modules/conv.py", line 338, in forward
self.padding, self.dilation, self.groups)
RuntimeError: Calculated padded input size per channel: (2 x 2). Kernel size: (3 x 3). Kernel size can't be greater than actual input size

About AVSPEECH dataset

full AVSPEECH dataset is roughly about 1500 GB, too large to download；and would you like to share your own with me? Thanks a lot.

modification about Wasserstein

hi sorry to troble again,
I noticed that it is mentioned some modification about Wasserstein compared to the original version . As i know， if use Wasserstein, the last layer sigmoid of discriminator needs to be removed , and each time the parameter of discriminator is updated, we should clip them to a constant and so on.
But i can find these modification in the code. is there misunderstanding about Wasserstein by me? looking forward to your help

Are you providing model weights?

I see that your dataset is private, but are you providing your model weights? Are you only providing training files? It looks like all documentation and links are for the original Rudrabha implementation, so it's a little hard to tell.

Thanks

What's the difference

Hi, thanks for this work.
May I ask the difference between this work and wav2lip except the network structure?

RuntimeError: Calculated padded input size per channel: (2 x 2). Kernel size: (3 x 3). Kernel size can't be greater than actual input size

I have trained wloss_hq_wav2lip_train.py and used checkpoint checkpoint_step000003000.pth for inference

python inference.py --checkpoint_path "/content/gdrive/MyDrive/wav2lip_288x288/checkpoints/checkpoint.pth"  --face "/content/gdrive/MyDrive/Wav2Lip/video.mp4" --audio "/content/gdrive/MyDrive/Wav2Lip/input_audio.wav"

Using cuda for inference.
Reading video frames...
Number of frames available for inference: 5760
/usr/local/lib/python3.7/dist-packages/librosa/core/audio.py:165: UserWarning: PySoundFile failed. Trying audioread instead.
  warnings.warn("PySoundFile failed. Trying audioread instead.")
(80, 222)
Length of mel chunks: 157
  0% 0/2 [00:00<?, ?it/s]
  0% 0/10 [00:00<?, ?it/s]
 10% 1/10 [00:06<00:56,  6.25s/it]
 20% 2/10 [00:07<00:26,  3.37s/it]
 30% 3/10 [00:08<00:17,  2.45s/it]
 40% 4/10 [00:10<00:12,  2.02s/it]
 50% 5/10 [00:11<00:08,  1.77s/it]
 60% 6/10 [00:13<00:06,  1.63s/it]
 70% 7/10 [00:14<00:04,  1.54s/it]
 80% 8/10 [00:15<00:02,  1.48s/it]
 90% 9/10 [00:17<00:01,  1.44s/it]
100% 10/10 [00:21<00:00,  2.10s/it]
Load checkpoint from: /content/gdrive/MyDrive/wav2lip_288x288/checkpoints/checkpoint.pth
Model loaded
  0% 0/2 [00:24<?, ?it/s]
Traceback (most recent call last):
  File "inference.py", line 280, in <module>
    main()
  File "inference.py", line 263, in main
    pred = model(mel_batch, img_batch)
  File "/usr/local/lib/python3.7/dist-packages/torch/nn/modules/module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "/content/gdrive/MyDrive/wav2lip_288x288/models/wav2lipv2.py", line 117, in forward
    x = f(x)
  File "/usr/local/lib/python3.7/dist-packages/torch/nn/modules/module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "/usr/local/lib/python3.7/dist-packages/torch/nn/modules/container.py", line 139, in forward
    input = module(input)
  File "/usr/local/lib/python3.7/dist-packages/torch/nn/modules/module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "/content/gdrive/MyDrive/wav2lip_288x288/models/conv2.py", line 16, in forward
    out = self.conv_block(x)
  File "/usr/local/lib/python3.7/dist-packages/torch/nn/modules/module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "/usr/local/lib/python3.7/dist-packages/torch/nn/modules/container.py", line 139, in forward
    input = module(input)
  File "/usr/local/lib/python3.7/dist-packages/torch/nn/modules/module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "/usr/local/lib/python3.7/dist-packages/torch/nn/modules/conv.py", line 457, in forward
    return self._conv_forward(input, self.weight, self.bias)
  File "/usr/local/lib/python3.7/dist-packages/torch/nn/modules/conv.py", line 454, in _conv_forward
    self.padding, self.dilation, self.groups)
RuntimeError: Calculated padded input size per channel: (2 x 2). Kernel size: (3 x 3). Kernel size can't be greater than actual input size

color_syncnet_train loss is too high for real movie.

I have recorded a video with my phone, and tested the loss with color_syncnet_train.py by loading the pretrained model lipsync_expert.pth, it showes the loss is aroud 1.3, not the well trained loss around 0.3? I have tested a movie generated by wav2lip, but it still shows a too high value around 1.0. Why is that? I tried these because I also cann't train my dataset to get a good loss around 0.3, while just pinned at just 0.69. I have cleaned my dataset by fps25, 16000hz, and using syncnet_python.

inference error

hi, may I ask a question?

I try to run color_syncnet_train.py and wav2lip_train.py

when I use the checkpoint of wav2lip_train.py for inference.py, the error is as follows:

RuntimeError: Caculated padded input size per channel: (2 X 2). Kernel size: (3 X 3). Kernel size can't be greater than actual input size.

why my wav2lip_train.py is always keep sync_loss = 0.0

the command is :
$ python wav2lip_train.py --data_root enhanced/ --checkpoint_dir checkpoints/ --syncnet_checkpoint_path
checkpoints/syncnet_checkpoint_step000120000.pth --checkpoint_path checkpoints/checkpoint_step000075000.pth

the result of train is:
L1: 0.02834209179919627, Sync Loss: 0.0: : 540it [09:59, 1.05s/it]

why the Sync loss is always =0.0 ?

help

Hello, could you please send me your weight?

can you shared pretrained_weight of AVspeech?

thanks for you so nice work!
I try to use wav2lip_288x288 in my own dataset, but my dataset is small. so I need to use a large dataset(eg. AVSpeech dataset) to do pretrain. But it has to be said that, this training was a large cost in time and compute source.
so, can you shared your pretrained weight? thankyou.

已创建中文的讨论组想加入的请添加微信xaaheng

Avspeech train/valid split

Hello How did you split your train/valid dataset?
If a video is divided into 5 seconds, is there a video that is the same video but has different sections divided into train set and valid set?

can you show your results, please?

Hi, what you have done is simply amazing, this is exactly what I want, can you show your results,please ?

Sample outputs

Hello!
Can you please share some samples of your model inference?
Any videos with your high quality 288x288px lip sync
I just want to find out, is the training of HQ Wav2Lip is reasonable
Thanks

(not the model, only samples of it's work)

Can u share demo?

Oh, u work so amazing. I'm interested in this field. Can u share me with the demo? I will appreciate u very much.

have you tried to finetune into a single person model?

I want to build a personalized wav2lip but finetuning it to a single person (5 min video) cannot get results. I am wondering if I use longer video, maybe 1 hour, I can get better results. Just want to see if you have any experience.

the loss is stuck at around 0.69 during training the syncnet

When I used the dataset collected by myself to train the syncnet of wav2lip, the loss tended to be around 0.69. Have you ever had the same problem? Hope you can help me. Thanks!

Preprocessing issue for fps conversion

Hi, If I change the fps to 25 as recommended by wav2lip model, the audio sync also changes.
How do I get the audio sync right again?

The following code is used for changing fps.
"ffmpeg -y -i {input_path} -r 25 {out_path} -hide_banner -loglevel error"

To change the sync of video, after I got the offset of the video using syncnet, I also used the following sync shifting code. But its output does not match the correct synchronization.

f"ffmpeg -y -i {input_path} -itsoffset {shift} -i {input_path} -ss {shift} -t {all duration of video - abs(shift)} -map 0:v -map 1:a {out_path} -hide_banner -loglevel error"

about the PRelu & LeakyRelu improvement

Thanks for your great work!

I have used the original wav2lip model to process wild video dubbing, and found there will be some abnormal color in the mouth occasional.

I think the reason is the original wav2lip model lacks proper softmax/relu process when handling color generation.

I read from your README and found that the 288x288 model used more powerful ReLU to process the convnet.

I have 2 questions about the improvement:
Q1: Why did you choose to update the PRelu & LeakyRelu? can you give me some typical scenarios when PRelu/LeakyRelu works better?
Q2: Will 288x288 model eliminates the abnormal color in the mouth?

Some bad cases in original Wav2Lip model:

color_syncnet_train.py

So, I have preprocessed a part of the AVSpeech dataset and ran the command for color_syncnet_train.py. The filelists (train.txt, and val.txt) are properly done as well but the output is like this:

python color_syncnet_train.py --data_root preprocess --checkpoint_dir checkpoints/
use_cuda: True
total trainable params 19881059
0it [00:00, ?it/s]Saved checkpoint: checkpoints/checkpoint_step000000001.pth
Loss: 1.0145502090454102: : 1it [16:59, 1019.31s/it]
0it [00:00, ?it/s]

Is this going on fine? why is it going for 1 it?
There are around 49 entries in train.txt and 26 in val.txt (all of them preprocessed). The folder paths look fine to me as well, not sure what am I messing up.

What is the inference time compared with Wav2Lip?

For example, 2x inference time?
Thanks!

PRelu and Overexplosion

The overexplosion problem may be related to PRelu in Conv2dTranspose.

on which dataset did you train the HD 288*288 model?

hi , thank you for your work. I want to consult about your dataset. does Lrs2 works? but the Lrs2 resolution is just 160*160.

Very slow Syncnet training proccess

Hello!
Did you modified data loader? Because this implementation have some speed troubles
48h training and loss dropped from 0.7 to 0.6
There've been 94 epochs (~ 29.5k steps). It's kinda slow... I think the problem is in dataloader.
On GPU Tesla V100

My params:
lr 1e-5
batch size 128
loss func nn.BCEWithLogitsLoss()

Dataset has 40k vids
All videos are filtered, sync corrected, have only one face with nice resolution

Can you please give any suggestions?

MTCNN

Did you write this MTCNN yourself? Or did you use that bag? Can you share it? I'm doing clear_ data.py

Preprocessing of AvSpeech dataset

Hi,
I want to train the Wav2Lip model on the AvSpeech dataset, but I am stuck on processing AVSpeech dataset for Wav2Lip, I have downloaded the data which is a CSV file, I am trying out youtube-dlp to get the youtube video, but I am not able to, so do you have any script for downloading data and processing it for training?

wloss_hq_wav2lip_train.py

Hello, I have a question
line 267 in mentioned file
interpolates = alpha * gt + ((1 - alpha) * fake_img)

What is fake_img? It's not defined anywhere, from what variable / func I should get it?

Can you provide AVSpeech dataset? I can't download it

avspeech train long time Loss ~0.5

I used avspeech data to train syncnet for 285 hours, and the training time dropped from 0.69 to 0.5. Now it has been consistently low to 0.5 for a long time. What is the reason for this. I use syncnet for avspeech syncnet_ython filtering is - 1, 1.

Evaluating for 1400 steps 0.6834709048271179

Loss: 0.5212869264862754: : 11it [00:06, 1.59it/s]
Loss: 0.516353040933609: : 11it [00:06, 1.66it/s]
Loss: 0.506797579201785: : 11it [00:06, 1.58it/s]
Loss: 0.5059969479387457: : 11it [00:06, 1.64it/s]
Loss: 0.5300133119929921: : 11it [00:06, 1.63it/s]
Loss: 0.5310295034538616: : 11it [00:06, 1.58it/s]
Loss: 0.517237208106301: : 11it [00:07, 1.56it/s]
Loss: 0.534422142939134: : 11it [00:06, 1.65it/s]
Loss: 0.5230060680346056: : 11it [00:06, 1.58it/s]
Loss: 0.5333795303648169: : 11it [00:06, 1.62it/s]
Loss: 0.5387652733109214: : 11it [00:06, 1.59it/s]
Loss: 0.5229526162147522: : 11it [00:07, 1.51it/s]
Loss: 0.5356160998344421: : 11it [00:06, 1.61it/s]
Loss: 0.5272310484539379: : 11it [00:06, 1.61it/s]
Loss: 0.5080998133529316: : 11it [00:06, 1.61it/s]
Loss: 0.5103642046451569: : 11it [00:06, 1.65it/s]
Loss: 0.5402759123932231: : 11it [00:06, 1.60it/s]
Loss: 0.5128899162465875: : 11it [00:06, 1.58it/s]
Loss: 0.5354925204407085: : 11it [00:07, 1.57it/s]
Loss: 0.5273735089735552: : 11it [00:06, 1.61it/s]
Loss: 0.5145161287351088: : 11it [00:06, 1.65it/s]
Loss: 0.5218090279535814: : 11it [00:07, 1.52it/s]
Loss: 0.5303930558941581: : 11it [00:06, 1.63it/s]
Loss: 0.5121880417520349: : 11it [00:06, 1.59it/s]
Loss: 0.5163682943040674: : 11it [00:06, 1.64it/s]
Loss: 0.5399262742562727: : 11it [00:06, 1.61it/s]
Loss: 0.5241495452143929: : 11it [00:06, 1.61it/s]
Loss: 0.5095025599002838: : 11it [00:06, 1.61it/s]
Loss: 0.5246310071511702: : 11it [00:07, 1.47it/s]
Loss: 0.5254473957148466: : 11it [00:06, 1.61it/s]
Loss: 0.5240989273244684: : 11it [00:06, 1.58it/s]
Loss: 0.5143423215909437: : 11it [00:07, 1.54it/s]
Loss: 0.5383796854452654: : 11it [00:06, 1.62it/s]
Loss: 0.5364107299934734: : 11it [00:06, 1.64it/s]
Loss: 0.535735699263486: : 11it [00:06, 1.63it/s]
Loss: 0.5260600068352439: : 11it [00:07, 1.53it/s]
Loss: 0.5264727879654277: : 11it [00:06, 1.65it/s]
Loss: 0.5269337269392881: : 11it [00:06, 1.58it/s]
Loss: 0.5506553758274425: : 11it [00:07, 1.41it/s]
Loss: 0.5356145446950739: : 11it [00:06, 1.68it/s]
Loss: 0.53057600422339: : 11it [00:06, 1.65it/s]
Loss: 0.5183943212032318: : 11it [00:07, 1.55it/s]
Loss: 0.5093956183303486: : 11it [00:07, 1.55it/s]
Loss: 0.5288541885939512: : 11it [00:07, 1.46it/s]
Loss: 0.5268966989083723: : 11it [00:06, 1.62it/s]
Loss: 0.5296651070768182: : 11it [00:07, 1.57it/s]
Loss: 0.5330738831650127: : 11it [00:07, 1.57it/s]
Loss: 0.5453376011414961: : 11it [00:07, 1.43it/s]
Loss: 0.5334051115946337: : 11it [00:06, 1.62it/s]
Loss: 0.5332772217013619: : 11it [00:06, 1.58it/s]
Loss: 0.5456769385121085: : 11it [00:06, 1.59it/s]
Loss: 0.5276198793541301: : 11it [00:06, 1.61it/s]
Loss: 0.5252100364728407: : 11it [00:06, 1.59it/s]
Loss: 0.547866637056524: : 11it [00:06, 1.59it/s]
Loss: 0.534071144732562: : 11it [00:07, 1.50it/s]
Loss: 0.5364801423116163: : 11it [00:06, 1.63it/s]
Loss: 0.5198463797569275: : 11it [00:06, 1.58it/s]
Loss: 0.5080715119838715: : 11it [00:06, 1.61it/s]
Loss: 0.5239014219154011: : 11it [00:07, 1.54it/s]
Loss: 0.5345901250839233: : 11it [00:06, 1.63it/s]
Loss: 0.5356171185320074: : 11it [00:07, 1.43it/s]
Loss: 0.5243599035523154: : 11it [00:06, 1.61it/s]
Loss: 0.5395512635057623: : 11it [00:06, 1.58it/s]
Loss: 0.526209909807552: : 11it [00:07, 1.56it/s]
Loss: 0.5344010916623202: : 11it [00:08, 1.34it/s]
Loss: 0.5253535590388558: : 11it [00:06, 1.58it/s]
Loss: 0.5191849280487407: : 11it [00:06, 1.59it/s]
Loss: 0.5137904584407806: : 11it [00:07, 1.54it/s]
Loss: 0.5361907888542522: : 11it [00:07, 1.55it/s]
Loss: 0.5215925763953816: : 11it [00:07, 1.56it/s]
Loss: 0.5253916762091897: : 11it [00:07, 1.56it/s]
Loss: 0.5226842273365367: : 11it [00:06, 1.61it/s]
Loss: 0.5378378033638: : 11it [00:07, 1.56it/s]
Loss: 0.5154962214556608: : 11it [00:06, 1.59it/s]
Loss: 0.5151732536879453: : 11it [00:06, 1.64it/s]
Loss: 0.5253660435026343: : 11it [00:06, 1.59it/s]
Loss: 0.5318919772451575: : 11it [00:07, 1.57it/s]
Loss: 0.5285821773789146: : 11it [00:07, 1.52it/s]
Loss: 0.5203951353376562: : 11it [00:06, 1.59it/s]
Loss: 0.5248001217842102: : 11it [00:06, 1.64it/s]
Loss: 0.5474389899860729: : 11it [00:06, 1.59it/s]
Loss: 0.5239813354882327: : 11it [00:06, 1.59it/s]
Loss: 0.510086715221405: : 11it [00:07, 1.53it/s]
Loss: 0.5268921526995572: : 11it [00:06, 1.59it/s]
Loss: 0.5242643247951161: : 11it [00:08, 1.35it/s]
Loss: 0.5328544540838762: : 11it [00:06, 1.61it/s]
Loss: 0.5278959978710521: : 11it [00:07, 1.51it/s]
Loss: 0.5087671984325756: : 11it [00:06, 1.64it/s]
Loss: 0.5189093026247892: : 11it [00:06, 1.58it/s]
Loss: 0.5501838326454163: : 11it [00:08, 1.37it/s]
Loss: 0.5115621387958527: : 11it [00:07, 1.55it/s]
Loss: 0.5253327245062048: : 11it [00:06, 1.61it/s]
Loss: 0.5272630588574843: : 11it [00:06, 1.61it/s]
Loss: 0.5148500204086304: : 11it [00:07, 1.54it/s]
Loss: 0.5340689122676849: : 11it [00:06, 1.60it/s]
Loss: 0.5341784683140841: : 11it [00:07, 1.54it/s]
Loss: 0.521747889843854: : 11it [00:06, 1.62it/s]
Loss: 0.5119168541648171: : 11it [00:06, 1.63it/s]
Loss: 0.5264650989662517: : 11it [00:07, 1.54it/s]
Loss: 0.5414948653091084: : 11it [00:06, 1.58it/s]
Loss: 0.5448070005937056: : 11it [00:06, 1.63it/s]
Loss: 0.510159890760075: : 11it [00:06, 1.59it/s]
Loss: 0.5324155363169584: : 11it [00:07, 1.57it/s]
Loss: 0.5363136529922485: : 11it [00:06, 1.59it/s]
Loss: 0.527454985813661: : 11it [00:06, 1.61it/s]
Loss: 0.5277877937663685: : 11it [00:07, 1.41it/s]
Loss: 0.524616306478327: : 11it [00:06, 1.58it/s]
Loss: 0.5203610306436365: : 11it [00:07, 1.55it/s]
Loss: 0.5298510220917788: : 11it [00:07, 1.57it/s]
Loss: 0.5173395330255682: : 11it [00:07, 1.56it/s]
Loss: 0.533916874365373: : 11it [00:06, 1.66it/s]
Loss: 0.5159734921021895: : 11it [00:06, 1.63it/s]
Loss: 0.5441486456177451: : 11it [00:06, 1.60it/s]
Loss: 0.5260709740898826: : 11it [00:06, 1.64it/s]
Loss: 0.5441469766876914: : 11it [00:06, 1.60it/s]
Loss: 0.5286740227179094: : 11it [00:06, 1.65it/s]
Loss: 0.5092522854154761: : 11it [00:06, 1.58it/s]
Loss: 0.512138466943394: : 11it [00:06, 1.61it/s]
Loss: 0.5216747386889025: : 11it [00:07, 1.57it/s]
Loss: 0.5291207080537622: : 11it [00:07, 1.49it/s]
Loss: 0.5329246493903074: : 11it [00:06, 1.61it/s]
Loss: 0.5120911191810261: : 11it [00:07, 1.55it/s]
Loss: 0.5228245149959218: : 11it [00:07, 1.57it/s]
Loss: 0.5253384844823317: : 11it [00:06, 1.64it/s]
Loss: 0.5360355079174042: : 11it [00:07, 1.52it/s]
Loss: 0.5080934898419813: : 11it [00:06, 1.57it/s]
Loss: 0.5472726388411089: : 11it [00:07, 1.53it/s]
Loss: 0.5124477256428112: : 11it [00:07, 1.52it/s]
Loss: 0.5321660935878754: : 11it [00:06, 1.58it/s]
Loss: 0.5292721851305529: : 11it [00:06, 1.59it/s]
Loss: 0.5342415164817463: : 11it [00:06, 1.64it/s]
Loss: 0.5322149070826444: : 11it [00:07, 1.56it/s]
Loss: 0.5233434005217119: : 11it [00:06, 1.60it/s]
Loss: 0.5123658369887959: : 11it [00:07, 1.53it/s]
Loss: 0.5206985040144487: : 11it [00:07, 1.41it/s]
Loss: 0.5344440151344646: : 11it [00:06, 1.63it/s]
Loss: 0.5306854437698018: : 11it [00:06, 1.57it/s]
Loss: 0.5189629088748585: : 11it [00:06, 1.62it/s]
Loss: 0.551388068632646: : 11it [00:07, 1.53it/s]
Loss: 0.5213939059864391: : 11it [00:06, 1.61it/s]
Loss: 0.5267486572265625: : 11it [00:06, 1.61it/s]
Loss: 0.5126107416369698: : 11it [00:06, 1.64it/s]
Loss: 0.5210787735202096: : 11it [00:07, 1.52it/s]
Loss: 0.5091477849266746: : 11it [00:06, 1.62it/s]
Loss: 0.5227019597183574: : 11it [00:07, 1.50it/s]
Loss: 0.5210425935008309: : 11it [00:06, 1.59it/s]
Loss: 0.537912601774389: : 11it [00:07, 1.52it/s]
Loss: 0.5248223380608992: : 11it [00:07, 1.57it/s]
Loss: 0.517209218306975: : 11it [00:07, 1.44it/s]
Loss: 0.505943544886329: : 11it [00:06, 1.61it/s]
Loss: 0.5258536582643335: : 11it [00:06, 1.59it/s]
Loss: 0.5366070893677798: : 11it [00:06, 1.63it/s]
Loss: 0.5362473644993522: : 11it [00:07, 1.54it/s]

Did you retrained the sync_expert network?

hi ，did you retained the lipsync_expert , i use your training code and reload the oringal pretrained lipsync_expert.pth(down load from https://github.com/Rudrabha/Wav2Lip) , and i get an shape missmatch error as follows:

Missing key(s) in state_dict: "face_encoder.0.act.weight", "face_encoder.1.act.weight", "face_encoder.2.act.weight", "face_encoder.3.act.weight", "face_encoder.4.act.weight", "face_encoder.5.act.weight", "face_encoder.6.act.weight", "face_encoder.7.act.weight", "face_encoder.8.act.weight", "face_encoder.9.act.weight", "face_encoder.10.act.weight", "face_encoder.11.act.weight", "face_encoder.12.act.weight", "face_encoder.13.act.weight", "face_encoder.14.act.weight", "face_encoder.15.act.weight", "face_encoder.16.act.weight", "face_encoder.17.conv_block.0.weight", "face_encoder.17.conv_block.0.bias", "face_encoder.17.conv_block.1.weight", "face_encoder.17.conv_block.1.bias", "face_encoder.17.conv_block.1.running_mean", "face_encoder.17.conv_block.1.running_var", "face_encoder.17.act.weight", "face_encoder.18.conv_block.0.weight", "face_encoder.18.conv_block.0.bias", "face_encoder.18.conv_block.1.weight", "face_encoder.18.conv_block.1.bias", "face_encoder.18.conv_block.1.running_mean", "face_encoder.18.conv_block.1.running_var", "face_encoder.18.act.weight", "face_encoder.19.conv_block.0.weight", "face_encoder.19.conv_block.0.bias", "face_encoder.19.conv_block.1.weight", "face_encoder.19.conv_block.1.bias", "face_encoder.19.conv_block.1.running_mean", "face_encoder.19.conv_block.1.running_var", "face_encoder.19.act.weight", "face_encoder.20.conv_block.0.weight", "face_encoder.20.conv_block.0.bias", "face_encoder.20.conv_block.1.weight", "face_encoder.20.conv_block.1.bias", "face_encoder.20.conv_block.1.running_mean", "face_encoder.20.conv_block.1.running_var", "face_encoder.20.act.weight", "audio_encoder.0.act.weight", "audio_encoder.1.act.weight", "audio_encoder.2.act.weight", "audio_encoder.3.act.weight", "audio_encoder.4.act.weight", "audio_encoder.5.act.weight", "audio_encoder.6.act.weight", "audio_encoder.7.act.weight", "audio_encoder.8.act.weight", "audio_encoder.9.act.weight", "audio_encoder.10.act.weight", "audio_encoder.11.act.weight", "audio_encoder.12.act.weight", "audio_encoder.13.act.weight". 
size mismatch for face_encoder.1.conv_block.0.weight: copying a param with shape torch.Size([64, 32, 5, 5]) from checkpoint, the shape in current model is torch.Size([32, 32, 5, 5]).
size mismatch for face_encoder.1.conv_block.0.bias: copying a param with shape torch.Size([64]) from checkpoint, the shape in current model is torch.Size([32]).
size mismatch for face_encoder.1.conv_block.1.weight: copying a param with shape torch.Size([64]) from checkpoint, the shape in current model is torch.Size([32]).
size mismatch for face_encoder.1.conv_block.1.bias: copying a param with shape torch.Size([64]) from checkpoint, the shape in current model is torch.Size([32]).
size mismatch for face_encoder.1.conv_block.1.running_mean: copying a param with shape torch.Size([64]) from checkpoint, the shape in current model is torch.Size([32]).
size mismatch for face_encoder.1.conv_block.1.running_var: copying a param with shape torch.Size([64]) from checkpoint, the shape in current model is torch.Size([32]).
size mismatch for face_encoder.2.conv_block.0.weight: copying a param with shape torch.Size([64, 64, 3, 3]) from checkpoint, the shape in current model is torch.Size([32, 32, 3, 3]).
size mismatch for face_encoder.2.conv_block.0.bias: copying a param with shape torch.Size([64]) from checkpoint, the shape in current model is torch.Size([32]).
size mismatch for face_encoder.2.conv_block.1.weight: copying a param with shape torch.Size([64]) from checkpoint, the shape in current model is torch.Size([32]

Percep: 0.0 | Fake: 100.0, Real: 0.0

@primepake Hello, thanks for your nice work. I have encountered some difficulties in training on my own dataset (followed your data preparation suggestions) using your sharing code recently.
While I do python hq_wav2lip_train.py , training log is:

use_cuda: True
Load 2687 audio feats.
use_cuda: True, MULTI_GPU: True
total trainable params 48520755
total DISC trainable params 18210561
Load checkpoint from: checkpoint_syn/checkpoint_step000171000.pth
Starting Epoch: 0
Saved checkpoint: checkpoint_step000000001.pth
Saved checkpoint: disc_checkpoint_step000000001.pth
L1: 0.2313518226146698, Sync: 0.0, Percep: 0.711134135723114 | Fake: 0.6754781603813171, Real: 0.711134135723114
L1: 0.21765484660863876, Sync: 0.0, Percep: 0.709110289812088 | Fake: 0.677438884973526, Real: 0.709110289812088
L1: 0.2188651313384374, Sync: 0.0, Percep: 0.7070923844973246 | Fake: 0.6794042587280273, Real: 0.7070923844973246
L1: 0.2144552432000637, Sync: 0.0, Percep: 0.7049891352653503 | Fake: 0.6814645230770111, Real: 0.7049891352653503
L1: 0.21138261258602142, Sync: 0.0, Percep: 0.7029658198356629 | Fake: 0.6834565877914429, Real: 0.702965784072876
L1: 0.20817621052265167, Sync: 0.0, Percep: 0.7010621925195059 | Fake: 0.6853393216927847, Real: 0.7010620832443237
L1: 0.20210737415722438, Sync: 0.0, Percep: 0.6996434501239231 | Fake: 0.6867433360644749, Real: 0.6996431180409023
L1: 0.19812600128352642, Sync: 0.0, Percep: 0.6987411752343178 | Fake: 0.6876341179013252, Real: 0.6987397372722626
L1: 0.19437309437327915, Sync: 0.0, Percep: 0.6981025603082445 | Fake: 0.6882637408044603, Real: 0.6980936461024814
...
L1: 0.12470868316135908, Sync: 0.0, Percep: 0.703860961136065 | Fake: 0.7219482924593122, Real: 0.69152611988952
L1: 0.12432418142755826, Sync: 0.0, Percep: 0.7042167019098997 | Fake: 0.7212010175765803, Real: 0.6919726772157446
L1: 0.1240154696033173, Sync: 0.0, Percep: 0.7046547470633516 | Fake: 0.7203877433827243, Real: 0.6924839418497868
L1: 0.12360538116523198, Sync: 0.0, Percep: 0.7050436321569948 | Fake: 0.7196274266711303, Real: 0.6929315188026521
L1: 0.12324049579675751, Sync: 0.0, Percep: 0.7055533739051434 | Fake: 0.7187670722904832, Real: 0.6934557074776173
Evaluating for 300 steps
L1: 0.08894559927284718, Sync: 7.633765455087026, Percep: 0.8102987110614777 | Fake: 0.5884036968151728, Real: 0.7851572235425314
L1: 0.12352411426603795, Sync: 0.0, Percep: 0.7059701490402222 | Fake: 0.7179982627183199, Real: 0.6938216637895676
L1: 0.12316158214713087, Sync: 0.0, Percep: 0.7069740667201505 | Fake: 0.7167386501879975, Real: 0.6947587119988258
L1: 0.12274648358716685, Sync: 0.0, Percep: 0.7086389707584008 | Fake: 0.7149891035960001, Real: 0.6961143705702852
L1: 0.122441600682666, Sync: 0.0, Percep: 0.7101659479650478 | Fake: 0.7133496456499239, Real: 0.6971959297446922
L1: 0.12237166498716061, Sync: 0.0, Percep: 0.7136020838068082 | Fake: 0.7105454275957667, Real: 0.6988249019290939
L1: 0.12226668624650865, Sync: 0.0, Percep: 0.7165913383165995 | Fake: 0.7080283074861481, Real: 0.6995955426850179
...
L1: 0.10978432702055822, Sync: 0.0, Percep: 0.7658024462613563 | Fake: 0.8152413980393286, Real: 0.6154363644432882
L1: 0.10972586760557995, Sync: 0.0, Percep: 0.7692448822380323 | Fake: 0.8124342786812339, Real: 0.6124296340577328
L1: 0.10953026241862897, Sync: 0.0, Percep: 0.7743397405519289 | Fake: 0.8092221762695191, Real: 0.610396875484025
L1: 0.10939421248741639, Sync: 0.0, Percep: 0.7824273446941963 | Fake: 0.8055855450166116, Real: 0.6093728619629893
L1: 0.1091919630309757, Sync: 0.0, Percep: 0.7908333182386574 | Fake: 0.8019456527194126, Real: 0.6076706493660488
L1: 0.10912049508790679, Sync: 0.0, Percep: 0.7942769879668473 | Fake: 0.7993013052296446, Real: 0.6045908347347031
L1: 0.10908047136182737, Sync: 0.0, Percep: 0.7925851992034527 | Fake: 0.8665490227044159, Real: 0.6015381057315442
L1: 0.10930284824053846, Sync: 0.0, Percep: 0.7917964988368816 | Fake: 0.8661491764380034, Real: 0.6038172583345233
Evaluating for 300 steps
L1: 0.5731244529287021, Sync: 9.090377567211787, Percep: 1.654488068819046 | Fake: 0.21219109917680423, Real: 1.4065884272257487
L1: 0.10955245170742273, Sync: 0.0, Percep: 0.9697534098973847 | Fake: 0.8618184333870491, Real: 0.707599309967045
L1: 0.10959959211782437, Sync: 0.0, Percep: 0.9730155654561438 | Fake: 0.8586212793076295, Real: 0.7107085656626347
L1: 0.1096739404936238, Sync: 0.0, Percep: 0.9733813980183366 | Fake: 0.8565126432325659, Real: 0.7121740724053752
L1: 0.10970133425566951, Sync: 0.0, Percep: 0.9722086560893506 | Fake: 0.8555087234860062, Real: 0.7122941125653687
L1: 0.10975395110161866, Sync: 0.0, Percep: 0.9709689952921027 | Fake: 0.8545878388406093, Real: 0.7123311641033997
L1: 0.10966527403854742, Sync: 0.0, Percep: 0.9697017977636012 | Fake: 0.8537138932196765, Real: 0.7123305465982057
...
L1: 0.10242784435932453, Sync: 0.0, Percep: 0.8304814692491738 | Fake: 10.244699917241833, Real: 0.6199986526972722
L1: 0.10239110164847111, Sync: 0.0, Percep: 0.8279339800796978 | Fake: 10.520022923630663, Real: 0.618096816339305
L1: 0.10235720289591985, Sync: 0.0, Percep: 0.8254020718837354 | Fake: 10.793661997258704, Real: 0.6162066120079922
L1: 0.10230096286480747, Sync: 0.0, Percep: 0.8228856021523825 | Fake: 11.065632539949988, Real: 0.6143279333128459
L1: 0.10232478230738712, Sync: 0.0, Percep: 0.8203844301093661 | Fake: 11.335949766272329, Real: 0.6124606751568799
Starting Epoch: 1
L1: 0.11442705243825912, Sync: 0.0, Percep: 0.0 | Fake: 100.0, Real: 0.0
L1: 0.09390852972865105, Sync: 0.0, Percep: 0.0 | Fake: 100.0, Real: 0.0
L1: 0.09028899172941844, Sync: 0.0, Percep: 0.0 | Fake: 100.0, Real: 0.0
L1: 0.08729504607617855, Sync: 0.0, Percep: 0.0 | Fake: 100.0, Real: 0.0
L1: 0.0875200405716896, Sync: 0.0, Percep: 0.0 | Fake: 100.0, Real: 0.0
L1: 0.08466161414980888, Sync: 0.0, Percep: 0.0 | Fake: 100.0, Real: 0.0
L1: 0.0851562459553991, Sync: 0.0, Percep: 0.0 | Fake: 100.0, Real: 0.0
...
L1: 0.08385955898174599, Sync: 0.0, Percep: 0.0 | Fake: 100.0, Real: 0.0
L1: 0.08394778782830518, Sync: 0.0, Percep: 0.0 | Fake: 100.0, Real: 0.0
L1: 0.08393304471088492, Sync: 0.0, Percep: 0.0 | Fake: 100.0, Real: 0.0
L1: 0.0836903992508139, Sync: 0.0, Percep: 0.0 | Fake: 100.0, Real: 0.0
Evaluating for 300 steps
L1: 0.062434629226724304, Sync: 7.282701448599497, Percep: 0.0 | Fake: 100.0, Real: 0.0
L1: 0.08369095140779523, Sync: 0.0, Percep: 0.0 | Fake: 100.0, Real: 0.0
L1: 0.0834334861073229, Sync: 0.0, Percep: 0.0 | Fake: 100.0, Real: 0.0
L1: 0.08353378695167907, Sync: 0.0, Percep: 0.0 | Fake: 100.0, Real: 0.0
L1: 0.08348599619962074, Sync: 0.0, Percep: 0.0 | Fake: 100.0, Real: 0.0
...

Is the training process normal? If not, could you please give me some suggestions? Need your help sincerely.

Steps to modify and train syncnet successfully?

Hello,

what are the steps to modify syncnet to get a successul training?

I refer to this: #35 (comment)

Thank you,

David

CUDA error while training syncnet

Hello!
I didn't make any changes to the code, but I have troubles with syncnet training
Filelists are available, data is available too
This error on first checkpoint save:

Saved checkpoint: check/checkpoint_step000000001.pth
Traceback (most recent call last):
File "color_syncnet_train.py", line 279, in
nepochs=hparams.nepochs)
File "color_syncnet_train.py", line 161, in train
loss = cosine_loss(a, v, y)
File "color_syncnet_train.py", line 136, in cosine_loss
loss = logloss(d.unsqueeze(1), y)
File "/home/kadochnikova/.conda/envs/lipsync/lib/python3.6/site-packages/torch/nn/modules/module.py", line 1051, in _call_impl
return forward_call(*input, **kwargs)
File "/home/kadochnikova/.conda/envs/lipsync/lib/python3.6/site-packages/torch/nn/modules/loss.py", line 612, in forward
return F.binary_cross_entropy(input, target, weight=self.weight, reduction=self.reduction)
File "/home/kadochnikova/.conda/envs/lipsync/lib/python3.6/site-packages/torch/nn/functional.py", line 2893, in binary_cross_entropy
return torch._C._nn.binary_cross_entropy(input, target, weight, reduction_enum)
RuntimeError: CUDA error: device-side assert triggered
srun: error: hpe: task 0: Exited with exit code 1

I can't find the error, can you please suggest me, what is the trouble? Thanks!

Could I download your checkpoint file?

RuntimeError: CUDA error: device-side assert triggered

hi, I run the color_syncnet_train.py, it is given the error:

RuntimeError: CUDA error: device-side assert triggered

Main wav2lip training expected loss behavior (wav2lip_train.py)

Hi @primepake!
Could you describe briefly how your main wav2lip training behaved?
My situation is the following: syncnet is trained until 0.29 eval loss (higher than 0.25, but I'd like to try with that). I start wav2lip_train.py, which goes okay, eval sync loss is descreasing, reaching 0.75, then syncnet_wt gets set to 0.01. It happens on 280000 steps roughly.

After that train sync loss starts to raise from 0 (this is an expected behavior), eval sync loss starts to decrease fast. On 340000 steps both of these values are stabilized at 0.34. After that no matter how much I train, train sync loss decreases (I trained it up to 0.24 at 1200000 steps) while eval sync loss decreases much slower and from some point (600000 steps roughly) it simply stays at 0.3 and fluctuates there. It's a pity I don't have a plot for the losses :( but seems like train loss first reaches some "point of balance" and then constantly decreases while eval loss decreases up to 0.3 and then stays there.

So, my questions are:

Did you train your model up to 0.2 eval sync loss, as it is advised in original repo?
What were the values of your train sync loss then?
How much steps it took? (I know this can differ depending on the parameters of training, but still)

I use LRS2 dataset to train the wav2lip model, but the moving of mouth shape is too small

I tried the author's wav2lip training pipeline with the whole LRS2 datasets as the author did, but I got a very small mouth moving. In contrast, I download the author's wav2lip_gan.pth and test the results, it has a regular mouth moving and it's bigger than mine.
So I want to ask you if you have any suggestion about this question.

Looking forward to your reply, thanks!

Overfitting

Hello, can you please suggest me any tricks to prevent syncnet overfitting? Decreasing learning rate doesn't help...

share video demos

Hi, thanks for this work
Can you share some Lip-sync video results of your work?
thanks

Scale up audio block

Hello, can you please suggest what layers I need to add to scale up audio encoder? thanks

Training with new dataset

Hi, Happy Christmas!

I'd like to train the model with hq images (320X320) and need some help.

There is a loss error in line 133 of "color_syncnet_train.py". The loss is out of (0, 1).

It seems work when I change loss function
from
logloss = nn.BCELoss()
to
logloss = nn.BCEWithLogitsLoss()

Is it OK? Do I have to make other changes?

The training lasts more than 4 days now, loss around 0.54 after 940, 000 steps.

Thanks in advance.

Can you help me to create a budget to train a model?

Hello,

First of all, thanks for your contribution and sharing this very valuable code. You have done a fantastic job.

I have two questions:

1- How long did it take to train a model with 10 x A6000 GPUs (assuming each one was 48GB)?
2- How much did it cost to rent the GPU service from the time you started until the model training was finished?

I understand that you can't share a pre-trained model. For you that is a lot of hours of dedication and money.

Although it would help me to know those two points to make a rough budget and train my own model.

I know I could make estimated calculations, although the reality is always different in practice.

Thank you!

David

Download the dataset
Clean and convert to 25fps [If 30fps what should be done]
Train
Finetune on custom videos
Test

I think such guide will help a lot of people not get confused.

Thank You.