GithubHelp home page GithubHelp logo

primepake / wav2lip_288x288 Goto Github PK

View Code? Open in Web Editor NEW
489.0 18.0 126.0 491 KB

License: MIT License

Python 99.86% Shell 0.14%
deep-learning generation generative talking-head video face-talking audio-driven-talking-face deep-fake deep-fakes image-animation talking-face talking-face-generation

wav2lip_288x288's Introduction

Better wav2lip model version.

Original repo: https://github.com/Rudrabha/Wav2Lip

Each line on filelist should be full path

First, Train syncnet

python3 train_syncnet_sam.py

Second, train wav2lip-Sam

python3 hq_wav2lip_sam_train.py

Some demo from chinese users: #89 (comment)

New Features: DINet full pipeline training

Original repo: https://github.com/MRzzm/DINet

  • Syncnet training using deepspeech
  • DINet frame training using deepspeech
  • DINet clip training using deepspeech

Citing

To cite this repository:

@misc{Wav2Lip,
  author={Rudrabha},
  title={Wav2Lip: Accurately Lip-syncing Videos In The Wild},
  year={2020},
  url={https://github.com/Rudrabha/Wav2Lip}
}

wav2lip_288x288's People

Contributors

sm-nocapinc avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

wav2lip_288x288's Issues

Access troubles on Linux

Hello, what OS have you used while working with this project?
All works fine on Windows, but on linux I have some troubles: hparams.py doesn't see filelists (but they exist in the right directory) and etc.
With original Wav2Lip repo all works fine on both OS.
I think there is some problem with file(s)/directory(ies) from repo: access, rights, owners or attributes. Did you set any special attributes or parameters on files/directories in your project?
Thanks

run color_syncnet_train.py have a error

my command:
python color_syncnet_train.py --data_root /home//dataset/myvideo_dataset/preprocess --checkpoint_dir /home//vir-person/wav2lip_288x288/myvideo_chekpoint

and then error:

Loss: 1.1169637313910894: : 14it [01:43,  2.70s/it]/pytorch/aten/src/THCUNN/BCECriterion.cu:42: Acctype bce_functor<Dtype, Acctype>::operator()(Tuple) [with Tuple = thrust::detail::tuple_of_iterator_references<thrust::device_reference<float>, thrust::device_reference<float>, thrust::null_type, thrust::null_type, thrust::null_type, thrust::null_type, thrust::null_type, thrust::null_type, thrust::null_type, thrust::null_type>, Dtype = float, Acctype = float]: block: [0,0,0], thread: [46,0,0] Assertion `input >= 0. && input <= 1.` failed.
Loss: 1.1169637313910894: : 14it [01:44,  7.47s/it]
Traceback (most recent call last):
  File "color_syncnet_train.py", line 279, in <module>
    nepochs=hparams.nepochs)
  File "color_syncnet_train.py", line 161, in train
    loss = cosine_loss(a, v, y)
  File "color_syncnet_train.py", line 136, in cosine_loss
    loss = logloss(d.unsqueeze(1), y)
  File "/root/anaconda3/envs/wav2lip/lib/python3.6/site-packages/torch/nn/modules/module.py", line 493, in __call__
    result = self.forward(*input, **kwargs)
  File "/root/anaconda3/envs/wav2lip/lib/python3.6/site-packages/torch/nn/modules/loss.py", line 512, in forward
    return F.binary_cross_entropy(input, target, weight=self.weight, reduction=self.reduction)
  File "/root/anaconda3/envs/wav2lip/lib/python3.6/site-packages/torch/nn/functional.py", line 2113, in binary_cross_entropy
    input, target, weight, reduction_enum)
RuntimeError: reduce failed to synchronize: device-side assert triggered

custom dataset

Hello @primepake, pls help me with dataset correction,
on my dataset FPS=30 and need to changed it to 25, up to now I did the face detection and the preprocessed data is frames and audio, what should I do to change FPS?
thank you

Kernel size can't be greater than actual input size

Thanks for your sharing!

I tried your model and find something wrong. Here are the details:

1. when loading wav2lip.pth, "Missing key(s) in state_dict" occured. I made some changes in inference and if solved. Can you please check if it's all right?
def load_model(path):
model = Wav2Lip()
print("Load checkpoint from: {}".format(path))
checkpoint = _load(path)
# s = checkpoint["state_dict"]
# new_s = {}
# for k, v in s.items():
# new_s[k.replace('module.', '')] = v
model.load_state_dict(checkpoint, False)
model = model.to(device)
return model.eval()

2. New error:
Using cuda for inference.
Reading video frames...
Number of frames available for inference: 128
(80, 377)
Length of mel chunks: 115
Recovering from OOM error; New batch size: 8 | 0/1 [00:00<?, ?it/s]
Load checkpoint from: checkpoints/wav2lip_gan.pth | 0/8 [00:00<?, ?it/s]
Model loaded####################################| 15/15 [00:23<00:00, 1.59s/it]

Traceback (most recent call last):
File "inference.py", line 373, in
main()
File "inference.py", line 354, in main
pred = model(mel_batch, img_batch)
File "/root/miniconda3/lib/python3.6/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
result = self.forward(*input, **kwargs)
File "/workspace/wav2lip_288x288-ckpt-mismatch/models/wav2lipv2.py", line 117, in forward
x = f(x)
File "/root/miniconda3/lib/python3.6/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
result = self.forward(*input, **kwargs)
File "/root/miniconda3/lib/python3.6/site-packages/torch/nn/modules/container.py", line 117, in forward
input = module(input)
File "/root/miniconda3/lib/python3.6/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
result = self.forward(*input, **kwargs)
File "/workspace/wav2lip_288x288-ckpt-mismatch/models/conv2.py", line 16, in forward
out = self.conv_block(x)
File "/root/miniconda3/lib/python3.6/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
result = self.forward(*input, **kwargs)
File "/root/miniconda3/lib/python3.6/site-packages/torch/nn/modules/container.py", line 117, in forward
input = module(input)
File "/root/miniconda3/lib/python3.6/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
result = self.forward(*input, **kwargs)
File "/root/miniconda3/lib/python3.6/site-packages/torch/nn/modules/conv.py", line 423, in forward
return self._conv_forward(input, self.weight)
File "/root/miniconda3/lib/python3.6/site-packages/torch/nn/modules/conv.py", line 420, in _conv_forward
self.padding, self.dilation, self.groups)
RuntimeError: Calculated padded input size per channel: (2 x 2). Kernel size: (3 x 3). Kernel size can't be greater than actual input size

Do you have any suggestions?
Thanks in advance.

about the inference error

hi,i used your code to run the trainning without any change of model. The trainning is normal. But when i run the ingerence code, the error occurs as below. have you ever met this error? Thank you

Model loaded
0%| | 0/2 [00:21<?, ?it/s]
Traceback (most recent call last):
File "inference.py", line 277, in
main()
File "inference.py", line 262, in main
pred = model(mel_batch, img_batch)
File "/bin/anaconda3/lib/python3.6/site-packages/torch/nn/modules/module.py", line 493, in call
result = self.forward(*input, **kwargs)
File "wav2lip/HD_wav2lip/wav2lip_288x288/models/wav2lipv2.py", line 117, in forward
x = f(x)
File "/bin/anaconda3/lib/python3.6/site-packages/torch/nn/modules/module.py", line 493, in call
result = self.forward(*input, **kwargs)
File "/bin/anaconda3/lib/python3.6/site-packages/torch/nn/modules/container.py", line 92, in forward
input = module(input)
File "/bin/anaconda3/lib/python3.6/site-packages/torch/nn/modules/module.py", line 493, in call
result = self.forward(*input, **kwargs)
File "wav2lip/HD_wav2lip/wav2lip_288x288/models/conv2.py", line 16, in forward
out = self.conv_block(x)
File "/bin/anaconda3/lib/python3.6/site-packages/torch/nn/modules/module.py", line 493, in call
result = self.forward(*input, **kwargs)
File "/bin/anaconda3/lib/python3.6/site-packages/torch/nn/modules/container.py", line 92, in forward
input = module(input)
File "/bin/anaconda3/lib/python3.6/site-packages/torch/nn/modules/module.py", line 493, in call
result = self.forward(*input, **kwargs)
File "/bin/anaconda3/lib/python3.6/site-packages/torch/nn/modules/conv.py", line 338, in forward
self.padding, self.dilation, self.groups)
RuntimeError: Calculated padded input size per channel: (2 x 2). Kernel size: (3 x 3). Kernel size can't be greater than actual input size

About AVSPEECH dataset

full AVSPEECH dataset is roughly about 1500 GB, too large to download;and would you like to share your own with me? Thanks a lot.

modification about Wasserstein

hi sorry to troble again,
I noticed that it is mentioned some modification about Wasserstein compared to the original version . As i know, if use Wasserstein, the last layer sigmoid of discriminator needs to be removed , and each time the parameter of discriminator is updated, we should clip them to a constant and so on.
But i can find these modification in the code. is there misunderstanding about Wasserstein by me? looking forward to your help

Are you providing model weights?

I see that your dataset is private, but are you providing your model weights? Are you only providing training files? It looks like all documentation and links are for the original Rudrabha implementation, so it's a little hard to tell.

Thanks

What's the difference

Hi, thanks for this work.
May I ask the difference between this work and wav2lip except the network structure?

RuntimeError: Calculated padded input size per channel: (2 x 2). Kernel size: (3 x 3). Kernel size can't be greater than actual input size

I have trained wloss_hq_wav2lip_train.py and used checkpoint checkpoint_step000003000.pth for inference

python inference.py --checkpoint_path "/content/gdrive/MyDrive/wav2lip_288x288/checkpoints/checkpoint.pth"  --face "/content/gdrive/MyDrive/Wav2Lip/video.mp4" --audio "/content/gdrive/MyDrive/Wav2Lip/input_audio.wav" 
Using cuda for inference.
Reading video frames...
Number of frames available for inference: 5760
/usr/local/lib/python3.7/dist-packages/librosa/core/audio.py:165: UserWarning: PySoundFile failed. Trying audioread instead.
  warnings.warn("PySoundFile failed. Trying audioread instead.")
(80, 222)
Length of mel chunks: 157
  0% 0/2 [00:00<?, ?it/s]
  0% 0/10 [00:00<?, ?it/s]
 10% 1/10 [00:06<00:56,  6.25s/it]
 20% 2/10 [00:07<00:26,  3.37s/it]
 30% 3/10 [00:08<00:17,  2.45s/it]
 40% 4/10 [00:10<00:12,  2.02s/it]
 50% 5/10 [00:11<00:08,  1.77s/it]
 60% 6/10 [00:13<00:06,  1.63s/it]
 70% 7/10 [00:14<00:04,  1.54s/it]
 80% 8/10 [00:15<00:02,  1.48s/it]
 90% 9/10 [00:17<00:01,  1.44s/it]
100% 10/10 [00:21<00:00,  2.10s/it]
Load checkpoint from: /content/gdrive/MyDrive/wav2lip_288x288/checkpoints/checkpoint.pth
Model loaded
  0% 0/2 [00:24<?, ?it/s]
Traceback (most recent call last):
  File "inference.py", line 280, in <module>
    main()
  File "inference.py", line 263, in main
    pred = model(mel_batch, img_batch)
  File "/usr/local/lib/python3.7/dist-packages/torch/nn/modules/module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "/content/gdrive/MyDrive/wav2lip_288x288/models/wav2lipv2.py", line 117, in forward
    x = f(x)
  File "/usr/local/lib/python3.7/dist-packages/torch/nn/modules/module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "/usr/local/lib/python3.7/dist-packages/torch/nn/modules/container.py", line 139, in forward
    input = module(input)
  File "/usr/local/lib/python3.7/dist-packages/torch/nn/modules/module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "/content/gdrive/MyDrive/wav2lip_288x288/models/conv2.py", line 16, in forward
    out = self.conv_block(x)
  File "/usr/local/lib/python3.7/dist-packages/torch/nn/modules/module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "/usr/local/lib/python3.7/dist-packages/torch/nn/modules/container.py", line 139, in forward
    input = module(input)
  File "/usr/local/lib/python3.7/dist-packages/torch/nn/modules/module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "/usr/local/lib/python3.7/dist-packages/torch/nn/modules/conv.py", line 457, in forward
    return self._conv_forward(input, self.weight, self.bias)
  File "/usr/local/lib/python3.7/dist-packages/torch/nn/modules/conv.py", line 454, in _conv_forward
    self.padding, self.dilation, self.groups)
RuntimeError: Calculated padded input size per channel: (2 x 2). Kernel size: (3 x 3). Kernel size can't be greater than actual input size

color_syncnet_train loss is too high for real movie.

I have recorded a video with my phone, and tested the loss with color_syncnet_train.py by loading the pretrained model lipsync_expert.pth, it showes the loss is aroud 1.3, not the well trained loss around 0.3? I have tested a movie generated by wav2lip, but it still shows a too high value around 1.0. Why is that? I tried these because I also cann't train my dataset to get a good loss around 0.3, while just pinned at just 0.69. I have cleaned my dataset by fps25, 16000hz, and using syncnet_python.

inference error

hi, may I ask a question?

I try to run color_syncnet_train.py and wav2lip_train.py

when I use the checkpoint of wav2lip_train.py for inference.py, the error is as follows:

RuntimeError: Caculated padded input size per channel: (2 X 2). Kernel size: (3 X 3). Kernel size can't be greater than actual input size.

why my wav2lip_train.py is always keep sync_loss = 0.0

the command is :
$ python wav2lip_train.py --data_root enhanced/ --checkpoint_dir checkpoints/ --syncnet_checkpoint_path
checkpoints/syncnet_checkpoint_step000120000.pth --checkpoint_path checkpoints/checkpoint_step000075000.pth

the result of train is:
L1: 0.02834209179919627, Sync Loss: 0.0: : 540it [09:59, 1.05s/it]

why the Sync loss is always =0.0 ?

help

Hello, could you please send me your weight?

can you shared pretrained_weight of AVspeech?

thanks for you so nice work!
I try to use wav2lip_288x288 in my own dataset, but my dataset is small. so I need to use a large dataset(eg. AVSpeech dataset) to do pretrain. But it has to be said that, this training was a large cost in time and compute source.
so, can you shared your pretrained weight? thankyou.

Avspeech train/valid split

Hello How did you split your train/valid dataset?
If a video is divided into 5 seconds, is there a video that is the same video but has different sections divided into train set and valid set?

Sample outputs

Hello!
Can you please share some samples of your model inference?
Any videos with your high quality 288x288px lip sync
I just want to find out, is the training of HQ Wav2Lip is reasonable
Thanks

(not the model, only samples of it's work)

Can u share demo?

Oh, u work so amazing. I'm interested in this field. Can u share me with the demo? I will appreciate u very much.

have you tried to finetune into a single person model?

I want to build a personalized wav2lip but finetuning it to a single person (5 min video) cannot get results. I am wondering if I use longer video, maybe 1 hour, I can get better results. Just want to see if you have any experience.

Preprocessing issue for fps conversion

Hi, If I change the fps to 25 as recommended by wav2lip model, the audio sync also changes.
How do I get the audio sync right again?

The following code is used for changing fps.
"ffmpeg -y -i {input_path} -r 25 {out_path} -hide_banner -loglevel error"

To change the sync of video, after I got the offset of the video using syncnet, I also used the following sync shifting code. But its output does not match the correct synchronization.

f"ffmpeg -y -i {input_path} -itsoffset {shift} -i {input_path} -ss {shift} -t {all duration of video - abs(shift)} -map 0:v -map 1:a {out_path} -hide_banner -loglevel error"

about the PRelu & LeakyRelu improvement

Thanks for your great work!

I have used the original wav2lip model to process wild video dubbing, and found there will be some abnormal color in the mouth occasional.

I think the reason is the original wav2lip model lacks proper softmax/relu process when handling color generation.

I read from your README and found that the 288x288 model used more powerful ReLU to process the convnet.

I have 2 questions about the improvement:
Q1: Why did you choose to update the PRelu & LeakyRelu? can you give me some typical scenarios when PRelu/LeakyRelu works better?
Q2: Will 288x288 model eliminates the abnormal color in the mouth?

Some bad cases in original Wav2Lip model:

image

image

image

color_syncnet_train.py

So, I have preprocessed a part of the AVSpeech dataset and ran the command for color_syncnet_train.py. The filelists (train.txt, and val.txt) are properly done as well but the output is like this:

python color_syncnet_train.py --data_root preprocess --checkpoint_dir checkpoints/
use_cuda: True
total trainable params 19881059
0it [00:00, ?it/s]Saved checkpoint: checkpoints/checkpoint_step000000001.pth
Loss: 1.0145502090454102: : 1it [16:59, 1019.31s/it]
0it [00:00, ?it/s]

Is this going on fine? why is it going for 1 it?
There are around 49 entries in train.txt and 26 in val.txt (all of them preprocessed). The folder paths look fine to me as well, not sure what am I messing up.

Very slow Syncnet training proccess

Hello!
Did you modified data loader? Because this implementation have some speed troubles
48h training and loss dropped from 0.7 to 0.6
There've been 94 epochs (~ 29.5k steps). It's kinda slow... I think the problem is in dataloader.
On GPU Tesla V100

My params:
lr 1e-5
batch size 128
loss func nn.BCEWithLogitsLoss()

Dataset has 40k vids
All videos are filtered, sync corrected, have only one face with nice resolution

Can you please give any suggestions?

MTCNN

Did you write this MTCNN yourself? Or did you use that bag? Can you share it? I'm doing clear_ data.py

Preprocessing of AvSpeech dataset

Hi,
I want to train the Wav2Lip model on the AvSpeech dataset, but I am stuck on processing AVSpeech dataset for Wav2Lip, I have downloaded the data which is a CSV file, I am trying out youtube-dlp to get the youtube video, but I am not able to, so do you have any script for downloading data and processing it for training?

wloss_hq_wav2lip_train.py

Hello, I have a question
line 267 in mentioned file
interpolates = alpha * gt + ((1 - alpha) * fake_img)

What is fake_img? It's not defined anywhere, from what variable / func I should get it?

avspeech train long time Loss ~0.5

I used avspeech data to train syncnet for 285 hours, and the training time dropped from 0.69 to 0.5. Now it has been consistently low to 0.5 for a long time. What is the reason for this. I use syncnet for avspeech syncnet_ython filtering is - 1, 1.

Evaluating for 1400 steps 0.6834709048271179

Loss: 0.5212869264862754: : 11it [00:06, 1.59it/s]
Loss: 0.516353040933609: : 11it [00:06, 1.66it/s]
Loss: 0.506797579201785: : 11it [00:06, 1.58it/s]
Loss: 0.5059969479387457: : 11it [00:06, 1.64it/s]
Loss: 0.5300133119929921: : 11it [00:06, 1.63it/s]
Loss: 0.5310295034538616: : 11it [00:06, 1.58it/s]
Loss: 0.517237208106301: : 11it [00:07, 1.56it/s]
Loss: 0.534422142939134: : 11it [00:06, 1.65it/s]
Loss: 0.5230060680346056: : 11it [00:06, 1.58it/s]
Loss: 0.5333795303648169: : 11it [00:06, 1.62it/s]
Loss: 0.5387652733109214: : 11it [00:06, 1.59it/s]
Loss: 0.5229526162147522: : 11it [00:07, 1.51it/s]
Loss: 0.5356160998344421: : 11it [00:06, 1.61it/s]
Loss: 0.5272310484539379: : 11it [00:06, 1.61it/s]
Loss: 0.5080998133529316: : 11it [00:06, 1.61it/s]
Loss: 0.5103642046451569: : 11it [00:06, 1.65it/s]
Loss: 0.5402759123932231: : 11it [00:06, 1.60it/s]
Loss: 0.5128899162465875: : 11it [00:06, 1.58it/s]
Loss: 0.5354925204407085: : 11it [00:07, 1.57it/s]
Loss: 0.5273735089735552: : 11it [00:06, 1.61it/s]
Loss: 0.5145161287351088: : 11it [00:06, 1.65it/s]
Loss: 0.5218090279535814: : 11it [00:07, 1.52it/s]
Loss: 0.5303930558941581: : 11it [00:06, 1.63it/s]
Loss: 0.5121880417520349: : 11it [00:06, 1.59it/s]
Loss: 0.5163682943040674: : 11it [00:06, 1.64it/s]
Loss: 0.5399262742562727: : 11it [00:06, 1.61it/s]
Loss: 0.5241495452143929: : 11it [00:06, 1.61it/s]
Loss: 0.5095025599002838: : 11it [00:06, 1.61it/s]
Loss: 0.5246310071511702: : 11it [00:07, 1.47it/s]
Loss: 0.5254473957148466: : 11it [00:06, 1.61it/s]
Loss: 0.5240989273244684: : 11it [00:06, 1.58it/s]
Loss: 0.5143423215909437: : 11it [00:07, 1.54it/s]
Loss: 0.5383796854452654: : 11it [00:06, 1.62it/s]
Loss: 0.5364107299934734: : 11it [00:06, 1.64it/s]
Loss: 0.535735699263486: : 11it [00:06, 1.63it/s]
Loss: 0.5260600068352439: : 11it [00:07, 1.53it/s]
Loss: 0.5264727879654277: : 11it [00:06, 1.65it/s]
Loss: 0.5269337269392881: : 11it [00:06, 1.58it/s]
Loss: 0.5506553758274425: : 11it [00:07, 1.41it/s]
Loss: 0.5356145446950739: : 11it [00:06, 1.68it/s]
Loss: 0.53057600422339: : 11it [00:06, 1.65it/s]
Loss: 0.5183943212032318: : 11it [00:07, 1.55it/s]
Loss: 0.5093956183303486: : 11it [00:07, 1.55it/s]
Loss: 0.5288541885939512: : 11it [00:07, 1.46it/s]
Loss: 0.5268966989083723: : 11it [00:06, 1.62it/s]
Loss: 0.5296651070768182: : 11it [00:07, 1.57it/s]
Loss: 0.5330738831650127: : 11it [00:07, 1.57it/s]
Loss: 0.5453376011414961: : 11it [00:07, 1.43it/s]
Loss: 0.5334051115946337: : 11it [00:06, 1.62it/s]
Loss: 0.5332772217013619: : 11it [00:06, 1.58it/s]
Loss: 0.5456769385121085: : 11it [00:06, 1.59it/s]
Loss: 0.5276198793541301: : 11it [00:06, 1.61it/s]
Loss: 0.5252100364728407: : 11it [00:06, 1.59it/s]
Loss: 0.547866637056524: : 11it [00:06, 1.59it/s]
Loss: 0.534071144732562: : 11it [00:07, 1.50it/s]
Loss: 0.5364801423116163: : 11it [00:06, 1.63it/s]
Loss: 0.5198463797569275: : 11it [00:06, 1.58it/s]
Loss: 0.5080715119838715: : 11it [00:06, 1.61it/s]
Loss: 0.5239014219154011: : 11it [00:07, 1.54it/s]
Loss: 0.5345901250839233: : 11it [00:06, 1.63it/s]
Loss: 0.5356171185320074: : 11it [00:07, 1.43it/s]
Loss: 0.5243599035523154: : 11it [00:06, 1.61it/s]
Loss: 0.5395512635057623: : 11it [00:06, 1.58it/s]
Loss: 0.526209909807552: : 11it [00:07, 1.56it/s]
Loss: 0.5344010916623202: : 11it [00:08, 1.34it/s]
Loss: 0.5253535590388558: : 11it [00:06, 1.58it/s]
Loss: 0.5191849280487407: : 11it [00:06, 1.59it/s]
Loss: 0.5137904584407806: : 11it [00:07, 1.54it/s]
Loss: 0.5361907888542522: : 11it [00:07, 1.55it/s]
Loss: 0.5215925763953816: : 11it [00:07, 1.56it/s]
Loss: 0.5253916762091897: : 11it [00:07, 1.56it/s]
Loss: 0.5226842273365367: : 11it [00:06, 1.61it/s]
Loss: 0.5378378033638: : 11it [00:07, 1.56it/s]
Loss: 0.5154962214556608: : 11it [00:06, 1.59it/s]
Loss: 0.5151732536879453: : 11it [00:06, 1.64it/s]
Loss: 0.5253660435026343: : 11it [00:06, 1.59it/s]
Loss: 0.5318919772451575: : 11it [00:07, 1.57it/s]
Loss: 0.5285821773789146: : 11it [00:07, 1.52it/s]
Loss: 0.5203951353376562: : 11it [00:06, 1.59it/s]
Loss: 0.5248001217842102: : 11it [00:06, 1.64it/s]
Loss: 0.5474389899860729: : 11it [00:06, 1.59it/s]
Loss: 0.5239813354882327: : 11it [00:06, 1.59it/s]
Loss: 0.510086715221405: : 11it [00:07, 1.53it/s]
Loss: 0.5268921526995572: : 11it [00:06, 1.59it/s]
Loss: 0.5242643247951161: : 11it [00:08, 1.35it/s]
Loss: 0.5328544540838762: : 11it [00:06, 1.61it/s]
Loss: 0.5278959978710521: : 11it [00:07, 1.51it/s]
Loss: 0.5087671984325756: : 11it [00:06, 1.64it/s]
Loss: 0.5189093026247892: : 11it [00:06, 1.58it/s]
Loss: 0.5501838326454163: : 11it [00:08, 1.37it/s]
Loss: 0.5115621387958527: : 11it [00:07, 1.55it/s]
Loss: 0.5253327245062048: : 11it [00:06, 1.61it/s]
Loss: 0.5272630588574843: : 11it [00:06, 1.61it/s]
Loss: 0.5148500204086304: : 11it [00:07, 1.54it/s]
Loss: 0.5340689122676849: : 11it [00:06, 1.60it/s]
Loss: 0.5341784683140841: : 11it [00:07, 1.54it/s]
Loss: 0.521747889843854: : 11it [00:06, 1.62it/s]
Loss: 0.5119168541648171: : 11it [00:06, 1.63it/s]
Loss: 0.5264650989662517: : 11it [00:07, 1.54it/s]
Loss: 0.5414948653091084: : 11it [00:06, 1.58it/s]
Loss: 0.5448070005937056: : 11it [00:06, 1.63it/s]
Loss: 0.510159890760075: : 11it [00:06, 1.59it/s]
Loss: 0.5324155363169584: : 11it [00:07, 1.57it/s]
Loss: 0.5363136529922485: : 11it [00:06, 1.59it/s]
Loss: 0.527454985813661: : 11it [00:06, 1.61it/s]
Loss: 0.5277877937663685: : 11it [00:07, 1.41it/s]
Loss: 0.524616306478327: : 11it [00:06, 1.58it/s]
Loss: 0.5203610306436365: : 11it [00:07, 1.55it/s]
Loss: 0.5298510220917788: : 11it [00:07, 1.57it/s]
Loss: 0.5173395330255682: : 11it [00:07, 1.56it/s]
Loss: 0.533916874365373: : 11it [00:06, 1.66it/s]
Loss: 0.5159734921021895: : 11it [00:06, 1.63it/s]
Loss: 0.5441486456177451: : 11it [00:06, 1.60it/s]
Loss: 0.5260709740898826: : 11it [00:06, 1.64it/s]
Loss: 0.5441469766876914: : 11it [00:06, 1.60it/s]
Loss: 0.5286740227179094: : 11it [00:06, 1.65it/s]
Loss: 0.5092522854154761: : 11it [00:06, 1.58it/s]
Loss: 0.512138466943394: : 11it [00:06, 1.61it/s]
Loss: 0.5216747386889025: : 11it [00:07, 1.57it/s]
Loss: 0.5291207080537622: : 11it [00:07, 1.49it/s]
Loss: 0.5329246493903074: : 11it [00:06, 1.61it/s]
Loss: 0.5120911191810261: : 11it [00:07, 1.55it/s]
Loss: 0.5228245149959218: : 11it [00:07, 1.57it/s]
Loss: 0.5253384844823317: : 11it [00:06, 1.64it/s]
Loss: 0.5360355079174042: : 11it [00:07, 1.52it/s]
Loss: 0.5080934898419813: : 11it [00:06, 1.57it/s]
Loss: 0.5472726388411089: : 11it [00:07, 1.53it/s]
Loss: 0.5124477256428112: : 11it [00:07, 1.52it/s]
Loss: 0.5321660935878754: : 11it [00:06, 1.58it/s]
Loss: 0.5292721851305529: : 11it [00:06, 1.59it/s]
Loss: 0.5342415164817463: : 11it [00:06, 1.64it/s]
Loss: 0.5322149070826444: : 11it [00:07, 1.56it/s]
Loss: 0.5233434005217119: : 11it [00:06, 1.60it/s]
Loss: 0.5123658369887959: : 11it [00:07, 1.53it/s]
Loss: 0.5206985040144487: : 11it [00:07, 1.41it/s]
Loss: 0.5344440151344646: : 11it [00:06, 1.63it/s]
Loss: 0.5306854437698018: : 11it [00:06, 1.57it/s]
Loss: 0.5189629088748585: : 11it [00:06, 1.62it/s]
Loss: 0.551388068632646: : 11it [00:07, 1.53it/s]
Loss: 0.5213939059864391: : 11it [00:06, 1.61it/s]
Loss: 0.5267486572265625: : 11it [00:06, 1.61it/s]
Loss: 0.5126107416369698: : 11it [00:06, 1.64it/s]
Loss: 0.5210787735202096: : 11it [00:07, 1.52it/s]
Loss: 0.5091477849266746: : 11it [00:06, 1.62it/s]
Loss: 0.5227019597183574: : 11it [00:07, 1.50it/s]
Loss: 0.5210425935008309: : 11it [00:06, 1.59it/s]
Loss: 0.537912601774389: : 11it [00:07, 1.52it/s]
Loss: 0.5248223380608992: : 11it [00:07, 1.57it/s]
Loss: 0.517209218306975: : 11it [00:07, 1.44it/s]
Loss: 0.505943544886329: : 11it [00:06, 1.61it/s]
Loss: 0.5258536582643335: : 11it [00:06, 1.59it/s]
Loss: 0.5366070893677798: : 11it [00:06, 1.63it/s]
Loss: 0.5362473644993522: : 11it [00:07, 1.54it/s]

Did you retrained the sync_expert network?

hi ,did you retained the lipsync_expert , i use your training code and reload the oringal pretrained lipsync_expert.pth(down load from https://github.com/Rudrabha/Wav2Lip) , and i get an shape missmatch error as follows:

Missing key(s) in state_dict: "face_encoder.0.act.weight", "face_encoder.1.act.weight", "face_encoder.2.act.weight", "face_encoder.3.act.weight", "face_encoder.4.act.weight", "face_encoder.5.act.weight", "face_encoder.6.act.weight", "face_encoder.7.act.weight", "face_encoder.8.act.weight", "face_encoder.9.act.weight", "face_encoder.10.act.weight", "face_encoder.11.act.weight", "face_encoder.12.act.weight", "face_encoder.13.act.weight", "face_encoder.14.act.weight", "face_encoder.15.act.weight", "face_encoder.16.act.weight", "face_encoder.17.conv_block.0.weight", "face_encoder.17.conv_block.0.bias", "face_encoder.17.conv_block.1.weight", "face_encoder.17.conv_block.1.bias", "face_encoder.17.conv_block.1.running_mean", "face_encoder.17.conv_block.1.running_var", "face_encoder.17.act.weight", "face_encoder.18.conv_block.0.weight", "face_encoder.18.conv_block.0.bias", "face_encoder.18.conv_block.1.weight", "face_encoder.18.conv_block.1.bias", "face_encoder.18.conv_block.1.running_mean", "face_encoder.18.conv_block.1.running_var", "face_encoder.18.act.weight", "face_encoder.19.conv_block.0.weight", "face_encoder.19.conv_block.0.bias", "face_encoder.19.conv_block.1.weight", "face_encoder.19.conv_block.1.bias", "face_encoder.19.conv_block.1.running_mean", "face_encoder.19.conv_block.1.running_var", "face_encoder.19.act.weight", "face_encoder.20.conv_block.0.weight", "face_encoder.20.conv_block.0.bias", "face_encoder.20.conv_block.1.weight", "face_encoder.20.conv_block.1.bias", "face_encoder.20.conv_block.1.running_mean", "face_encoder.20.conv_block.1.running_var", "face_encoder.20.act.weight", "audio_encoder.0.act.weight", "audio_encoder.1.act.weight", "audio_encoder.2.act.weight", "audio_encoder.3.act.weight", "audio_encoder.4.act.weight", "audio_encoder.5.act.weight", "audio_encoder.6.act.weight", "audio_encoder.7.act.weight", "audio_encoder.8.act.weight", "audio_encoder.9.act.weight", "audio_encoder.10.act.weight", "audio_encoder.11.act.weight", "audio_encoder.12.act.weight", "audio_encoder.13.act.weight". 
size mismatch for face_encoder.1.conv_block.0.weight: copying a param with shape torch.Size([64, 32, 5, 5]) from checkpoint, the shape in current model is torch.Size([32, 32, 5, 5]).
size mismatch for face_encoder.1.conv_block.0.bias: copying a param with shape torch.Size([64]) from checkpoint, the shape in current model is torch.Size([32]).
size mismatch for face_encoder.1.conv_block.1.weight: copying a param with shape torch.Size([64]) from checkpoint, the shape in current model is torch.Size([32]).
size mismatch for face_encoder.1.conv_block.1.bias: copying a param with shape torch.Size([64]) from checkpoint, the shape in current model is torch.Size([32]).
size mismatch for face_encoder.1.conv_block.1.running_mean: copying a param with shape torch.Size([64]) from checkpoint, the shape in current model is torch.Size([32]).
size mismatch for face_encoder.1.conv_block.1.running_var: copying a param with shape torch.Size([64]) from checkpoint, the shape in current model is torch.Size([32]).
size mismatch for face_encoder.2.conv_block.0.weight: copying a param with shape torch.Size([64, 64, 3, 3]) from checkpoint, the shape in current model is torch.Size([32, 32, 3, 3]).
size mismatch for face_encoder.2.conv_block.0.bias: copying a param with shape torch.Size([64]) from checkpoint, the shape in current model is torch.Size([32]).
size mismatch for face_encoder.2.conv_block.1.weight: copying a param with shape torch.Size([64]) from checkpoint, the shape in current model is torch.Size([32]

Percep: 0.0 | Fake: 100.0, Real: 0.0

@primepake Hello, thanks for your nice work. I have encountered some difficulties in training on my own dataset (followed your data preparation suggestions) using your sharing code recently.
While I do python hq_wav2lip_train.py , training log is:

use_cuda: True
Load 2687 audio feats.
use_cuda: True, MULTI_GPU: True
total trainable params 48520755
total DISC trainable params 18210561
Load checkpoint from: checkpoint_syn/checkpoint_step000171000.pth
Starting Epoch: 0
Saved checkpoint: checkpoint_step000000001.pth
Saved checkpoint: disc_checkpoint_step000000001.pth
L1: 0.2313518226146698, Sync: 0.0, Percep: 0.711134135723114 | Fake: 0.6754781603813171, Real: 0.711134135723114
L1: 0.21765484660863876, Sync: 0.0, Percep: 0.709110289812088 | Fake: 0.677438884973526, Real: 0.709110289812088
L1: 0.2188651313384374, Sync: 0.0, Percep: 0.7070923844973246 | Fake: 0.6794042587280273, Real: 0.7070923844973246
L1: 0.2144552432000637, Sync: 0.0, Percep: 0.7049891352653503 | Fake: 0.6814645230770111, Real: 0.7049891352653503
L1: 0.21138261258602142, Sync: 0.0, Percep: 0.7029658198356629 | Fake: 0.6834565877914429, Real: 0.702965784072876
L1: 0.20817621052265167, Sync: 0.0, Percep: 0.7010621925195059 | Fake: 0.6853393216927847, Real: 0.7010620832443237
L1: 0.20210737415722438, Sync: 0.0, Percep: 0.6996434501239231 | Fake: 0.6867433360644749, Real: 0.6996431180409023
L1: 0.19812600128352642, Sync: 0.0, Percep: 0.6987411752343178 | Fake: 0.6876341179013252, Real: 0.6987397372722626
L1: 0.19437309437327915, Sync: 0.0, Percep: 0.6981025603082445 | Fake: 0.6882637408044603, Real: 0.6980936461024814
...
L1: 0.12470868316135908, Sync: 0.0, Percep: 0.703860961136065 | Fake: 0.7219482924593122, Real: 0.69152611988952
L1: 0.12432418142755826, Sync: 0.0, Percep: 0.7042167019098997 | Fake: 0.7212010175765803, Real: 0.6919726772157446
L1: 0.1240154696033173, Sync: 0.0, Percep: 0.7046547470633516 | Fake: 0.7203877433827243, Real: 0.6924839418497868
L1: 0.12360538116523198, Sync: 0.0, Percep: 0.7050436321569948 | Fake: 0.7196274266711303, Real: 0.6929315188026521
L1: 0.12324049579675751, Sync: 0.0, Percep: 0.7055533739051434 | Fake: 0.7187670722904832, Real: 0.6934557074776173
Evaluating for 300 steps
L1: 0.08894559927284718, Sync: 7.633765455087026, Percep: 0.8102987110614777 | Fake: 0.5884036968151728, Real: 0.7851572235425314
L1: 0.12352411426603795, Sync: 0.0, Percep: 0.7059701490402222 | Fake: 0.7179982627183199, Real: 0.6938216637895676
L1: 0.12316158214713087, Sync: 0.0, Percep: 0.7069740667201505 | Fake: 0.7167386501879975, Real: 0.6947587119988258
L1: 0.12274648358716685, Sync: 0.0, Percep: 0.7086389707584008 | Fake: 0.7149891035960001, Real: 0.6961143705702852
L1: 0.122441600682666, Sync: 0.0, Percep: 0.7101659479650478 | Fake: 0.7133496456499239, Real: 0.6971959297446922
L1: 0.12237166498716061, Sync: 0.0, Percep: 0.7136020838068082 | Fake: 0.7105454275957667, Real: 0.6988249019290939
L1: 0.12226668624650865, Sync: 0.0, Percep: 0.7165913383165995 | Fake: 0.7080283074861481, Real: 0.6995955426850179
...
L1: 0.10978432702055822, Sync: 0.0, Percep: 0.7658024462613563 | Fake: 0.8152413980393286, Real: 0.6154363644432882
L1: 0.10972586760557995, Sync: 0.0, Percep: 0.7692448822380323 | Fake: 0.8124342786812339, Real: 0.6124296340577328
L1: 0.10953026241862897, Sync: 0.0, Percep: 0.7743397405519289 | Fake: 0.8092221762695191, Real: 0.610396875484025
L1: 0.10939421248741639, Sync: 0.0, Percep: 0.7824273446941963 | Fake: 0.8055855450166116, Real: 0.6093728619629893
L1: 0.1091919630309757, Sync: 0.0, Percep: 0.7908333182386574 | Fake: 0.8019456527194126, Real: 0.6076706493660488
L1: 0.10912049508790679, Sync: 0.0, Percep: 0.7942769879668473 | Fake: 0.7993013052296446, Real: 0.6045908347347031
L1: 0.10908047136182737, Sync: 0.0, Percep: 0.7925851992034527 | Fake: 0.8665490227044159, Real: 0.6015381057315442
L1: 0.10930284824053846, Sync: 0.0, Percep: 0.7917964988368816 | Fake: 0.8661491764380034, Real: 0.6038172583345233
Evaluating for 300 steps
L1: 0.5731244529287021, Sync: 9.090377567211787, Percep: 1.654488068819046 | Fake: 0.21219109917680423, Real: 1.4065884272257487
L1: 0.10955245170742273, Sync: 0.0, Percep: 0.9697534098973847 | Fake: 0.8618184333870491, Real: 0.707599309967045
L1: 0.10959959211782437, Sync: 0.0, Percep: 0.9730155654561438 | Fake: 0.8586212793076295, Real: 0.7107085656626347
L1: 0.1096739404936238, Sync: 0.0, Percep: 0.9733813980183366 | Fake: 0.8565126432325659, Real: 0.7121740724053752
L1: 0.10970133425566951, Sync: 0.0, Percep: 0.9722086560893506 | Fake: 0.8555087234860062, Real: 0.7122941125653687
L1: 0.10975395110161866, Sync: 0.0, Percep: 0.9709689952921027 | Fake: 0.8545878388406093, Real: 0.7123311641033997
L1: 0.10966527403854742, Sync: 0.0, Percep: 0.9697017977636012 | Fake: 0.8537138932196765, Real: 0.7123305465982057
...
L1: 0.10242784435932453, Sync: 0.0, Percep: 0.8304814692491738 | Fake: 10.244699917241833, Real: 0.6199986526972722
L1: 0.10239110164847111, Sync: 0.0, Percep: 0.8279339800796978 | Fake: 10.520022923630663, Real: 0.618096816339305
L1: 0.10235720289591985, Sync: 0.0, Percep: 0.8254020718837354 | Fake: 10.793661997258704, Real: 0.6162066120079922
L1: 0.10230096286480747, Sync: 0.0, Percep: 0.8228856021523825 | Fake: 11.065632539949988, Real: 0.6143279333128459
L1: 0.10232478230738712, Sync: 0.0, Percep: 0.8203844301093661 | Fake: 11.335949766272329, Real: 0.6124606751568799
Starting Epoch: 1
L1: 0.11442705243825912, Sync: 0.0, Percep: 0.0 | Fake: 100.0, Real: 0.0
L1: 0.09390852972865105, Sync: 0.0, Percep: 0.0 | Fake: 100.0, Real: 0.0
L1: 0.09028899172941844, Sync: 0.0, Percep: 0.0 | Fake: 100.0, Real: 0.0
L1: 0.08729504607617855, Sync: 0.0, Percep: 0.0 | Fake: 100.0, Real: 0.0
L1: 0.0875200405716896, Sync: 0.0, Percep: 0.0 | Fake: 100.0, Real: 0.0
L1: 0.08466161414980888, Sync: 0.0, Percep: 0.0 | Fake: 100.0, Real: 0.0
L1: 0.0851562459553991, Sync: 0.0, Percep: 0.0 | Fake: 100.0, Real: 0.0
...
L1: 0.08385955898174599, Sync: 0.0, Percep: 0.0 | Fake: 100.0, Real: 0.0
L1: 0.08394778782830518, Sync: 0.0, Percep: 0.0 | Fake: 100.0, Real: 0.0
L1: 0.08393304471088492, Sync: 0.0, Percep: 0.0 | Fake: 100.0, Real: 0.0
L1: 0.0836903992508139, Sync: 0.0, Percep: 0.0 | Fake: 100.0, Real: 0.0
Evaluating for 300 steps
L1: 0.062434629226724304, Sync: 7.282701448599497, Percep: 0.0 | Fake: 100.0, Real: 0.0
L1: 0.08369095140779523, Sync: 0.0, Percep: 0.0 | Fake: 100.0, Real: 0.0
L1: 0.0834334861073229, Sync: 0.0, Percep: 0.0 | Fake: 100.0, Real: 0.0
L1: 0.08353378695167907, Sync: 0.0, Percep: 0.0 | Fake: 100.0, Real: 0.0
L1: 0.08348599619962074, Sync: 0.0, Percep: 0.0 | Fake: 100.0, Real: 0.0
...

Is the training process normal? If not, could you please give me some suggestions? Need your help sincerely.

CUDA error while training syncnet

Hello!
I didn't make any changes to the code, but I have troubles with syncnet training
Filelists are available, data is available too
This error on first checkpoint save:

Saved checkpoint: check/checkpoint_step000000001.pth
Traceback (most recent call last):
File "color_syncnet_train.py", line 279, in
nepochs=hparams.nepochs)
File "color_syncnet_train.py", line 161, in train
loss = cosine_loss(a, v, y)
File "color_syncnet_train.py", line 136, in cosine_loss
loss = logloss(d.unsqueeze(1), y)
File "/home/kadochnikova/.conda/envs/lipsync/lib/python3.6/site-packages/torch/nn/modules/module.py", line 1051, in _call_impl
return forward_call(*input, **kwargs)
File "/home/kadochnikova/.conda/envs/lipsync/lib/python3.6/site-packages/torch/nn/modules/loss.py", line 612, in forward
return F.binary_cross_entropy(input, target, weight=self.weight, reduction=self.reduction)
File "/home/kadochnikova/.conda/envs/lipsync/lib/python3.6/site-packages/torch/nn/functional.py", line 2893, in binary_cross_entropy
return torch._C._nn.binary_cross_entropy(input, target, weight, reduction_enum)
RuntimeError: CUDA error: device-side assert triggered
srun: error: hpe: task 0: Exited with exit code 1

I can't find the error, can you please suggest me, what is the trouble? Thanks!

Main wav2lip training expected loss behavior (wav2lip_train.py)

Hi @primepake!
Could you describe briefly how your main wav2lip training behaved?
My situation is the following: syncnet is trained until 0.29 eval loss (higher than 0.25, but I'd like to try with that). I start wav2lip_train.py, which goes okay, eval sync loss is descreasing, reaching 0.75, then syncnet_wt gets set to 0.01. It happens on 280000 steps roughly.

After that train sync loss starts to raise from 0 (this is an expected behavior), eval sync loss starts to decrease fast. On 340000 steps both of these values are stabilized at 0.34. After that no matter how much I train, train sync loss decreases (I trained it up to 0.24 at 1200000 steps) while eval sync loss decreases much slower and from some point (600000 steps roughly) it simply stays at 0.3 and fluctuates there. It's a pity I don't have a plot for the losses :( but seems like train loss first reaches some "point of balance" and then constantly decreases while eval loss decreases up to 0.3 and then stays there.

So, my questions are:

  • Did you train your model up to 0.2 eval sync loss, as it is advised in original repo?
  • What were the values of your train sync loss then?
  • How much steps it took? (I know this can differ depending on the parameters of training, but still)

I use LRS2 dataset to train the wav2lip model, but the moving of mouth shape is too small

I tried the author's wav2lip training pipeline with the whole LRS2 datasets as the author did, but I got a very small mouth moving. In contrast, I download the author's wav2lip_gan.pth and test the results, it has a regular mouth moving and it's bigger than mine.
So I want to ask you if you have any suggestion about this question.

Looking forward to your reply, thanks!

Overfitting

Hello, can you please suggest me any tricks to prevent syncnet overfitting? Decreasing learning rate doesn't help...
image

share video demos

Hi, thanks for this work
Can you share some Lip-sync video results of your work?
thanks

Scale up audio block

Hello, can you please suggest what layers I need to add to scale up audio encoder? thanks
image

Training with new dataset

Hi, Happy Christmas!

I'd like to train the model with hq images (320X320) and need some help.

There is a loss error in line 133 of "color_syncnet_train.py". The loss is out of (0, 1).

It seems work when I change loss function
from
logloss = nn.BCELoss()
to
logloss = nn.BCEWithLogitsLoss()

Is it OK? Do I have to make other changes?

The training lasts more than 4 days now, loss around 0.54 after 940, 000 steps.

Thanks in advance.

Can you help me to create a budget to train a model?

Hello,

First of all, thanks for your contribution and sharing this very valuable code. You have done a fantastic job.

I have two questions:

1- How long did it take to train a model with 10 x A6000 GPUs (assuming each one was 48GB)?
2- How much did it cost to rent the GPU service from the time you started until the model training was finished?

I understand that you can't share a pre-trained model. For you that is a lot of hours of dedication and money.

Although it would help me to know those two points to make a rough budget and train my own model.

I know I could make estimated calculations, although the reality is always different in practice.

Thank you!

David

Inferencing after training wloss_hq_wav2lip_train.py

Hello,the pre-trained model named wav2lip_gan is used to inference. After my own training, there are two checkpoints:checkpoint_stepxxx and disc_checkpoint_stepxxx.
So, which checkpoint should I use in inference (with visual_quality_disc )?

How to train with custom datset ?

Hi, I am quiet new to this. I am looking for step by step guide to train custom dataset OR Train on AVSpeech dataset and finetune for other videos. Steps can be :

  • Download the dataset
  • Clean and convert to 25fps [If 30fps what should be done]
  • Train
  • Finetune on custom videos
  • Test

I think such guide will help a lot of people not get confused.

Thank You.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.