GithubHelp home page GithubHelp logo

sihyun-yu / pvdm Goto Github PK

View Code? Open in Web Editor NEW
292.0 292.0 15.0 36.25 MB

Official PyTorch implementation of Video Probabilistic Diffusion Models in Projected Latent Space (CVPR 2023).

Home Page: https://sihyun.me/PVDM

License: MIT License

Python 100.00%
diffusion-models video-generation

pvdm's Introduction

pvdm's People

Contributors

sihyun-yu avatar subin-kim-cv avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

pvdm's Issues

Batch size

For training the autoencoder.

Is it really possible to train the autoencoder with a 7-8 batch size?

As you mentioned in the paper, how can we train the autoencoder with 24 batchsize?

I am using an A6000 (48 GB), and the memory usage is already more than 22GB when the batch size is set to 1.

about inference

Excuse me ~ How can we do the inference with the checkpoints ?

self._selector.poll(timeout)

Hello, I encountered the following problem when training on UCF101 data set. I use n_gpus=2, but I keep getting stuck on "self._selector.poll(timeout)" while training. How do you deal with it during training?
In addition, how to modify to train on a single GPU?

pytorch lightning?

Thanks for releasing this very interesting package. I wonder if you will consider refactoring the code to use pytorch lightning. It'll make my life much easier :-) Based on your code structure, I think this is feasible, right?

Code can't adapt to different number of timesteps

The repo as a few hardcoded things that makes it difficult to use with a different setting, like different resolution or timesteps.
I think I managed the resolution problem also thanks to this issue.
Now I'm really struggling with the timesteps (number of frames in a video) parameter.

Apparently using a number that's not a power of two (8, 16, 32) causes problems with the UNet (when concatenating residuals with the new upsampled dim).

I managed to train the AE with timesteps 8 and res 128, so now it produces an embedding of dim [1, 4, 1536], one for the noisy frames one for the conditioning frames.
I also had to change the code in the UNet that is marked with a TODO:

# TODO: treat 32 and 16 as variables
h_xy = h[:, :, 0:32*32].view(h.size(0), h.size(1), 32, 32)
h_yt = h[:, :, 32*32:32*(32+16)].view(h.size(0), h.size(1), 16, 
h_xt = h[:, :, 32*(32+16):32*(32+16+16)].view(h.size(0), h.size(1), 16, 32)

So I defined a variable n2 = 32 and n = n2 // 2 to replace the raw numbers.
To use timesteps 8 i set n2 to 16, which I'm not sure is correct but if for timesteps 16 was 32 then the thing should hold.

The problem now is that the forward pass of the UNet produces a tensor of shape [1, 4, 512], so there's a dimension mismatch when trying to compute the loss.
I'm referring to the code in the function

    def p_losses(self, x_start, cond, t, noise=None):
        noise = default(noise, lambda: torch.randn_like(x_start))
        x_noisy = self.q_sample(x_start=x_start, t=t, noise=noise)
        model_out = self.model(x_noisy, cond, t)
        ...
        loss = self.get_loss(model_out, target, mean=False).mean(dim=[1, 2])

Which causes the following error:

RuntimeError: The size of tensor a (1536) must match the size of tensor b (512) at non-singleton dimension 2

@sihyun-yu Did I miss anything else that should be changed in order to make this code "timesteps adaptive" ?

How to test encoder?

I'm sorry, I'm a novice who has just started scientific research. After I have trained the encoder, how can I test it?

Hand + Face of Human Pose

Hi,
Is it possible to generate a single character from the Pose for about 5 seconds?

I have a video of Pose ( openpose + hands + face) and i was wondering if it is possible to generate an output video withe the length of 5 seconds that has a consistent character/Avatar which plays Dance, .... from the controlled (pose) input?

I have a video of OpenPose+hands+face and i want to generate human like animation (No matter what, but just a consistent Character/Avatar)
Sample Video

Thanks
Best regards

Regarding training issue and artifact issue

Hello, sorry to bother you. I'm studying your work recently, and I have a few questions I'd like to ask you.
①Taking training UCF101 as an example, how many steps did you train when training the diffusion model? What equipment was used and how long did it take?
②When training the diffusion model, I trained it on 4 a100 GPUs (40GB) for six days, and trained about 250,000 steps. The test results on the UCF101 data set are very good. Then I used this model to test videos from the WebVid-10M dataset and found that there are many artifacts in the generated videos, which are very similar to other people's actions. I don't understand why these artifacts appear, so I want to ask you. Is it because the training is not enough? (I only trained for 250,000 steps)
Below is the resulting predicted video. The second example is a man holding a remote control. From this example, we can see that there are obvious artifacts in the generated part, which may be the characteristics or actions of other characters.

predicted_easy.mp4

More single channel video inquiry

Hi I'm using single channel 128x128 video. The ae_loss calculation threw an error. I think somewhere in the loss function it expects 3-channel frames. Can you provide some advice on dealing with single channel videos? Because you listed in/out channels as parameters, I think you probably have tested with single-channel before, no?

File "/home/work/PVDM/tools/trainer.py", line 188, in first_stage_train
    ae_loss = criterion(vq_loss, x,
  File "/home/anaconda3/envs/mympi/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/work/PVDM/losses/perceptual.py", line 118, in forward
    logits_real_2d, pred_real_2d = self.discriminator_2d(inputs_2d)
  File "/home/anaconda3/envs/mympi/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/work/PVDM/losses/perceptual.py", line 207, in forward
    res.append(model(res[-1]))
  File "/home/anaconda3/envs/mympi/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/anaconda3/envs/mympi/lib/python3.9/site-packages/torch/nn/modules/container.py", line 204, in forward
    input = module(input)
  File "/home/anaconda3/envs/mympi/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/anaconda3/envs/mympi/lib/python3.9/site-packages/torch/nn/modules/conv.py", line 463, in forward
    return self._conv_forward(input, self.weight, self.bias)
  File "/home/anaconda3/envs/mympi/lib/python3.9/site-packages/torch/nn/modules/conv.py", line 459, in _conv_forward
    return F.conv2d(input, weight, bias, self.stride,
RuntimeError: Given groups=1, weight of size [64, 3, 4, 4], expected input[2, 1, 128, 128] to have 3 channels, but got 1 channels instead 

Clarifying Evaluation Results

Firstly, thank you for your great work. I have a question about calculating FVD score.

(1) In the Appendix: B.2. Metrics section of PVDM paper, it is mentioned, "We sample 2,048 samples (or the size of the real data if it is smaller) for calculating real statistics and 2,048 samples for evaluating fake statistics."

For the SKY-Timelapse dataset, it was noted that there are only 196 real samples available, so real statistics were calculated using these 196 samples. However, if we assume that 2,048 fake samples are generated as mentioned, there will be a difference in the number of real and fake samples. Is there any potential issue when comparing statistics between real and fake samples due to this difference in quantity?

vqloss

the d_loss was always zero when I ran the code. I noticed in forward() of LPIPSWithDiscriminator, codebook_loss is never used. Is that related?

error with dataloder

File "/PVDM/tools/dataloader.py", line 103, in _select_fold
with open(f, "r") as fid:
FileNotFoundError: [Errno 2] No such file or directory: '../../datasets/UCF-101/ucfTrainTestlist/trainlist01.txt'

I use the structure you mentioned for the dataset, but i have this error, is there any other preprocessing required on the dataset?

Missing ddconfig parameters in configs/latent-diffusion/base.yaml

Hi I get errors when running

python main.py \
 --exp ddpm \
 --id main \
 --pretrain_config configs/latent-diffusion/base.yaml \
 --data UCF101 \
 --first_model 'results/first_stage_main_gan_UCF101_42/model_last.pth'  
 --diffusion_config configs/latent-diffusion/base.yaml \
 --batch_size 48

It says we don't have the key of model.params.ddconfig and I find out that is not included in base.yaml. Could you help to fix this issue?

Fail to load scalers. Start from initial point.

Hi I get errors when running

launch.json
"args":[
"--exp","first_stage",
"--id","main",
"--pretrain_config", "configs/autoencoder/base.yaml",
"--data","UCF101",
"--batch_size","8"
],

loaded pretrained LPIPS loss from ./losses/vgg.pth
Fail to load scalers. Start from initial point.Fail to load scalers. Start from initial point.

Thanks.

Some questions about changing this work to a text-to-video generation work

Sorry to bother you, my recent project is text-to-video generation, and I am currently making some modifications based on your open source code.

Upon reviewing your code, I noticed that you randomly sample 32 frames from the video and divide them into two sections. The first 16 frames are used as conditions to generate the last 16 frames. I would like to ask, have you ever experimented with text-to-video before? How was the code modified?

I've modified some of your code so far, but it doesn't work after training. No matter what text is input, the generated video is almost the same, and the loss does not converge during training, and it is difficult to decrease.

Modified details:
After the text is encoded with BERT, it is added to UNet through the cross-attention method, and the calculation of loss is changed from the original '(loss, t), loss_dict = criterion(z.float(), c.float())' to '(loss, t), loss_dict = criterion(z.float(), encoded_texts.float())'.

I'm a beginner, any answer from you will help me a lot, thanks!

How to replicate the efficient memory usage on a 24GB GPU with batch size 2?

I am trying to reproduce your results. However, when I run the default configuration autoencoder/base.yaml, my memory cost is super high.

Specifically, the memory cost of each batch when running with the default configuration is 27591 MiB. I would like to know how to modify the configuration or the training procedure in order to achieve a similar level of efficiency as reported in the paper, where a batch size of 2 was used on a 24GB GPU.

Can you provide guidance on how to modify the configuration or the training procedure to achieve this? Or, if this is not possible, can you explain why this is the case and suggest alternative approaches to reproducing the reported results?

Thank you in advance for your help.

Training set used as eval/test ???

Is there a reason behind using the training set for evaluation, or is it just a mistake?

PVDM/tools/dataloader.py

Lines 305 to 311 in 793172f

trainset_sampler = InfiniteSampler(dataset=trainset, rank=rank, num_replicas=n_gpus, seed=seed)
trainloader = DataLoader(trainset, sampler=trainset_sampler, batch_size=batch_size // n_gpus, pin_memory=False, num_workers=4, prefetch_factor=2)
testset_sampler = InfiniteSampler(testset, num_replicas=n_gpus, rank=rank, seed=seed)
testloader = DataLoader(testset, sampler=testset_sampler, batch_size=batch_size // n_gpus, pin_memory=False, num_workers=4, prefetch_factor=2)
return trainloader, trainloader, testloader

train_loader, test_loader, total_vid = get_loaders(rank, args.data, args.res, args.timesteps, args.skip, args.batch_size, args.n_gpus, args.seed, cond=False)

train_loader, test_loader, total_vid = get_loaders(rank, args.data, args.res, args.timesteps, args.skip, args.batch_size, args.n_gpus, args.seed, args.cond_model)

How long does it take to train?

It's a very interesting job. I noticed that the resources required for training are not mentioned in the paper. How long will it take to train?🥳

missing ddconfig

Hi, I run your code and find that in PVDM/configs/latent_diffusion/base.yaml, It seems to be missing a part called ddconfig. Could you please share the complete file with ddconfig? Thanks.

Release of pretrain autoencoder

Hi,

Thank you very much for releasing the code, the paper is super interesting.
Do you plan to also release the weights of the autoencoder?

Best,

.yaml file missing

While running the script for Diffusion Model, it gives error saying "No such file or directory: '/home/........./PVDM/configs/latent-diffusion/ucf101-ldm-kl-3_res128.yaml'. This error pops up when I give this argument "--first_model 'results/first_stage_main_gan_UCF101_42/model_last.pth' " from the diffusion model script. Does it means the "ucf101-ldm-kl-3_res128.yaml " is missing from the configs/latent-diffusion file ? Help in this matter will be really appreciated.

torch.multiprocessing.spawn hangs

When I tried to run first_stage, the code hangs at
torch.multiprocessing.spawn(fn=first_stage, args=(args, ), nprocs=args.n_gpus)

After I changed to single GPU, the code ran, however, I kept getting memory error even after reducing channels from 384 to 48. In the paper, it says the model can fit on a single card.

torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 2.00 GiB (GPU 0; 23.70 GiB total capacity; 20.26 GiB already allocated; 1.34 GiB free; 21.31 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

Here's my model configuration,

model:
  resume: False
  amp: True
  base_learning_rate: 1.0e-4
  params:
    embed_dim: 4
    lossconfig:
      params:
        disc_start: 100000000

    ddconfig: 
      double_z: False
      channels: 48 
      resolution: 128 
      timesteps: 8 
      skip: 1
      in_channels: 1 
      out_ch: 1 
      num_res_blocks: 2
      attn_resolutions: []
      splits: 1

Environment Setting

Hi, the default environment has high cuda, can I use lower setting for training & evaluate? like, CUDA11.3 for pytorch 1.10. Thanks.

Need advice on the training of autoencoder

Hey, I trained the autoencoder with default configurations on 5 GPUs, 1 batch per GPU, but it has not converged even after 190k iterations (over 1 day). I would like to know the expected number of iterations it would take for the autoencoder to converge with default settings.

Also, should I continue training the model, or are there any other suggestions to help it converge faster?

image

Thank you,
Jiankun

a general question regarding videogpt

Hi your paper shows PVDM beat VideoGPT by a large margin. I wonder if you can offer more insights. VideoGPT also uses a two step process, first training a VQVAE, and then end-to-end autoregression. Do you think the main difference lies in the diffusion part? Thanks.

Increasing memory

Maybe there is a memory non reclamation issue in the first_stage_train, resulting in gradual memory growth
image

does this work for 128x128

I changed the configuration to 128x128 and got the following errors. Please help.

  first_stage_train(rank, model, opt, d_opt, criterion, train_loader, test_loader, args.first_model, fp, logger)
  File "/home/work/PVDM/tools/trainer.py", line 185, in first_stage_train
    x_tilde, vq_loss  = model(x)
  File "/home/anaconda3/envs/mympi/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/work/PVDM/models/autoencoder/autoencoder_vit.py", line 206, in forward
    z = self.encode(input)
  File "/home/work/PVDM/models/autoencoder/autoencoder_vit.py", line 153, in encode
    h = self.encoder(x)
  File "/home/anaconda3/envs/mympi/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/work/PVDM/models/autoencoder/vit_modules.py", line 220, in forward
    x = self.to_patch_embedding(video)
  File "/home/anaconda3/envs/mympi/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/anaconda3/envs/mympi/lib/python3.9/site-packages/torch/nn/modules/linear.py", line 114, in forward
    return F.linear(input, self.weight, self.bias)
RuntimeError: mat1 and mat2 shapes cannot be multiplied (65536x16 and 48x192)

Here's the yaml I used,

model:
  resume: False
  amp: True
  base_learning_rate: 1.0e-4
  params:
    embed_dim: 4
    lossconfig:
      params:
        disc_start: 100000000

    ddconfig: 
      double_z: False
      channels: 192 
      resolution: 128 
      timesteps: 8 
      skip: 1
      in_channels: 1 
      out_ch: 1 
      num_res_blocks: 2
      attn_resolutions: []
      splits: 1

torch.nn.parallel.DistributedDataParallel

Excellent work! : )
But I got a bug. When I use the multi-GPU run the first_stage code, my code was block up at this line. I find the issue is induced by model desynchronization.
criterion = torch.nn.parallel.DistributedDataParallel(criterion, device_ids=[device], broadcast_buffers=False, find_unused_parameters=True)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.