sihyun-yu / pvdm Goto Github PK

View Code? Open in Web Editor NEW

292.0 292.0 15.0 36.25 MB

Official PyTorch implementation of Video Probabilistic Diffusion Models in Projected Latent Space (CVPR 2023).

Home Page: https://sihyun.me/PVDM

License: MIT License

Python 100.00%

diffusion-models video-generation

pvdm's Introduction

pvdm's People

Contributors

Stargazers

Watchers

Forkers

szhan227 yyang181 tonyfpy throne77 dannielge lllybcd aspnetcs paperwave davidlikecookies enochgli janjurca lazysheepsheep821 dogyunpark peterzs gaybro8777

pvdm's Issues

Batch size

For training the autoencoder.

Is it really possible to train the autoencoder with a 7-8 batch size?

As you mentioned in the paper, how can we train the autoencoder with 24 batchsize?

I am using an A6000 (48 GB), and the memory usage is already more than 22GB when the batch size is set to 1.

RuntimeError：Default process group has not been initialized, please make sure to call init_process_group.

When I use the sky data set and set n_gpus to 1, I try to run first_stage, "res.append(model(res[-1]))" at line 207 in ".loss/perceptual.py" reports an error "RuntimeError: Default process group has not been initialized, please make sure to call init_process_group.”. Why is there such a problem

about inference

Excuse me ~ How can we do the inference with the checkpoints ?

self._selector.poll(timeout)

Hello, I encountered the following problem when training on UCF101 data set. I use n_gpus=2, but I keep getting stuck on "self._selector.poll(timeout)" while training. How do you deal with it during training?
In addition, how to modify to train on a single GPU？

pytorch lightning?

Thanks for releasing this very interesting package. I wonder if you will consider refactoring the code to use pytorch lightning. It'll make my life much easier :-) Based on your code structure, I think this is feasible, right?

Code can't adapt to different number of timesteps

The repo as a few hardcoded things that makes it difficult to use with a different setting, like different resolution or timesteps.
I think I managed the resolution problem also thanks to this issue.
Now I'm really struggling with the timesteps (number of frames in a video) parameter.

Apparently using a number that's not a power of two (8, 16, 32) causes problems with the UNet (when concatenating residuals with the new upsampled dim).

I managed to train the AE with timesteps 8 and res 128, so now it produces an embedding of dim [1, 4, 1536], one for the noisy frames one for the conditioning frames.
I also had to change the code in the UNet that is marked with a TODO:

# TODO: treat 32 and 16 as variables
h_xy = h[:, :, 0:32*32].view(h.size(0), h.size(1), 32, 32)
h_yt = h[:, :, 32*32:32*(32+16)].view(h.size(0), h.size(1), 16, 
h_xt = h[:, :, 32*(32+16):32*(32+16+16)].view(h.size(0), h.size(1), 16, 32)

So I defined a variable n2 = 32 and n = n2 // 2 to replace the raw numbers.
To use timesteps 8 i set n2 to 16, which I'm not sure is correct but if for timesteps 16 was 32 then the thing should hold.

The problem now is that the forward pass of the UNet produces a tensor of shape [1, 4, 512], so there's a dimension mismatch when trying to compute the loss.
I'm referring to the code in the function

    def p_losses(self, x_start, cond, t, noise=None):
        noise = default(noise, lambda: torch.randn_like(x_start))
        x_noisy = self.q_sample(x_start=x_start, t=t, noise=noise)
        model_out = self.model(x_noisy, cond, t)
        ...
        loss = self.get_loss(model_out, target, mean=False).mean(dim=[1, 2])

Which causes the following error:

RuntimeError: The size of tensor a (1536) must match the size of tensor b (512) at non-singleton dimension 2

@sihyun-yu Did I miss anything else that should be changed in order to make this code "timesteps adaptive" ?

How to test encoder?

I'm sorry, I'm a novice who has just started scientific research. After I have trained the encoder, how can I test it?

Hand + Face of Human Pose

Hi,
Is it possible to generate a single character from the Pose for about 5 seconds?

I have a video of Pose ( openpose + hands + face) and i was wondering if it is possible to generate an output video withe the length of 5 seconds that has a consistent character/Avatar which plays Dance, .... from the controlled (pose) input?

I have a video of OpenPose+hands+face and i want to generate human like animation (No matter what, but just a consistent Character/Avatar)
Sample Video

Thanks
Best regards

Regarding training issue and artifact issue

Hello, sorry to bother you. I'm studying your work recently, and I have a few questions I'd like to ask you.
①Taking training UCF101 as an example, how many steps did you train when training the diffusion model? What equipment was used and how long did it take?
②When training the diffusion model, I trained it on 4 a100 GPUs (40GB) for six days, and trained about 250,000 steps. The test results on the UCF101 data set are very good. Then I used this model to test videos from the WebVid-10M dataset and found that there are many artifacts in the generated videos, which are very similar to other people's actions. I don't understand why these artifacts appear, so I want to ask you. Is it because the training is not enough? (I only trained for 250,000 steps)
Below is the resulting predicted video. The second example is a man holding a remote control. From this example, we can see that there are obvious artifacts in the generated part, which may be the characteristics or actions of other characters.

predicted_easy.mp4

Clarifying Evaluation Results

Firstly, thank you for your great work. I have a question about calculating FVD score.

(1) In the Appendix: B.2. Metrics section of PVDM paper, it is mentioned, "We sample 2,048 samples (or the size of the real data if it is smaller) for calculating real statistics and 2,048 samples for evaluating fake statistics."

For the SKY-Timelapse dataset, it was noted that there are only 196 real samples available, so real statistics were calculated using these 196 samples. However, if we assume that 2,048 fake samples are generated as mentioned, there will be a difference in the number of real and fake samples. Is there any potential issue when comparing statistics between real and fake samples due to this difference in quantity?

vqloss

the d_loss was always zero when I ran the code. I noticed in forward() of LPIPSWithDiscriminator, codebook_loss is never used. Is that related?

unet file has many errors.some blocks and functions are not define.

such as:Normalize,BasicTransformerBlock,convert_module_to_f16,etc

error with dataloder

File "/PVDM/tools/dataloader.py", line 103, in _select_fold
with open(f, "r") as fid:
FileNotFoundError: [Errno 2] No such file or directory: '../../datasets/UCF-101/ucfTrainTestlist/trainlist01.txt'

I use the structure you mentioned for the dataset, but i have this error, is there any other preprocessing required on the dataset?

Missing ddconfig parameters in configs/latent-diffusion/base.yaml

Hi I get errors when running

python main.py \
 --exp ddpm \
 --id main \
 --pretrain_config configs/latent-diffusion/base.yaml \
 --data UCF101 \
 --first_model 'results/first_stage_main_gan_UCF101_42/model_last.pth'  
 --diffusion_config configs/latent-diffusion/base.yaml \
 --batch_size 48

It says we don't have the key of model.params.ddconfig and I find out that is not included in base.yaml. Could you help to fix this issue?

Fail to load scalers. Start from initial point.

Hi I get errors when running

launch.json
"args":[
"--exp","first_stage",
"--id","main",
"--pretrain_config", "configs/autoencoder/base.yaml",
"--data","UCF101",
"--batch_size","8"
],

loaded pretrained LPIPS loss from ./losses/vgg.pth
Fail to load scalers. Start from initial point.Fail to load scalers. Start from initial point.

Thanks.

Some questions about changing this work to a text-to-video generation work

Sorry to bother you, my recent project is text-to-video generation, and I am currently making some modifications based on your open source code.

Upon reviewing your code, I noticed that you randomly sample 32 frames from the video and divide them into two sections. The first 16 frames are used as conditions to generate the last 16 frames. I would like to ask, have you ever experimented with text-to-video before? How was the code modified?

I've modified some of your code so far, but it doesn't work after training. No matter what text is input, the generated video is almost the same, and the loss does not converge during training, and it is difficult to decrease.

Modified details:
After the text is encoded with BERT, it is added to UNet through the cross-attention method, and the calculation of loss is changed from the original '(loss, t), loss_dict = criterion(z.float(), c.float())' to '(loss, t), loss_dict = criterion(z.float(), encoded_texts.float())'.

I'm a beginner, any answer from you will help me a lot, thanks!

How to replicate the efficient memory usage on a 24GB GPU with batch size 2?

I am trying to reproduce your results. However, when I run the default configuration autoencoder/base.yaml, my memory cost is super high.

Specifically, the memory cost of each batch when running with the default configuration is 27591 MiB. I would like to know how to modify the configuration or the training procedure in order to achieve a similar level of efficiency as reported in the paper, where a batch size of 2 was used on a 24GB GPU.

Can you provide guidance on how to modify the configuration or the training procedure to achieve this? Or, if this is not possible, can you explain why this is the case and suggest alternative approaches to reproducing the reported results?

Thank you in advance for your help.

Training set used as eval/test ???

Is there a reason behind using the training set for evaluation, or is it just a mistake?

PVDM/tools/dataloader.py

Lines 305 to 311 in 793172f

 trainset_sampler = InfiniteSampler(dataset=trainset, rank=rank, num_replicas=n_gpus, seed=seed) 

 trainloader = DataLoader(trainset, sampler=trainset_sampler, batch_size=batch_size // n_gpus, pin_memory=False, num_workers=4, prefetch_factor=2) 

 testset_sampler = InfiniteSampler(testset, num_replicas=n_gpus, rank=rank, seed=seed) 

 testloader = DataLoader(testset, sampler=testset_sampler, batch_size=batch_size // n_gpus, pin_memory=False, num_workers=4, prefetch_factor=2) 

 return trainloader, trainloader, testloader

PVDM/exps/first_stage.py

Line 80 in 793172f

 train_loader, test_loader, total_vid = get_loaders(rank, args.data, args.res, args.timesteps, args.skip, args.batch_size, args.n_gpus, args.seed, cond=False) 

PVDM/exps/diffusion.py

Line 83 in 793172f

 train_loader, test_loader, total_vid = get_loaders(rank, args.data, args.res, args.timesteps, args.skip, args.batch_size, args.n_gpus, args.seed, args.cond_model) 

How long does it take to train?

It's a very interesting job. I noticed that the resources required for training are not mentioned in the paper. How long will it take to train?🥳

missing ddconfig

Hi, I run your code and find that in PVDM/configs/latent_diffusion/base.yaml, It seems to be missing a part called ddconfig. Could you please share the complete file with ddconfig? Thanks.

Release of pretrain autoencoder

Hi,

Thank you very much for releasing the code, the paper is super interesting.
Do you plan to also release the weights of the autoencoder?

Best,

Why not a concatenation operation?

.yaml file missing

While running the script for Diffusion Model, it gives error saying "No such file or directory: '/home/........./PVDM/configs/latent-diffusion/ucf101-ldm-kl-3_res128.yaml'. This error pops up when I give this argument "--first_model 'results/first_stage_main_gan_UCF101_42/model_last.pth' " from the diffusion model script. Does it means the "ucf101-ldm-kl-3_res128.yaml " is missing from the configs/latent-diffusion file ? Help in this matter will be really appreciated.

torch.multiprocessing.spawn hangs

When I tried to run first_stage, the code hangs at
torch.multiprocessing.spawn(fn=first_stage, args=(args, ), nprocs=args.n_gpus)

After I changed to single GPU, the code ran, however, I kept getting memory error even after reducing channels from 384 to 48. In the paper, it says the model can fit on a single card.

torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 2.00 GiB (GPU 0; 23.70 GiB total capacity; 20.26 GiB already allocated; 1.34 GiB free; 21.31 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

Here's my model configuration,

model:
  resume: False
  amp: True
  base_learning_rate: 1.0e-4
  params:
    embed_dim: 4
    lossconfig:
      params:
        disc_start: 100000000

    ddconfig: 
      double_z: False
      channels: 48 
      resolution: 128 
      timesteps: 8 
      skip: 1
      in_channels: 1 
      out_ch: 1 
      num_res_blocks: 2
      attn_resolutions: []
      splits: 1

Environment Setting

Hi, the default environment has high cuda, can I use lower setting for training & evaluate? like, CUDA11.3 for pytorch 1.10. Thanks.

Need advice on the training of autoencoder

Hey, I trained the autoencoder with default configurations on 5 GPUs, 1 batch per GPU, but it has not converged even after 190k iterations (over 1 day). I would like to know the expected number of iterations it would take for the autoencoder to converge with default settings.

Also, should I continue training the model, or are there any other suggestions to help it converge faster?

Thank you,
Jiankun

Question about the autoencoder design

Q1:
As latent diffusion uses VAE, why did you modify the structure to autoencoder, is it because of poor VAE performance?

Q2:
Why design a bottleneck structure here?
https://github.com/sihyun-yu/PVDM/blob/17699659148423469c0d1ccdca5e466933b943e1/models/autoencoder/autoencoder_vit.py#L180C1-L190C34

GPU memory needed to train AE？

Hi! Thanks for your paper and code. Can you tell me how much gpu memory is needed to train AE? Is 24GB enough? Thanks again.

a general question regarding videogpt

Hi your paper shows PVDM beat VideoGPT by a large margin. I wonder if you can offer more insights. VideoGPT also uses a two step process, first training a VQVAE, and then end-to-end autoregression. Do you think the main difference lies in the diffusion part? Thanks.

Increasing memory

Maybe there is a memory non reclamation issue in the first_stage_train, resulting in gradual memory growth

does this work for 128x128

I changed the configuration to 128x128 and got the following errors. Please help.

  first_stage_train(rank, model, opt, d_opt, criterion, train_loader, test_loader, args.first_model, fp, logger)
  File "/home/work/PVDM/tools/trainer.py", line 185, in first_stage_train
    x_tilde, vq_loss  = model(x)
  File "/home/anaconda3/envs/mympi/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/work/PVDM/models/autoencoder/autoencoder_vit.py", line 206, in forward
    z = self.encode(input)
  File "/home/work/PVDM/models/autoencoder/autoencoder_vit.py", line 153, in encode
    h = self.encoder(x)
  File "/home/anaconda3/envs/mympi/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/work/PVDM/models/autoencoder/vit_modules.py", line 220, in forward
    x = self.to_patch_embedding(video)
  File "/home/anaconda3/envs/mympi/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/anaconda3/envs/mympi/lib/python3.9/site-packages/torch/nn/modules/linear.py", line 114, in forward
    return F.linear(input, self.weight, self.bias)
RuntimeError: mat1 and mat2 shapes cannot be multiplied (65536x16 and 48x192)

Here's the yaml I used,

model:
  resume: False
  amp: True
  base_learning_rate: 1.0e-4
  params:
    embed_dim: 4
    lossconfig:
      params:
        disc_start: 100000000

    ddconfig: 
      double_z: False
      channels: 192 
      resolution: 128 
      timesteps: 8 
      skip: 1
      in_channels: 1 
      out_ch: 1 
      num_res_blocks: 2
      attn_resolutions: []
      splits: 1

Is PVDM built upon Imagen Video?

I guess it should since VDM's authors are the same authors for ImagenVideo but I'm not sure

The file 'i3d_pretrained_400.pt' cannot download.

Thank you for your outstanding work.
This link https://drive.google.com/file/d/1fBNl3TS0LA5FEhZv5nMGJs2_7qQmvTmh/view to download 'i3d_pretrained_400.pt' has been cancelled.
Can you upload it again?
Thank you.

torch.nn.parallel.DistributedDataParallel

Excellent work! : )
But I got a bug. When I use the multi-GPU run the first_stage code, my code was block up at this line. I find the issue is induced by model desynchronization.
criterion = torch.nn.parallel.DistributedDataParallel(criterion, device_ids=[device], broadcast_buffers=False, find_unused_parameters=True)

[NIT] typo in README.md

I checked a need of nit fix for typo dataloadet.py

	trainset_sampler = InfiniteSampler(dataset=trainset, rank=rank, num_replicas=n_gpus, seed=seed)
	trainloader = DataLoader(trainset, sampler=trainset_sampler, batch_size=batch_size // n_gpus, pin_memory=False, num_workers=4, prefetch_factor=2)

	testset_sampler = InfiniteSampler(testset, num_replicas=n_gpus, rank=rank, seed=seed)
	testloader = DataLoader(testset, sampler=testset_sampler, batch_size=batch_size // n_gpus, pin_memory=False, num_workers=4, prefetch_factor=2)

	return trainloader, trainloader, testloader

sihyun-yu / pvdm Goto Github PK

pvdm's Introduction

pvdm's People

Contributors

Stargazers

Watchers

Forkers

pvdm's Issues

Recommend Projects

Recommend Topics

Recommend Org

Jobs