GithubHelp home page GithubHelp logo

voletiv / mcvd-pytorch Goto Github PK

View Code? Open in Web Editor NEW
313.0 4.0 25.0 375 KB

Official implementation of MCVD: Masked Conditional Video Diffusion for Prediction, Generation, and Interpolation (https://arxiv.org/abs/2205.09853)

License: MIT License

Python 68.60% Shell 4.78% C++ 0.20% Cuda 1.64% Jupyter Notebook 24.78%

mcvd-pytorch's Introduction

MCVD: Masked Conditional Video Diffusion
for Prediction, Generation, and Interpolation

NeurIPS 2022

This is the official implementation of the NeurIPS 2022 paper MCVD: Masked Conditional Video Diffusion for Prediction, Generation, and Interpolation. In this paper, we devise a general-purpose model for video prediction (forward and backward), unconditional generation, and interpolation with Masked Conditional Video Diffusion (MCVD) models. Please see our website for more details. This repo is based on the code from https://github.com/ermongroup/ncsnv2.

If you find the code/idea useful for your research, please cite:

@inproceedings{voleti2022MCVD,
 author = {Voleti, Vikram and Jolicoeur-Martineau, Alexia and Pal, Christopher},
 title = {MCVD: Masked Conditional Video Diffusion for Prediction, Generation, and Interpolation},
 url = {https://arxiv.org/abs/2205.09853},
 booktitle = {(NeurIPS) Advances in Neural Information Processing Systems},
 year = {2022}
}

Scaling

The models from our paper were trained with 1 to 4 GPUs (requiring from 32GB to 160GB of RAM). Models can be scaled with less or more GPUs by changing the following parameters:

  • model.ngf and model.n_heads_channel (doubling ngf and n_heads_channels approximately doubles the memory demand)
  • model.num_res_blocks (number of sequential residual layers per block)
  • model.ch_mult=[1,2,3,4,4,4] will use 6 resblocks instead of the default 4 (model.ch_mult=[1,2,3,4])
  • training.batch_size (doubling the batch size approximately increase the memory demand by 50%)
  • SPATIN models can be scaled through model.spade_dim (128 -> 512 increase memory demands by 2x, 128 -> 1024 increase memory demand by 4x); it should be scaled proportionally to the number of past+future frames for best results. In practice we find the SPATIN models often need very large spade_dim to be competitive, thus we recommend regular users to stick to concatenation.

Installation

# if using conda (ignore otherwise)
conda create --name vid python=3.8
# # (Optional) If your machine has a GCC/G++ version < 5:
# conda install -c conda-forge gxx=8.5.0    # (should be executed before the installation of pytorch, torchvision, and torchaudio)
conda install pytorch torchvision torchaudio cudatoolkit=11.3 -c pytorch

pip install -r requirements.txt # install all requirements

Experiments

The experiments to reproduce the paper can be found in /example_scripts/final/training_scripts.sh and /example_scripts/final/sampling_scripts.sh.

We also provide a small notebook demo for sampling from SMMNIST: https://github.com/voletiv/mcvd-pytorch/blob/master/MCVD_demo_SMMNIST.ipynb.

Pretrained Checkpoints and results

The checkpoints used for the experiments and their results can be used here: https://drive.google.com/drive/u/1/folders/15pDq2ziTv3n5SlrGhGM0GVqwIZXgebyD

Configurations

The models configurations are available at /configs. To overide any existing configuration from a config file, you can simply use the --config_mod argument in the command line. For example:

--config_mod training.snapshot_freq=50000 sampling.subsample=100 sampling.clip_before=True sampling.max_data_iter=1 model.version=DDPM model.arch=unetmore model.num_res_blocks=2

The important config options are:

training.batch_size=64 # training batch size

sampling.batch_size=200 # sampling batch size
sampling.subsample=100 # how many diffusion steps to take (1000 is best but is slower, 100 is faster)
sampling.max_data_iter=1000 # how many mini-batches of the test to go through at the maximum (set to 1 for training and a large value for sampling)

model.ngf=192 # number of channels (controls model size)
model.n_head_channels=192 # number of channels per self-attention head (should ideally be larger or equal to model.ngf, otherwise you may have a size mismatch error)
model.spade=True # if True uses space-time adaptive normalization instead of concatenation
model.spade_dim=128 # number of channels in space-time adaptive normalization; worth increasing, especially if conditioning on a large number of frames

sampling.num_frames_pred=16 # number of frames to predict (autoregressively)
data.num_frames=4 # number of current frames
data.num_frames_cond=4 # number of previous frames
data.num_frames_future=4 # number of future frames

data.prob_mask_cond=0.50 # probability of masking the previous frames (allows predicting current frames with no past frames)
data.prob_mask_future=0.50 # probability of masking the future frames (allows predicting current frames with no future frames)

When data.num_frames_future > 0, data.num_frames_cond > 0, data.prob_mask_cond=0.50, and data.prob_mask_future=0.50, one can do video prediction (forward and backward), generation, and interpolation.

Training

You can train on Stochastic Moving MNIST with 1 GPU (if memory issues, use model.ngf=64) using:

CUDA_VISIBLE_DEVICES=0 python main.py --config configs/smmnist_DDPM_big5.yml --data_path /my/data/path/to/datasets --exp smmnist_cat --ni

Log files will be saved in <exp>/logs/smmnist_cat. This folder contains stdout, metric plots, and video samples over time.

You can train on Cityscapes with 4 GPUs using:

CUDA_VISIBLE_DEVICES=0,1,2,3 python main.py --config configs/cityscapes_big_spade.yml --data_path /my/data/path/to/datasets --exp exp_city_spade --ni

Sampling

You can look at stdout or the metric plots in <exp>/logs/smmnist_cat to determine which checkpoint provides the best metrics. Then, you can sample from 25 frames using the chosen checkpoint (e.g., 250k) of the previous SMNIST model by running main.py with the --video_gen option:

CUDA_VISIBLE_DEVICES=0 python main.py --config configs/smmnist_DDPM_big5.yml --data_path /my/data/path/to/datasets --exp smmnist_cat --ni --config_mod sampling.max_data_iter=1000 sampling.num_frames_pred=25 sampling.preds_per_test=10 sampling.subsample=100 model.version=DDPM --ckpt 250000 --video_gen -v videos_250k_DDPM_1000_nfp_pred25

Results will be saved in <exp>/video_samples/videos_250k_DDPM_1000_nfp_pred25.

You can use the above option to sample videos from any pretrained MCVD model.

Esoteric options

We tried a few options that did not help, but we left them in the code. Some of these options might be broken, we make no guarantees, use them at your own risk.

model.gamma=True # Gamma noise from https://arxiv.org/abs/2106.07582
training.L1=True # L1 loss
model.cond_emb=True # Embedding for wether we mask (1) or we don't mask (0)
output_all_frames=True # Option to output/predict all frames, not just current frames
noise_in_cond=True # Diffusion noise also in conditioning frames
one_frame_at_a_time=True # Autoregressive one image at a time instead of blockwise
model.version=FPNDM # F-PNDM from https://arxiv.org/abs/2202.09778

Note that this code can be used to generate images by setting data.num_frames=0, data.num_frames_cond=0, data.num_frames_future=0.

Many unused options also exist which are from the original code by https://github.com/ermongroup/ncsnv2, mostly applicable only to images.

For LPIPS

The code will do it for you!

Code will download https://download.pytorch.org/models/alexnet-owt-7be5be79.pth and move it into: models/weights/v0.1/alex.pth

For FVD

The code will do it for you!

Code will download i3D model pretrained on Kinetics-400 from "https://onedrive.live.com/download?cid=78EEF3EB6AE7DBCB&resid=78EEF3EB6AE7DBCB%21199&authkey=AApKdFHPXzWLNyI" Use models/fvd/convert_tf_pretrained.py to make i3d_pretrained_400.pt

Datasets

Stochastic Moving MNIST (64x64, ch1)

The script will automatically download the PyTorch MNIST dataset, which will be used to generate Stochastic Moving MNIST dynamically.

KTH (64x64, ch1)

Download the hdf5 dataset:

gdown --fuzzy https://drive.google.com/file/d/1d2UfHV6RhSrwdDAlCFY3GymtFPpmh_8X/view?usp=sharing

How the data was processed:

  1. Download KTH dataset to /path/to/KTH:
    sh kth_download.sh /path/to/KTH
  2. Convert 64x64 images to HDF5 format:
    python datasets/kth_convert.py --kth_dir '/path/to/KTH' --image_size 64 --out_dir '/path/to/KTH64_h5' --force_h5 False

BAIR (64x64, ch3)

Download the hdf5 dataset:

gdown --fuzzy https://drive.google.com/file/d/1-R_srAOy5ZcylGXVernqE4WLCe6N4_wq/view?usp=sharing

How the data was processed:

  1. Download BAIR Robotic Push dataset to /path/to/BAIR:
    sh bair_dowload.sh /path/to/BAIR
  2. Convert it to HDF5 format, and save in /path/to/BAIR_h5:
    python datasets/bair_convert.py --bair_dir '/path/to/BAIR' --out_dir '/path/to/BAIR_h5'

Cityscapes (64x64, ch3)

gdown --fuzzy https://drive.google.com/file/d/1oP7n-FUfa9ifsMn6JHNS9depZfftvrXx/view?usp=sharing

How the data was processed:
MAKE SURE YOU HAVE ~657GB SPACE! 324GB for the zip file, and 333GB for the unzipped image files

  1. Download Cityscapes video dataset (leftImg8bit_sequence_trainvaltest.zip (324GB)) :
    sh cityscapes_download.sh username password
             using your username and password that you created on https://www.cityscapes-dataset.com/
  2. Convert it to HDF5 format, and save in /path/to/Cityscapes<image_size>_h5:
    python datasets/cityscapes_convert.py --leftImg8bit_sequence_dir '/path/to/Cityscapes/leftImg8bit_sequence' --image_size 64 --out_dir '/path/to/Cityscapes64_h5'

Cityscapes (128x128, ch3)

Download the hdf5 dataset:

gdown --fuzzy https://drive.google.com/file/d/13yaJkKtmDsgtaEvuXKSvbix5usea6TJy/view?usp=sharing

How the data was processed:
MAKE SURE YOU HAVE ~657GB SPACE! 324GB for the zip file, and 333GB for the unzipped image files

  1. Download Cityscapes video dataset (leftImg8bit_sequence_trainvaltest.zip (324GB)) :
    sh cityscapes_download.sh /path/to/download/to username password
             using your username and password that you created on https://www.cityscapes-dataset.com/
  2. Convert it to HDF5 format, and save in /path/to/Cityscapes<image_size>_h5:
    python datasets/cityscapes_convert.py --leftImg8bit_sequence_dir '/path/to/Cityscapes/leftImg8bit_sequence' --image_size 128 --out_dir '/path/to/Cityscapes128_h5'

UCF-101 (orig:320x240, ch3)

Download the hdf5 dataset:

gdown --fuzzy https://drive.google.com/file/d/1bDqhhfKYrdbIIOZeJcWHWjSyFQwmO1t-/view?usp=sharing

How the data was processed:
MAKE SURE YOU HAVE ~20GB SPACE! 6.5GB for the zip file, and 8GB for the unzipped image files

  1. Download UCF-101 video dataset (UCF101.rar (6.5GB)) :
    sh cityscapes_download.sh /download/dir\
  2. Convert it to HDF5 format, and save in /path/to/UCF101_h5:
    python datasets/ucf101_convert.py --out_dir /path/to/UCF101_h5 --ucf_dir /download/dir/UCF-101 --splits_dir /download/dir/ucfTrainTestlist

mcvd-pytorch's People

Contributors

alexiajm avatar himangim avatar voletiv avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar

mcvd-pytorch's Issues

Error in training on MNIST

Hi!

When I was training on MNIST with command:
CUDA_VISIBLE_DEVICES=0 python main.py --config configs/smmnist_DDPM_big5.yml --data_path /cluster/51/dichang/datasets/mcvd --exp smmnist_cat --ni

I received following error: smmnist_cat/logs/meters.pkl does not exist! Returning.
ERROR - main.py - 2022-06-16 21:39:49,313 - Traceback (most recent call last):
File "/rhome/dichang/anaconda3/envs/vid/lib/python3.8/site-packages/torch/utils/cpp_extension.py", line 1740, in _run_ninja_build
subprocess.run(
File "/rhome/dichang/anaconda3/envs/vid/lib/python3.8/subprocess.py", line 516, in run
raise CalledProcessError(retcode, process.args,
subprocess.CalledProcessError: Command '['ninja', '-v']' returned non-zero exit status 1.

I checked the class NCSNRunner and load_meters(),it seems it's trying to load from "meters_pkl = os.path.join(self.args.log_path, 'meters.pkl')". What's the meters.pkl here? And how can I solve the error?

Thanks!

UCF101 dataset

Does the code to sample videos also works on Windows?

KTH dataset content question

Hi, I use gdown --fuzzy https://drive.google.com/file/d/1d2UfHV6RhSrwdDAlCFY3GymtFPpmh_8X/view?usp=sharing to download KTH64 dataset. When I run the code dataset, test_dataset = get_dataset(DATA_PATH, config, video_frames_pred=config.data.num_frames) I got an KeyError.

Screenshot from 2023-02-10 16-19-06

I found that the file persons.pkl is empty (5 bytes and only {}). I would like to confirm with you about the dataset if it is all right. Thanks!

Multi-gpu training slower than single gpu

Hi, I just wanted to know whether anyone else faced the same issue.
I'm running a training on cityscapes, after a few test I decided to scale it up but I noticed something weird.
When training with 4 GPUs it's way slower compared to training on 1 GPU only.

Tried with different model settings (smaller and bigger), looks like it takes 4x the time for 100 steps training time.

EDIT: Also tried with 2 GPUs on my PC, the training time is 2x then single gpu training.

Code looks fine to me.

How to download Moving MNIST dataset for training?

Dear authors,

The README says that The script will automatically download the PyTorch MNIST dataset, which will be used to generate Stochastic Moving MNIST dynamically but doesn't mention which script is supposed to do this. Can you please edit the README with the information?

Thank you.

Error training for KTH64

Dear authors,

I am unable to run the provided training script for KTH64. I downloaded the dataset from gdown --fuzzy https://drive.google.com/file/d/1d2UfHV6RhSrwdDAlCFY3GymtFPpmh_8X/view?usp=sharing as instructed and unzipped it. I tried running the training with the following command CUDA_VISIBLE_DEVICES=0 python main.py --config configs/kth64.yml --data_path Data/KTH64/KTH64_h5 --exp smmnist_cat --ni. However, I end up with the error shown in the attached screenshot. Can you please suggest what is the issue and how to fix it?

Thank you.

Screenshot 2022-06-01 at 19 16 27

how much is your sampling.batch_size?

hi, I notice that you set sampling.batch_size=100/200 in sampling_scripts.sh on SMMNIST/KTH dataset but your paper reads averaged across 256 test videos and on 256 test videos, so it means batch_size should be 256? thanks again!
image

export config_mod="sampling.batch_size=200 sampling.max_data_iter=1000 model.arch=unetmore model.num_res_blocks=2"

export config_mod="data.num_frames=5 data.num_frames_future=0 data.num_frames_cond=10 training.batch_size=64 sampling.batch_size=100 sampling.max_data_iter=1000 model.arch=unetmore"

About FVD usage. How did conditional frames added into the calculation of FVD?

Question

Hi, I come back again haha... Our team want to follow your nice work so now I want to figure out your FVD metrics setting.

For example, I use your pre-trained KTH model to do video prediction task, and use the setting of kth64_big.yaml to predict videos. As the config says:

num_frames: 5
num_frames_cond: 10
num_frames_future: 0
prob_mask_cond: 0.0
prob_mask_future: 0.0
prob_mask_sync: false

num_frames_pred: 20

It means model should use 10 conditional frames (past frames) to generate 5 predicted frames (future frames). If we want to prediction 20 frames, when we calculate FVD, do we calculate on 30 frames with 10 conditional frames and 20 predicted frames, or only calculate on 20 predicted frames?

When I read other papers, they usually calculate the FVD of predicted frames with conditional frames. If it is true, what are the considerations behind adding conditional frames in the calculation?, while SSIM, PSNR, LPIPS do not calculate like this, because they are image metrics not video metircs?

Hope to your early reply! Thank you for your nice contribution again!

Experiment Results

And I (1) use your code to re-train a new model with 200,000 steps and (2) directedly use your pre-trained model to (3) compare these metrics with your paper. It seems to the metrics FVD is the concat result of conditional frames and predicted frames.

image

Code Reading

In validation stage, you sample from model by calling function video_gen()

# Sample from model
if (step % self.config.training.snapshot_freq == 0 or step % self.config.training.sample_freq == 0) and self.config.training.snapshot_sampling:
logging.info(f"Saving images in {self.args.log_sample_path}")
# Calc video metrics with max_data_iter=1
if conditional and step % self.config.training.snapshot_freq == 0 and self.config.training.snapshot_sampling: # only at snapshot_freq, not at sample_freq
vid_metrics = self.video_gen(scorenet=test_scorenet, ckpt=step, train=True)

there, real_fvd is the concat result of cond_original and real, and fake_fvd also. then get real_embeddings and fake_embeddings, we input them to calculate FVD.

if (calc_fvd1 or (calc_fvd3 and not second_calc)) and real.shape[1] >= pred.shape[1]:
# real
if future == 0:
real_fvd = torch.cat([
cond_original[:, :self.config.data.num_frames_cond*self.config.data.channels],
real
], dim=1)[::preds_per_test] # Ignore the repeated ones
else:
real_fvd = torch.cat([
cond_original[:, :self.config.data.num_frames_cond*self.config.data.channels],
real,
cond_original[:, -future*self.config.data.channels:]
], dim=1)[::preds_per_test] # Ignore the repeated ones
real_fvd = to_i3d(real_fvd)
real_embeddings.append(get_fvd_feats(real_fvd, i3d=i3d, device=self.config.device))
# fake
if future == 0:
fake_fvd = torch.cat([
cond_original[:, :self.config.data.num_frames_cond*self.config.data.channels], pred], dim=1)
else:
fake_fvd = torch.cat([
cond_original[:, :self.config.data.num_frames_cond*self.config.data.channels],
pred,
cond_original[:, -future*self.config.data.channels:]
], dim=1)
fake_fvd = to_i3d(fake_fvd)
fake_embeddings.append(get_fvd_feats(fake_fvd, i3d=i3d, device=self.config.device))

if calc_fvd1:
# (1) Video Pred/Interp
real_embeddings = np.concatenate(real_embeddings)
fake_embeddings = np.concatenate(fake_embeddings)
avg_fvd, fvd_traj_mean, fvd_traj_std, fvd_traj_conf95 = fvd_stuff(fake_embeddings, real_embeddings)
vid_metrics.update({'fvd': avg_fvd, 'fvd_traj_mean': fvd_traj_mean, 'fvd_traj_std': fvd_traj_std, 'fvd_traj_conf95': fvd_traj_conf95})

Reference

SLAMP
https://github.com/kaanakan/slamp/blob/4f5fc0707a4843d34dd1cb98f4939f1357e05183/calculate_fvd.py#L25-L31

SRVP
https://github.com/edouardelasalles/srvp/blob/3e90a748db04d182290132163fea5b0410ea2452/test.py#L295-L302

FVD

报错2
Hello, may I ask why the above error occurs when calculating the indicators? The weight of the I3D model is downloaded from the link you gave. Looking forward to your answer.

The problem of "targets" parameter

Thanks to the author's work, the experiment is very good!
I want to ask about a parameter problem in the code:
image
This is the targets parameter of stochastic_moving_mnist.
image
This is the targets parameter of cityscapes.
The targets parameter seems to return different values in different code. What is the function of this parameter? I want to use my own data set. How should this parameter be set.
Looking forward to your reply, thank you!

Error when downloading alexnet

My server cannot connect to network, when I run the script main.py, and copy the weight/v0.1/alexnet to where the default alexnet downloading path, the following error occurred, what is the difference of the alexnet.py and the pytorch default downloading model?
RuntimeError: Error(s) in loading state_dict for AlexNet:
Missing key(s) in state_dict: "features.0.weight", "features.0.bias", "features.3.weight", "features.3.bias", "features.6.weight", "features.6.bias", "features.8.weight", "features.8.bias", "features.10.weight", "features.10.bias", "classifier.1.weight", "classifier.1.bias", "classifier.4.weight", "classifier.4.bias", "classifier.6.weight", "classifier.6.bias".
Unexpected key(s) in state_dict: "lin0.model.1.weight", "lin1.model.1.weight", "lin2.model.1.weight", "lin3.model.1.weight", "lin4.model.1.weight".

Question about DDPM and DDIM sampling.

Hi, thanks for sharing your excellent work!

I just walked through the code base and noticed that during sampling you used timestamp t from 0 to 999 (see here. I think in the reversed pass, we should start from 999 till 0. I'm a little confused about this.

Another question is, what does the denoise option mean for the last sampling step? please check here.

These two questions can be raised either for the DDPM or DDIM sampler. Really appreciate your explanation.

UCF101 Unconditional Generation FVD Result (16 frames vs 20 frames)

Hello. I want to confirm the calculation method of unconditional generation FVD.
.
In your paper, you generate 16 frames.

image

#############
## UCF-101 ##
#############
export config="ucf101"
export data="${data_folder}"
export devices="0"
export nfp="16"

And you calculate FVD between the 16-frame predicted result and the 20-frame origin video, right?

for 20-frame origin video

# real
if future == 0:
real_fvd = torch.cat([
cond_original[:, :self.config.data.num_frames_cond*self.config.data.channels],
real
], dim=1)[::preds_per_test] # Ignore the repeated ones

real_fvd = to_i3d(real_fvd)
real_embeddings.append(get_fvd_feats(real_fvd, i3d=i3d, device=self.config.device))

# (3) fake 3: uncond
if calc_fvd3:
# real uncond
real_embeddings_uncond.append(real_embeddings2[-1] if second_calc else real_embeddings[-1])

for 16-frame predicted result

pred_uncond = torch.cat(pred_samples, dim=1)[:, :self.config.data.channels*num_frames_pred]
pred_uncond = inverse_data_transform(self.config, pred_uncond)

# fake uncond
fake_fvd_uncond = torch.cat([pred_uncond], dim=1) # We don't want to input the zero-mask
fake_fvd_uncond = to_i3d(fake_fvd_uncond)
fake_embeddings_uncond.append(get_fvd_feats(fake_fvd_uncond, i3d=i3d, device=self.config.device))

calculate unconditional FVD result

# (3) uncond
if calc_fvd3:
real_embeddings_uncond = np.concatenate(real_embeddings_uncond)
fake_embeddings_uncond = np.concatenate(fake_embeddings_uncond)
avg_fvd3, fvd3_traj_mean, fvd3_traj_std, fvd3_traj_conf95 = fvd_stuff(fake_embeddings_uncond, real_embeddings_uncond)
vid_metrics.update({'fvd3': avg_fvd3, 'fvd3_traj_mean': fvd3_traj_mean, 'fvd3_traj_std': fvd3_traj_std, 'fvd3_traj_conf95': fvd3_traj_conf95})

RuntimeError: Trying to resize storage that is not resizable

Hi!

When I was training on ucf101 with command:

python main.py --config configs/ucf101.yml --data_path datasets/download_ucf_101/UCF101_h5 --exp ucf101 --ni

I received following error:
ucf101/logs/meters.pkl does not exist! Returning.
ERROR - main.py - 2023-08-05 11:15:09,283 - Traceback (most recent call last):
File "main.py", line 404, in main
runner.train()
File "/home/jsun/PycharmProjects/mcvd-pytorch-master/runners/ncsn_runner.py", line 374, in train
for batch, (X, y) in enumerate(dataloader):
File "/opt/anaconda3/envs/vid/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 530, in next
data = self._next_data()
File "/opt/anaconda3/envs/vid/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 1224, in _next_data
return self._process_data(data)
File "/opt/anaconda3/envs/vid/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 1250, in _process_data
data.reraise()
File "/opt/anaconda3/envs/vid/lib/python3.8/site-packages/torch/_utils.py", line 457, in reraise
raise exception
RuntimeError: Caught RuntimeError in DataLoader worker process 0.
Original Traceback (most recent call last):
File "/opt/anaconda3/envs/vid/lib/python3.8/site-packages/torch/utils/data/_utils/worker.py", line 287, in _worker_loop
data = fetcher.fetch(index)
File "/opt/anaconda3/envs/vid/lib/python3.8/site-packages/torch/utils/data/_utils/fetch.py", line 52, in fetch
return self.collate_fn(data)
File "/opt/anaconda3/envs/vid/lib/python3.8/site-packages/torch/utils/data/_utils/collate.py", line 174, in default_collate
return [default_collate(samples) for samples in transposed] # Backwards compatibility.
File "/opt/anaconda3/envs/vid/lib/python3.8/site-packages/torch/utils/data/_utils/collate.py", line 174, in
return [default_collate(samples) for samples in transposed] # Backwards compatibility.
File "/opt/anaconda3/envs/vid/lib/python3.8/site-packages/torch/utils/data/utils/collate.py", line 139, in default_collate
out = elem.new(storage).resize
(len(batch), *list(elem.size()))
RuntimeError: Trying to resize storage that is not resizable

Check the information and say that the data dimensions are not uniform, what should I do?
Thank you

fvd compute question

In StyleGAN-V,they resize the input image to 128x128 to compute the fvd metric.
image
But the official fvd metric use 224x224 as input.
What is the size of the input image in your work? It feels like everyone's treatment is different. Thank you!

Are metrics on your 2D convolution MCVD model?

Hi, dear author!

Our MCVD models are built from simple non-recurrent 2D-convolutional architectures, conditioning on blocks of frames and generating blocks of frames. We generate videos of arbitrary lengths autoregressively in a block-wise manner.

you mention you use conv2d to solve the problem, and you show the FVD of the model. But your code has many modes of model, like ncsn, unet, unetmore, unetmore3d. I want to know which mode makes the FVD on your paper.

def get_model(config):
    version = getattr(config.model, 'version', 'SMLD').upper()
    arch = getattr(config.model, 'arch', 'ncsn')
    depth = getattr(config.model, 'depth', 'deep')
    if arch == 'unetmore':
        from models.better.ncsnpp_more import UNetMore_DDPM # This lets the code run on CPU when 'unetmore' is not used
        return UNetMore_DDPM(config).to(config.device)#.to(memory_format=torch.channels_last).to(config.device)
    elif arch in ['unetmore3d', 'unetmorepseudo3d']:
        from models.better.ncsnpp_more import UNetMore_DDPM # This lets the code run on CPU when 'unetmore' is not used
        # return UNetMore_DDPM(config).to(memory_format=torch.channels_last_3d).to(config.device) # channels_last_3d doesn't work!
        return UNetMore_DDPM(config).to(config.device)#.to(memory_format=torch.channels_last).to(config.device)
    else:
        Exception("arch is not valid [ncsn, unet, unetmore, unetmore3d]")

in your code /example_scripts/final/training_scripts.sh , i find their mode are all
arch='unetmore' and it use NCSNpp model, so i am not sure your result is on 2D convolution or 3D convolution?

like

# Video prediction non-spade
export exp="smmnist_big_5c5_unetm_b2"
export config_mod="training.snapshot_freq=50000 sampling.subsample=100 sampling.clip_before=True sampling.max_data_iter=1 model.version=DDPM model.arch=unetmore model.num_res_blocks=2"
sh ./example_scripts/final/base_1f.sh

and in NCSNpp you use layers3d , it is based on Conv3d.

class NCSNpp(nn.Module):
  """NCSN++ model"""
  def __init__(self, config):
    super().__init__()
    self.config = config
    self.act = act = get_act(config)
    self.register_buffer('sigmas', get_sigmas(config))
    self.is3d = (config.model.arch in ["unetmore3d", "unetmorepseudo3d"])
    self.pseudo3d = (config.model.arch == "unetmorepseudo3d")
    if self.is3d:
      from . import layers3d

what is the `average FVD`?

In table 1 of your paper, what is the average FVD? Is it avg_fvd? or fvd_traj_mean?

image

https://github.com/voletiv/mcvd-pytorch/blob/451da2eb635bad50da6a7c03b443a34c6eb08b3a/runners/ncsn_runner.py#L2217-L2229C20

        def fvd_stuff(fake_embeddings, real_embeddings):
            avg_fvd = frechet_distance(fake_embeddings, real_embeddings)
            if preds_per_test > 1:
                fvds_list = []
                # Calc FVD for 5 random trajs (each), and average that FVD
                trajs = np.random.choice(np.arange(preds_per_test), (preds_per_test,), replace=False)
                for traj in trajs:
                    fvds_list.append(frechet_distance(fake_embeddings[traj::preds_per_test], real_embeddings))
                fvd_traj_mean, fvd_traj_std  = float(np.mean(fvds_list)), float(np.std(fvds_list))
                fvd_traj_conf95 = fvd_traj_mean - float(st.norm.interval(alpha=0.95, loc=fvd_traj_mean, scale=st.sem(fvds_list))[0])
            else:
                fvd_traj_mean, fvd_traj_std, fvd_traj_conf95 = -1, -1, -1
            return avg_fvd, fvd_traj_mean, fvd_traj_std, fvd_traj_conf95

Adding class condition to time embeddings in resnet block

Referenced paper by Dhariwal et al. 2021 suggests to use AdaGN(h, y) = ys GroupNorm(h)+yb to combine time ys and class yb embeddings with resnet block activations h. I am having some trouble understanding how to implement this in the mcvd code in this repo since class conditioning lines are commented out. It seems time and class embeddings are to be concatenated together (based on commented code) and are fed together to the resnet block as "emb".

# resnetblock 
def forward(self, x, temb=None, yemb=None, cond=None):
    if emb is not None:
        emb = torch.cat([temb, yemb], dim=1) # Combine time and class embeddings
        emb_out = self.Dense_0(self.act_emb(emb))[:, :, None, None]  # Linear projection
        scale, shift = torch.chunk(emb_out, 2, dim=1)
        [ ... ]
        emb_norm = self.Norm_0(x)
        x = emb_norm * (1 + scale) + shift

My confusion:- How does splitting the linear projection of the combined embeddings into 2 chunks give us scale and shift? How to interpret these two values in relation to the time and class embeddings? It seems scale might be analogous to temb and shift to yemb, but that's not what the code suggests.

PS: Getting some really good results for prediction tasks, thanks for making your code available!

Finetuning model from pretrained checkpoint leads to size mismatch

There seems to be a discrepancy between the BAIR config files and the parameter shapes in pretrained checkpoints from the provided links. I have tried all BAIR configs, and used the checkpoint from bair64_big192_5c1_pmask50_unetm-20230211T213918Z-002. I run this training command: "CUDA_VISIBLE_DEVICES=0 python main.py --config configs/bair.yml --data_path datasets/echo_h5 --resume_training --exp bair --ni". This is the terminal output:

size mismatch for module.unet.all_modules.39.Conv_1.bias: copying a param with shape torch.Size([192]) from checkpoint, the shape in current model is torch.Size([64]).
size mismatch for module.unet.all_modules.39.Conv_2.weight: copying a param with shape torch.Size([192, 384, 1, 1]) from checkpoint, the shape in current model is torch.Size([64, 128, 1, 1]).
size mismatch for module.unet.all_modules.39.Conv_2.bias: copying a param with shape torch.Size([192]) from checkpoint, the shape in current model is torch.Size([64]).

Unconditional generation

Hello thanks for the great repository!

I'm wondering how the unconditional video generation results from the paper can be reproduced.
In the runner.video_gen() function there seems to be an assert that only allows conditional generation.
Is there an example for sampling unconditionally elsewhere?

nvrtc error

Dataset length: 60000
Dataset length: 256
Setting up Perceptual loss...
Downloading: "https://download.pytorch.org/models/alexnet-owt-4df8aa71.pth" to /cluster/home/tangha/.cache/t
orch/hub/checkpoints/alexnet-owt-4df8aa71.pth
100%|█████████████████████████████████████████████████████████████████████| 233M/233M [00:02<00:00, 106MB/s$
Loading model from: /cluster/work/mcvd-pytorch/models/weights/v0.1/alex.pth
...[net-lin [alex]] initialized
...Done

video_gen dataloader: 0%| | 0/1 [00:00<?, ?it/s]I
NFO - ncsn_runner.py - 2022-09-03 16:48:22,970 - (1) Video Pred
INFO - ncsn_runner.py - 2022-09-03 16:48:22,971 - PREDICTING 20 frames, using a 5 frame model conditioned on
5 frames, subsample=1000, preds_per_test=1

Generating video frames: 100%|███████████████████████████████████████████████| 4/4 [16:49<00:00, 252.40s/it]
INFO - ncsn_runner.py - 2022-09-03 17:05:21,209 - fvd1 True, fvd2 False, fvd3 False[16:49<00:00, 252.36s/it]

video_gen dataloader: 0%| | 0/1 [17:01<?, ?it/s]
ERROR - main.py - 2022-09-03 17:05:24,564 - Traceback (most recent call last):
File "main.py", line 404, in main
runner.train()
File "/cluster/work/mcvd-pytorch/runners/ncsn_runner.py", line 497, in train
vid_metrics = self.video_gen(scorenet=test_scorenet, ckpt=step, train=True)
File "/cluster/home/.local/lib/python3.8/site-packages/torch/autograd/grad_mode.py", line 27, in de
corate_context
return func(args, **kwargs)
File "/cluster/work/mcvd-pytorch/runners/ncsn_runner.py", line 1940, in video_gen
real_embeddings.append(get_fvd_feats(real_fvd, i3d=i3d, device=self.config.device))
File "/cluster/work/mcvd-pytorch/models/fvd/fvd.py", line 55, in get_fvd_feats
embeddings = get_feats(videos, i3d, device, bs)
File "/cluster/work/mcvd-pytorch/models/fvd/fvd.py", line 48, in get_feats
feats = np.vstack([feats, detector(torch.stack([preprocess_single(video) for video in videos[i
bs:(i+1)*
bs]]).to(device), **detector_kwargs).detach().cpu().numpy()])
File "/cluster/home/.local/lib/python3.8/site-packages/torch/nn/modules/module.py", line 889, in _c
all_impl
result = self.forward(*input, **kwargs)
File "/cluster/home/.local/lib/python3.8/site-packages/torch/nn/parallel/data_parallel.py", line 16
5, in forward
return self.module(*inputs[0], **kwargs[0])
File "/cluster/home/.local/lib/python3.8/site-packages/torch/nn/modules/module.py", line 889, in _c
all_impl
result = self.forward(*input, **kwargs)
RuntimeError: nvrtc: error: failed to open libnvrtc-builtins.so.11.1.
Make sure that libnvrtc-builtins.so.11.1 is installed correctly.
nvrtc compilation failed:

#define NAN __int_as_float(0x7fffffff)
#define POS_INFINITY __int_as_float(0x7f800000)
#define NEG_INFINITY __int_as_float(0xff800000)

template
device T maximum(T a, T b) {
return isnan(a) ? a : (a > b ? a : b);
}

Why set KTH/Cityscapes test dataset length to 256?

  1. I see you set test dataset length to 256. It is for calculating FVD easier? Or other reason? Do you follow other researcher's setting? This setting is different from other model like SimVP (For SMMNIST train dataset length is 10000 and test datatset length is 10000, and for KTH train dataset and test datatset are also different).

test_dataset = StochasticMovingMNIST(data_path, train=False, seq_len=seq_len, num_digits=getattr(config.data, "num_digits", 2),
step_length=config.data.step_length, with_target=True, total_videos=256)

test_dataset = KTHDataset(data_path, frames_per_sample=frames_per_sample, train=False,
random_time=True, random_horizontal_flip=False, total_videos=256, start_at=start_at)

test_dataset = CityscapesDataset(os.path.join(data_path, "test"), frames_per_sample=frames_per_sample, random_time=True,
random_horizontal_flip=False, color_jitter=0.0, total_videos=256)

image

  1. In SimVP, for KTH dataset, their method is to clip the video into small clips offline. And I notice your code will get a clip randomly when user gets an item by an index online. And your test dataset length is actually smaller than 256 (5 people 4 scene 6 action = 120 video), and you use a mod method to get more data (like index 200 => 200*(119/255) = 93.3 = 93 => the random clipped video which index is 93 in 120 origin test dataset.) Has this method been used in other models?
video_index = round(index / (self.__len__() - 1) * (self.max_index() - 1)) = round(index*(119/255))
# a jroject from [0,255] to [0, 119]
shard_idx, idx_in_shard = self.videos_ds.get_indices(video_index) 
# get video from 120 length dataset

def __getitem__(self, index, time_idx=0):
# Use `index` to select the video, and then
# randomly choose a `frames_per_sample` window of frames in the video
video_index = round(index / (self.__len__() - 1) * (self.max_index() - 1))
shard_idx, idx_in_shard = self.videos_ds.get_indices(video_index)
prefinals = []
flip_p = np.random.randint(2) == 0 if self.random_horizontal_flip else 0
with self.videos_ds.opener(self.videos_ds.shard_paths[shard_idx]) as f:
video_len = f['len'][str(idx_in_shard)][()]
if self.random_time and video_len > self.frames_per_sample:
time_idx = np.random.choice(video_len - self.frames_per_sample)
for i in range(time_idx, min(time_idx + self.frames_per_sample, video_len)):
# byte_str = f[str(idx_in_shard)][str(i)][()]
# img = Image.frombytes('RGB', (64, 64), byte_str)
# arr = np.expand_dims(np.array(img.getdata()).reshape(img.size[1], img.size[0], 3), 0)
img = f[str(idx_in_shard)][str(i)][()]
arr = transforms.RandomHorizontalFlip(flip_p)(transforms.ToTensor()(img))
prefinals.append(arr)
data = torch.stack(prefinals)
data = self.jitter(data)
if self.with_target:
return data, torch.tensor(1)
else:
return data

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.