GithubHelp home page GithubHelp logo

guided-diffusion's Introduction

guided-diffusion

This is the codebase for Diffusion Models Beat GANS on Image Synthesis.

This repository is based on openai/improved-diffusion, with modifications for classifier conditioning and architecture improvements.

Download pre-trained models

We have released checkpoints for the main models in the paper. Before using these models, please review the corresponding model card to understand the intended use and limitations of these models.

Here are the download links for each model checkpoint:

Sampling from pre-trained models

To sample from these models, you can use the classifier_sample.py, image_sample.py, and super_res_sample.py scripts. Here, we provide flags for sampling from all of these models. We assume that you have downloaded the relevant model checkpoints into a folder called models/.

For these examples, we will generate 100 samples with batch size 4. Feel free to change these values.

SAMPLE_FLAGS="--batch_size 4 --num_samples 100 --timestep_respacing 250"

Classifier guidance

Note for these sampling runs that you can set --classifier_scale 0 to sample from the base diffusion model. You may also use the image_sample.py script instead of classifier_sample.py in that case.

  • 64x64 model:
MODEL_FLAGS="--attention_resolutions 32,16,8 --class_cond True --diffusion_steps 1000 --dropout 0.1 --image_size 64 --learn_sigma True --noise_schedule cosine --num_channels 192 --num_head_channels 64 --num_res_blocks 3 --resblock_updown True --use_new_attention_order True --use_fp16 True --use_scale_shift_norm True"
python classifier_sample.py $MODEL_FLAGS --classifier_scale 1.0 --classifier_path models/64x64_classifier.pt --classifier_depth 4 --model_path models/64x64_diffusion.pt $SAMPLE_FLAGS
  • 128x128 model:
MODEL_FLAGS="--attention_resolutions 32,16,8 --class_cond True --diffusion_steps 1000 --image_size 128 --learn_sigma True --noise_schedule linear --num_channels 256 --num_heads 4 --num_res_blocks 2 --resblock_updown True --use_fp16 True --use_scale_shift_norm True"
python classifier_sample.py $MODEL_FLAGS --classifier_scale 0.5 --classifier_path models/128x128_classifier.pt --model_path models/128x128_diffusion.pt $SAMPLE_FLAGS
  • 256x256 model:
MODEL_FLAGS="--attention_resolutions 32,16,8 --class_cond True --diffusion_steps 1000 --image_size 256 --learn_sigma True --noise_schedule linear --num_channels 256 --num_head_channels 64 --num_res_blocks 2 --resblock_updown True --use_fp16 True --use_scale_shift_norm True"
python classifier_sample.py $MODEL_FLAGS --classifier_scale 1.0 --classifier_path models/256x256_classifier.pt --model_path models/256x256_diffusion.pt $SAMPLE_FLAGS
  • 256x256 model (unconditional):
MODEL_FLAGS="--attention_resolutions 32,16,8 --class_cond False --diffusion_steps 1000 --image_size 256 --learn_sigma True --noise_schedule linear --num_channels 256 --num_head_channels 64 --num_res_blocks 2 --resblock_updown True --use_fp16 True --use_scale_shift_norm True"
python classifier_sample.py $MODEL_FLAGS --classifier_scale 10.0 --classifier_path models/256x256_classifier.pt --model_path models/256x256_diffusion_uncond.pt $SAMPLE_FLAGS
  • 512x512 model:
MODEL_FLAGS="--attention_resolutions 32,16,8 --class_cond True --diffusion_steps 1000 --image_size 512 --learn_sigma True --noise_schedule linear --num_channels 256 --num_head_channels 64 --num_res_blocks 2 --resblock_updown True --use_fp16 False --use_scale_shift_norm True"
python classifier_sample.py $MODEL_FLAGS --classifier_scale 4.0 --classifier_path models/512x512_classifier.pt --model_path models/512x512_diffusion.pt $SAMPLE_FLAGS

Upsampling

For these runs, we assume you have some base samples in a file 64_samples.npz or 128_samples.npz for the two respective models.

  • 64 -> 256:
MODEL_FLAGS="--attention_resolutions 32,16,8 --class_cond True --diffusion_steps 1000 --large_size 256  --small_size 64 --learn_sigma True --noise_schedule linear --num_channels 192 --num_heads 4 --num_res_blocks 2 --resblock_updown True --use_fp16 True --use_scale_shift_norm True"
python super_res_sample.py $MODEL_FLAGS --model_path models/64_256_upsampler.pt --base_samples 64_samples.npz $SAMPLE_FLAGS
  • 128 -> 512:
MODEL_FLAGS="--attention_resolutions 32,16 --class_cond True --diffusion_steps 1000 --large_size 512 --small_size 128 --learn_sigma True --noise_schedule linear --num_channels 192 --num_head_channels 64 --num_res_blocks 2 --resblock_updown True --use_fp16 True --use_scale_shift_norm True"
python super_res_sample.py $MODEL_FLAGS --model_path models/128_512_upsampler.pt $SAMPLE_FLAGS --base_samples 128_samples.npz

LSUN models

These models are class-unconditional and correspond to a single LSUN class. Here, we show how to sample from lsun_bedroom.pt, but the other two LSUN checkpoints should work as well:

MODEL_FLAGS="--attention_resolutions 32,16,8 --class_cond False --diffusion_steps 1000 --dropout 0.1 --image_size 256 --learn_sigma True --noise_schedule linear --num_channels 256 --num_head_channels 64 --num_res_blocks 2 --resblock_updown True --use_fp16 True --use_scale_shift_norm True"
python image_sample.py $MODEL_FLAGS --model_path models/lsun_bedroom.pt $SAMPLE_FLAGS

You can sample from lsun_horse_nodropout.pt by changing the dropout flag:

MODEL_FLAGS="--attention_resolutions 32,16,8 --class_cond False --diffusion_steps 1000 --dropout 0.0 --image_size 256 --learn_sigma True --noise_schedule linear --num_channels 256 --num_head_channels 64 --num_res_blocks 2 --resblock_updown True --use_fp16 True --use_scale_shift_norm True"
python image_sample.py $MODEL_FLAGS --model_path models/lsun_horse_nodropout.pt $SAMPLE_FLAGS

Note that for these models, the best samples result from using 1000 timesteps:

SAMPLE_FLAGS="--batch_size 4 --num_samples 100 --timestep_respacing 1000"

Results

This table summarizes our ImageNet results for pure guided diffusion models:

Dataset FID Precision Recall
ImageNet 64x64 2.07 0.74 0.63
ImageNet 128x128 2.97 0.78 0.59
ImageNet 256x256 4.59 0.82 0.52
ImageNet 512x512 7.72 0.87 0.42

This table shows the best results for high resolutions when using upsampling and guidance together:

Dataset FID Precision Recall
ImageNet 256x256 3.94 0.83 0.53
ImageNet 512x512 3.85 0.84 0.53

Finally, here are the unguided results on individual LSUN classes:

Dataset FID Precision Recall
LSUN Bedroom 1.90 0.66 0.51
LSUN Cat 5.57 0.63 0.52
LSUN Horse 2.57 0.71 0.55

Training models

Training diffusion models is described in the parent repository. Training a classifier is similar. We assume you have put training hyperparameters into a TRAIN_FLAGS variable, and classifier hyperparameters into a CLASSIFIER_FLAGS variable. Then you can run:

mpiexec -n N python scripts/classifier_train.py --data_dir path/to/imagenet $TRAIN_FLAGS $CLASSIFIER_FLAGS

Make sure to divide the batch size in TRAIN_FLAGS by the number of MPI processes you are using.

Here are flags for training the 128x128 classifier. You can modify these for training classifiers at other resolutions:

TRAIN_FLAGS="--iterations 300000 --anneal_lr True --batch_size 256 --lr 3e-4 --save_interval 10000 --weight_decay 0.05"
CLASSIFIER_FLAGS="--image_size 128 --classifier_attention_resolutions 32,16,8 --classifier_depth 2 --classifier_width 128 --classifier_pool attention --classifier_resblock_updown True --classifier_use_scale_shift_norm True"

For sampling from a 128x128 classifier-guided model, 25 step DDIM:

MODEL_FLAGS="--attention_resolutions 32,16,8 --class_cond True --image_size 128 --learn_sigma True --num_channels 256 --num_heads 4 --num_res_blocks 2 --resblock_updown True --use_fp16 True --use_scale_shift_norm True"
CLASSIFIER_FLAGS="--image_size 128 --classifier_attention_resolutions 32,16,8 --classifier_depth 2 --classifier_width 128 --classifier_pool attention --classifier_resblock_updown True --classifier_use_scale_shift_norm True --classifier_scale 1.0 --classifier_use_fp16 True"
SAMPLE_FLAGS="--batch_size 4 --num_samples 50000 --timestep_respacing ddim25 --use_ddim True"
mpiexec -n N python scripts/classifier_sample.py \
    --model_path /path/to/model.pt \
    --classifier_path path/to/classifier.pt \
    $MODEL_FLAGS $CLASSIFIER_FLAGS $SAMPLE_FLAGS

To sample for 250 timesteps without DDIM, replace --timestep_respacing ddim25 to --timestep_respacing 250, and replace --use_ddim True with --use_ddim False.

guided-diffusion's People

Contributors

erinbeesley avatar leedoyup avatar liujianzhi avatar prafullasd avatar unixpickle avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

guided-diffusion's Issues

Issue with Attention Resolution Preprocessing

First of all, thanks for open-sourcing your implementation. Where ever you look the base implementation is OpenAI's guided diffusion, which is great!

I was going over the code for a personal project, and I understood that model config is preprocessed using the following code in script_util.py:

    for res in attention_resolutions.split(","):
        attention_ds.append(image_size // int(res))

However in the unet.py, attention_resolutions is defined as:

a collection of downsample rates at which attention will take place. May be a set, list, or tuple. For example, if this contains 4, then at 4x downsampling, attention will be used.

Which means the implementation is independent of the image resolution, which totally makes sense.

The only thing that needs to be changed to fix this discrepancy is to change the code snippet above to:

    for res in attention_resolutions.split(","):
        attention_ds.append(int(res))

I would be more than happy to submit a PR, but first wanted to bring this to your attention and seek your opinion.

Adjusting `cond_fn` for enabling `nn.DataParallel` to use Multi-GPUs

Hello there,
I hope you all doing well. Sorry If my question looks silly, but I want to know If there is a capability of adding Diffusion into nn.DataParallel for paralyzing the generating condition among multiple GPUs.

Or If there is a possibility to adjust guided-diffusion diffusion to run amount multiple-GPUs.

regards,

[Bugs] Classifier is conditioning on x_{t+1} instead of x_{t} and grad calculation

Hi @unixpickle @prafullasd @erinbeesley,
I think I found 2 bugs:

  1. Shouldn't we pass out["mean"] (x_{t}) instead of x (x_{t+1}) here (similarly t-1 instead of t):
    https://github.com/openai/guided-diffusion/blob/main/guided_diffusion/gaussian_diffusion.py#L435

  2. Shouldn't we separate grad calculation here?
    https://github.com/openai/guided-diffusion/blob/main/scripts/classifier_sample.py#L61
    We need grads of i-th image in the batch w.r.t. corresponding log prob and not grads of i-th image w.r.t. the sum of log probs?
    It makes no sense to optimize the sum, as we want each image to be directed by its own class.

I might have misunderstood something please let me know if so! :)

about model size

hello thx for your code. I want to ask why my 512*512 diffusion model size is more than 400mb while yours is about 2t. The data numbers is 50 percents of yours. I use the parent repository direction to train on your code. Look forward to your reply! thx

Sampling at 64x64 - Missing key(s) in state_dict / size mismatch - segfault

I want to sample images from the pretrained 64x64_diffusion model but am hitting a segfault with the suggested run configuration. I've downloaded the 64x64 checkpoints to a models folder and am running with the following flags.

!SAMPLE_FLAGS="--batch_size 4 --num_samples 100 --timestep_respacing 250"

!MODEL_FLAGS="--attention_resolutions 32,16,8 --class_cond True --diffusion_steps 1000 --dropout 0.1 --image_size 64 --learn_sigma True --noise_schedule cosine --num_channels 192 --num_head_channels 64 --num_res_blocks 3 --resblock_updown True --use_new_attention_order True --use_fp16 True --use_scale_shift_norm True"

!python image_sample.py $MODEL_FLAGS --model_path models/64x64_diffusion.pt $SAMPLE_FLAGS

At runtime, I get a slew of warnings about missing and unused keys before the code crashes via segfault:

Missing key(s) in state_dict: "input_blocks.3.0.op.weight", "input_blocks.3.0.op.bias", "input_blocks.4.0.skip_connection.weight", ..., "output_blocks.8.1.conv.bias".

Unexpected key(s) in state_dict: "label_emb.weight", "input_blocks.12.0.in_layers.0.weight", "input_blocks.12.0.in_layers.0.bias", ..., "output_blocks.11.2.out_layers.3.bias".

size mismatch for time_embed.0.weight: copying a param with shape torch.Size([768, 192]) from checkpoint, the shape in current model is torch.Size([512, 128]). ... size mismatch for out.2.bias: copying a param with shape torch.Size([6]) from checkpoint, the shape in current model is torch.Size([3]).

Error when training with 1 gpu: RuntimeError: a leaf Variable that requires grad is being used in an in-place operation.

Settings:
Win10 Pro
python 3.7.9
ptorch 1.8.1+cu111
1 GPU
GLOO backend
jupyter notebook

I can run the rest of the methods fine. I can run classifier_sample.py, super_res_sample.py but when I tried to run classifier_train.py I got a runtime error.

...\torch\distributed\distributed_c10d.py in broadcast(tensor, src, group, async_op)
1027 return work
1028 else:
-> 1029 work.wait()
1030
1031

RuntimeError: a leaf Variable that requires grad is being used in an in-place operation.

This is the argument I used and the training commands:

TRAIN_FLAGS="--iterations 300000 --anneal_lr True --batch_size 256 --lr 3e-4 --save_interval 10000 --weight_decay 0.05"
CLASSIFIER_FLAGS="--image_size 128 --classifier_attention_resolutions 32,16,8 --classifier_depth 2 --classifier_width 128 --classifier_pool attention --classifier_resblock_updown True --classifier_use_scale_shift_norm True"

%run scripts/classifier_train.py --data_dir r"G:\data_set\imagenette2-160\train" $TRAIN_FLAGS $CLASSIFIER_FLAGS

Thanks for any comments and assistance in advance.

Any plan to release "optimized version" according to Appendix A?

I believe the current version is of "naive" implementation, as the number I got from benchmarking is close to it.

I tried FusedAdam from Apex, but it didn't improve the throughput much, so either that's not what you used, or fused GroupNorm-Swish has a bigger benefit.

Do you have any plan to release thie optimized version or any code snippet one can use to improve the code?

Sampling details for FID evaluation

While it's mentioned that the full training set was used as reference for FID, I couldn't find details about evaluation so I wonder if I have missed them. Noticed the 10K samples (instead of 50K) used in ablation, but no other specifications about sampling. Besides, for guided sampling, specifics like samples per class would be useful too. Any insight is appreciated, either from author or readers.

Compute Comparison

Hi, big fan of your work here, the results are amazing!
Also, thank you for including the table on the compute comparison; these are rare to see :)

I wondered if you also have the compute numbers for ImageNet64x64 since you ran those experiments, but they do not appear in A.3 (Table 10). I want to compare to your model and your baselines, and it would be super helpful to have those numbers and not have to rerun the experiments.

Thanks!

How to train on images that have multiple labels?

Thank you so much for releasing this code!

I am wondering how to train the model on images that have multiple labels?
In my understanding, the formulation (2) --- pθ,φ(xt|xt+1, y) = Zpθ(xt|xt+1)pφ(y|xt) --- in the paper shows that there is only one label y that has been incorporated into the Conditional Reverse Noising Process. If there are multiple labels for an image, should the label y be the mean of these multiple labels or the pφ(y|xt) should be ∏_i=1 ^k pφ(y_i|xt) for k labels?

Thanks again.

Question about the guide process

Hi all!
I'm having trouble understanding the guide process using this repo.

I've trained a model (~10K random dog images) using unlabeled data with the following script:
python3 scripts/image_train.py --data_dir ./dogDB --image_size 256 --num_channels 64 --num_res_blocks 3 --diffusion_steps 2000 --noise_schedule linear --lr 1e-4 --batch_size 8
This resulted in the 3 training files: "model_XXXX.pt, "ema_XXXX.pt" and "opt_XXXX.pt".

Then, I trained the classifier on labeled images (10 classes, dog breeds, folders and images renamed according to the repo 'improved-diffusion') using the following script:
python3 scripts/classifier_train.py --data_dir ./dogBreeds --lr 1e-4 --batch_size 8 --image_size 256 --classifier_attention_resolutions 32,16,8 --classifier_depth 2 --classifier_width 128 --classifier_pool attention --classifier_resblock_updown True --classifier_use_scale_shift_norm True --classifier_use_fp16 True

Finally, I'm sampling the classifier using:
python3 scripts/classifier_sample.py --attention_resolutions 32,16,8 --class_cond True --diffusion_steps 2000 --image_size 256 --learn_sigma True --noise_schedule linear --num_channels 64 --num_head_channels 64 --num_res_blocks 3 --resblock_updown True --use_fp16 True --use_scale_shift_norm True --classifier_scale 1.0 --classifier_path ./classifierRes/model050000.pt --model_path ./trainRes/model700000.pt --batch_size 10 --num_samples 20

Doing this I get only noisy images.
My goal is to sample as Figure 3 of the paper.

Any help to solve this?

Thanks!

Bug in attention?

The scale is being calculated here as 1 / math.sqrt(math.sqrt(ch)). The comment says it was adapted from the attention implementation here, where the scale is int(C) ** (-0.5), which is 1 / math.sqrt(ch), not 1 / math.sqrt(math.sqrt(ch)).

Is this change to use 2 square roots intentional?

network architecture (UNet)

Dear Guided-diffusion team,

Thank you for sharing this great work, I really enojy your work.

Is a figure to show the network architecture you described in section 3 Architecture Improvements? It is very hard to understand the final architecture without looking at a figure.

Thank you for your help and supports.

Best Wishes,

Zongze

enumerate errors

I seem to keep getting enumerate errors.
IndexError: list index out of range

And the code with problems is in this range:

    **for j, sample in enumerate(samples):**
        cur_t -= 1
        if j % 100 == 0 or cur_t == -1:
            print()
            for k, image in enumerate(sample['pred_xstart']):
                filename = f'progress_{i * batch_size + k:1}.png'
                TF.to_pil_image(image.add(1).div(2).clamp(0, 1)).save(filename)
                tqdm.write(f'Batch {i}, step {j}, output {k}:')
                display.display(display.Image(filename))

That first line is usually where the error is highlighted. I'm new to this so is there anywhere I should be looking to see whats wrong?

Error when loading pretrained weights

Thank you for releasing pretrained weights.
I tried to use some of your pretrained weights as you described in the readme but there is a mismatch between checkpoint weights and the model.

Logging to /tmp/openai-2021-07-22-14-28-52-986510
creating model and diffusion...
Traceback (most recent call last):
File "scripts/image_sample.py", line 108, in
main()
File "scripts/image_sample.py", line 33, in main
model.load_state_dict(
File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1223, in load_state_dict
raise RuntimeError('Error(s) in loading state_dict for {}:\n\t{}'.format(
RuntimeError: Error(s) in loading state_dict for UNetModel:
Missing key(s) in state_dict: "input_blocks.3.0.op.weight", "input_blocks.3.0.op.bias", "input_blocks.4.0.skip_connection.weight", "input_blocks.4.0.skip_connection.bias", "input_blocks.6.0.op.weight", "input_blocks.6.0.op.bias", "input_blocks.7.1.norm.weight", "input_blocks.7.1.norm.bias", "input_blocks.7.1.qkv.weight", "input_blocks.7.1.qkv.bias", "input_blocks.7.1.proj_out.weight", "input_blocks.7.1.proj_out.bias", "input_blocks.8.1.norm.weight", "input_blocks.8.1.norm.bias", "input_blocks.8.1.qkv.weight", "input_blocks.8.1.qkv.bias", "input_blocks.8.1.proj_out.weight", "input_blocks.8.1.proj_out.bias", "input_blocks.9.0.op.weight", "input_blocks.9.0.op.bias", "input_blocks.10.0.skip_connection.weight", "input_blocks.10.0.skip_connection.bias", "output_blocks.2.2.conv.weight", "output_blocks.2.2.conv.bias", "output_blocks.5.2.conv.weight", "output_blocks.5.2.conv.bias", "output_blocks.8.1.conv.weight", "output_blocks.8.1.conv.bias".
Unexpected key(s) in state_dict: "input_blocks.12.0.in_layers.0.weight", "input_blocks.12.0.in_layers.0.bias", "input_blocks.12.0.in_layers.2.weight", "input_blocks.12.0.in_layers.2.bias", "input_blocks.12.0.emb_layers.1.weight", "input_blocks.12.0.emb_layers.1.bias", "input_blocks.12.0.out_layers.0.weight", "input_blocks.12.0.out_layers.0.bias", "input_blocks.12.0.out_layers.3.weight", "input_blocks.12.0.out_layers.3.bias", "input_blocks.13.0.in_layers.0.weight", "input_blocks.13.0.in_layers.0.bias", "input_blocks.13.0.in_layers.2.weight", "input_blocks.13.0.in_layers.2.bias", "input_blocks.13.0.emb_layers.1.weight", "input_blocks.13.0.emb_layers.1.bias", "input_blocks.13.0.out_layers.0.weight", "input_blocks.13.0.out_layers.0.bias", "input_blocks.13.0.out_layers.3.weight", "input_blocks.13.0.out_layers.3.bias", "input_blocks.13.0.skip_connection.weight", "input_blocks.13.0.skip_connection.bias", "input_blocks.13.1.norm.weight", "input_blocks.13.1.norm.bias", "input_blocks.13.1.qkv.weight", "input_blocks.13.1.qkv.bias", "input_blocks.13.1.proj_out.weight", "input_blocks.13.1.proj_out.bias", "input_blocks.14.0.in_layers.0.weight", "input_blocks.14.0.in_layers.0.bias", "input_blocks.14.0.in_layers.2.weight", "input_blocks.14.0.in_layers.2.bias", "input_blocks.14.0.emb_layers.1.weight", "input_blocks.14.0.emb_layers.1.bias", "input_blocks.14.0.out_layers.0.weight", "input_blocks.14.0.out_layers.0.bias", "input_blocks.14.0.out_layers.3.weight", "input_blocks.14.0.out_layers.3.bias", "input_blocks.14.1.norm.weight", "input_blocks.14.1.norm.bias", "input_blocks.14.1.qkv.weight", "input_blocks.14.1.qkv.bias", "input_blocks.14.1.proj_out.weight", "input_blocks.14.1.proj_out.bias", "input_blocks.15.0.in_layers.0.weight", "input_blocks.15.0.in_layers.0.bias", "input_blocks.15.0.in_layers.2.weight", "input_blocks.15.0.in_layers.2.bias", "input_blocks.15.0.emb_layers.1.weight", "input_blocks.15.0.emb_layers.1.bias", "input_blocks.15.0.out_layers.0.weight", "input_blocks.15.0.out_layers.0.bias", "input_blocks.15.0.out_layers.3.weight", "input_blocks.15.0.out_layers.3.bias", "input_blocks.16.0.in_layers.0.weight", "input_blocks.16.0.in_layers.0.bias", "input_blocks.16.0.in_layers.2.weight", "input_blocks.16.0.in_layers.2.bias", "input_blocks.16.0.emb_layers.1.weight", "input_blocks.16.0.emb_layers.1.bias", "input_blocks.16.0.out_layers.0.weight", "input_blocks.16.0.out_layers.0.bias", "input_blocks.16.0.out_layers.3.weight", "input_blocks.16.0.out_layers.3.bias", "input_blocks.16.1.norm.weight", "input_blocks.16.1.norm.bias", "input_blocks.16.1.qkv.weight", "input_blocks.16.1.qkv.bias", "input_blocks.16.1.proj_out.weight", "input_blocks.16.1.proj_out.bias", "input_blocks.17.0.in_layers.0.weight", "input_blocks.17.0.in_layers.0.bias", "input_blocks.17.0.in_layers.2.weight", "input_blocks.17.0.in_layers.2.bias", "input_blocks.17.0.emb_layers.1.weight", "input_blocks.17.0.emb_layers.1.bias", "input_blocks.17.0.out_layers.0.weight", "input_blocks.17.0.out_layers.0.bias", "input_blocks.17.0.out_layers.3.weight", "input_blocks.17.0.out_layers.3.bias", "input_blocks.17.1.norm.weight", "input_blocks.17.1.norm.bias", "input_blocks.17.1.qkv.weight", "input_blocks.17.1.qkv.bias", "input_blocks.17.1.proj_out.weight", "input_blocks.17.1.proj_out.bias", "input_blocks.3.0.in_layers.0.weight", "input_blocks.3.0.in_layers.0.bias", "input_blocks.3.0.in_layers.2.weight", "input_blocks.3.0.in_layers.2.bias", "input_blocks.3.0.emb_layers.1.weight", "input_blocks.3.0.emb_layers.1.bias", "input_blocks.3.0.out_layers.0.weight", "input_blocks.3.0.out_layers.0.bias", "input_blocks.3.0.out_layers.3.weight", "input_blocks.3.0.out_layers.3.bias", "input_blocks.6.0.in_layers.0.weight", "input_blocks.6.0.in_layers.0.bias", "input_blocks.6.0.in_layers.2.weight", "input_blocks.6.0.in_layers.2.bias", "input_blocks.6.0.emb_layers.1.weight", "input_blocks.6.0.emb_layers.1.bias", "input_blocks.6.0.out_layers.0.weight", "input_blocks.6.0.out_layers.0.bias", "input_blocks.6.0.out_layers.3.weight", "input_blocks.6.0.out_layers.3.bias", "input_blocks.9.0.in_layers.0.weight", "input_blocks.9.0.in_layers.0.bias", "input_blocks.9.0.in_layers.2.weight", "input_blocks.9.0.in_layers.2.bias", "input_blocks.9.0.emb_layers.1.weight", "input_blocks.9.0.emb_layers.1.bias", "input_blocks.9.0.out_layers.0.weight", "input_blocks.9.0.out_layers.0.bias", "input_blocks.9.0.out_layers.3.weight", "input_blocks.9.0.out_layers.3.bias", "output_blocks.12.0.in_layers.0.weight", "output_blocks.12.0.in_layers.0.bias", "output_blocks.12.0.in_layers.2.weight", "output_blocks.12.0.in_layers.2.bias", "output_blocks.12.0.emb_layers.1.weight", "output_blocks.12.0.emb_layers.1.bias", "output_blocks.12.0.out_layers.0.weight", "output_blocks.12.0.out_layers.0.bias", "output_blocks.12.0.out_layers.3.weight", "output_blocks.12.0.out_layers.3.bias", "output_blocks.12.0.skip_connection.weight", "output_blocks.12.0.skip_connection.bias", "output_blocks.13.0.in_layers.0.weight", "output_blocks.13.0.in_layers.0.bias", "output_blocks.13.0.in_layers.2.weight", "output_blocks.13.0.in_layers.2.bias", "output_blocks.13.0.emb_layers.1.weight", "output_blocks.13.0.emb_layers.1.bias", "output_blocks.13.0.out_layers.0.weight", "output_blocks.13.0.out_layers.0.bias", "output_blocks.13.0.out_layers.3.weight", "output_blocks.13.0.out_layers.3.bias", "output_blocks.13.0.skip_connection.weight", "output_blocks.13.0.skip_connection.bias", "output_blocks.14.0.in_layers.0.weight", "output_blocks.14.0.in_layers.0.bias", "output_blocks.14.0.in_layers.2.weight", "output_blocks.14.0.in_layers.2.bias", "output_blocks.14.0.emb_layers.1.weight", "output_blocks.14.0.emb_layers.1.bias", "output_blocks.14.0.out_layers.0.weight", "output_blocks.14.0.out_layers.0.bias", "output_blocks.14.0.out_layers.3.weight", "output_blocks.14.0.out_layers.3.bias", "output_blocks.14.0.skip_connection.weight", "output_blocks.14.0.skip_connection.bias", "output_blocks.14.1.in_layers.0.weight", "output_blocks.14.1.in_layers.0.bias", "output_blocks.14.1.in_layers.2.weight", "output_blocks.14.1.in_layers.2.bias", "output_blocks.14.1.emb_layers.1.weight", "output_blocks.14.1.emb_layers.1.bias", "output_blocks.14.1.out_layers.0.weight", "output_blocks.14.1.out_layers.0.bias", "output_blocks.14.1.out_layers.3.weight", "output_blocks.14.1.out_layers.3.bias", "output_blocks.15.0.in_layers.0.weight", "output_blocks.15.0.in_layers.0.bias", "output_blocks.15.0.in_layers.2.weight", "output_blocks.15.0.in_layers.2.bias", "output_blocks.15.0.emb_layers.1.weight", "output_blocks.15.0.emb_layers.1.bias", "output_blocks.15.0.out_layers.0.weight", "output_blocks.15.0.out_layers.0.bias", "output_blocks.15.0.out_layers.3.weight", "output_blocks.15.0.out_layers.3.bias", "output_blocks.15.0.skip_connection.weight", "output_blocks.15.0.skip_connection.bias", "output_blocks.16.0.in_layers.0.weight", "output_blocks.16.0.in_layers.0.bias", "output_blocks.16.0.in_layers.2.weight", "output_blocks.16.0.in_layers.2.bias", "output_blocks.16.0.emb_layers.1.weight", "output_blocks.16.0.emb_layers.1.bias", "output_blocks.16.0.out_layers.0.weight", "output_blocks.16.0.out_layers.0.bias", "output_blocks.16.0.out_layers.3.weight", "output_blocks.16.0.out_layers.3.bias", "output_blocks.16.0.skip_connection.weight", "output_blocks.16.0.skip_connection.bias", "output_blocks.17.0.in_layers.0.weight", "output_blocks.17.0.in_layers.0.bias", "output_blocks.17.0.in_layers.2.weight", "output_blocks.17.0.in_layers.2.bias", "output_blocks.17.0.emb_layers.1.weight", "output_blocks.17.0.emb_layers.1.bias", "output_blocks.17.0.out_layers.0.weight", "output_blocks.17.0.out_layers.0.bias", "output_blocks.17.0.out_layers.3.weight", "output_blocks.17.0.out_layers.3.bias", "output_blocks.17.0.skip_connection.weight", "output_blocks.17.0.skip_connection.bias", "output_blocks.2.2.in_layers.0.weight", "output_blocks.2.2.in_layers.0.bias", "output_blocks.2.2.in_layers.2.weight", "output_blocks.2.2.in_layers.2.bias", "output_blocks.2.2.emb_layers.1.weight", "output_blocks.2.2.emb_layers.1.bias", "output_blocks.2.2.out_layers.0.weight", "output_blocks.2.2.out_layers.0.bias", "output_blocks.2.2.out_layers.3.weight", "output_blocks.2.2.out_layers.3.bias", "output_blocks.5.2.in_layers.0.weight", "output_blocks.5.2.in_layers.0.bias", "output_blocks.5.2.in_layers.2.weight", "output_blocks.5.2.in_layers.2.bias", "output_blocks.5.2.emb_layers.1.weight", "output_blocks.5.2.emb_layers.1.bias", "output_blocks.5.2.out_layers.0.weight", "output_blocks.5.2.out_layers.0.bias", "output_blocks.5.2.out_layers.3.weight", "output_blocks.5.2.out_layers.3.bias", "output_blocks.6.1.norm.weight", "output_blocks.6.1.norm.bias", "output_blocks.6.1.qkv.weight", "output_blocks.6.1.qkv.bias", "output_blocks.6.1.proj_out.weight", "output_blocks.6.1.proj_out.bias", "output_blocks.7.1.norm.weight", "output_blocks.7.1.norm.bias", "output_blocks.7.1.qkv.weight", "output_blocks.7.1.qkv.bias", "output_blocks.7.1.proj_out.weight", "output_blocks.7.1.proj_out.bias", "output_blocks.8.2.in_layers.0.weight", "output_blocks.8.2.in_layers.0.bias", "output_blocks.8.2.in_layers.2.weight", "output_blocks.8.2.in_layers.2.bias", "output_blocks.8.2.emb_layers.1.weight", "output_blocks.8.2.emb_layers.1.bias", "output_blocks.8.2.out_layers.0.weight", "output_blocks.8.2.out_layers.0.bias", "output_blocks.8.2.out_layers.3.weight", "output_blocks.8.2.out_layers.3.bias", "output_blocks.8.1.norm.weight", "output_blocks.8.1.norm.bias", "output_blocks.8.1.qkv.weight", "output_blocks.8.1.qkv.bias", "output_blocks.8.1.proj_out.weight", "output_blocks.8.1.proj_out.bias", "output_blocks.11.1.in_layers.0.weight", "output_blocks.11.1.in_layers.0.bias", "output_blocks.11.1.in_layers.2.weight", "output_blocks.11.1.in_layers.2.bias", "output_blocks.11.1.emb_layers.1.weight", "output_blocks.11.1.emb_layers.1.bias", "output_blocks.11.1.out_layers.0.weight", "output_blocks.11.1.out_layers.0.bias", "output_blocks.11.1.out_layers.3.weight", "output_blocks.11.1.out_layers.3.bias".
size mismatch for time_embed.0.weight: copying a param with shape torch.Size([1024, 256]) from checkpoint, the shape in current model is torch.Size([512, 128]).
size mismatch for time_embed.0.bias: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([512]).
size mismatch for time_embed.2.weight: copying a param with shape torch.Size([1024, 1024]) from checkpoint, the shape in current model is torch.Size([512, 512]).
size mismatch for time_embed.2.bias: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([512]).
size mismatch for input_blocks.0.0.weight: copying a param with shape torch.Size([256, 3, 3, 3]) from checkpoint, the shape in current model is torch.Size([128, 3, 3, 3]).
size mismatch for input_blocks.0.0.bias: copying a param with shape torch.Size([256]) from checkpoint, the shape in current model is torch.Size([128]).
size mismatch for input_blocks.1.0.in_layers.0.weight: copying a param with shape torch.Size([256]) from checkpoint, the shape in current model is torch.Size([128]).
size mismatch for input_blocks.1.0.in_layers.0.bias: copying a param with shape torch.Size([256]) from checkpoint, the shape in current model is torch.Size([128]).
size mismatch for input_blocks.1.0.in_layers.2.weight: copying a param with shape torch.Size([256, 256, 3, 3]) from checkpoint, the shape in current model is torch.Size([128, 128, 3, 3]).
size mismatch for input_blocks.1.0.in_layers.2.bias: copying a param with shape torch.Size([256]) from checkpoint, the shape in current model is torch.Size([128]).
size mismatch for input_blocks.1.0.emb_layers.1.weight: copying a param with shape torch.Size([512, 1024]) from checkpoint, the shape in current model is torch.Size([256, 512]).
size mismatch for input_blocks.1.0.emb_layers.1.bias: copying a param with shape torch.Size([512]) from checkpoint, the shape in current model is torch.Size([256]).
size mismatch for input_blocks.1.0.out_layers.0.weight: copying a param with shape torch.Size([256]) from checkpoint, the shape in current model is torch.Size([128]).
size mismatch for input_blocks.1.0.out_layers.0.bias: copying a param with shape torch.Size([256]) from checkpoint, the shape in current model is torch.Size([128]).
size mismatch for input_blocks.1.0.out_layers.3.weight: copying a param with shape torch.Size([256, 256, 3, 3]) from checkpoint, the shape in current model is torch.Size([128, 128, 3, 3]).
size mismatch for input_blocks.1.0.out_layers.3.bias: copying a param with shape torch.Size([256]) from checkpoint, the shape in current model is torch.Size([128]).
size mismatch for input_blocks.2.0.in_layers.0.weight: copying a param with shape torch.Size([256]) from checkpoint, the shape in current model is torch.Size([128]).
size mismatch for input_blocks.2.0.in_layers.0.bias: copying a param with shape torch.Size([256]) from checkpoint, the shape in current model is torch.Size([128]).
size mismatch for input_blocks.2.0.in_layers.2.weight: copying a param with shape torch.Size([256, 256, 3, 3]) from checkpoint, the shape in current model is torch.Size([128, 128, 3, 3]).
size mismatch for input_blocks.2.0.in_layers.2.bias: copying a param with shape torch.Size([256]) from checkpoint, the shape in current model is torch.Size([128]).
size mismatch for input_blocks.2.0.emb_layers.1.weight: copying a param with shape torch.Size([512, 1024]) from checkpoint, the shape in current model is torch.Size([256, 512]).
size mismatch for input_blocks.2.0.emb_layers.1.bias: copying a param with shape torch.Size([512]) from checkpoint, the shape in current model is torch.Size([256]).
size mismatch for input_blocks.2.0.out_layers.0.weight: copying a param with shape torch.Size([256]) from checkpoint, the shape in current model is torch.Size([128]).
size mismatch for input_blocks.2.0.out_layers.0.bias: copying a param with shape torch.Size([256]) from checkpoint, the shape in current model is torch.Size([128]).
size mismatch for input_blocks.2.0.out_layers.3.weight: copying a param with shape torch.Size([256, 256, 3, 3]) from checkpoint, the shape in current model is torch.Size([128, 128, 3, 3]).
size mismatch for input_blocks.2.0.out_layers.3.bias: copying a param with shape torch.Size([256]) from checkpoint, the shape in current model is torch.Size([128]).
size mismatch for input_blocks.4.0.in_layers.0.weight: copying a param with shape torch.Size([256]) from checkpoint, the shape in current model is torch.Size([128]).
size mismatch for input_blocks.4.0.in_layers.0.bias: copying a param with shape torch.Size([256]) from checkpoint, the shape in current model is torch.Size([128]).
size mismatch for input_blocks.4.0.in_layers.2.weight: copying a param with shape torch.Size([256, 256, 3, 3]) from checkpoint, the shape in current model is torch.Size([256, 128, 3, 3]).
size mismatch for input_blocks.4.0.emb_layers.1.weight: copying a param with shape torch.Size([512, 1024]) from checkpoint, the shape in current model is torch.Size([512, 512]).
size mismatch for input_blocks.5.0.emb_layers.1.weight: copying a param with shape torch.Size([512, 1024]) from checkpoint, the shape in current model is torch.Size([512, 512]).
size mismatch for input_blocks.7.0.in_layers.2.weight: copying a param with shape torch.Size([512, 256, 3, 3]) from checkpoint, the shape in current model is torch.Size([384, 256, 3, 3]).
size mismatch for input_blocks.7.0.in_layers.2.bias: copying a param with shape torch.Size([512]) from checkpoint, the shape in current model is torch.Size([384]).
size mismatch for input_blocks.7.0.emb_layers.1.weight: copying a param with shape torch.Size([1024, 1024]) from checkpoint, the shape in current model is torch.Size([768, 512]).
size mismatch for input_blocks.7.0.emb_layers.1.bias: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([768]).
size mismatch for input_blocks.7.0.out_layers.0.weight: copying a param with shape torch.Size([512]) from checkpoint, the shape in current model is torch.Size([384]).
size mismatch for input_blocks.7.0.out_layers.0.bias: copying a param with shape torch.Size([512]) from checkpoint, the shape in current model is torch.Size([384]).
size mismatch for input_blocks.7.0.out_layers.3.weight: copying a param with shape torch.Size([512, 512, 3, 3]) from checkpoint, the shape in current model is torch.Size([384, 384, 3, 3]).
size mismatch for input_blocks.7.0.out_layers.3.bias: copying a param with shape torch.Size([512]) from checkpoint, the shape in current model is torch.Size([384]).
size mismatch for input_blocks.7.0.skip_connection.weight: copying a param with shape torch.Size([512, 256, 1, 1]) from checkpoint, the shape in current model is torch.Size([384, 256, 1, 1]).
size mismatch for input_blocks.7.0.skip_connection.bias: copying a param with shape torch.Size([512]) from checkpoint, the shape in current model is torch.Size([384]).
size mismatch for input_blocks.8.0.in_layers.0.weight: copying a param with shape torch.Size([512]) from checkpoint, the shape in current model is torch.Size([384]).
size mismatch for input_blocks.8.0.in_layers.0.bias: copying a param with shape torch.Size([512]) from checkpoint, the shape in current model is torch.Size([384]).
size mismatch for input_blocks.8.0.in_layers.2.weight: copying a param with shape torch.Size([512, 512, 3, 3]) from checkpoint, the shape in current model is torch.Size([384, 384, 3, 3]).
size mismatch for input_blocks.8.0.in_layers.2.bias: copying a param with shape torch.Size([512]) from checkpoint, the shape in current model is torch.Size([384]).
size mismatch for input_blocks.8.0.emb_layers.1.weight: copying a param with shape torch.Size([1024, 1024]) from checkpoint, the shape in current model is torch.Size([768, 512]).
size mismatch for input_blocks.8.0.emb_layers.1.bias: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([768]).
size mismatch for input_blocks.8.0.out_layers.0.weight: copying a param with shape torch.Size([512]) from checkpoint, the shape in current model is torch.Size([384]).
size mismatch for input_blocks.8.0.out_layers.0.bias: copying a param with shape torch.Size([512]) from checkpoint, the shape in current model is torch.Size([384]).
size mismatch for input_blocks.8.0.out_layers.3.weight: copying a param with shape torch.Size([512, 512, 3, 3]) from checkpoint, the shape in current model is torch.Size([384, 384, 3, 3]).
size mismatch for input_blocks.8.0.out_layers.3.bias: copying a param with shape torch.Size([512]) from checkpoint, the shape in current model is torch.Size([384]).
size mismatch for input_blocks.10.0.in_layers.0.weight: copying a param with shape torch.Size([512]) from checkpoint, the shape in current model is torch.Size([384]).
size mismatch for input_blocks.10.0.in_layers.0.bias: copying a param with shape torch.Size([512]) from checkpoint, the shape in current model is torch.Size([384]).
size mismatch for input_blocks.10.0.in_layers.2.weight: copying a param with shape torch.Size([512, 512, 3, 3]) from checkpoint, the shape in current model is torch.Size([512, 384, 3, 3]).
size mismatch for input_blocks.10.0.emb_layers.1.weight: copying a param with shape torch.Size([1024, 1024]) from checkpoint, the shape in current model is torch.Size([1024, 512]).
size mismatch for input_blocks.11.0.emb_layers.1.weight: copying a param with shape torch.Size([1024, 1024]) from checkpoint, the shape in current model is torch.Size([1024, 512]).
size mismatch for middle_block.0.in_layers.0.weight: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([512]).
size mismatch for middle_block.0.in_layers.0.bias: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([512]).
size mismatch for middle_block.0.in_layers.2.weight: copying a param with shape torch.Size([1024, 1024, 3, 3]) from checkpoint, the shape in current model is torch.Size([512, 512, 3, 3]).
size mismatch for middle_block.0.in_layers.2.bias: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([512]).
size mismatch for middle_block.0.emb_layers.1.weight: copying a param with shape torch.Size([2048, 1024]) from checkpoint, the shape in current model is torch.Size([1024, 512]).
size mismatch for middle_block.0.emb_layers.1.bias: copying a param with shape torch.Size([2048]) from checkpoint, the shape in current model is torch.Size([1024]).
size mismatch for middle_block.0.out_layers.0.weight: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([512]).
size mismatch for middle_block.0.out_layers.0.bias: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([512]).
size mismatch for middle_block.0.out_layers.3.weight: copying a param with shape torch.Size([1024, 1024, 3, 3]) from checkpoint, the shape in current model is torch.Size([512, 512, 3, 3]).
size mismatch for middle_block.0.out_layers.3.bias: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([512]).
size mismatch for middle_block.1.norm.weight: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([512]).
size mismatch for middle_block.1.norm.bias: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([512]).
size mismatch for middle_block.1.qkv.weight: copying a param with shape torch.Size([3072, 1024, 1]) from checkpoint, the shape in current model is torch.Size([1536, 512, 1]).
size mismatch for middle_block.1.qkv.bias: copying a param with shape torch.Size([3072]) from checkpoint, the shape in current model is torch.Size([1536]).
size mismatch for middle_block.1.proj_out.weight: copying a param with shape torch.Size([1024, 1024, 1]) from checkpoint, the shape in current model is torch.Size([512, 512, 1]).
size mismatch for middle_block.1.proj_out.bias: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([512]).
size mismatch for middle_block.2.in_layers.0.weight: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([512]).
size mismatch for middle_block.2.in_layers.0.bias: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([512]).
size mismatch for middle_block.2.in_layers.2.weight: copying a param with shape torch.Size([1024, 1024, 3, 3]) from checkpoint, the shape in current model is torch.Size([512, 512, 3, 3]).
size mismatch for middle_block.2.in_layers.2.bias: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([512]).
size mismatch for middle_block.2.emb_layers.1.weight: copying a param with shape torch.Size([2048, 1024]) from checkpoint, the shape in current model is torch.Size([1024, 512]).
size mismatch for middle_block.2.emb_layers.1.bias: copying a param with shape torch.Size([2048]) from checkpoint, the shape in current model is torch.Size([1024]).
size mismatch for middle_block.2.out_layers.0.weight: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([512]).
size mismatch for middle_block.2.out_layers.0.bias: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([512]).
size mismatch for middle_block.2.out_layers.3.weight: copying a param with shape torch.Size([1024, 1024, 3, 3]) from checkpoint, the shape in current model is torch.Size([512, 512, 3, 3]).
size mismatch for middle_block.2.out_layers.3.bias: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([512]).
size mismatch for output_blocks.0.0.in_layers.0.weight: copying a param with shape torch.Size([2048]) from checkpoint, the shape in current model is torch.Size([1024]).
size mismatch for output_blocks.0.0.in_layers.0.bias: copying a param with shape torch.Size([2048]) from checkpoint, the shape in current model is torch.Size([1024]).
size mismatch for output_blocks.0.0.in_layers.2.weight: copying a param with shape torch.Size([1024, 2048, 3, 3]) from checkpoint, the shape in current model is torch.Size([512, 1024, 3, 3]).
size mismatch for output_blocks.0.0.in_layers.2.bias: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([512]).
size mismatch for output_blocks.0.0.emb_layers.1.weight: copying a param with shape torch.Size([2048, 1024]) from checkpoint, the shape in current model is torch.Size([1024, 512]).
size mismatch for output_blocks.0.0.emb_layers.1.bias: copying a param with shape torch.Size([2048]) from checkpoint, the shape in current model is torch.Size([1024]).
size mismatch for output_blocks.0.0.out_layers.0.weight: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([512]).
size mismatch for output_blocks.0.0.out_layers.0.bias: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([512]).
size mismatch for output_blocks.0.0.out_layers.3.weight: copying a param with shape torch.Size([1024, 1024, 3, 3]) from checkpoint, the shape in current model is torch.Size([512, 512, 3, 3]).
size mismatch for output_blocks.0.0.out_layers.3.bias: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([512]).
size mismatch for output_blocks.0.0.skip_connection.weight: copying a param with shape torch.Size([1024, 2048, 1, 1]) from checkpoint, the shape in current model is torch.Size([512, 1024, 1, 1]).
size mismatch for output_blocks.0.0.skip_connection.bias: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([512]).
size mismatch for output_blocks.0.1.norm.weight: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([512]).
size mismatch for output_blocks.0.1.norm.bias: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([512]).
size mismatch for output_blocks.0.1.qkv.weight: copying a param with shape torch.Size([3072, 1024, 1]) from checkpoint, the shape in current model is torch.Size([1536, 512, 1]).
size mismatch for output_blocks.0.1.qkv.bias: copying a param with shape torch.Size([3072]) from checkpoint, the shape in current model is torch.Size([1536]).
size mismatch for output_blocks.0.1.proj_out.weight: copying a param with shape torch.Size([1024, 1024, 1]) from checkpoint, the shape in current model is torch.Size([512, 512, 1]).
size mismatch for output_blocks.0.1.proj_out.bias: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([512]).
size mismatch for output_blocks.1.0.in_layers.0.weight: copying a param with shape torch.Size([2048]) from checkpoint, the shape in current model is torch.Size([1024]).
size mismatch for output_blocks.1.0.in_layers.0.bias: copying a param with shape torch.Size([2048]) from checkpoint, the shape in current model is torch.Size([1024]).
size mismatch for output_blocks.1.0.in_layers.2.weight: copying a param with shape torch.Size([1024, 2048, 3, 3]) from checkpoint, the shape in current model is torch.Size([512, 1024, 3, 3]).
size mismatch for output_blocks.1.0.in_layers.2.bias: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([512]).
size mismatch for output_blocks.1.0.emb_layers.1.weight: copying a param with shape torch.Size([2048, 1024]) from checkpoint, the shape in current model is torch.Size([1024, 512]).
size mismatch for output_blocks.1.0.emb_layers.1.bias: copying a param with shape torch.Size([2048]) from checkpoint, the shape in current model is torch.Size([1024]).
size mismatch for output_blocks.1.0.out_layers.0.weight: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([512]).
size mismatch for output_blocks.1.0.out_layers.0.bias: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([512]).
size mismatch for output_blocks.1.0.out_layers.3.weight: copying a param with shape torch.Size([1024, 1024, 3, 3]) from checkpoint, the shape in current model is torch.Size([512, 512, 3, 3]).
size mismatch for output_blocks.1.0.out_layers.3.bias: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([512]).
size mismatch for output_blocks.1.0.skip_connection.weight: copying a param with shape torch.Size([1024, 2048, 1, 1]) from checkpoint, the shape in current model is torch.Size([512, 1024, 1, 1]).
size mismatch for output_blocks.1.0.skip_connection.bias: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([512]).
size mismatch for output_blocks.1.1.norm.weight: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([512]).
size mismatch for output_blocks.1.1.norm.bias: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([512]).
size mismatch for output_blocks.1.1.qkv.weight: copying a param with shape torch.Size([3072, 1024, 1]) from checkpoint, the shape in current model is torch.Size([1536, 512, 1]).
size mismatch for output_blocks.1.1.qkv.bias: copying a param with shape torch.Size([3072]) from checkpoint, the shape in current model is torch.Size([1536]).
size mismatch for output_blocks.1.1.proj_out.weight: copying a param with shape torch.Size([1024, 1024, 1]) from checkpoint, the shape in current model is torch.Size([512, 512, 1]).
size mismatch for output_blocks.1.1.proj_out.bias: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([512]).
size mismatch for output_blocks.2.0.in_layers.0.weight: copying a param with shape torch.Size([2048]) from checkpoint, the shape in current model is torch.Size([896]).
size mismatch for output_blocks.2.0.in_layers.0.bias: copying a param with shape torch.Size([2048]) from checkpoint, the shape in current model is torch.Size([896]).
size mismatch for output_blocks.2.0.in_layers.2.weight: copying a param with shape torch.Size([1024, 2048, 3, 3]) from checkpoint, the shape in current model is torch.Size([512, 896, 3, 3]).
size mismatch for output_blocks.2.0.in_layers.2.bias: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([512]).
size mismatch for output_blocks.2.0.emb_layers.1.weight: copying a param with shape torch.Size([2048, 1024]) from checkpoint, the shape in current model is torch.Size([1024, 512]).
size mismatch for output_blocks.2.0.emb_layers.1.bias: copying a param with shape torch.Size([2048]) from checkpoint, the shape in current model is torch.Size([1024]).
size mismatch for output_blocks.2.0.out_layers.0.weight: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([512]).
size mismatch for output_blocks.2.0.out_layers.0.bias: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([512]).
size mismatch for output_blocks.2.0.out_layers.3.weight: copying a param with shape torch.Size([1024, 1024, 3, 3]) from checkpoint, the shape in current model is torch.Size([512, 512, 3, 3]).
size mismatch for output_blocks.2.0.out_layers.3.bias: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([512]).
size mismatch for output_blocks.2.0.skip_connection.weight: copying a param with shape torch.Size([1024, 2048, 1, 1]) from checkpoint, the shape in current model is torch.Size([512, 896, 1, 1]).
size mismatch for output_blocks.2.0.skip_connection.bias: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([512]).
size mismatch for output_blocks.2.1.norm.weight: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([512]).
size mismatch for output_blocks.2.1.norm.bias: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([512]).
size mismatch for output_blocks.2.1.qkv.weight: copying a param with shape torch.Size([3072, 1024, 1]) from checkpoint, the shape in current model is torch.Size([1536, 512, 1]).
size mismatch for output_blocks.2.1.qkv.bias: copying a param with shape torch.Size([3072]) from checkpoint, the shape in current model is torch.Size([1536]).
size mismatch for output_blocks.2.1.proj_out.weight: copying a param with shape torch.Size([1024, 1024, 1]) from checkpoint, the shape in current model is torch.Size([512, 512, 1]).
size mismatch for output_blocks.2.1.proj_out.bias: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([512]).
size mismatch for output_blocks.3.0.in_layers.0.weight: copying a param with shape torch.Size([2048]) from checkpoint, the shape in current model is torch.Size([896]).
size mismatch for output_blocks.3.0.in_layers.0.bias: copying a param with shape torch.Size([2048]) from checkpoint, the shape in current model is torch.Size([896]).
size mismatch for output_blocks.3.0.in_layers.2.weight: copying a param with shape torch.Size([1024, 2048, 3, 3]) from checkpoint, the shape in current model is torch.Size([384, 896, 3, 3]).
size mismatch for output_blocks.3.0.in_layers.2.bias: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([384]).
size mismatch for output_blocks.3.0.emb_layers.1.weight: copying a param with shape torch.Size([2048, 1024]) from checkpoint, the shape in current model is torch.Size([768, 512]).
size mismatch for output_blocks.3.0.emb_layers.1.bias: copying a param with shape torch.Size([2048]) from checkpoint, the shape in current model is torch.Size([768]).
size mismatch for output_blocks.3.0.out_layers.0.weight: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([384]).
size mismatch for output_blocks.3.0.out_layers.0.bias: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([384]).
size mismatch for output_blocks.3.0.out_layers.3.weight: copying a param with shape torch.Size([1024, 1024, 3, 3]) from checkpoint, the shape in current model is torch.Size([384, 384, 3, 3]).
size mismatch for output_blocks.3.0.out_layers.3.bias: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([384]).
size mismatch for output_blocks.3.0.skip_connection.weight: copying a param with shape torch.Size([1024, 2048, 1, 1]) from checkpoint, the shape in current model is torch.Size([384, 896, 1, 1]).
size mismatch for output_blocks.3.0.skip_connection.bias: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([384]).
size mismatch for output_blocks.3.1.norm.weight: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([384]).
size mismatch for output_blocks.3.1.norm.bias: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([384]).
size mismatch for output_blocks.3.1.qkv.weight: copying a param with shape torch.Size([3072, 1024, 1]) from checkpoint, the shape in current model is torch.Size([1152, 384, 1]).
size mismatch for output_blocks.3.1.qkv.bias: copying a param with shape torch.Size([3072]) from checkpoint, the shape in current model is torch.Size([1152]).
size mismatch for output_blocks.3.1.proj_out.weight: copying a param with shape torch.Size([1024, 1024, 1]) from checkpoint, the shape in current model is torch.Size([384, 384, 1]).
size mismatch for output_blocks.3.1.proj_out.bias: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([384]).
size mismatch for output_blocks.4.0.in_layers.0.weight: copying a param with shape torch.Size([2048]) from checkpoint, the shape in current model is torch.Size([768]).
size mismatch for output_blocks.4.0.in_layers.0.bias: copying a param with shape torch.Size([2048]) from checkpoint, the shape in current model is torch.Size([768]).
size mismatch for output_blocks.4.0.in_layers.2.weight: copying a param with shape torch.Size([1024, 2048, 3, 3]) from checkpoint, the shape in current model is torch.Size([384, 768, 3, 3]).
size mismatch for output_blocks.4.0.in_layers.2.bias: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([384]).
size mismatch for output_blocks.4.0.emb_layers.1.weight: copying a param with shape torch.Size([2048, 1024]) from checkpoint, the shape in current model is torch.Size([768, 512]).
size mismatch for output_blocks.4.0.emb_layers.1.bias: copying a param with shape torch.Size([2048]) from checkpoint, the shape in current model is torch.Size([768]).
size mismatch for output_blocks.4.0.out_layers.0.weight: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([384]).
size mismatch for output_blocks.4.0.out_layers.0.bias: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([384]).
size mismatch for output_blocks.4.0.out_layers.3.weight: copying a param with shape torch.Size([1024, 1024, 3, 3]) from checkpoint, the shape in current model is torch.Size([384, 384, 3, 3]).
size mismatch for output_blocks.4.0.out_layers.3.bias: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([384]).
size mismatch for output_blocks.4.0.skip_connection.weight: copying a param with shape torch.Size([1024, 2048, 1, 1]) from checkpoint, the shape in current model is torch.Size([384, 768, 1, 1]).
size mismatch for output_blocks.4.0.skip_connection.bias: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([384]).
size mismatch for output_blocks.4.1.norm.weight: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([384]).
size mismatch for output_blocks.4.1.norm.bias: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([384]).
size mismatch for output_blocks.4.1.qkv.weight: copying a param with shape torch.Size([3072, 1024, 1]) from checkpoint, the shape in current model is torch.Size([1152, 384, 1]).
size mismatch for output_blocks.4.1.qkv.bias: copying a param with shape torch.Size([3072]) from checkpoint, the shape in current model is torch.Size([1152]).
size mismatch for output_blocks.4.1.proj_out.weight: copying a param with shape torch.Size([1024, 1024, 1]) from checkpoint, the shape in current model is torch.Size([384, 384, 1]).
size mismatch for output_blocks.4.1.proj_out.bias: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([384]).
size mismatch for output_blocks.5.0.in_layers.0.weight: copying a param with shape torch.Size([1536]) from checkpoint, the shape in current model is torch.Size([640]).
size mismatch for output_blocks.5.0.in_layers.0.bias: copying a param with shape torch.Size([1536]) from checkpoint, the shape in current model is torch.Size([640]).
size mismatch for output_blocks.5.0.in_layers.2.weight: copying a param with shape torch.Size([1024, 1536, 3, 3]) from checkpoint, the shape in current model is torch.Size([384, 640, 3, 3]).
size mismatch for output_blocks.5.0.in_layers.2.bias: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([384]).
size mismatch for output_blocks.5.0.emb_layers.1.weight: copying a param with shape torch.Size([2048, 1024]) from checkpoint, the shape in current model is torch.Size([768, 512]).
size mismatch for output_blocks.5.0.emb_layers.1.bias: copying a param with shape torch.Size([2048]) from checkpoint, the shape in current model is torch.Size([768]).
size mismatch for output_blocks.5.0.out_layers.0.weight: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([384]).
size mismatch for output_blocks.5.0.out_layers.0.bias: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([384]).
size mismatch for output_blocks.5.0.out_layers.3.weight: copying a param with shape torch.Size([1024, 1024, 3, 3]) from checkpoint, the shape in current model is torch.Size([384, 384, 3, 3]).
size mismatch for output_blocks.5.0.out_layers.3.bias: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([384]).
size mismatch for output_blocks.5.0.skip_connection.weight: copying a param with shape torch.Size([1024, 1536, 1, 1]) from checkpoint, the shape in current model is torch.Size([384, 640, 1, 1]).
size mismatch for output_blocks.5.0.skip_connection.bias: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([384]).
size mismatch for output_blocks.5.1.norm.weight: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([384]).
size mismatch for output_blocks.5.1.norm.bias: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([384]).
size mismatch for output_blocks.5.1.qkv.weight: copying a param with shape torch.Size([3072, 1024, 1]) from checkpoint, the shape in current model is torch.Size([1152, 384, 1]).
size mismatch for output_blocks.5.1.qkv.bias: copying a param with shape torch.Size([3072]) from checkpoint, the shape in current model is torch.Size([1152]).
size mismatch for output_blocks.5.1.proj_out.weight: copying a param with shape torch.Size([1024, 1024, 1]) from checkpoint, the shape in current model is torch.Size([384, 384, 1]).
size mismatch for output_blocks.5.1.proj_out.bias: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([384]).
size mismatch for output_blocks.6.0.in_layers.0.weight: copying a param with shape torch.Size([1536]) from checkpoint, the shape in current model is torch.Size([640]).
size mismatch for output_blocks.6.0.in_layers.0.bias: copying a param with shape torch.Size([1536]) from checkpoint, the shape in current model is torch.Size([640]).
size mismatch for output_blocks.6.0.in_layers.2.weight: copying a param with shape torch.Size([512, 1536, 3, 3]) from checkpoint, the shape in current model is torch.Size([256, 640, 3, 3]).
size mismatch for output_blocks.6.0.in_layers.2.bias: copying a param with shape torch.Size([512]) from checkpoint, the shape in current model is torch.Size([256]).
size mismatch for output_blocks.6.0.emb_layers.1.weight: copying a param with shape torch.Size([1024, 1024]) from checkpoint, the shape in current model is torch.Size([512, 512]).
size mismatch for output_blocks.6.0.emb_layers.1.bias: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([512]).
size mismatch for output_blocks.6.0.out_layers.0.weight: copying a param with shape torch.Size([512]) from checkpoint, the shape in current model is torch.Size([256]).
size mismatch for output_blocks.6.0.out_layers.0.bias: copying a param with shape torch.Size([512]) from checkpoint, the shape in current model is torch.Size([256]).
size mismatch for output_blocks.6.0.out_layers.3.weight: copying a param with shape torch.Size([512, 512, 3, 3]) from checkpoint, the shape in current model is torch.Size([256, 256, 3, 3]).
size mismatch for output_blocks.6.0.out_layers.3.bias: copying a param with shape torch.Size([512]) from checkpoint, the shape in current model is torch.Size([256]).
size mismatch for output_blocks.6.0.skip_connection.weight: copying a param with shape torch.Size([512, 1536, 1, 1]) from checkpoint, the shape in current model is torch.Size([256, 640, 1, 1]).
size mismatch for output_blocks.6.0.skip_connection.bias: copying a param with shape torch.Size([512]) from checkpoint, the shape in current model is torch.Size([256]).
size mismatch for output_blocks.7.0.in_layers.0.weight: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([512]).
size mismatch for output_blocks.7.0.in_layers.0.bias: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([512]).
size mismatch for output_blocks.7.0.in_layers.2.weight: copying a param with shape torch.Size([512, 1024, 3, 3]) from checkpoint, the shape in current model is torch.Size([256, 512, 3, 3]).
size mismatch for output_blocks.7.0.in_layers.2.bias: copying a param with shape torch.Size([512]) from checkpoint, the shape in current model is torch.Size([256]).
size mismatch for output_blocks.7.0.emb_layers.1.weight: copying a param with shape torch.Size([1024, 1024]) from checkpoint, the shape in current model is torch.Size([512, 512]).
size mismatch for output_blocks.7.0.emb_layers.1.bias: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([512]).
size mismatch for output_blocks.7.0.out_layers.0.weight: copying a param with shape torch.Size([512]) from checkpoint, the shape in current model is torch.Size([256]).
size mismatch for output_blocks.7.0.out_layers.0.bias: copying a param with shape torch.Size([512]) from checkpoint, the shape in current model is torch.Size([256]).
size mismatch for output_blocks.7.0.out_layers.3.weight: copying a param with shape torch.Size([512, 512, 3, 3]) from checkpoint, the shape in current model is torch.Size([256, 256, 3, 3]).
size mismatch for output_blocks.7.0.out_layers.3.bias: copying a param with shape torch.Size([512]) from checkpoint, the shape in current model is torch.Size([256]).
size mismatch for output_blocks.7.0.skip_connection.weight: copying a param with shape torch.Size([512, 1024, 1, 1]) from checkpoint, the shape in current model is torch.Size([256, 512, 1, 1]).
size mismatch for output_blocks.7.0.skip_connection.bias: copying a param with shape torch.Size([512]) from checkpoint, the shape in current model is torch.Size([256]).
size mismatch for output_blocks.8.0.in_layers.0.weight: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([384]).
size mismatch for output_blocks.8.0.in_layers.0.bias: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([384]).
size mismatch for output_blocks.8.0.in_layers.2.weight: copying a param with shape torch.Size([512, 1024, 3, 3]) from checkpoint, the shape in current model is torch.Size([256, 384, 3, 3]).
size mismatch for output_blocks.8.0.in_layers.2.bias: copying a param with shape torch.Size([512]) from checkpoint, the shape in current model is torch.Size([256]).
size mismatch for output_blocks.8.0.emb_layers.1.weight: copying a param with shape torch.Size([1024, 1024]) from checkpoint, the shape in current model is torch.Size([512, 512]).
size mismatch for output_blocks.8.0.emb_layers.1.bias: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([512]).
size mismatch for output_blocks.8.0.out_layers.0.weight: copying a param with shape torch.Size([512]) from checkpoint, the shape in current model is torch.Size([256]).
size mismatch for output_blocks.8.0.out_layers.0.bias: copying a param with shape torch.Size([512]) from checkpoint, the shape in current model is torch.Size([256]).
size mismatch for output_blocks.8.0.out_layers.3.weight: copying a param with shape torch.Size([512, 512, 3, 3]) from checkpoint, the shape in current model is torch.Size([256, 256, 3, 3]).
size mismatch for output_blocks.8.0.out_layers.3.bias: copying a param with shape torch.Size([512]) from checkpoint, the shape in current model is torch.Size([256]).
size mismatch for output_blocks.8.0.skip_connection.weight: copying a param with shape torch.Size([512, 1024, 1, 1]) from checkpoint, the shape in current model is torch.Size([256, 384, 1, 1]).
size mismatch for output_blocks.8.0.skip_connection.bias: copying a param with shape torch.Size([512]) from checkpoint, the shape in current model is torch.Size([256]).
size mismatch for output_blocks.9.0.in_layers.0.weight: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([384]).
size mismatch for output_blocks.9.0.in_layers.0.bias: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([384]).
size mismatch for output_blocks.9.0.in_layers.2.weight: copying a param with shape torch.Size([512, 1024, 3, 3]) from checkpoint, the shape in current model is torch.Size([128, 384, 3, 3]).
size mismatch for output_blocks.9.0.in_layers.2.bias: copying a param with shape torch.Size([512]) from checkpoint, the shape in current model is torch.Size([128]).
size mismatch for output_blocks.9.0.emb_layers.1.weight: copying a param with shape torch.Size([1024, 1024]) from checkpoint, the shape in current model is torch.Size([256, 512]).
size mismatch for output_blocks.9.0.emb_layers.1.bias: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([256]).
size mismatch for output_blocks.9.0.out_layers.0.weight: copying a param with shape torch.Size([512]) from checkpoint, the shape in current model is torch.Size([128]).
size mismatch for output_blocks.9.0.out_layers.0.bias: copying a param with shape torch.Size([512]) from checkpoint, the shape in current model is torch.Size([128]).
size mismatch for output_blocks.9.0.out_layers.3.weight: copying a param with shape torch.Size([512, 512, 3, 3]) from checkpoint, the shape in current model is torch.Size([128, 128, 3, 3]).
size mismatch for output_blocks.9.0.out_layers.3.bias: copying a param with shape torch.Size([512]) from checkpoint, the shape in current model is torch.Size([128]).
size mismatch for output_blocks.9.0.skip_connection.weight: copying a param with shape torch.Size([512, 1024, 1, 1]) from checkpoint, the shape in current model is torch.Size([128, 384, 1, 1]).
size mismatch for output_blocks.9.0.skip_connection.bias: copying a param with shape torch.Size([512]) from checkpoint, the shape in current model is torch.Size([128]).
size mismatch for output_blocks.10.0.in_layers.0.weight: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([256]).
size mismatch for output_blocks.10.0.in_layers.0.bias: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([256]).
size mismatch for output_blocks.10.0.in_layers.2.weight: copying a param with shape torch.Size([512, 1024, 3, 3]) from checkpoint, the shape in current model is torch.Size([128, 256, 3, 3]).
size mismatch for output_blocks.10.0.in_layers.2.bias: copying a param with shape torch.Size([512]) from checkpoint, the shape in current model is torch.Size([128]).
size mismatch for output_blocks.10.0.emb_layers.1.weight: copying a param with shape torch.Size([1024, 1024]) from checkpoint, the shape in current model is torch.Size([256, 512]).
size mismatch for output_blocks.10.0.emb_layers.1.bias: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([256]).
size mismatch for output_blocks.10.0.out_layers.0.weight: copying a param with shape torch.Size([512]) from checkpoint, the shape in current model is torch.Size([128]).
size mismatch for output_blocks.10.0.out_layers.0.bias: copying a param with shape torch.Size([512]) from checkpoint, the shape in current model is torch.Size([128]).
size mismatch for output_blocks.10.0.out_layers.3.weight: copying a param with shape torch.Size([512, 512, 3, 3]) from checkpoint, the shape in current model is torch.Size([128, 128, 3, 3]).
size mismatch for output_blocks.10.0.out_layers.3.bias: copying a param with shape torch.Size([512]) from checkpoint, the shape in current model is torch.Size([128]).
size mismatch for output_blocks.10.0.skip_connection.weight: copying a param with shape torch.Size([512, 1024, 1, 1]) from checkpoint, the shape in current model is torch.Size([128, 256, 1, 1]).
size mismatch for output_blocks.10.0.skip_connection.bias: copying a param with shape torch.Size([512]) from checkpoint, the shape in current model is torch.Size([128]).
size mismatch for output_blocks.11.0.in_layers.0.weight: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size([256]).
size mismatch for output_blocks.11.0.in_layers.0.bias: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size([256]).
size mismatch for output_blocks.11.0.in_layers.2.weight: copying a param with shape torch.Size([512, 768, 3, 3]) from checkpoint, the shape in current model is torch.Size([128, 256, 3, 3]).
size mismatch for output_blocks.11.0.in_layers.2.bias: copying a param with shape torch.Size([512]) from checkpoint, the shape in current model is torch.Size([128]).
size mismatch for output_blocks.11.0.emb_layers.1.weight: copying a param with shape torch.Size([1024, 1024]) from checkpoint, the shape in current model is torch.Size([256, 512]).
size mismatch for output_blocks.11.0.emb_layers.1.bias: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([256]).
size mismatch for output_blocks.11.0.out_layers.0.weight: copying a param with shape torch.Size([512]) from checkpoint, the shape in current model is torch.Size([128]).
size mismatch for output_blocks.11.0.out_layers.0.bias: copying a param with shape torch.Size([512]) from checkpoint, the shape in current model is torch.Size([128]).
size mismatch for output_blocks.11.0.out_layers.3.weight: copying a param with shape torch.Size([512, 512, 3, 3]) from checkpoint, the shape in current model is torch.Size([128, 128, 3, 3]).
size mismatch for output_blocks.11.0.out_layers.3.bias: copying a param with shape torch.Size([512]) from checkpoint, the shape in current model is torch.Size([128]).
size mismatch for output_blocks.11.0.skip_connection.weight: copying a param with shape torch.Size([512, 768, 1, 1]) from checkpoint, the shape in current model is torch.Size([128, 256, 1, 1]).
size mismatch for output_blocks.11.0.skip_connection.bias: copying a param with shape torch.Size([512]) from checkpoint, the shape in current model is torch.Size([128]).
size mismatch for out.0.weight: copying a param with shape torch.Size([256]) from checkpoint, the shape in current model is torch.Size([128]).
size mismatch for out.0.bias: copying a param with shape torch.Size([256]) from checkpoint, the shape in current model is torch.Size([128]).
size mismatch for out.2.weight: copying a param with shape torch.Size([6, 256, 3, 3]) from checkpoint, the shape in current model is torch.Size([3, 128, 3, 3]).
size mismatch for out.2.bias: copying a param with shape torch.Size([6]) from checkpoint, the shape in current model is torch.Size([3]).

What is the 'zero_module' used for? [Question]

First of all, thanks for an extraordinary paper - so many interesting details!! Also, thanks for open sourcing the code.

I have a few ideas I want to test and I'm trying to understand all the parts of the code. Most of it is clear and well commented, but I can't seem to figure out the reasoning behind the 'zero_module' you have in a few places in the guided-diffusion/guided_diffusion/unet.py file?

def zero_module(module):
    """
    Zero out the parameters of a module and return it.
    """
    for p in module.parameters():
        p.detach().zero_()
    return module

I couldn't find anything in the paper or online to explain why this is used.

I'm also curious why you used a custom mixed precision training instead of using PyTorch's mixed precision training (torch.cuda.amp.autocast)?

FID score on Imagenet-256 dataset

It's really a great work! However, on ImageNet-256 dataset, I test the FID between 50k images from the validation set and 50k images from the training set, the FID is about 7.6. (using two popular FID code, https://github.com/mseitzer/pytorch-fid and https://github.com/tsc2017/Frechet-Inception-Distance , which get very similar results). So I am really confused how to achieve the FID score 4.59 in your paper. Could you share the code to calculate the FID score? I just want to make sure my way to calculate FID is not wrong.

Latent Space Projection and Interpolation Code

Hi.

Thanks for releasing the code for your interesting paper.
I looked into the repo, but couldn't find the latent space projection and interpolation code for Figure 10 results.
Can you please provide the code?

Thanks.

Query: Diffusion models vs GANs

@prafullasd @unixpickle
Thank you for your work. I have a query, while catching up with the research in diffusion models, I stumbled upon one paper which asks for the comeback of GANs for they require less parameters and hence scaling models like BigGAN could be a good approach. Which approach should one now adopt to help diffusion models beat GANs again?
reference: StudioGAN (benchmark)
https://arxiv.org/abs/2206.09479

how do i train upsampling 128->512 on a custom dataset?

hello,
how do i train upsampling 128->512 on a custom dataset?
is this possible on an RTX 3090? and if so, any idea how long this would take?
if anyone knows please let me know because training instructions for a custom dataset don't seem to be present..
thanks.

Does max_beta=0.999 in cosine schedule make any sense ?

Hello,

elif schedule_name == "cosine":
return betas_for_alpha_bar(
num_diffusion_timesteps,
lambda t: math.cos((t + 0.008) / 1.008 * math.pi / 2) ** 2,
)
else:
raise NotImplementedError(f"unknown beta schedule: {schedule_name}")
def betas_for_alpha_bar(num_diffusion_timesteps, alpha_bar, max_beta=0.999):

I get this is a strict implementation of paper https://arxiv.org/pdf/2102.09672.pdf, but I don't understand how could max_beta=0.999 not be a bug.

In my personal loose implementation of this paper, I had to set max_beta = 0.02 which is the end point of the linear schedule, to get working results.

In equation (13) of the Improved Diffusion paper,
mu(xt,t) = 1/sqrt(alpha[t]) *( xt - beta[t] / sqrt(1-alphabar[t]) * eps(thetat,t)

At the start of the reverse diffusion process when t=max T ,

xt is Normal(0,1), 
eps aims to be Normal(0,1), 
beta[t] = clipped_value = 0.999,
alpha[t] = 1-beta[t] = 0.001,
1/sqrt(alpha[t]) ~ 31.6,
alphabar[t] ~0 because it's forgetting the initial x0
beta[t] / sqrt(1-alphabar[t])  ~ 1

This mean that variance( mu(xt,t) ) ~ 30 this mean that the variance of x[t-1] ~ 30

In the paper equation (9)
The neural network inputs are trained with :
xt = sqrt(alphabar[t])*x0 + sqrt(1-alphabar[t])*eps which is roughly of variance ~1

This means that all the sampling will be made from sample with variance 30 while having been trained with variance around 1.
Even if the model normalize its input internally, it screws the ratio of the predicted variance and therefore the diffusion process is dominated by the first few terms, because the network will predict a variance ~30 times smaller.

In my personal loose implementation I have decided to use the prediction of the noise (Ho-style) instead of the prediction of mu as you seem to have chosen here, and therefore I am much more sensitive to this bug.

But even predicting mu directly, if you predict mu correctly this mean you will get out of the training zone during the diffusion process (which you seem to mitigate with (dubious ?) clipping), and if you predict it incorrectly because its weight is low (by sheer luck?) it's just added noise to training process.

In the paper you explain that max_beta should be < 1 to avoid singularities, but can you clarify the reasoning for max_beta=0.999 in the range [0.02-0.999] ?

Thanks

Training doesn't really work

When running

python scripts/image_train.py --data_dir path/to/images $MODEL_FLAGS $DIFFUSION_FLAGS $TRAIN_FLAGS
The code will run for 2 seconds and finish it without creating some /tmp folder. There isn't any error neither.

training across mutiple nodes does not work

if the number of GPUs > 8 (each node has 8 GPUs), then I have to train in several nodes

In this case, run by mpiexec -n 16 python script/image_train.py doesn't work.

It says the error of nccl

Is there a feature similar to max_to_keep

Is there a feature similar to max_to_keep to limit how many checkpoints are saved?

It would be nice to be able to limit to, say, 5, and delete all previous to save space.

nan occurs when training ImageNet 128x128

Hi, @unixpickle

Thanks for your awesome work and open source.

I met the nan issue when training on ImageNet 128x128,

-----------------------------
| lg_loss_scale | -1.62e+04 |
| loss          | nan       |
| loss_q0       | nan       |
| loss_q1       | nan       |
| loss_q2       | nan       |
| loss_q3       | nan       |
| mse           | nan       |
| mse_q0        | nan       |
| mse_q1        | nan       |
| mse_q2        | nan       |
| mse_q3        | nan       |
| samples       | 3.92e+07  |
| step          | 1.53e+05  |
| vb            | nan       |
| vb_q0         | nan       |
| vb_q1         | nan       |
| vb_q2         | nan       |
| vb_q3         | nan       |
-----------------------------
Found NaN, decreased lg_loss_scale to -16199.354
Found NaN, decreased lg_loss_scale to -16200.354
Found NaN, decreased lg_loss_scale to -16201.354
Found NaN, decreased lg_loss_scale to -16202.354
Found NaN, decreased lg_loss_scale to -16203.354

I used fp16. Did you meet similar issues?

Thanks in advance.

Assertion Error of variable `betas` when using 256 x 256 model with classifier guidance

When I use the recommended command

MODEL_FLAGS="--attention_resolutions 32,16,8 --class_cond True --diffusion_steps 1000 --image_size 256 --learn_sigma True --noise_schedule linear --num_channels 256 --num_head_channels 64 --num_res_blocks 2 --resblock_updown True --use_fp16 True --use_scale_shift_norm True"
python classifier_sample.py $MODEL_FLAGS --classifier_scale 1.0 --classifier_path models/256x256_classifier.pt --model_path models/256x256_diffusion.pt $SAMPLE_FLAGS

There seems to be an assertion error

Logging to results/openai-2022-04-22-15-28-42-019927
creating model and diffusion...
Traceback (most recent call last):
File "scripts/classifier_sample.py", line 152, in
main()
File "scripts/classifier_sample.py", line 35, in main
model, diffusion = create_model_and_diffusion(
File "/home/lujiahao/research/guided-diffusion-main/guided_diffusion/script_util.py", line 117, in create_model_and_diffusion
diffusion = create_gaussian_diffusion(
File "/home/lujiahao/research/guided-diffusion-main/guided_diffusion/script_util.py", line 407, in create_gaussian_diffusion
return SpacedDiffusion(
File "/home/lujiahao/research/guided-diffusion-main/guided_diffusion/respace.py", line 77, in init
base_diffusion = GaussianDiffusion(**kwargs) # pylint: disable=missing-kwoa
File "/home/lujiahao/research/guided-diffusion-main/guided_diffusion/gaussian_diffusion.py", line 136, in init
assert (betas > 0).all() and (betas <= 1).all()
AssertionError

And I print the betas value :

[0.01 0.23111111 0.45222222 0.67333333 0.89444444 1.11555556
1.33666667 1.55777778 1.77888889 2. ]

How to solve this bug?

unable to train/sample using mpiexec on multiple GPUs

Thanks for providing the code implementation.

I am able to train and use the model on 1 GPU but I am having issues while using multiple GPUs .

I am creating multiple process using mpiexec as suggested in the repo (I tried mpiexec from both OpenMPI and MPICH and I am having the same issue).

Issue: For both sampling and training cases, multiple processes are created and models load on GPUs. I am not able to sample/train. I see no progress at all (seems like a deadlock situation).

A) Below is an example of the commands I am running for inference/sampling (as suggested in this repo- openai/guided_diffusion)

mpiexec -n 8 python classifier_sample.py --attention_resolutions 32,16,8 --class_cond True --diffusion_steps 1000 --image_size 256 --learn_sigma True --noise_schedule linear --num_channels 256 --num_head_channels 64 --num_res_blocks 2 --resblock_updown True --use_fp16 True --use_scale_shift_norm True --classifier_scale 1.0 --classifier_path "models/256x256_classifier.pt" --model_path "models/256x256_diffusion.pt" --batch_size 1 --num_samples 4 --timestep_respacing 250

Problem A: The program is stopping at <line93, classifier_sample.py> i.e ,all_images.extend([sample.cpu().numpy() for sample in gathered_samples])

B) Below is an example of a command I am running for training (as suggested in the parent repo – /openai/improved diffusion)

mpiexec -n 8 python image_train.py --data_dir ./data_dir --image_size 256 --class_cond False --learn_sigma True --num_channels 256 --num_res_blocks 2 --num_head_channels 64 --attention_resolutions 32,16,8 --dropout 0.1 --diffusion_steps 1000 --noise_schedule linear --use_checkpoint True --use_scale_shift_norm True --resblock_updown True --use_fp16 True --use_new_attention_order True --lr 1e-4 --batch_size 32

Problem B: The program is stopping in TrainLoop init function- where distributeDataParallel(DDP) function is called i.e,
self.ddp_model = DDP( self.model, device_ids=[dist_util.dev()], output_device=dist_util.dev(), broadcast_buffers=False, bucket_cap_mb=128, find_unused_parameters=False,)

I have waited for approximately 24 hours to observe if code runs, but it did not. I have tried different approaches also to create multiple process such as python -m torch.distributed.run --nnodes=1 --nproc_per_node=4 and multiprocess.spawn. They did not work.

With this issue:

A) I request, if possible, could please provide the version details of all the dependencies. Such as PyTorch, CUDA, CUDNN, Python, OpenMPI/MPICH, mpi4py and so on. My problems may be due to dependency version incompatibility.

I also build PyTorch from the source with CUDA 11.2 and had the same issues.

B) Do you have any suggestions/insights for training. Did you see any such behavior? Could you please suggest a training strategy for ablation study?

Below are the dependencies version I am using currently (issue is reproducible with these version):

conda 4.10.3
Python 3.9.7
PyTorch 1.9.1 (py3.9_cuda11.1_cudnn8.0.5_0)
cudatoolkit 11.1.74
mpich 3.4.2
mpi4py 3.1.1

I will be happy to provide any other details related to the dependencies I am using.

Why not using "pred_xstart".

A quick question:

What's the difference between "pred_xstart" and ”sample“ in p_sample function.
I found the results from "pred_xstart" is much better than "sample" .

Thanks.

Figures Appendix C of the paper

Hello,

Congrats on the great work and thanks for sharing the code!

I wonder how you generated Figure 8 of Appendix C in your paper?

How are the images so similar, even when the scale is so reduced? Is there any conditioning beyond classifier guidance?

Many thanks!

Missing optimizer checkpoints

I tried training based on the pre-trained models that were published, but this requires the checkpoints for the AdamW optimizer.

Are you planning on publishing them soon?

Many thanks

Add dimension as input to create_model

Hi, thank you so much for open sourcing the code. I was planning on using this repo for 3d images and although I can see that the UNet class allows 3d inputs, the "dims" keyword is not passed through the create_model function in script_util.py.
If it is alright with the owners of the repo, I would like to push my change that would allow this. This change would keep the default dimension to 2 so that no breaking changes hopefully result.

What is the minimum amount of VRAM needed to train 512 or 256 model?

I used Tesla T4 on google colab with batch size 1 but still get cuda out of memory error.
Is 16GB VRAM not enough to train 512 model?
(I also tired 256 uncond model with batch size 1 and still cuda out of memory.)

These are flags I used:

MODEL_512_FLAGS = "--attention_resolutions 32,16,8 --class_cond False --image_size 512 --learn_sigma True --num_channels 256 --num_res_blocks 2 --resblock_updown True --use_fp16 True --use_scale_shift_norm True --resume_checkpoint /content/models/512x512_diffusion_uncond_finetune_008100.pt"
MODEL_256_FLAGS = "--attention_resolutions 32,16,8 --class_cond False --image_size 256 --learn_sigma True --num_channels 256 --num_res_blocks 2 --resblock_updown True --use_fp16 True --use_scale_shift_norm True --resume_checkpoint /content/drive/MyDrive/models/256x256_diffusion_uncond.pt"
DIFFUSION_FLAGS = "--diffusion_steps 4000 --noise_schedule linear"
TRAIN_FLAGS = "--lr 1e-4 --batch_size 1 --save_interval 10000 --log_interval 1000"

script:

!python scripts/image_train.py --data_dir /content/drive/MyDrive/datasets/animals/animals_10/images $MODEL_512_FLAGS $DIFFUSION_FLAGS $TRAIN_FLAGS

Classifier accuracy

Hi! Thanks for your job with guided-diffusion. Has anyone of you calculated the accuracy of the classifier on imagenet testset?

resume training does not work for multi-gpus training

I add --resume_checkpoint $path_to_checkpoint$ to continue the training, it works for a single GPU, but does not work for multi-gpus

the code gets stuck here:

Logging to /proj/ihorse_2021/users/x_manni/guided-diffusion/log9
creating model and diffusion...
creating data loader...
start training...
loading model from checkpoint: /proj/ihorse_2021/users/x_manni/guided-diffusion/log9/model200000.pt...

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.