GithubHelp home page GithubHelp logo

foundationvision / llamagen Goto Github PK

View Code? Open in Web Editor NEW
1.1K 21.0 41.0 5.48 MB

Autoregressive Model Beats Diffusion: 🦙 Llama for Scalable Image Generation

Home Page: https://arxiv.org/abs/2406.06525

License: MIT License

Python 99.21% Shell 0.79%
auto-regressive-model diffusion diffusion-models image-generation llama llm text2image

llamagen's Introduction

Autoregressive Model Beats Diffusion: 🦙 Llama for Scalable Image Generation

demo  arXiv  project page 

This repo contains pre-trained model weights and training/sampling PyTorch(torch>=2.1.0) codes used in

Autoregressive Model Beats Diffusion: Llama for Scalable Image Generation
Peize Sun, Yi Jiang, Shoufa Chen, Shilong Zhang, Bingyue Peng, Ping Luo, Zehuan Yuan
HKU, ByteDance

You can find more visualizations on project page

🔥 Update

  • [2024.06.28] Image tokenizers and AR models for text-conditional image generation are released ! Try it !
  • [2024.06.15] All models ranging from 100M to 3B parameters are supported by vLLM !
  • [2024.06.11] Image tokenizers and AR models for class-conditional image generation are released !
  • [2024.06.11] Code and Demo are released !

🌿 Introduction

We introduce LlamaGen, a new family of image generation models that apply original next-token prediction paradigm of large language models to visual generation domain. It is an affirmative answer to whether vanilla autoregressive models, e.g., Llama, without inductive biases on visual signals can achieve state-of-the-art image generation performance if scaling properly. We reexamine design spaces of image tokenizers, scalability properties of image generation models, and their training data quality.

In this repo, we release:

  • Two image tokenizers of downsample ratio 16 and 8.
  • Seven class-conditional generation models ranging from 100M to 3B parameters.
  • Two text-conditional generation models of 700M parameters.
  • Online demos in Hugging Face Spaces for running pre-trained models.
  • Supported vLLM serving framework to enable 300% - 400% speedup.

🦄 Class-conditional image generation on ImageNet

VQ-VAE models

Method params tokens rFID (256x256) weight
vq_ds16_c2i 72M 16x16 2.19 vq_ds16_c2i.pt
vq_ds16_c2i 72M 24x24 0.94 above
vq_ds16_c2i 72M 32x32 0.70 above
vq_ds8_c2i 70M 32x32 0.59 vq_ds8_c2i.pt

AR models

Method params training tokens FID (256x256) weight
LlamaGen-B 111M DDP 16x16 5.46 c2i_B_256.pt
LlamaGen-B 111M DDP 24x24 6.09 c2i_B_384.pt
LlamaGen-L 343M DDP 16x16 3.80 c2i_L_256.pt
LlamaGen-L 343M DDP 24x24 3.07 c2i_L_384.pt
LlamaGen-XL 775M DDP 24x24 2.62 c2i_X_384L.pt
LlamaGen-XXL 1.4B FSDP 24x24 2.34 c2i_XXL_384.pt
LlamaGen-3B 3.1B FSDP 24x24 2.18 c2i_3B_384.pt

Demo

Please download models, put them in the folder ./pretrained_models, and run

python3 autoregressive/sample/sample_c2i.py --vq-ckpt ./pretrained_models/vq_ds16_c2i.pt --gpt-ckpt ./pretrained_models/c2i_L_384.pt --gpt-model GPT-L --image-size 384
# or
python3 autoregressive/sample/sample_c2i.py --vq-ckpt ./pretrained_models/vq_ds16_c2i.pt --gpt-ckpt ./pretrained_models/c2i_XXL_384.pt --gpt-model GPT-XXL --from-fsdp --image-size 384

The generated images will be saved to sample_c2i.png.

Gradio Demo

You can use our online gradio demo Hugging Face Spaces or run gradio locally:

python app.py

🚀 Text-conditional image generation

VQ-VAE models

Method params tokens data weight
vq_ds16_t2i 72M 16x16 LAION COCO (50M) + internal data (10M) vq_ds16_t2i.pt

AR models

Method params tokens data weight
LlamaGen-XL 775M 16x16 LAION COCO (50M) t2i_XL_stage1_256.pt
LlamaGen-XL 775M 32x32 internal data (10M) t2i_XL_stage2_512.pt

Demo

Before running demo, please refer to language readme to install the required packages and language models.

Please download models, put them in the folder ./pretrained_models, and run

python3 autoregressive/sample/sample_t2i.py --vq-ckpt ./pretrained_models/vq_ds16_t2i.pt --gpt-ckpt ./pretrained_models/t2i_XL_stage1_256.pt --gpt-model GPT-XL --image-size 256
# or
python3 autoregressive/sample/sample_t2i.py --vq-ckpt ./pretrained_models/vq_ds16_t2i.pt --gpt-ckpt ./pretrained_models/t2i_XL_stage2_512.pt --gpt-model GPT-XL --image-size 512

The generated images will be saved to sample_t2i.png.

Local Gradio Demo

⚡ Serving

We use serving framework vLLM to enable higher throughput. Please refer to serving readme to install the required packages.

python3 autoregressive/serve/sample_c2i.py --vq-ckpt ./pretrained_models/vq_ds16_c2i.pt --gpt-ckpt ./pretrained_models/c2i_XXL_384.pt --gpt-model GPT-XXL --from-fsdp --image-size 384

The generated images will be saved to sample_c2i_vllm.png.

Getting Started

See Getting Started for installation, training and evaluation.

License

The majority of this project is licensed under MIT License. Portions of the project are available under separate license of referred projects, detailed in corresponding files.

BibTeX

@article{sun2024autoregressive,
  title={Autoregressive Model Beats Diffusion: Llama for Scalable Image Generation},
  author={Sun, Peize and Jiang, Yi and Chen, Shoufa and Zhang, Shilong and Peng, Bingyue and Luo, Ping and Yuan, Zehuan},
  journal={arXiv preprint arXiv:2406.06525},
  year={2024}
}

llamagen's People

Contributors

eltociear avatar jshilong avatar nielsrogge avatar peizesun avatar shoufachen avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

llamagen's Issues

Difficulty in reproducing results with pre-trained weights

I was trying to run: https://github.com/FoundationVision/LlamaGen/blob/main/autoregressive/sample/sample_t2i.py, and I tried doing so across different seeds and also tried playing around with the parameters. I have been unsuccessful in reproducing similar looking outputs with 512 x 512 model which produces these outputs:

image

for these prompts

"A portrait photo of a kangaroo wearing an orange hoodie and blue sunglasses standing on the grassin front of the Sydney Opera House holding a sign on the chest that says Welcome Friends!",
"A blue Porsche 356 parked in front of a yellow brick wall.",
"A photo of an astronaut riding a horse in the forest. There is a river in front of them with water lilies.",
"A map of the United States made out of sushi. It is on a table next to a glass of red wine."

I was wondering if you had any tips on reproducing inference results?

VQ-VAE ckpt optimizer states?

Hello! Thank you for the clean + user friendly codebase!

I'm trying to finetune the VQ-VAE tokenizer and noticed some keys might be missing from the pretrained checkpoint listed on huggingface: "optimizer", "discriminator", and "optimizer_disc". See here:

command:

torchrun --nnodes=1 --nproc_per_node=1 -m tokenizer.tokenizer_image.vq_train --finetune --disc-start 0 --vq-ckpt ./pretrained_models/vq_ds16_c2i.pt --dataset imagenet --data-path /home/julian/images --cloud-save-path ./training-save-dir --global-batch-size 8

output:

| distributed init (rank 0): env://
[2024-06-13 09:12:10] Experiment directory created at results_tokenizer_image/000-VQ-16
[2024-06-13 09:12:10] Experiment directory created in cloud at ./training-save-dir/2024-06-13-09-12-10/000-VQ-16/checkpoints
[2024-06-13 09:12:10] Namespace(data_path='/home/julian/images', data_face_path=None, cloud_save_path='./training-save-dir', no_local_save=False, vq_model='VQ-16', vq_ckpt='./pretrained_models/vq_ds16_c2i.pt', finetune=True, ema=False, codebook_size=16384, codebook_embed_dim=8, codebook_l2_norm=True, codebook_weight=1.0, entropy_loss_ratio=0.0, commit_loss_beta=0.25, reconstruction_weight=1.0, reconstruction_loss='l2', perceptual_weight=1.0, disc_weight=0.5, disc_start=0, disc_type='patchgan', disc_loss='hinge', gen_loss='hinge', compile=False, dropout_p=0.0, results_dir='results_tokenizer_image', dataset='imagenet', image_size=256, epochs=40, lr=0.0001, weight_decay=0.05, beta1=0.9, beta2=0.95, max_grad_norm=1.0, global_batch_size=8, global_seed=0, num_workers=16, log_every=100, ckpt_every=5000, gradient_accumulation_steps=1, mixed_precision='bf16', rank=0, world_size=1, gpu=0, dist_url='env://', distributed=True, dist_backend='nccl')
[2024-06-13 09:12:10] Starting rank=0, seed=0, world_size=1.
[2024-06-13 09:12:12] VQ Model Parameters: 71,883,403
loaded pretrained LPIPS loss from /home/julian/LlamaGen/tokenizer/tokenizer_image/cache/vgg.pth
[2024-06-13 09:12:22] Discriminator Parameters: 2,765,633
[2024-06-13 09:12:32] Dataset contains 691,040 images (/home/julian/images)
[rank0]: Traceback (most recent call last):
[rank0]:   File "/opt/conda/lib/python3.10/runpy.py", line 196, in _run_module_as_main
[rank0]:     return _run_code(code, main_globals, None,
[rank0]:   File "/opt/conda/lib/python3.10/runpy.py", line 86, in _run_code
[rank0]:     exec(code, run_globals)
[rank0]:   File "/home/julian/LlamaGen/tokenizer/tokenizer_image/vq_train.py", line 316, in <module>
[rank0]:     main(args)
[rank0]:   File "/home/julian/LlamaGen/tokenizer/tokenizer_image/vq_train.py", line 146, in main
[rank0]:     optimizer.load_state_dict(checkpoint["optimizer"])
[rank0]: KeyError: 'optimizer'

Should the huggingface ckpts be updated to include these?

Thanks again

FID results of GPT-L and GPT-1B on 256*256 images

Hi, thanks for the excellent work. I'm trying to reproduce the results on 256*256 images. The VQGAN model is reproduced successively, achieving $2.10$ rFID. However, the AR part experiences a significant performance gap. More specifically, I use 8 A100-80G GPU to run the following scripts

bash scripts/autoregressive/train_c2i.sh --cloud-save-path xxx --code-path xxx --gpt-model GPT-L --epochs 50
bash scripts/autoregressive/train_c2i.sh --cloud-save-path xxx --code-path xxx --gpt-model GPT-1B --epochs 50

The training results are as follows

Model Final Loss FID Expected FID
GPT-L 7.86 4.62 4.22
GPT-1B 7.33 4.13 3.09

Is the final loss reasonable? Do you have any idea what the reason might be?

Thanks!

FID Evaluation not matching paper results for VQ-16 checkpoint

Hi! Thanks for the great repo. I've tried reproducing some of your numbers on ImageNet val (256x256) and specifically the rFID isn't matching, both for your checkpoint and for a tokenizer I've trained with your settings.

With your VQ-16 I get:

PSNR: 20.793026, SSIM: 0.675290 (this matches your paper exactly)

Inception Score: 172.32923889160156
FID: 4.284650117003025
sFID: 5.144700494258814
Precision: 0.73054
Recall: 0.6533

(the model-based evals are systematically worse than the results in your paper)

After re-running your training script and performing eval, I get:

PSNR: 20.625670, SSIM: 0.664614

Inception Score: 174.96481323242188
FID: 4.243560592613846
sFID: 5.425596037757714
Precision: 0.72864
Recall: 0.6552

(very similar to your results)

Given that the PSNR/SSIM match exactly, I believe I'm producing the reconstructions and npz files correctly. For running the evaluator, my command looks like:

python evaluator.py ~/assets/VIRTUAL_imagenet256_labeled.npz VQ-16-flatdataset-size-256-size-256-codebook-size-16384-dim-8-seed-0.npz 

Could you advise where I've gone wrong? I'm just using the OpenAI evaluation code provided in this repository. Thanks!

[Feature] ControlNet support via process similar to PixArt's ControlNet-Transformer

Hi!

I have found your work much interesting and inspiring ever since the first VAR release. However, it would be nice for such a project to implement much-used image-conditional generation in the manner similar to ControlNet.

Transformer architectures, especially autoregressive ones, differ significantly from Unet-controls, however successful efforts have been made.

The PixArt project has achieved this effect by making a mirror transformer of half-depth, delivering the processed conditions with zero-initialized linear layers.

https://github.com/PixArt-alpha/PixArt-alpha/blob/a9f400f09614e212f5f4ace0162d0106d10a5ec8/asset/docs/pixart_controlnet.md

You can see some of the results in PixArt delta's technical report, they look quite promising https://arxiv.org/pdf/2401.05252

I think it will be useful for the popularity, adoption and community around the concept

Issues about the 3B model

Thanks for your fascinating work!

I'm now trying on the 3B model and encountered two issues:

  1. The json of 3B model is missing. I tried to modify from the json of the XXL version to match the checkpoint and statistics in the paper, but meet another issue;
  2. ValueError: Head size 100 is not supported by PagedAttention. Supported head sizes are: [64, 80, 96, 112, 128, 256]. from xformers.

Mask guidance, inpaiting and outpaiting

Thanks for the awesome paper. Even the codebase is very easy to use.
Can you please do some initial experiments on mask guidance image generation, inpainting and outpainting. It will really help the community.

Error in FID evaluation

Hi, I'm running FID evaluation code by following command

bash scripts/autoregressive/sample_c2i.sh --vq-ckpt ./pretrained_models/vq_ds16_c2i.pt --gpt-ckpt ./pretrained_models/c2i_B.pt --gpt-model GPT-B --image-size 384 --image-size-eval 256 --cfg-scale 2.0

This code will raise following error

torch._dynamo.exc.Unsupported: dynamic shape operator: aten.repeat_interleave.Tensor                                                                                                                                                     
                                                                                                                                                                                                                                         
from user code:                                                                                                                                                                                                                          
   File "/data1/qlt/LlamaGen/autoregressive/models/gpt.py", line 255, in forward                                                                                                                                                         
    h = x + self.drop_path(self.attention(self.attention_norm(x), freqs_cis, start_pos, mask))                                                                                                                                           
  File "/data1/qlt/LlamaGen/autoregressive/models/gpt.py", line 229, in forward                                                                                                                                                          
    keys = keys.repeat_interleave(self.n_head // self.n_kv_head, dim=1) 

I notice this error iscaused by the default args in autoregressive/sample/sample_c2i_ddp.py where torch compile is set to True by default.

       parser.add_argument("--compile", action='store_true', default=True)

The scripts can work by setting this --compile to False. I'm wondering if this is due to my environment or it's a bugs in codes.

Train script

Hi. The work you did is really cool and I want to reproduce it. Do you have any plans to release the script you learned?

KeyError: 'optimizer'

@PeizeSun hi, I met the same problem when funetuning the t2i model:

bash scripts/tokenizer/train_vq_finetune_continue.sh --cloud-save-path /data/vjuicefs_ai_camera_llm/11170092/00proj/LlamaGen/cloud_save --data-path /data/vjuicefs_ai_camera_llm/public_data/vivo_internal_data/AIPortrait/crop_imgs_ffhqcrop_png/20230703_fourth_fix-1_625 --image-size 256 --vq-model VQ-16 --dataset  coco --global-batch-size 32"

Does vq_ds16_t2i.pt also need to be updated?

Traceback (most recent call last):
  File "/data/vjuicefs_ai_camera_llm/11170092/00proj/LlamaGen/tokenizer/tokenizer_image/vq_train.py", line 320, in <module>
    main(args)
  File "/data/vjuicefs_ai_camera_llm/11170092/00proj/LlamaGen/tokenizer/tokenizer_image/vq_train.py", line 150, in main
    optimizer.load_state_dict(checkpoint["optimizer"])
KeyError: 'optimizer'
Traceback (most recent call last):
  File "/data/vjuicefs_ai_camera_llm/11170092/00proj/LlamaGen/tokenizer/tokenizer_image/vq_train.py", line 320, in <module>
    main(args)
  File "/data/vjuicefs_ai_camera_llm/11170092/00proj/LlamaGen/tokenizer/tokenizer_image/vq_train.py", line 150, in main
    optimizer.load_state_dict(checkpoint["optimizer"])
KeyError: 'optimizer'

Training Details?

Hi, thank you for your excellent work in open-sourcing the code. I have several questions.

  1. I notice the paper says all models have the same learning rate of 1e-4 (The code is the same) and batch size of 256. However, #8 suggests otherwise (lr for XXL model is 2e-4). Which one is the correct one?
  2. I notice there are two options in image cropping. 1.1 and 1.05 are both provided in the source code. Which one is used for main experiments in the paper, please?

Question about cannot reproduce FID results

Hi, thanks for the great repo. I tried to reproduce the results in the paper with the model weights you provided, but the results are much worse than those in the paper. The reproduced commands and results are showed below:

val:

val.sh:

# !/bin/bash
set -x
export NCCL_P2P_LEVEL=NVL

torchrun \
--nnodes=1 --nproc_per_node=2 --node_rank=0 \
--master_port=12343 \
tokenizer/validation/val_ddp.py \
--data-path /mnt/ShareDB_1TB/datasets/imagenet-1k/val \
"$@"

command:

sh scripts/tokenizer/val.sh

GPT-XL :

command:

bash scripts/autoregressive/sample_c2i.sh --vq-ckpt ./pretrained_models/vq_ds16_c2i.pt --gpt-ckpt ./pretrained_models/c2i_XL_384.pt --gpt-model GPT-XL --image-size 384 --image-size-eval 256 --cfg-scale 1.75

python3 evaluations/c2i/evaluator.py mywork/LlamaGen/reconstructions/val_imagenet.npz mywork/LlamaGen_ours/samples/GPT-XL-c2i_XL_384-size-384-size-256-VQ-16-topk-0-topp-1.0-temperature-1.0-cfg-1.75-seed-0.npz

reproduced results:

Inception Score: 245.02481079101562
FID: 3.6842287284328563
sFID: 8.495801038066816
Precision: 0.70964
Recall: 0.57578

We also reproduce the results of GPT-B and GPT-L, the results is similar to #48. I followed the commands you provided as closely as possible in my reproduction, except for the number of GPUs, and I'm curious if the difference in results is due to the number of GPUs. Any assistance would be greatly appreciated!

[Feature] Inpainting script

In your VAR repository I noticed the script for image inpainting, which is a beautiful demonstration and which has a lot of practical applications, however I didn't find it in either repo. I'd be very grateful if you could include it in this repo as well

image

image

RuntimeError: shape '[1, 512, 1, 32, 2]' is invalid for input of size 16448

Hi, thanks for the interesting work.

I'm playing a bit with the code on a simple single-class dataset of 256x256 images, and I've modified basic things (imagenet hardcoded numbers, etc...).

I'm hitting the error above on the rope embedding:

freqs_cis = freqs_cis.view(1, xshaped.size(1), 1, xshaped.size(3), 2) # (1, seq_len, 1, head_dim//2, 2)

Went chasing the issue, and it seems this is due to a mismatch between the precomputed freqs_cis and the reshaping of the attention vectors. This mismatch appears to mostly be due to the number of augmentations (I went from 10 to 2 during debug).

If this error rings a bell, I'd appreciate any hint :) I see how to fix it with a hack (reducing aug to none), but I believe something else is wrong, otherwise it wouldn't work at all.

Thanks!

T2I Data

Hi, do you have any plans to open source some of the t2i data?

Question about text-conditional generation.

Great work!

I would like to know how you fine-tune the image tokenizer before the two-stage training for text conditional generation. Could you please provide some details, such as the image resolution and any other relevant information?

Questions about the discriminator

Follow issue #27
I would like to ask if the discriminator's logit_real and logit_fake are always similar, and when they are both positive, the discriminator_adv_loss will be negative. Is such a discriminator beneficial to the generator?

why using constant lr?

Hi authors! Thanks for your excellent work. I just wonder why do you use constant lr rather than other lr_schedulers, which are often preferred.

Looking forward to your reply!

HF model

Hello any plan to adapt the model so we can import it through huggingface ?

Unexpected error when replicating experiments

Hi, I have a problem when reproducing experiments according to readme instructions, could you give me any clue for the potential reasons?

Using all official codes and tokenizer and only train a new GPT-B model on 256*256 images, I get the following sampled images, ALL looking like this (texture-style):
image
image
image

The only difference from the official instructions is that I use 8 40Gb A100 instead of 8 80Gb A100. Plus, I extract both imagnet train/eval dataset codes.

I'm basically certain there are no problems with image code extracting (because I verify images using check_image_codes.py and find no problems.) There should be no problem with image sampling, I sampled from the officially released model and it is okay.

Specifically, My experiment scripts are

  1. extract codes for imagenet (both train and eval subsets) using provided image encoder:
    bash scripts/autoregressive/extract_codes_c2i.sh --vq-ckpt ./vq_ds16_c2i.pt --data-path /workspace/home/LargeData/Large/ImageNet/--code-path ./imagenet_train_code_c2i_flip_ten_crop --ten-crop --crop-range 1.1 --image-size 256
  2. Train GPT-B AR model:
nnodes=1
nproc_per_node=8
node_rank=0
master_addr="127.0.0.1"
master_port=29502

torchrun \
--nnodes=$nnodes --nproc_per_node=$nproc_per_node --node_rank=$node_rank \
--master_addr=$master_addr --master_port=$master_port \
autoregressive/train/train_c2i.py --global-batch-size 256 --cloud-save-path /raid/data/huayu/LlamaGen_models --code-path /raid/data/huayu/imagenet_code_c2i_flip_ten_crop/ --image-size 256 --gpt-model GPT-B

  1. Sample images using provided scripts.

Thank you for your help!

Text embedding inject

In your code, you simply concat the text embedding with the image token embedding. So my question is why you choose this and instead of doing cross attention? Are there any main difference in these two method ?

Questions about the results of your experiment.

hi. I was wondering about the results of your experiments. In section 3.1, in the codebook size ablation study, you show that the usage increases from 75%->97% when the size increases from 8192 to 16384. It's very interesting to see that at 4096 size, the usage is 100%, at 8192 size, the usage drops to 75%, and then rises to 97% at 16384 size. Do you guys have any insights or conclusions on this result?

tokenizer of 4 dim

Thanks for your work! Would u release the .pth of more VQ-Model version, especially 4 hidden dim? And I wonder how the tokenizer performs with LDM. Thanks for answering!

Training Results

Dear authors,

Thanks for your excellent work in autoregressive image generation!

I try to reproduce the training of GPT-B-256 following the instruction provided here. The specific command I used is:

torchrun \
--nnodes=1 --nproc_per_node=8 --node_rank=0 \
--master_addr=127.0.0.1 --master_port=26667 \
train_c2i.py --cloud-save-path ckpt/GPT_B --code-path dataset/imagenet_code_c2i_flip_256_ten_crop/ --image-size 256 --gpt-model GPT-B 

However, after training ~150 epoch on ImageNet1k, it seems that the generated results are still meaningless:

Screenshot 2024-07-04 at 11 20 04

My environment is 8xA5000 GPUs, which is different from yours (8xA100). I wonder whether the results are sensitive to such a difference, and whether the problem would be alleviated after full training (300 epochs).

Thanks for your help in advance :)

Lacks Detailed Documentation

I am not able to figure out how to enable text prompts and check how good the images are and how to use it in my own projects, also there can be a option to support high vram usage and mechanism to train controlnets, lora adapters with these since there is no support with Hugging face eithers

Training cost

Thanks for the amazing work, could you open the training cost for each model? such as training GPU times and the least GPU needed.

About evaluation on private dataset

Hello, I'm a little confused about how to make a .npz file on my private dataset like you exhibit in evalution/c2i/README.md
Could you give us a simple example ?

T2I VQVAE Training Details

Was the T2I VQVAE checkpoint trained at 256x256 or 512x512 (or both)?

I didn't see it specified in the paper and noticed that the 256x256 quality is very good but the 512x512 quality is a bit worse, and I wanted to know if it was already fine-tuned at 512x512.

Thanks!

Mismatched model weights document

Hi, I'm currently testing the official checkpoints. I found the model name and config in autoregressive/models/gpt.py

### text-conditional
def GPT_7B(**kwargs):
    return Transformer(ModelArgs(n_layer=32, n_head=32, dim=4096, **kwargs)) # 6.6B

def GPT_3B(**kwargs):
    return Transformer(ModelArgs(n_layer=24, n_head=32, dim=3200, **kwargs)) # 3.1B

def GPT_1B(**kwargs):
    return Transformer(ModelArgs(n_layer=22, n_head=32, dim=2048, **kwargs)) # 1.2B

### class-conditional
def GPT_XXXL(**kwargs):
    return Transformer(ModelArgs(n_layer=48, n_head=40, dim=2560, **kwargs)) # 3.9B

def GPT_XXL(**kwargs):
    return Transformer(ModelArgs(n_layer=48, n_head=24, dim=1536, **kwargs)) # 1.4B

def GPT_XL(**kwargs):
    return Transformer(ModelArgs(n_layer=36, n_head=20, dim=1280, **kwargs)) # 775M

def GPT_L(**kwargs):
    return Transformer(ModelArgs(n_layer=24, n_head=16, dim=1024, **kwargs)) # 343M

def GPT_B(**kwargs):
    return Transformer(ModelArgs(n_layer=12, n_head=12, dim=768, **kwargs)) # 111M

In Readme.md, I can successfully load c2i_3B_384.pt as GPT_3B in
LlamaGen-3B | 3.1B | FSDP | 24x24 | 2.18 c2i_3B_384.pt. However, the GPT_3B is marked as text-conditional in above code.

FSDP training gets stuck after saving the weights

Thank you for your excellent work, I am very interested in this paper. However, when I try to reproduce the FSDP training, the training gets stuck after saving the weights, even though the GPU utilization seems normal. Could you please let me know what the issue might be?
image

Discriminator is not training properly?

Hi peize, I try to train VQGAN with your default config, where the discriminator start training after 20k iterations. However, I notice that logits_real and logits_fake are very close all the time. For example:
Beginning epoch 7... (Generator) rec_loss: 0.0441, perceptual_loss: 0.2432, vq_loss: 0.0107, commit_loss: 0.0027, entropy_loss: -0.0000, codebook_usage: 0.9761, generator_adv_loss: 0.0698, disc_adaptive_weight: 1.0000, disc_weight: 0.5000 (Discriminator) discriminator_adv_loss: 0.4980, disc_weight: 0.5000, logits_real: -0.1318, logits_fake: -0.1396

To be a good discriminator, logits_real should be close to 1, while logits_fake should be close to -1, right? Can you share me your training log regarding these logits?

About ROPE in sample process

Hi, thanks for the interesting work. I want to know why the PE of the text token in generating process is set to zero?
image

Cannot Reproduce LlamaGen-B or L numbers using provided models

Hi, thanks for the great repo. I'm trying to reproduce some of your paper's numbers using the provided models. After using val.sh to produce the ImageNet val npz file, my reconstruction FID matches yours perfectly (VQ-16), 2.19. However, after sampling stage 2, my numbers disagree with yours by wide margins, except on Inception Score which matches exactly on GPT-B.

Here are the numbers I get:

GPT-B (16x16):

Inception Score: 193.6122283935547                                                                                   
FID: 6.133130640203092                                                                                               
sFID: 10.453294743799916                                                                                             
Precision: 0.75066                                                                                                   
Recall: 0.44764

GPT-L (16x16):

Inception Score: 288.1690368652344
FID: 6.226299024953619
sFID: 12.101763646433483
Precision: 0.7687
Recall: 0.47984

Commands (GPT-B):

torchrun --nproc_per_node=8 autoregressive/sample/sample_c2i_ddp.py --vq-ckpt /data/home/vkramanuj/assets/vq_ds16_c2i.pt --gpt-ckpt /data/home/vkramanuj/assets/c2i_B_256.pt --gpt-model GPT-B --image-size 256 --sample-dir /tmp/samples_v7_B --cfg-scale=2.0 

python3 evaluations/c2i/evaluator.py \
    /tmp/val_flatdataset.npz /tmp/samples_v7_B/GPT-B-c2i_B_256-size-256-size-256-VQ-16-topk-0-topp-1.0-temperature-1.0-cfg-2.0-seed-0.npz

Commands (GPT-L):

torchrun --nproc_per_node=8 autoregressive/sample/sample_c2i_ddp.py --vq-ckpt /data/home/vkramanuj/assets/vq_ds16_c2i.pt --gpt-ckpt /data/home/vkramanuj/assets/c2i_L_256.pt --gpt-model GPT-L --image-size 256 --sample-dir /tmp/samples_v7_L --cfg-scale=2.0 

python3 evaluations/c2i/evaluator.py \
    /tmp/val_flatdataset.npz /tmp/samples_v7_L/GPT-L-c2i_L_256-size-256-size-256-VQ-16-topk-0-topp-1.0-temperature-1.0-cfg-2.0-seed-0.npz

Any assistance would be greatly appreciated!

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.