GithubHelp home page GithubHelp logo

kakaobrain / magvlt Goto Github PK

View Code? Open in Web Editor NEW
23.0 5.0 0.0 18.79 MB

The official implementation of MAGVLT: Masked Generative Vision-and-Language Transformer (CVPR'23)

License: MIT License

Python 100.00%

magvlt's Introduction

MAGVLT: Masked Generative Vision-and-Language Transformer


The official PyTorch implementation of Masked Generative Vision-and-Language Transformer, CVPR 2023

MAGVLT is a unified non-autoregressive generative Vision-and-Language (VL) model which is trained via 1) three multimodal masked token prediction tasks along with two sub-tasks, 2) step-unrolled masked prediction and 3) MixSel.

Requirements

We have tested our codes on the environment below

PyTorch 1.10.0
Python 3.7.11
Ubuntu 18.04

Please run the following command to install the other dependencies

pip install -r requirements.txt

Coverage of Released Codes

  • Implementation of MAGVLT
  • Pretrained checkpoints of MAGVLT-base and MAGVLT-large
  • Sampling pipelines of MAGVLT:
    • Generate image from text
    • Generate text from image
    • Generate image from text and image (inpainting)
    • Generate text from text and image (infilling)
    • Generate text and image (unconditional generation)
  • Evaluation pipelines of MAGVLT on downstream tasks
  • Training pipeline with data preparation example

Pretrained Checkpoints

MAGVLT uses VQGAN (vqgan_imagenet_f16_16384) as the image encoder which can be downloaded from this repo.

Model #Parameters CIDEr (↑, coco) CIDEr (↑, NoCaps) FID (↓, coco)
MAGVLT-base 371M 60.4 46.3 12.08
MAGVLT-large 840M 68.1 55.8 10.14

Sampling

We provide the following sampling codes.

python sampling_t2i.py  --prompt=[YOUR PROMPT] 
                        --config_path=configs/magvlt-it2it-base-sampling.yaml 
                        --model_path=[MAGVLT_MODEL_PATH] 
                        --stage1_model_path=[VQGAN_MODEL_PATH]

python sampling_i2t.py  --source_img_path=[YOUR_IMAGE_PATH] 
                        --config_path=configs/magvlt-it2it-base-sampling.yaml 
                        --model_path=[MAGVLT_MODEL_PATH] 
                        --stage1_model_path=[VQGAN_MODEL_PATH]

python sampling_it2i.py --prompt=[YOUR PROMPT] 
                        --source_img_path=[YOUR_IMAGE_PATH] 
                        --config_path=configs/magvlt-it2it-base-sampling.yaml 
                        --model_path=[MAGVLT_MODEL_PATH] 
                        --stage1_model_path=[VQGAN_MODEL_PATH]

Citation

@InProceedings{Kim_2023_CVPR,
    author    = {Kim, Sungwoong and Jo, Daejin and Lee, Donghoon and Kim, Jongmin},
    title     = {MAGVLT: Masked Generative Vision-and-Language Transformer},
    booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
    month     = {June},
    year      = {2023},
    pages     = {23338-23348}
}

Contact

Donghoon Lee, [email protected]
Jongmin Kim, [email protected]

License

This project is released under MIT license.

magvlt's People

Contributors

daemyung avatar donghoonlee04 avatar twidddj avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar

magvlt's Issues

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.