GithubHelp home page GithubHelp logo

microsoft / bridgetower Goto Github PK

View Code? Open in Web Editor NEW
133.0 12.0 4.0 15.86 MB

Open source code for AAAI 2023 Paper "BridgeTower: Building Bridges Between Encoders in Vision-Language Representation Learning"

Home Page: https://arxiv.org/abs/2206.08657

License: MIT License

Python 95.91% Shell 4.09%

bridgetower's Introduction

BridgeTower

This repo is the official Pytorch implementation of "BridgeTower: Building Bridges Between Encoders in Vision-Language Representation Learning".

Updates

Abstract

Vision-Language (VL) models with the Two-Tower architecture have dominated visual-language representation learning in recent years. Current VL models either use lightweight uni-modal encoders and learn to extract, align and fuse both modalities simultaneously in a deep cross-modal encoder, or feed the last-layer uni-modal representations from the deep pre-trained uni-modal encoders into the top cross-modal encoder. Both approaches potentially restrict vision-language representation learning and limit model performance. In this paper, we propose BridgeTower, which introduces multiple bridge layers that build a connection between the top layers of uni-modal encoders and each layer of the cross-modal encoder. This enables effective bottom-up cross-modal alignment and fusion between visual and textual representations of different semantic levels of pre-trained uni-modal encoders in the cross-modal encoder. Pre-trained with only 4M images, BridgeTower achieves state-of-the-art performance on various downstream vision-language tasks. In particular, on the VQAv2 test-std set, BridgeTower achieves an accuracy of 78.73%, outperforming the previous state-of-the-art model METER by 1.09% with the same pre-training data and almost negligible additional parameters and computational costs. Notably, when further scaling the model, BridgeTower achieves an accuracy of 81.15%, surpassing models that are pre-trained on orders-of-magnitude larger datasets.

Architecture

Architecture

Main Results

Result1

Result2

Deployment

  • Run setup.sh to set up the environment.
  • [Optional] We use wandb to track experiments! Please remember to wandb login and paste your token before running the script.

Dataset Preparation

Checkpoints

  • Pre-trained checkpoints on 4M data: BASE and LARGE

  • Fine-tuned checkpoints for

  • Here is an example for downloading a checkpoint.

    # download azcopy
    wget https://aka.ms/downloadazcopy-v10-linux
    tar -xvf downloadazcopy-v10-linux
    sudo cp ./azcopy_linux_amd64_*/azcopy /usr/bin/
    sudo chmod -R 777 /usr/bin/azcopy
    # azcopy copy [remote path] [local path]
    azcopy copy "https://chenfei.blob.core.windows.net/data/G/LCI/best_checkpoints/BridgeTower_pt_base.ckpt?sv=2020-10-02&st=2022-11-24T12%3A18%3A49Z&se=2027-11-25T12%3A18%3A00Z&sr=b&sp=r&sig=BJigddAMHfNUtQuTGH8bJUrzAO3LfaeSm48AXUqZngY%3D" "./BridgeTower_pt_base.ckpt"

Pre-training on Image-Text Datasets

# Pre-train BridgeTower Base Model
bash scripts/pre_train.sh
# Pre-train BridgeTower Large Model
bash scripts/pre_train_large.sh

Fine-tuning on Downstream VL Tasks

  • VQAv2 Evaluation needs to submit the json file in the logs/ directory to eval.ai evaluation server to get the test-dev and/or test-std scores.
# Base Model on VQAv2 without VLP
bash scripts/ftfs_base_vqa.sh

# Large Model on VQAv2 without VLP
bash scripts/ftfs_large_vqa.sh

# Base Model on VQAv2 with VLP
bash scripts/ftfpt_base_vqa.sh

# Large Model on VQAv2 with VLP
bash scripts/ftfpt_large_vqa.sh

# Base Model on IRTR-Flickr30K with VLP (directly use ITM with multiple false texts)
bash scripts/ftfpt_base_irtr_f30k.sh

# Base Model on IRTR-Flickr30K with VLP (follow ALBEF to use ITC to sample hard negatives for ITM)
bash scripts/ftfpt_base_irtr_itm_itc_f30k.sh

# Base Model on SNLI-VE with VLP
bash scripts/ftfpt_base_snlive.sh

# Base Model on NLVR^2 with VLP
bash scripts/ftfpt_base_nlvr2.sh

# Base Model on IRTR-MSCOCO with VLP (follow ALBEF to use ITC to sample hard negatives for ITM)
bash scripts/ftfpt_base_irtr_itm_itc_coco.sh

Fine-tuning on Uni-Modal Tasks

# Base Model on CIFAR with VLP
bash scripts/ftfpt_base_cifar.sh

# Base Model on GLUE with VLP
bash scripts/ftfpt_base_glue.sh

Citation

@article{xu2022bridge,
  title={BridgeTower: Building Bridges Between Encoders in Vision-Language Representation Learning},
  author={Xu, Xiao and Wu, Chenfei and Rosenman, Shachar and Lal, Vasudev and Che, Wanxiang and Duan, Nan},
  journal={arXiv preprint arXiv:2206.08657},
  year={2022}
}

Acknowledgement

We are highly grateful for the public code of the following papers, our code is partly based on them:

bridgetower's People

Contributors

chenfei-wu avatar looperxx avatar microsoftopensource avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

bridgetower's Issues

Few questions about the implimentation of BridgeTower

Thanks for the insightful research, Bridge-Tower seems to show new promising way to fuse text and image. We'd like to test this model in our local environment by tweaking the code from METER, but i am not clear in some details of the model.

  1. METER uses 11 layer encoder in clip16 by default, does Bridge-Tower follow this setting as well?

  2. Output embedding for each encoder block in clip is not layer-normed (instead it is at start of the block) in the original clip code. Does the BridgeLayer use the embeddings before layer-normed as an input? or should I ensure that layer-normed input goes in the BridgeLayer block as an input?

  3. Equation from your paper seems to share linear-projection and modal-type embeddings for all cross-modal layers. Am I understanding this right? Does they share the LayerNorm weights too?

  4. There isn't any mention for initialization of Z_0^T and Z_0^V for the first bridge layer. should it be V_7 @ W^T + V^(type) (then x == y)?

Thanks in advance!

Pretraining Result of BridgeTower

Hello, I have implemented BridgeTower architecture according to the paper and this issue based on METER github.

However, I was not able to get the result that match the paper. Below is the validation epoch loss graph for BridgeTower(blue) and METER(orange), mlm and itm respectively.

MLM ITM
image image

The training graph for both models are similar, even the downstream results for VQAv2 are similar

VQAv2 test-dev
METER 77.65
BridgeTower 77.64

This is how i implemented BridgeTower

  1. For ImageEncoder(CLIP) and TextEncdoer(RoBertA), change forward() so that it returns last 6 intermediate outputs instead of only last one. so we have [V0, V1, V2, V3, V4, V5], [T0, T1, T2, T3, T4, T5].
  2. For CLIP, these intermediate layers are permuted to be LND -> NLD and normalized with self.ln_post.
  3. Newly added layers are BridgeLayer with are 12 LayerNorms (6 for each modality).
  4. starting with $Z^T_0= Z^V_0 = 0$, $\tilde{Z^V_l} = LayerNorm(Z^V_l + V_l W_V + V_{type})$, $\tilde{Z^T_l} = LayerNorm(Z^T_l + T_l W_T + T_{type})$ where LayerNorm is different for each layer, but projections W_V, W_T and type embedding T_type, V_type is shared.
  5. Then $Z^V_l ,Z^T_l =Encoder^Z_l(\tilde{Z}^V_l , \tilde{Z}^T_l )$ just as METER.
  6. the lr for new LayerNorms are multiplied 5 times the base lr and have no weight decay.
  7. Rest hyperparameters are same as METER.

Is there anything wrong or anything that I missed in my implementation?? Thanks in advance.

Processor only accepts 3 channel images

Hello,

I have been trying to finetune this model using my own data which is single channel images and some text describing features of the images.

The processer only seems to handle 3 channel images. To get around this I have stacked the images onto themselves to fill 3 channels. However, this seems like it will have a significant impact on the amount of data I will train the model on.

Is there currently support for single channel in a pretrain BridgeTower checkpoint? If not are there plans to include it? Is stacking images the best approach when working with single channel images for this model?

about code

Congratulations on the great work! I wonder when the code will be released? Thanks a lot for sharing the code!

Pre-trained visual encoder, 4M pretraining?

HI thanks for this valuable work, its a good insight into fusing vision and text and impressive results :).

Regarding the initialisation of the visual encoder, pretrained CLIP is used for the visual encoder, trained on 400M images?

It's somewhat confusing because in the table it states 4M images used for #pretraining.

processor support for single channel?

Hello,

I have been trying to finetune this model using my own data which is single channel images and some text describing features of the images.

The processer only seems to handle 3 channel images. To get around this I have stacked the images onto themselves to fill 3 channels. However, this seems like it will have a significant impact on the amount of data I will train the model on.

Is there currently support for single channel in a pretrain BridgeTower checkpoint? If not are there plans to include it? Is stacking images the best approach when working with single channel images for this model?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.