Towards Adversarial Attack on Vision-Language Pre-training Models

This is the official PyTorch implement of the paper "Towards Adversarial Attack on Vision-Language Pre-training Models" at ACM Multimedia 2022.

Update 20/03/2023

To get the ASR, you should run "--adv 0" to get the clean accuracy, then run "--adv 4" to get the adversarial accuracy, and the ASR = clean accuracy-adversarial accuracy.

Update 29/11/2022

We released the fine-tuned checkpoints (Baidu, password: iqvp) for VE task on ALBEF and TCL, which can be considered not only as an attacked model in this paper, but also useful for other studies.

Requirements

pytorch 1.10.2
transformers 4.8.1
timm 0.4.9
bert_score 0.3.11

Download

Dataset json files for downstream tasks [ALBEF github]
Finetuned checkpoint for ALBEF [ALBEF github]
Finetuned checkpoint for TCL [TCL github]

Evaluation

Adv	Instruction
0	No Attack
1	Attack Text
2	Attack Image
3	Attack Both (vanilla)
4	Co-Attack

When attack unimodal embedding, using "--adv 4" and not using "--cls" will raise an expected error due to the different sequence length of image embedding and text embedding.

Image-Text Retrieval

Download MSCOCO or Flickr30k datasets from origin website.

# Attack Unimodal Embedding
python RetrievalEval.py --adv 4 --gpu 0 --cls \
--config configs/Retrieval_flickr.yaml \
--output_dir output/Retrieval_flickr \
--checkpoint [Finetuned checkpoint]

# Attack Multimodal Embedding
python RetrievalFusionEval.py ...

# Attack Clip Model
python RetrievalCLIPEval.py --adv 4 --gpu 0 --image_encoder ViT-B/16  ...

Visual Entailment

Download SNLI-VE datasets from origin website.

# Attack Unimodal Embedding
python VEEval.py --adv 4 --gpu 0 --cls \
--config configs/VE.yaml \
--output_dir output/VE \
--checkpoint [Finetuned checkpoint]

# Attack Multimodal Embedding
python VEFusionEval.py ...

Visual Grounding

Download MSCOCO dataset from the original website.

# Attack Unimodal Embedding
python GroundingEval.py --adv 4 --gpu 0 --cls \
--config configs/Grounding.yaml \
--output_dir output/Grounding \
--checkpoint [Finetuned checkpoint]

# Attack Multimodal Embedding
python GroundingFusionEval.py ...

Visualization

python visualization.py --adv 4 --gpu 0

Citation

If you find this code to be useful for your research, please consider citing.

@inproceedings{zhang2022towards,
  title={Towards Adversarial Attack on Vision-Language Pre-training Models},
  author={Zhang, Jiaming and Yi, Qi and Sang, Jitao},
  booktitle="Proceedings of the 30th ACM International Conference on Multimedia",
  year={2022}
}

Reference

ALBEF

chenpan0103 / co-attack Goto Github PK

co-attack's Introduction

Towards Adversarial Attack on Vision-Language Pre-training Models

Update 20/03/2023

Update 29/11/2022

Requirements

Download

Evaluation

Image-Text Retrieval

Visual Entailment

Visual Grounding

Visualization

Citation

Reference

co-attack's People

Contributors

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent

Jobs