microsoft / oscar Goto Github PK

View Code? Open in Web Editor NEW

1.0K 25.0 248.0 732 KB

Oscar and VinVL

License: MIT License

Python 100.00%

vision-and-language pre-training image-captioning vqa image-text-search oscar vinvl

oscar's Introduction

Oscar: Object-Semantics Aligned Pre-training for Vision-and-Language Tasks

VinVL: Revisiting Visual Representations in Vision-Language Models

Updates

04/17/2023: Visual instruction tuning with GPT-4 is released! Please check out the multimodal model LLaVA: [Project Page] [Paper] [Demo] [Data] [Model]

05/28/2020: Released finetuned models on downstream tasks, please check MODEL_ZOO.md.
05/15/2020: Released pretrained models, datasets, and code for downstream tasks finetuning.
01/13/2021: our new work VinVL proposed OSCAR+, an improved version of OSCAR, and provided a better object-attribute detection model to extract features for V+L tasks. The VinVL work achieved SOTA performance on all seven V+L tasks here. Please stay tuned for the model and code release.
03/08/2021: Oscar+ pretraining code released, please check the last section in VinVL_MODEL_ZOO.md. All image features and model checkpoints in VinVL are also released. Please check VinVL for details.
04/13/2021: Our Scene Graph Benchmark Repo has been released. Welcome to use the code there to extract image features with VinVL pretrained models.

Introduction

This repository contains source code necessary to reproduce the results presented in the paper Oscar: Object-Semantics Aligned Pre-training for Vision-Language Tasks. We propose a new cross-modal pre-training method Oscar (Object-Semantics Aligned Pre-training). It leverages object tags detected in images as anchor points to significantly ease the learning of image-text alignments. We pre-train Oscar on the public corpus of 6.5 million text-image pairs, and fine-tune it on downstream tasks, creating new state-of-the-arts on six well-established vision-language understanding and generation tasks. For more on this project, see the Microsoft Research Blog post.

Performance

Task	t2i	t2i	i2t	i2t	IC	IC	IC	IC	NoCaps	NoCaps	VQA	NLVR2	GQA
Metric	R@1	R@5	R@1	R@5	B@4	M	C	S	C	S	test-std	test-P	test-std
SoTA_S	39.2	68.0	56.6	84.5	38.9	29.2	129.8	22.4	61.5	9.2	70.92	58.80	63.17
SoTA_B	54.0	80.8	70.0	91.1	40.5	29.7	137.6	22.8	86.58	12.38	73.67	79.30	-
SoTA_L	57.5	82.8	73.5	92.2	41.7	30.6	140.0	24.5	-	-	74.93	81.47	-
-----	---	---	---	---	---	---	---	---	---	---	---	---	---
Oscar_B	54.0	80.8	70.0	91.1	40.5	29.7	137.6	22.8	78.8	11.7	73.44	78.36	61.62
Oscar_L	57.5	82.8	73.5	92.2	41.7	30.6	140.0	24.5	80.9	11.3	73.82	80.05	-
-----	---	---	---	---	---	---	---	---	---	---	---	---	---
VinVL_B	58.1	83.2	74.6	92.6	40.9	30.9	140.6	25.1	92.46	13.07	76.12	83.08	64.65
VinVL_L	58.8	83.5	75.4	92.9	41.0	31.1	140.9	25.2	-	-	76.62	83.98	-
gain	1.3	0.7	1.9	0.6	-0.7	0.5	0.9	0.7	5.9	0.7	1.69	2.51	1.48

t2i: text-to-image retrieval; i2t: image-to-text retrieval; IC: image captioning on COCO.

Download

We released pre-trained models, datasets, VinVL image features, and Oscar+ pretraining corpus for downstream tasks. Please check VinVL_DOWNLOAD.md for details.

To download checkpoints for the Vanilla OSCAR, please check DOWNLOAD.md for details.

Installation

Check INSTALL.md for installation instructions.

Model Zoo

Check MODEL_ZOO.md for scripts to run oscar downstream finetuning.

Check VinVL_MODEL_ZOO.md for scripts to run oscar+ pretraining and downstream finetuning.

Citations

Please consider citing this paper if you use the code:

@article{li2020oscar,
  title={Oscar: Object-Semantics Aligned Pre-training for Vision-Language Tasks},
  author={Li, Xiujun and Yin, Xi and Li, Chunyuan and Hu, Xiaowei and Zhang, Pengchuan and Zhang, Lei and Wang, Lijuan and Hu, Houdong and Dong, Li and Wei, Furu and Choi, Yejin and Gao, Jianfeng},
  journal={ECCV 2020},
  year={2020}
}

@article{zhang2021vinvl,
  title={VinVL: Making Visual Representations Matter in Vision-Language Models},
  author={Zhang, Pengchuan and Li, Xiujun and Hu, Xiaowei and Yang, Jianwei and Zhang, Lei and Wang, Lijuan and Choi, Yejin and Gao, Jianfeng},
  journal={CVPR 2021},
  year={2021}
}

License

Oscar is released under the MIT license. See LICENSE for details.

oscar's People

Contributors

Stargazers

Watchers

Forkers

chunyuanli jaeyun95 linxi1158 peternara xiaoweihu taaccoo-beta jwyang rosssong jeonsworld bianximo henry-huski vivid-k xiyinmsu ayushnareshshah dianarolien zhaoxinchi taffywrinkle anisha2102 purealpha001 chengjiun claudiusgonzo jindl465 dendisuhubdy brucew91 kokuno1122 barryzm xden2331 luowensheng magesh-technovator vibhavagarwal5 yaoliuoa danvitoriano dalek-who garima-mahato scott0123 dmortem kwj2104 phoebussi vision-and-language vision-lang mrarcrm loganwu0526 peterli1001 chodi150 huanlei07 ayushjain1144 debarghag pieterblomme oscarimprover voxlogic yue-png stjordanis matteostefanini zengyingyue liuhl-source marcofernandez007 rmoin osinkolu mcdavid109 alpersayan chenyutongthu dixonch ericwang0701 jhads ammaddd bradleygrantham eaidova testmailtt jing-msft-2020 dorazhao99 michaelcstrauss mohaoran93 jacobswan1 mymuli liqing-ustc ssmgg lexusarcherliu cavendishbosh dahrs arensis-julia adaakanishi ljingv jvd10 paranoth tejas1995 ericdoug-qi starlight-2021 hackgoofer ombelote amulyahwr wh-forker kambehmw saraansh1999 sjtuytc leefree-git ecwu moaladham jnwestra mumulmaulana devbox10

oscar's Issues

Question about t2i retrieval task

Hi, thank you very much for open source the project! I tried to reproduce the text-to-image retrieval task. However, it appears that only image-to-text retrieval code has been released. May I ask if it is possible to release the text-to-image retrieval code and model as well? Thank you very much for your help!

The result of IR/TR from BERT base without pre-training

Hi there, nice work!

I tried to reproduce the result you provided in Table 3 of the paper, i.e., IR and TR on COCO 1K with the model initialzed from BERT base without pre-training.
My results (default setting with all attentions) are far below what you reported:
TR: 0.6820 @ R1, 0.9180 @ R5, 0.9620 @ R10
IR: 0.5676 @ R1, 0.8748 @ R5, 0.9466 @ R10

I followed the script, but only change the --model_name_or_path to 'bert-base-uncased'.

Did I miss some thing important or need another set of hyper-params for finetune without pre-training?

Thank you!

The image of 2D visualization using t-SNE

Hello, I tried to reduce the dimension of text and image features with t-sne, but the final image and text are not in the same range and the same kind of text and image are not together. Have you processed the feature or dimension reduction before visualization?
Do you mind sharing the code of 2D visualization using t-SNE? thanks!

release the fine-tuned model for Image Text Retrieval ?

may I ask : how much time does it take to do finetuning for the Image Text Retrieval task ? and is it possible to release the fine-tuned model so we can directly inference on the coco dataset ? because the training with (4 V100 with 16G mem) or (8 V100 with 32G mem) is kinda expensive...

Segmentation fault (core dumped) When I fintune VQA task

Hi~
When I finetune VQA task, there is an error "Segmentation fault (core dumped)"
The reason that my memory is 128G which is not enough ， Could you give me some suggestion？

i have cannot allocate memory error!

i got this error.

(oscar) ailab@ailab:~/oscar/Oscar/oscar$ python run_vqa.py -j 4 --img_feature_dim 2054 --max_img_seq_length 50 --data_label_type mask --img_feature_type faster_r-cnn --data_dir /media/ailab/jaeyun/oscar/datasets/vqa/2k/ --model_type bert --model_name_or_path /media/ailab/jaeyun/oscar/models/base-vg-labels/ep_107_1192087/ --task_name vqa_text --do_train --do_lower_case --max_seq_length 128 --per_gpu_eval_batch_size 1 --per_gpu_train_batch_size 1 --learning_rate 5e-05 --num_train_epochs 25 --output_dir results --label_file /media/ailab/jaeyun/oscar/datasets/vqa/cache/trainval_ans2label.pkl --save_epoch 1 --seed 88 --evaluate_during_training --logging_steps 4000 --drop_out 0.3 --weight_decay 0.05 --warmup_steps 0 --loss_type bce --img_feat_format pt --classifier linear --cls_hidden_scale 3 --txt_data_dir /media/ailab/jaeyun/oscar/datasets/vqa/2k/
07/06/2020 12:17:14 - WARNING - __main__ - Process rank: -1, device: cuda, n_gpu: 2, distributed training: False, 16-bits training: False
07/06/2020 12:17:14 - INFO - __main__ - Task Name: vqa_text, #Labels: 3129
07/06/2020 12:17:14 - INFO - transformers.pytorch_transformers.modeling_utils - loading configuration file /media/ailab/jaeyun/oscar/models/base-vg-labels/ep_107_1192087/config.json
07/06/2020 12:17:14 - INFO - transformers.pytorch_transformers.modeling_utils - Model config {
  "attention_probs_dropout_prob": 0.1,
  "finetuning_task": "vqa_text",
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "img_feature_dim": 2054,
  "img_feature_type": "faster_r-cnn",
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "layer_norm_eps": 1e-12,
  "max_position_embeddings": 512,
  "num_attention_heads": 12,
  "num_hidden_layers": 12,
  "num_labels": 3129,
  "output_attentions": false,
  "output_hidden_states": false,
  "torchscript": false,
  "type_vocab_size": 2,
  "vocab_size": 30522
}

07/06/2020 12:17:14 - INFO - transformers.pytorch_transformers.tokenization_utils - Model name '/media/ailab/jaeyun/oscar/models/base-vg-labels/ep_107_1192087/' not found in model shortcut name list (bert-base-uncased, bert-large-uncased, bert-base-cased, bert-large-cased, bert-base-multilingual-uncased, bert-base-multilingual-cased, bert-base-chinese, bert-base-german-cased, bert-large-uncased-whole-word-masking, bert-large-cased-whole-word-masking, bert-large-uncased-whole-word-masking-finetuned-squad, bert-large-cased-whole-word-masking-finetuned-squad, bert-base-cased-finetuned-mrpc). Assuming '/media/ailab/jaeyun/oscar/models/base-vg-labels/ep_107_1192087/' is a path or url to a directory containing tokenizer files.
07/06/2020 12:17:14 - INFO - transformers.pytorch_transformers.tokenization_utils - loading file /media/ailab/jaeyun/oscar/models/base-vg-labels/ep_107_1192087/added_tokens.json
07/06/2020 12:17:14 - INFO - transformers.pytorch_transformers.tokenization_utils - loading file /media/ailab/jaeyun/oscar/models/base-vg-labels/ep_107_1192087/special_tokens_map.json
07/06/2020 12:17:14 - INFO - transformers.pytorch_transformers.tokenization_utils - loading file /media/ailab/jaeyun/oscar/models/base-vg-labels/ep_107_1192087/vocab.txt
07/06/2020 12:17:14 - INFO - transformers.pytorch_transformers.modeling_utils - loading weights file /media/ailab/jaeyun/oscar/models/base-vg-labels/ep_107_1192087/pytorch_model.bin
07/06/2020 12:17:15 - INFO - oscar.modeling.modeling_bert - BertImgModel Image Dimension: 2054
07/06/2020 12:17:16 - INFO - transformers.pytorch_transformers.modeling_utils - Weights of ImageBertForSequenceClassification not initialized from pretrained model: ['classifier.weight', 'classifier.bias']
07/06/2020 12:17:16 - INFO - transformers.pytorch_transformers.modeling_utils - Weights from pretrained model not used in ImageBertForSequenceClassification: ['cls.predictions.bias', 'cls.predictions.transform.dense.weight', 'cls.predictions.transform.dense.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.decoder.weight', 'cls.seq_relationship.weight', 'cls.seq_relationship.bias']
07/06/2020 12:17:17 - INFO - __main__ - Training/evaluation parameters Namespace(adam_epsilon=1e-08, adjust_dp=False, adjust_loss=False, adjust_loss_epoch=-1, cache_dir='', classifier='linear', cls_hidden_scale=3, code_level='top', code_voc=512, config_name='', data_dir='/media/ailab/jaeyun/oscar/datasets/vqa/2k/', data_label_type='mask', device=device(type='cuda'), do_eval=False, do_lower_case=True, do_test=False, do_test_dev=False, do_train=True, do_train_val=False, drop_out=0.3, eval_all_checkpoints=False, evaluate_during_training=True, fp16=False, fp16_opt_level='O1', gradient_accumulation_steps=1, hard_label=False, img_feat_dir=None, img_feat_format='pt', img_feature_dim=2054, img_feature_type='faster_r-cnn', label2ans_file=None, label_file='/media/ailab/jaeyun/oscar/datasets/vqa/cache/trainval_ans2label.pkl', learning_rate=5e-05, load_fast=False, local_rank=-1, logging_steps=4000, loss_type='bce', max_grad_norm=1.0, max_img_seq_length=50, max_seq_length=128, max_steps=-1, model_name_or_path='/media/ailab/jaeyun/oscar/models/base-vg-labels/ep_107_1192087/', model_type='bert', n_gpu=2, no_cuda=False, num_train_epochs=25.0, output_dir='results', output_mode='classification', overwrite_cache=False, overwrite_output_dir=False, per_gpu_eval_batch_size=1, per_gpu_train_batch_size=1, philly=False, save_after_epoch=-1, save_epoch=1, save_steps=-1, scheduler='linear', seed=88, server_ip='', server_port='', task_name='vqa_text', tokenizer_name='', txt_data_dir='/media/ailab/jaeyun/oscar/datasets/vqa/2k/', use_vg=False, use_vg_dev=False, warmup_steps=0, weight_decay=0.05, workers=4)
07/06/2020 12:17:18 - INFO - __main__ - Info: loading val features using 0.13 secs
07/06/2020 12:17:18 - INFO - __main__ - val Data Examples: 10402
07/06/2020 12:17:33 - INFO - __main__ - Info: loading train features using 15.48 secs
07/06/2020 12:17:37 - INFO - __main__ - train Data Examples: 634516
07/06/2020 12:17:37 - INFO - __main__ - ***** Running training *****
07/06/2020 12:17:37 - INFO - __main__ -   Num examples = 634516
07/06/2020 12:17:37 - INFO - __main__ -   Num Epochs = 25
07/06/2020 12:17:37 - INFO - __main__ -   Instantaneous batch size per GPU = 1
07/06/2020 12:17:37 - INFO - __main__ -   Total train batch size (w. parallel, distributed & accumulation) = 2
07/06/2020 12:17:37 - INFO - __main__ -   Gradient Accumulation steps = 1
07/06/2020 12:17:37 - INFO - __main__ -   Total optimization steps = 7931450
Traceback (most recent call last):
  File "run_vqa.py", line 1222, in <module>
    main()
  File "run_vqa.py", line 1145, in main
    global_step, tr_loss = train(args, train_dataset, eval_dataset, model, tokenizer)
  File "run_vqa.py", line 554, in train
    for step, batch in enumerate(train_dataloader):
  File "/home/ailab/anaconda3/envs/oscar/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 278, in __iter__
    return _MultiProcessingDataLoaderIter(self)
  File "/home/ailab/anaconda3/envs/oscar/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 682, in __init__
    w.start()
  File "/home/ailab/anaconda3/envs/oscar/lib/python3.7/multiprocessing/process.py", line 112, in start
    self._popen = self._Popen(self)
  File "/home/ailab/anaconda3/envs/oscar/lib/python3.7/multiprocessing/context.py", line 223, in _Popen
    return _default_context.get_context().Process._Popen(process_obj)
  File "/home/ailab/anaconda3/envs/oscar/lib/python3.7/multiprocessing/context.py", line 277, in _Popen
    return Popen(process_obj)
  File "/home/ailab/anaconda3/envs/oscar/lib/python3.7/multiprocessing/popen_fork.py", line 20, in __init__
    self._launch(process_obj)
  File "/home/ailab/anaconda3/envs/oscar/lib/python3.7/multiprocessing/popen_fork.py", line 70, in _launch
    self.pid = os.fork()
OSError: [Errno 12] Cannot allocate memory

i think it is lack of my gpu memory.
my gpu is 1080ti, and i use two gpu.
which gpu do you use?
thank you!

Object Detector Trained on OID

An object detector trained on OID-V5 was used in your paper. Do you mind sharing this pre-trained object detector?

Thanks!

What is the range of the number of boxes for image extraction?

Generating inputs to Oscar model

Hi Oscar Team,

Thanks for the interesting paper and open-sourcing your model.

On your download page, you mention that images are fed into Oscar through the outputs of a "Faster R-CNN with ResNet-101, using object and attribute annotations from Visual Genome". Have you made this model available too? It would be great if you could give a link to this pre-trained model, as it is necessary to run Oscar on my own images (I'm interested in image captioning and VQA).

I have tried to look for it myself, and the closest thing I could find was the R101-FPN from the Detectron2 model zoo (PyTorch model). However, this was trained on the COCO dataset of object tags, and I understand that the Visual Genome has significantly more labels. So surely this one would fail to produce the image features that Oscar expects?

I'd be grateful if you could let me know if my thinking is correct and if there is a link to the appropriate PyTorch model for generating inputs that Oscar can use.

Thanks in advance!

Faster RCNN model version and Object Tag Sequences

Did you use the open sourced version of the faster rcnn from torchvision: https://pytorch.org/docs/stable/torchvision/models.html#torchvision.models.detection.fasterrcnn_resnet50_fpn?
And, did you use the open sourced version of tags and labels?

About Download Datasets and Pre-trained Models

Could you please provide Datasets and Pre-trained Models in google drive? I download them very slowly through wget，and usually cause download failure due to interruption.

Thanks!

Generating label.lineidx and feature.lineidx for my own images

Hey guys, Great work!! I am trying to run the model on my own images. I followed other issues and was able to generate my own feature.tsv and label.tsv files. But I am not sure on how to generate feature.lineidx and label.lineidx files for my own images. I am not sure if I am missing something. it would be great if you could help me with this issue.

Thanks

ERROR 404: The specified blob does not exist

When I run the command: wget https://biglmdiag.blob.core.windows.net/oscar/datasets/$coco_ir.zip. The following error has occurred:
--2020-10-01 19:55:19-- https://biglmdiag.blob.core.windows.net/oscar/datasets/.zip
Resolving biglmdiag.blob.core.windows.net (biglmdiag.blob.core.windows.net)... 52.239.247.100
Connecting to biglmdiag.blob.core.windows.net (biglmdiag.blob.core.windows.net)|52.239.247.100|:443... connected.
HTTP request sent, awaiting response... 404 The specified blob does not exist.
2020-10-01 19:55:20 ERROR 404: The specified blob does not exist..
How can I solve it?

Few questions about the paper.

Our group is currently reviewing your paper. It's awesome :D.

We have a few questions about the model.

Which Faster R-CNN version is used in OSCAR? Is the one on Ross Girshick’s GitHub? or your group reproduce it.
In Image Captioning, the paper says repeating the process until [STOP] token is detected. Is [STOP] token [SEP] in BERT?
During Image Captioning fine-tuning, the paper says “We randomly mask out 15% of the caption tokens…”. Are 15% of the caption tokens masked or each word has a 15% probability to be masked?

Looking forward to reviewing the source code. :D
Cheers

AssertionError: run_captioning always expects test_yaml to be in data_dir

Oscar/oscar/run_captioning.py

Line 328 in 1044e6e

assert op.isfile(yaml_file)

Assertion fails since when called on #L858 the yaml path is always joined to args.data_dir. This is contrary to the reproduciton example given in the model zoo.

features and object tags for COCO test server images

Hello. I would like to ask, are the bottom-up features and object tags for the official COCO test images available? In the repo, it's only available for the karpathy split.

Thanks

Ask for scripts to generate coco_caption dataset in Oscar format

Could you please also release the scripts to generate the coco_caption dataset used here? https://biglmdiag.blob.core.windows.net/oscar/datasets/coco_caption.zip

Although I know that you leverage the bottom-up-attention to generate the boxes, features and so on, it's hard for me to generate the data of the same format as you provide:
label: *.label.tsv
feature: *.feature.tsv
caption: *.json

why attention mask is plus, should it be multiply attention mask?

Hello,

Thanks for your interesting work. I recently read this code, some part I did not fully understand. Could you explain the attention mask why is added to the original attention score? In this case, the attention can still attend to future token, right? Thanks!

Oscar/oscar/modeling/modeling_bert.py

Line 48 in a9013bb

attention_scores = attention_scores + attention_mask

Pre-training for image captioning

Hello, and congrats for your brilliant work!
I’d like to ask. For image captioning, you mention in the appendix:

we directly fine-tune Oscar for image captioning on COCO without additional pre-training on Conceptual Captions

Does that mean you only use COCO dataset for pretraining, and not the rest (SBU, Flickr, GQA)? And the cider score of 1.4 is achieved after fine tuning the coco only pretrained model?

Coco caption pertained model output results are not good

Hello,

    Thank you for your great works!

    I used your pertained model for coco image captioning. Here is the command I used.

   python oscar/run_captioning.py \
--do_test \
--do_eval \
--test_yaml test.yaml \
--per_gpu_eval_batch_size 64 \
--num_beams 5 \
--max_gen_length 20 \
--eval_model_dir image_caption/Oscarrepo/Oscar/checkpoint-29-132780/

where checkpoint-29-132780 is uncompressed pertained coco model folder. But the outputs are not good.
Some following examples are the following:

caption claire libraries libraries libraries libraries libraries robbery libraries libraries libraries libraries libraries libraries libraries librariesletsletslets
caption demanded adoptedrredrred libraries libraries libraries libraries librariessteadsteadsteadsteadsteadstead libraries libraries libraries
caption typing curvature curvature libraries curvature curvature curvature curvature curvature curvature curvature curvature curvature curvature curvature curvature curvature

Do I miss some important steps? Thank you for your help!
Also, where is the test.yaml. Thanks

The pre-trained model can not be downloaded

how to test VQA?

Thanks for your great work!
I want to ask how to test VQA2?
Do you upload to the Eval.al website to test or test with your own code ?
Can you offer me the scrip to test ?thanks a lot!!
Looking forward to your reply!

training_args.bin not included in downloaded datasets base-vg-labels nor large-vg-labels

I am trying to run this project for COCO Captioning.

I downloaded the pretrained base and large vg-models as instructed in the DOWNLOAD.

These were the respective folders:

+-- base-vg-labels
| +-- ep_67_588997
| +-- ep_107_1192087

+-- large-vg-labels
| +-- ep_7_816000
| +-- ep_20_590000
| +-- ep_34_999600
| +-- ep_55_1617000

I tried to get the performance of those checkpoints, but after executing:

python oscar/run_captioning.py \
  --do_test \
  --do_eval \
  --data_dir ../Data/coco_caption \
  --test_yaml test.yaml \
  --per_gpu_eval_batch_size 64 \
  --max_gen_length 20 \
  --num_beams 5 \
  --eval_model_dir ../Models/base-vg-labels/ep_107_1192087

an error ocurred ponting out that the file training_args.bin was not found inside the model's directory (base-vg-labels/ep_107_1192087).

I also downloaded the Checkpoint available in the MODEL_ZOO, under Image Captioning on COCO.
This Checkpoint corresponds to checkpoint-29-66420, which include a file training_args.bin.

These are the files included in each folder:

checkpoint-29-66420	large-vg-labels/ep_55_1617000
`added_tokens.json`	`added_tokens.json`
`config.json`	`config.json`
`pytorch_model.bin`	`pytorch_model.bin`
`special_tokens_map.json`	`special_tokens_map.json`
`training_args.bin`	???
`vocab.txt`	`vocab.txt`

It seems that the only one missing is training_args.bin. After finetunning the provided checkpoint, the generated checkpoints also include that file. Maybe you missed to include them in the downloadable models?

Could you please provide those files?

Or am I missing something?

I also noted that the Checkpoint/checkpoint-29-66420 corresponds to training base-vg-labels with cross-entropy loss (deducted from the provided training logs). So I assume its training_args.bin file is probably used across the entire base-vg-labels training. I am now copying the missing file into base-vg-labels/ep_107_1192087 to test its performance. Does that make sense?

Edit:

The performance of base-vg-labels/ep_107_1192087 with the args.bin borrowed from checkpoint-29-66420 was a failure.

 {'SPICE': 0.00043991859734872146, 
  'Bleu_1': 4.759355107382894e-05, 
  'Bleu_2': 7.759555665291071e-13, 
  'Bleu_3': 2.0103762808759086e-15, 
  'Bleu_4': 1.0403605447565029e-16, 
  'ROUGE_L': 6.248437014771456e-05, 
  'CIDEr': 1.3248802263844757e-06}

Azcopy fail

I failed to download your dataset by executing

azcopy copy https://biglmdiag.blob.core.windows.net/oscar/pretrained_models/coco_caption.zip .

coco_ir.zip also cannot be downloaded by Azcopy. But the fine-tuned models you release are available by Azcopy.

About the image features dimensions

Hello，
Thank you for your great works!
When I extract image features, the dimension of the result is 2048, but the dimension of your model is 2054?

where is the yaml config file?

image feature has a different size rather than 2054

The dataset I am using has image feature size of 2048. But bert.img_embedding.weight has a size of [768, 2054]. What am I suppose to do? padding all 0 after each image feature? Thanks!

gcc version

I get the error "command 'gcc' failed with exit status 1" when running the line of INSTALL.md "python setup.py install --cuda_ext --cpp_ext"

What gcc version do you use?

Trained features?

For image captioning on COCO, I am trying to obtain an image features from a trained model instead of generating the caption. In DOWNLOAD.md, under Datasets, are the image region features (e.g., train.feature.tsv) extracted before or after training the model on downstream tasks (e.g., image captioning on COCO)? If before, how can I obtain an image features from a trained model?
One more question: in MODEL_ZOO.md, under Image Captioning on COCO, is the Model checkpoint: checkpoint.zip trained and finetuned? or we still need to train with cross-entropy loss and finetune with CIDEr optimization?

export INSTALL_DIR=$PWD

I'm beginner programmer.

i'm doing installation step. but i don't understand what i should do.

export INSTALL_DIR=$PWD << can i pass this code?

Why are the number of labels and the number of image feature regions unequal in the CaptionTensorizer Coco-caption

Oscar/oscar/run_captioning.py

Line 195 in a9013bb

def __init__(self, tokenizer, max_img_seq_length=50, max_seq_length=70,

Hello, could you please explain why the number of labels(text_b) is not equal to the number of image feature regions? It's a little bit weird from my point of view.

VQA custom dataset

Hi, first of all Thank you, for making this work public.

I am quite new to this field, but I would like to use this model for VQA on custom data. I found some .pkl files in your dataset, but I can't find any code associated with creation of these files.

Would you be so kind, to provide me with that code?
If that is not possible, could you at least tell me how were these files created?

Did you use peteanderson80/bottom-up-attention as you did for Image captioning, or any other public code for extracting image features?

Thank you.

HOW LONG did you train and fintune the image-captioning model?

I have downloaded your coco_caption zip and tried to train and finetune the model, but it seems like costing a long time, don't know if I am correct. If convient, could I kown how long it cost when you train or finetune the model through your 8 V100?

The order of tag labels and image features

First, thanks for sharing the code of this nice work!

I have a question about the dataset you provide.
In the case that the number of tag labels and image features are the same, are they 1-1 mapped with the same order or are they just randomly ordered? In other words, the first/second/third labels correspond to the first/second image features respectively? If not, can I get the mapping from labels to features?

Thanks :)

how to convert gqa_objects_x.h5 file to gqa_img_frcnn_feats.pt？

Pretrained Model Release

Hi,

In table 1, the paper claims to use the assembled dataset to pretrained model weights, but in the released pretrained models, both base and large models are from either visual genome or open image. Do you release the pretrained models on the assembled dataset?

i got some error during install apex

i got this error during install apex.
i think it is becuase gcc version.
what is your gcc version?

Thank you:)

~~~~~~~~~~~~~~~
csrc/mlp.cpp:147:46: error: expected primary-expression before ‘>’ token
         grad_o.contiguous().data_ptr<scalar_t>(),
                                              ^
/home/ailab/anaconda3/envs/oscar/lib/python3.7/site-packages/torch/include/ATen/Dispatch.h:12:12: note: in definition of macro ‘AT_PRIVATE_CASE_TYPE’
     return __VA_ARGS__();                          \
            ^
csrc/mlp.cpp:123:3: note: in expansion of macro ‘AT_DISPATCH_FLOATING_TYPES_AND_HALF’
   AT_DISPATCH_FLOATING_TYPES_AND_HALF(inputs[0].type(), "mlp_backward", [&] {
   ^
csrc/mlp.cpp:147:48: error: expected primary-expression before ‘)’ token
         grad_o.contiguous().data_ptr<scalar_t>(),
                                                ^
/home/ailab/anaconda3/envs/oscar/lib/python3.7/site-packages/torch/include/ATen/Dispatch.h:12:12: note: in definition of macro ‘AT_PRIVATE_CASE_TYPE’
     return __VA_ARGS__();                          \
            ^
csrc/mlp.cpp:123:3: note: in expansion of macro ‘AT_DISPATCH_FLOATING_TYPES_AND_HALF’
   AT_DISPATCH_FLOATING_TYPES_AND_HALF(inputs[0].type(), "mlp_backward", [&] {
   ^
csrc/mlp.cpp:148:43: error: expected primary-expression before ‘>’ token
         fprop_outputs[1].data_ptr<scalar_t>(),
                                           ^
/home/ailab/anaconda3/envs/oscar/lib/python3.7/site-packages/torch/include/ATen/Dispatch.h:12:12: note: in definition of macro ‘AT_PRIVATE_CASE_TYPE’
     return __VA_ARGS__();                          \
            ^
csrc/mlp.cpp:123:3: note: in expansion of macro ‘AT_DISPATCH_FLOATING_TYPES_AND_HALF’
   AT_DISPATCH_FLOATING_TYPES_AND_HALF(inputs[0].type(), "mlp_backward", [&] {
   ^
csrc/mlp.cpp:148:45: error: expected primary-expression before ‘)’ token
         fprop_outputs[1].data_ptr<scalar_t>(),
                                             ^
/home/ailab/anaconda3/envs/oscar/lib/python3.7/site-packages/torch/include/ATen/Dispatch.h:12:12: note: in definition of macro ‘AT_PRIVATE_CASE_TYPE’
     return __VA_ARGS__();                          \
            ^
csrc/mlp.cpp:123:3: note: in expansion of macro ‘AT_DISPATCH_FLOATING_TYPES_AND_HALF’
   AT_DISPATCH_FLOATING_TYPES_AND_HALF(inputs[0].type(), "mlp_backward", [&] {
   ^
csrc/mlp.cpp:149:37: error: expected primary-expression before ‘>’ token
         work_space.data_ptr<scalar_t>(),
                                     ^
/home/ailab/anaconda3/envs/oscar/lib/python3.7/site-packages/torch/include/ATen/Dispatch.h:12:12: note: in definition of macro ‘AT_PRIVATE_CASE_TYPE’
     return __VA_ARGS__();                          \
            ^
csrc/mlp.cpp:123:3: note: in expansion of macro ‘AT_DISPATCH_FLOATING_TYPES_AND_HALF’
   AT_DISPATCH_FLOATING_TYPES_AND_HALF(inputs[0].type(), "mlp_backward", [&] {
   ^
csrc/mlp.cpp:149:39: error: expected primary-expression before ‘)’ token
         work_space.data_ptr<scalar_t>(),
                                       ^
/home/ailab/anaconda3/envs/oscar/lib/python3.7/site-packages/torch/include/ATen/Dispatch.h:12:12: note: in definition of macro ‘AT_PRIVATE_CASE_TYPE’
     return __VA_ARGS__();                          \
            ^
csrc/mlp.cpp:123:3: note: in expansion of macro ‘AT_DISPATCH_FLOATING_TYPES_AND_HALF’
   AT_DISPATCH_FLOATING_TYPES_AND_HALF(inputs[0].type(), "mlp_backward", [&] {
   ^
csrc/mlp.cpp: In lambda function:
csrc/mlp.cpp:125:23: warning: comparison between signed and unsigned integer expressions [-Wsign-compare]
     for (int i = 0; i < num_layers; i++) {
                       ^
/home/ailab/anaconda3/envs/oscar/lib/python3.7/site-packages/torch/include/ATen/Dispatch.h:12:12: note: in definition of macro ‘AT_PRIVATE_CASE_TYPE’
     return __VA_ARGS__();                          \
            ^
csrc/mlp.cpp:123:3: note: in expansion of macro ‘AT_DISPATCH_FLOATING_TYPES_AND_HALF’
   AT_DISPATCH_FLOATING_TYPES_AND_HALF(inputs[0].type(), "mlp_backward", [&] {
   ^
csrc/mlp.cpp:126:54: error: expected primary-expression before ‘>’ token
       w_ptr.push_back(inputs[i + 1].data_ptr<scalar_t>());
                                                      ^
/home/ailab/anaconda3/envs/oscar/lib/python3.7/site-packages/torch/include/ATen/Dispatch.h:12:12: note: in definition of macro ‘AT_PRIVATE_CASE_TYPE’
     return __VA_ARGS__();                          \
            ^
csrc/mlp.cpp:123:3: note: in expansion of macro ‘AT_DISPATCH_FLOATING_TYPES_AND_HALF’
   AT_DISPATCH_FLOATING_TYPES_AND_HALF(inputs[0].type(), "mlp_backward", [&] {
   ^
csrc/mlp.cpp:126:56: error: expected primary-expression before ‘)’ token
       w_ptr.push_back(inputs[i + 1].data_ptr<scalar_t>());
                                                        ^
/home/ailab/anaconda3/envs/oscar/lib/python3.7/site-packages/torch/include/ATen/Dispatch.h:12:12: note: in definition of macro ‘AT_PRIVATE_CASE_TYPE’
     return __VA_ARGS__();                          \
            ^
csrc/mlp.cpp:123:3: note: in expansion of macro ‘AT_DISPATCH_FLOATING_TYPES_AND_HALF’
   AT_DISPATCH_FLOATING_TYPES_AND_HALF(inputs[0].type(), "mlp_backward", [&] {
   ^
csrc/mlp.cpp:129:23: warning: comparison between signed and unsigned integer expressions [-Wsign-compare]
     for (int i = 0; i < inputs.size(); i++) {
                       ^
/home/ailab/anaconda3/envs/oscar/lib/python3.7/site-packages/torch/include/ATen/Dispatch.h:12:12: note: in definition of macro ‘AT_PRIVATE_CASE_TYPE’
     return __VA_ARGS__();                          \
            ^
csrc/mlp.cpp:123:3: note: in expansion of macro ‘AT_DISPATCH_FLOATING_TYPES_AND_HALF’
   AT_DISPATCH_FLOATING_TYPES_AND_HALF(inputs[0].type(), "mlp_backward", [&] {
   ^
csrc/mlp.cpp:130:57: error: expected primary-expression before ‘>’ token
       outputs_ptr.push_back(outputs[i].data_ptr<scalar_t>());
                                                         ^
/home/ailab/anaconda3/envs/oscar/lib/python3.7/site-packages/torch/include/ATen/Dispatch.h:12:12: note: in definition of macro ‘AT_PRIVATE_CASE_TYPE’
     return __VA_ARGS__();                          \
            ^
csrc/mlp.cpp:123:3: note: in expansion of macro ‘AT_DISPATCH_FLOATING_TYPES_AND_HALF’
   AT_DISPATCH_FLOATING_TYPES_AND_HALF(inputs[0].type(), "mlp_backward", [&] {
   ^
csrc/mlp.cpp:130:59: error: expected primary-expression before ‘)’ token
       outputs_ptr.push_back(outputs[i].data_ptr<scalar_t>());
                                                           ^
/home/ailab/anaconda3/envs/oscar/lib/python3.7/site-packages/torch/include/ATen/Dispatch.h:12:12: note: in definition of macro ‘AT_PRIVATE_CASE_TYPE’
     return __VA_ARGS__();                          \
            ^
csrc/mlp.cpp:123:3: note: in expansion of macro ‘AT_DISPATCH_FLOATING_TYPES_AND_HALF’
   AT_DISPATCH_FLOATING_TYPES_AND_HALF(inputs[0].type(), "mlp_backward", [&] {
   ^
csrc/mlp.cpp:137:44: warning: narrowing conversion of ‘(work_size / sizeof (scalar_t))’ from ‘long unsigned int’ to ‘long int’ inside { } [-Wnarrowing]
     auto work_space = at::empty({work_size / sizeof(scalar_t)}, inputs[0].type());
                                            ^
/home/ailab/anaconda3/envs/oscar/lib/python3.7/site-packages/torch/include/ATen/Dispatch.h:12:12: note: in definition of macro ‘AT_PRIVATE_CASE_TYPE’
     return __VA_ARGS__();                          \
            ^
csrc/mlp.cpp:123:3: note: in expansion of macro ‘AT_DISPATCH_FLOATING_TYPES_AND_HALF’
   AT_DISPATCH_FLOATING_TYPES_AND_HALF(inputs[0].type(), "mlp_backward", [&] {
   ^
csrc/mlp.cpp:137:44: warning: narrowing conversion of ‘(work_size / sizeof (scalar_t))’ from ‘long unsigned int’ to ‘long int’ inside { } [-Wnarrowing]
     auto work_space = at::empty({work_size / sizeof(scalar_t)}, inputs[0].type());
                                            ^
/home/ailab/anaconda3/envs/oscar/lib/python3.7/site-packages/torch/include/ATen/Dispatch.h:12:12: note: in definition of macro ‘AT_PRIVATE_CASE_TYPE’
     return __VA_ARGS__();                          \
            ^
csrc/mlp.cpp:123:3: note: in expansion of macro ‘AT_DISPATCH_FLOATING_TYPES_AND_HALF’
   AT_DISPATCH_FLOATING_TYPES_AND_HALF(inputs[0].type(), "mlp_backward", [&] {
   ^
csrc/mlp.cpp:140:36: error: expected primary-expression before ‘>’ token
         inputs[0].data_ptr<scalar_t>(),
                                    ^
/home/ailab/anaconda3/envs/oscar/lib/python3.7/site-packages/torch/include/ATen/Dispatch.h:12:12: note: in definition of macro ‘AT_PRIVATE_CASE_TYPE’
     return __VA_ARGS__();                          \
            ^
csrc/mlp.cpp:123:3: note: in expansion of macro ‘AT_DISPATCH_FLOATING_TYPES_AND_HALF’
   AT_DISPATCH_FLOATING_TYPES_AND_HALF(inputs[0].type(), "mlp_backward", [&] {
   ^
csrc/mlp.cpp:140:38: error: expected primary-expression before ‘)’ token
         inputs[0].data_ptr<scalar_t>(),
                                      ^
/home/ailab/anaconda3/envs/oscar/lib/python3.7/site-packages/torch/include/ATen/Dispatch.h:12:12: note: in definition of macro ‘AT_PRIVATE_CASE_TYPE’
     return __VA_ARGS__();                          \
            ^
csrc/mlp.cpp:123:3: note: in expansion of macro ‘AT_DISPATCH_FLOATING_TYPES_AND_HALF’
   AT_DISPATCH_FLOATING_TYPES_AND_HALF(inputs[0].type(), "mlp_backward", [&] {
   ^
csrc/mlp.cpp:141:43: error: expected primary-expression before ‘>’ token
         fprop_outputs[0].data_ptr<scalar_t>(),
                                           ^
/home/ailab/anaconda3/envs/oscar/lib/python3.7/site-packages/torch/include/ATen/Dispatch.h:12:12: note: in definition of macro ‘AT_PRIVATE_CASE_TYPE’
     return __VA_ARGS__();                          \
            ^
csrc/mlp.cpp:123:3: note: in expansion of macro ‘AT_DISPATCH_FLOATING_TYPES_AND_HALF’
   AT_DISPATCH_FLOATING_TYPES_AND_HALF(inputs[0].type(), "mlp_backward", [&] {
   ^
csrc/mlp.cpp:141:45: error: expected primary-expression before ‘)’ token
         fprop_outputs[0].data_ptr<scalar_t>(),
                                             ^
/home/ailab/anaconda3/envs/oscar/lib/python3.7/site-packages/torch/include/ATen/Dispatch.h:12:12: note: in definition of macro ‘AT_PRIVATE_CASE_TYPE’
     return __VA_ARGS__();                          \
            ^
csrc/mlp.cpp:123:3: note: in expansion of macro ‘AT_DISPATCH_FLOATING_TYPES_AND_HALF’
   AT_DISPATCH_FLOATING_TYPES_AND_HALF(inputs[0].type(), "mlp_backward", [&] {
   ^
csrc/mlp.cpp:147:46: error: expected primary-expression before ‘>’ token
         grad_o.contiguous().data_ptr<scalar_t>(),
                                              ^
/home/ailab/anaconda3/envs/oscar/lib/python3.7/site-packages/torch/include/ATen/Dispatch.h:12:12: note: in definition of macro ‘AT_PRIVATE_CASE_TYPE’
     return __VA_ARGS__();                          \
            ^
csrc/mlp.cpp:123:3: note: in expansion of macro ‘AT_DISPATCH_FLOATING_TYPES_AND_HALF’
   AT_DISPATCH_FLOATING_TYPES_AND_HALF(inputs[0].type(), "mlp_backward", [&] {
   ^
csrc/mlp.cpp:147:48: error: expected primary-expression before ‘)’ token
         grad_o.contiguous().data_ptr<scalar_t>(),
                                                ^
/home/ailab/anaconda3/envs/oscar/lib/python3.7/site-packages/torch/include/ATen/Dispatch.h:12:12: note: in definition of macro ‘AT_PRIVATE_CASE_TYPE’
     return __VA_ARGS__();                          \
            ^
csrc/mlp.cpp:123:3: note: in expansion of macro ‘AT_DISPATCH_FLOATING_TYPES_AND_HALF’
   AT_DISPATCH_FLOATING_TYPES_AND_HALF(inputs[0].type(), "mlp_backward", [&] {
   ^
csrc/mlp.cpp:148:43: error: expected primary-expression before ‘>’ token
         fprop_outputs[1].data_ptr<scalar_t>(),
                                           ^
/home/ailab/anaconda3/envs/oscar/lib/python3.7/site-packages/torch/include/ATen/Dispatch.h:12:12: note: in definition of macro ‘AT_PRIVATE_CASE_TYPE’
     return __VA_ARGS__();                          \
            ^
csrc/mlp.cpp:123:3: note: in expansion of macro ‘AT_DISPATCH_FLOATING_TYPES_AND_HALF’
   AT_DISPATCH_FLOATING_TYPES_AND_HALF(inputs[0].type(), "mlp_backward", [&] {
   ^
csrc/mlp.cpp:148:45: error: expected primary-expression before ‘)’ token
         fprop_outputs[1].data_ptr<scalar_t>(),
                                             ^
/home/ailab/anaconda3/envs/oscar/lib/python3.7/site-packages/torch/include/ATen/Dispatch.h:12:12: note: in definition of macro ‘AT_PRIVATE_CASE_TYPE’
     return __VA_ARGS__();                          \
            ^
csrc/mlp.cpp:123:3: note: in expansion of macro ‘AT_DISPATCH_FLOATING_TYPES_AND_HALF’
   AT_DISPATCH_FLOATING_TYPES_AND_HALF(inputs[0].type(), "mlp_backward", [&] {
   ^
csrc/mlp.cpp:149:37: error: expected primary-expression before ‘>’ token
         work_space.data_ptr<scalar_t>(),
                                     ^
/home/ailab/anaconda3/envs/oscar/lib/python3.7/site-packages/torch/include/ATen/Dispatch.h:12:12: note: in definition of macro ‘AT_PRIVATE_CASE_TYPE’
     return __VA_ARGS__();                          \
            ^
csrc/mlp.cpp:123:3: note: in expansion of macro ‘AT_DISPATCH_FLOATING_TYPES_AND_HALF’
   AT_DISPATCH_FLOATING_TYPES_AND_HALF(inputs[0].type(), "mlp_backward", [&] {
   ^
csrc/mlp.cpp:149:39: error: expected primary-expression before ‘)’ token
         work_space.data_ptr<scalar_t>(),
                                       ^
/home/ailab/anaconda3/envs/oscar/lib/python3.7/site-packages/torch/include/ATen/Dispatch.h:12:12: note: in definition of macro ‘AT_PRIVATE_CASE_TYPE’
     return __VA_ARGS__();                          \
            ^
csrc/mlp.cpp:123:3: note: in expansion of macro ‘AT_DISPATCH_FLOATING_TYPES_AND_HALF’
   AT_DISPATCH_FLOATING_TYPES_AND_HALF(inputs[0].type(), "mlp_backward", [&] {
   ^
error: command 'gcc' failed with exit status 1

Fails in INSTALL.md

I can not successfully run this git clone --recursive [email protected]:xjli/Oscar.git with the following error message:

Submodule 'coco_caption' ([email protected]:LuoweiZhou/coco-caption.git) registered for path 'coco_caption'
Submodule 'transformers' ([email protected]:huggingface/transformers.git) registered for path 'transformers'
Cloning into '/Github/Oscar/coco_caption'...
Permission denied (publickey).
fatal: Could not read from remote repository.

Please make sure you have the correct access rights
and the repository exists.
fatal: clone of '[email protected]:LuoweiZhou/coco-caption.git' into submodule path '/Github/Oscar/coco_caption' failed
Failed to clone 'coco_caption'. Retry scheduled
Cloning into '/Github/Oscar/transformers'...
Permission denied (publickey).
fatal: Could not read from remote repository.

Extracted feature for VQA test-dev set

Thank you for making this excellent work public!
I hope to reproduce your result on vqa task, but some problem occurred about the dataset.
I downloaded the vqa dataset following this instruction: https://github.com/microsoft/Oscar/blob/master/DOWNLOAD.md#datasets, and I didn't find image frcnn feature for test-dev. I'm not sure if there's some mistake during my downloading, or this part just wasn't provided.
If it's not possible to share frcnn features for test-dev, could you please provide some code and basic information about how to extract the features myself, so I can reproduce the work correctly. For example,

the features were extracted by which version of faster rcnn?
what's the correct structure to save these features? [i.e. for each image, how to organize all rois features&location and bound them with image id or question id]

Really thank you for your kind help!

when will the fine-tuned COCO image caption models be released?

How can i generate caption for any(my) image?

In coco_caption dataset, train.yaml file shows that train.img.tsv is an image, but i couldn`t found train.img.tsv.

Where can i find the train(val or test).img.tsv?

feature: train.feature.tsv

391895 {"num_boxes": 37, "features": "W6aDPlMKLj6FySc9zdycPyewQj7zsqw/8FjLQE+ABUEspTk+AAAAAEg0Dz8FzHo

Can you explain how you changed original image to feature?

What I want to do is look at the image, caption example, like in your paper Fig5.
3. how can i generate caption for any sample image?

Generating label.tsv and feature.tsv from image

Hi guys, I am trying to generate my own features.tsv and labels.tsv for my dataset, but I am stuck at the following:

I have a slight confusion regarding what exactly these features are. Upon reading the "Oscar" paper, I can understand that per bounding box a feature vector is of type (v',z) where v' is P-dimensional (2048) and z is 6 dimensional (position).
I have a difficulty in understanding where do these 2048 features come from. Initially, I thought that these were from the FC-layer of Faster-R-CNN but upon checking the FC-layer size is 4096 in Faster-R-CNN.
The Oscar paper mentions, " Specifically, v and q are generated as follows. Given an image with K regions
of objects (normally over-sampled and noisy), Faster R-CNN [28] is used to extract the visual semantics of each region". I have a slight confusion regarding how are these K-regions determined. Are these K-image regions the bound-boxes output by Faster-RCNN?

I am relatively new to this area. Any help would be appreciated.

Is there any code related to Oscar Pre-training？

How to use Oscar tool to generate captions for our own dataset

Performance on COCO captioning test set

Hi,

Do you have your results posted on the captioning leaderboard?

Unable to Reproduce the Baseline results for NLVR2 task

We tried to reproduce the baselines for the NLVR2 task. But our result was off by a visible margin.

Hardware Specifications

Graphic Card : GeForce RTX 208
CUDA version : 10.2

Command Given

CUDA_VISIBLE_DEVICES=0 python run_nlvr.py -j 4 --img_feature_dim 2054 --max_img_seq_length 40 --data_dir dataset/nlvr2/ft_corpus --model_type bert --model_name_or_path model/base-vg-labels/ep_107_1192087 --task_name nlvr --do_lower_case --max_seq_length 55 --per_gpu_eval_batch_size 8 --per_gpu_train_batch_size 9 --gradient_accumulation_steps 8 --learning_rate 3e-05 --num_train_epochs 20 --output_dir results2 --img_feature_type faster_r-cnn --data_label_type all --train_data_type all --eval_data_type all --loss_type xe --save_epoch -1 --seed 88 --evaluate_during_training --logging_steps -1 --drop_out 0.3 --do_train --weight_decay 0.05 --warmup_steps 10000 --classifier mlp --cls_hidden_scale 3 --num_choice 2 --use_pair

Evaluation Result

[{"epoch": 0, "eval_score": 0.5138928673732455, "best_score": 0.5138928673732455}, {"epoch": 1, "eval_score": 0.624462904611859, "best_score": 0.624462904611859}, {"epoch": 2, "eval_score": 0.6764537381839014, "best_score": 0.6764537381839014}, {"epoch": 3, "eval_score": 0.6975078773990261, "best_score": 0.6975078773990261}, {"epoch": 4, "eval_score": 0.7033801203093669, "best_score": 0.7033801203093669}, {"epoch": 5, "eval_score": 0.7413348610713263, "best_score": 0.7413348610713263}, {"epoch": 6, "eval_score": 0.7463477513606417, "best_score": 0.7463477513606417}, {"epoch": 7, "eval_score": 0.7472071039816671, "best_score": 0.7472071039816671}, {"epoch": 8, "eval_score": 0.7446290461185907, "best_score": 0.7472071039816671}, {"epoch": 9, "eval_score": 0.7464909767974792, "best_score": 0.7472071039816671}, {"epoch": 10, "eval_score": 0.7414780865081638, "best_score": 0.7472071039816671}, {"epoch": 11, "eval_score": 0.7593812661128616, "best_score": 0.7593812661128616}, {"epoch": 12, "eval_score": 0.764394156402177, "best_score": 0.764394156402177}, {"epoch": 13, "eval_score": 0.7691205958178172, "best_score": 0.7691205958178172}, {"epoch": 14, "eval_score": 0.7641077055285018, "best_score": 0.7691205958178172}, {"epoch": 15, "eval_score": 0.7656831853337153, "best_score": 0.7691205958178172}, {"epoch": 16, "eval_score": 0.7593812661128616, "best_score": 0.7691205958178172}, {"epoch": 17, "eval_score": 0.7583786880549985, "best_score": 0.7691205958178172}, {"epoch": 18, "eval_score": 0.7621025494127757, "best_score": 0.7691205958178172}, {"epoch": 19, "eval_score": 0.7653967344600401, "best_score": 0.7691205958178172}]

We get the best score as 0.7691205958178172 while the baseline for this task itself gives 0.7807218562016615.

Another issue that we faced was difference in total number of parameters. In the given code they have given 114611714 as the total parameters but we noticed them to be 114606338.