dandelin / vilt Goto Github PK
View Code? Open in Web Editor NEWCode for the ICML 2021 (long talk) paper: "ViLT: Vision-and-Language Transformer Without Convolution or Region Supervision"
License: Apache License 2.0
Code for the ICML 2021 (long talk) paper: "ViLT: Vision-and-Language Transformer Without Convolution or Region Supervision"
License: Apache License 2.0
Saving latest checkpoint...
INFO - lightning - Saving latest checkpoint...
ERROR - ViLT - Failed after 1:05:38!
Traceback (most recent calls WITHOUT Sacred internals):
File "/root/anaconda3/envs/vilt/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 524, in train
self.train_loop.run_training_epoch()
File "/root/anaconda3/envs/vilt/lib/python3.7/site-packages/pytorch_lightning/trainer/training_loop.py", line 572, in run_training_epoch
batch_output = self.run_training_batch(batch, batch_idx, dataloader_idx)
File "/root/anaconda3/envs/vilt/lib/python3.7/site-packages/pytorch_lightning/trainer/training_loop.py", line 704, in run_training_batch
self.trainer.hiddens)
File "/root/anaconda3/envs/vilt/lib/python3.7/site-packages/pytorch_lightning/trainer/training_loop.py", line 818, in training_step_and_backward
result = self.training_step(split_batch, batch_idx, opt_idx, hiddens)
File "/root/anaconda3/envs/vilt/lib/python3.7/site-packages/pytorch_lightning/trainer/training_loop.py", line 339, in training_step
training_step_output = self.trainer.accelerator_backend.training_step(args)
File "/root/anaconda3/envs/vilt/lib/python3.7/site-packages/pytorch_lightning/accelerators/ddp_accelerator.py", line 158, in training_step
return self._step(args)
File "/root/anaconda3/envs/vilt/lib/python3.7/site-packages/pytorch_lightning/accelerators/ddp_accelerator.py", line 170, in _step
output = self.trainer.model(*args)
File "/root/anaconda3/envs/vilt/lib/python3.7/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
result = self.forward(*input, **kwargs)
File "/root/anaconda3/envs/vilt/lib/python3.7/site-packages/pytorch_lightning/overrides/data_parallel.py", line 179, in forward
output = self.module.training_step(*inputs[0], **kwargs[0])
File "/data/workspace/ViLT/vilt/modules/vilt_module.py", line 219, in training_step
vilt_utils.set_task(self)
File "/data/workspace/ViLT/vilt/modules/vilt_utils.py", line 177, in set_task
picked = all_gather(current_tasks)
File "/data/workspace/ViLT/vilt/modules/dist_utils.py", line 165, in all_gather
size_list, tensor = _pad_to_largest_tensor(tensor, group)
File "/data/workspace/ViLT/vilt/modules/dist_utils.py", line 129, in _pad_to_largest_tensor
dist.all_gather(size_list, local_size, group=group)
File "/root/anaconda3/envs/vilt/lib/python3.7/site-packages/torch/distributed/distributed_c10d.py", line 1870, in all_gather
work.wait()
RuntimeError: [/pytorch/third_party/gloo/gloo/transport/tcp/unbound_buffer.cc:84] Timed out waiting 1800000ms for recv operation to complete
During handling of the above exception, another exception occurred:
Traceback (most recent calls WITHOUT Sacred internals):
File "/data/workspace/ViLT/run.py", line 72, in main
trainer.fit(model, datamodule=dm)
File "/root/anaconda3/envs/vilt/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 473, in fit
results = self.accelerator_backend.train()
File "/root/anaconda3/envs/vilt/lib/python3.7/site-packages/pytorch_lightning/accelerators/ddp_accelerator.py", line 152, in train
results = self.ddp_train(process_idx=self.task_idx, model=model)
File "/root/anaconda3/envs/vilt/lib/python3.7/site-packages/pytorch_lightning/accelerators/ddp_accelerator.py", line 305, in ddp_train
results = self.train_or_test()
File "/root/anaconda3/envs/vilt/lib/python3.7/site-packages/pytorch_lightning/accelerators/accelerator.py", line 69, in train_or_test
results = self.trainer.train()
File "/root/anaconda3/envs/vilt/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 555, in train
self.train_loop.on_train_end()
File "/root/anaconda3/envs/vilt/lib/python3.7/site-packages/pytorch_lightning/trainer/training_loop.py", line 200, in on_train_end
self.check_checkpoint_callback(should_save=True, is_last=True)
File "/root/anaconda3/envs/vilt/lib/python3.7/site-packages/pytorch_lightning/trainer/training_loop.py", line 234, in check_checkpoint_callback
callback.on_validation_end(self.trainer, model)
File "/root/anaconda3/envs/vilt/lib/python3.7/site-packages/pytorch_lightning/callbacks/model_checkpoint.py", line 203, in on_validation_end
self.save_checkpoint(trainer, pl_module)
File "/root/anaconda3/envs/vilt/lib/python3.7/site-packages/pytorch_lightning/callbacks/model_checkpoint.py", line 238, in save_checkpoint
self._validate_monitor_key(trainer)
File "/root/anaconda3/envs/vilt/lib/python3.7/site-packages/pytorch_lightning/callbacks/model_checkpoint.py", line 516, in _validate_monitor_key
raise MisconfigurationException(m)
pytorch_lightning.utilities.exceptions.MisconfigurationException: ModelCheckpoint(monitor='val/the_metric') not found in the returned metrics: ['irtr/train/irtr_loss', 'itm/train/loss', 'itm/train/wpa_loss', 'itm/train/accuracy']. HINT: Did you call self.log('val/the_metric', tensor) in the LightningModule?
Epoch 0: 0%| | 24/9691 [30:14<202:59:08, 75.59s/it, loss=0.579, v_num=0]
root
├── images_train
│ ├── 0000 # First four letters of the image name
│ │ ├── 0000000 # Image Binary
│ │ ├── 0000001
│ │ └── ...
│ ├── 0001
│ │ ├── 0001000
│ │ ├── 0001001
│ │ └── ...
Hello, please forgive my stupid question. I don't know what you mean about "0000 # First four letters of image name" and "0000000 # Image Binary" in your DATA.md. Can you explain what are the "Image Binary" and "First four letters of image name"? Thanks
Hello,
The link of NLVR2 in DATA.md seems incorrect.
hello, I didn't use the pre-training weight of you provided, get the following error:
INFO - timm.models.helpers - Loading pretrained weights from url (https://github.com/rwightman/pytorch-image-models/releases/download/v0.1-vitjx/jx_vit_base_p32_384-830016f5.pth)
Downloading: "https://github.com/rwightman/pytorch-image-models/releases/download/v0.1-vitjx/jx_vit_base_p32_384-830016f5.pth"
urllib.error.URLError: <urlopen error [Errno 110] Connection timed out>
Hi, @dandelin
Thank you for your interesting work, I am runing demo_vqa.py and failed at line 38 https://dl.dropboxusercontent.com/s/otya4i5sagt4f5p/vqa_dict.json
, because this link is unavailable now.
Do you download this file?
Also, vilt_200k_mlm_itm.ckpt
link is already unavailable...
Wishing for your reply!
hi,
thanks for releasing your code!
I am wondering how much time you spent during the pretraining for 64 V100 GPUs.
I am encountering this error:
WARNING - root - Changed type of config entry "max_steps" from int to NoneType
WARNING - ViLT - No observers have been added to this run
INFO - ViLT - Running command 'main'
INFO - ViLT - Started
Global seed set to 0
INFO - lightning - Global seed set to 0
ERROR - ViLT - Failed after 0:00:12!
Traceback (most recent calls WITHOUT Sacred internals):
File "run.py", line 17, in main
model = ViLTransformerSS(_config)
File "/others/cs16b114/ViLT/vilt/modules/vilt_module.py", line 61, in init
ckpt = torch.load(self.hparams.config["load_path"], map_location="cpu")
File "/usr/local/lib/python3.7/dist-packages/torch/serialization.py", line 527, in load
with _open_zipfile_reader(f) as opened_zipfile:
File "/usr/local/lib/python3.7/dist-packages/torch/serialization.py", line 224, in init
super(_open_zipfile_reader, self).init(torch.C.PyTorchFileReader(name_or_buffer))
RuntimeError: version <= kMaxSupportedFileFormatVersion INTERNAL ASSERT FAILED at /pytorch/caffe2/serialize/inline_container.cc:132, please report a bug to PyTorch. Attempted to read a PyTorch file with version 3, but the maximum supported version for reading is 2. Your PyTorch installation may be too old. (init at /pytorch/caffe2/serialize/inline_container.cc:132)
frame #0: c10::Error::Error(c10::SourceLocation, std::string const&) + 0x33 (0x7fb11cc1a193 in /usr/local/lib/python3.7/dist-packages/torch/lib/libc10.so)
frame #1: caffe2::serialize::PyTorchStreamReader::init() + 0x1f5b (0x7fafcd29cafb in /usr/local/lib/python3.7/dist-packages/torch/lib/libtorch.so)
frame #2: caffe2::serialize::PyTorchStreamReader::PyTorchStreamReader(std::string const&) + 0x64 (0x7fafcd29dd14 in /usr/local/lib/python3.7/dist-packages/torch/lib/libtorch.so)
frame #3: + 0x6c6296 (0x7fb0ad870296 in /usr/local/lib/python3.7/dist-packages/torch/lib/libtorch_python.so)
frame #4: + 0x2957d4 (0x7fb0ad43f7d4 in /usr/local/lib/python3.7/dist-packages/torch/lib/libtorch_python.so)
frame #6: python() [0x4d8067]
frame #8: python() [0x58f850]
frame #10: python() [0x54aa51]
frame #14: python() [0x58f6a3]
frame #19: python() [0x54a880]
frame #23: python() [0x58f6a3]
frame #30: python() [0x5db3e4]
frame #32: + 0x431b (0x7fb0e0b9531b in /usr/local/lib/python3.7/dist-packages/wrapt/_wrappers.cpython-37m-x86_64-linux-gnu.so)
frame #37: python() [0x59412b]
frame #42: python() [0x54a880]
frame #51: python() [0x6308e2]
frame #54: python() [0x65450e]
frame #56: __libc_start_main + 0xe7 (0x7fb12285abf7 in /lib/x86_64-linux-gnu/libc.so.6)
Hi @dandelin , thanks for this great repo and work! Could you please say what COCO split was used for pre-training? (was it 2014, 2017, Karpathy, or something else?) Thanks!
Hi, I am very interested in your work! I am wondering why use 15 texts as negative samples instead of 1 text during the finetune period. And what do you think training the model from scratch only using flickr30k dataset?
In Image Retrieval, the R@1 is 68.4 which is higher than 61.9 in paper
In Text retrieval, the R@1 is 73.5 which is lower than 81.4 in paper
So, I want if the input format is error in my code.
In image, I use "pixelbert_transform" function of size=384. In Text, I use Bert base tokenizer with max len 40 which includes [CLS], word tokens and without [SEP]. In flickr-30k, I use dataset_flickr30k.json to get test datasets, and I chose the first caption of five about each image.
Thanks very much for your help!
Hi, Thanks for your fantastic work
I've been trying to reproduce the result on fine-tuning on pre-trained model.It takes me only about 6 hours to fine-tune on VQAV2 dataset.But when I try to fine-tune on the COCO dataset,It took me about 20 hours to run only 1 epoch,I wonder if It's the corrent result.
Also,I find that when I finish one epoch fine-tuning on COCO dataset,It didn't automatically save the model.Is there a problem with my setting?
My experiement was ran on 8 V100 GPUs.
Hello,
I have read through your code, but haven't run the code yet. One question about the dataloader implementation. According to
https://github.com/dandelin/ViLT/blob/master/vilt/datasets/base_dataset.py#L43
You load all the arrow files into memory. The pre-training data have hundreds of gigabytes. Is it possible that this may cause out-of-memory issue? Or does this implementation assume large machine memory?
Thanks,
Command
$PYTHONBIN run.py with data_root=dataset \
num_gpus=1 num_nodes=1 task_finetune_vqa \
per_gpu_batchsize=64 load_path="weights/vilt_200k_mlm_itm.ckpt"
And the error is in trainer.fit()
File "/data/home/lyuchen/miniconda/envs/vilt/lib/python3.8/site-packages/pytorch_lightning/trainer/training_loop.py", line 184, in setu[52/1882]
g
self.trainer.checkpoint_connector.restore_weights(model)
File "/data/home/lyuchen/miniconda/envs/vilt/lib/python3.8/site-packages/pytorch_lightning/trainer/connectors/checkpoint_connector.py", line 63,
in restore_weights
self.hpc_load(checkpoint_path, self.trainer.on_gpu)
File "/data/home/lyuchen/miniconda/envs/vilt/lib/python3.8/site-packages/pytorch_lightning/trainer/connectors/checkpoint_connector.py", line 336
, in hpc_load
self.restore_model_state(model, checkpoint)
File "/data/home/lyuchen/miniconda/envs/vilt/lib/python3.8/site-packages/pytorch_lightning/trainer/connectors/checkpoint_connector.py", line 119
, in restore_model_state
model.load_state_dict(checkpoint['state_dict'])
File "/data/home/lyuchen/miniconda/envs/vilt/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1051, in load_state_dict
raise RuntimeError('Error(s) in loading state_dict for {}:\n\t{}'.format(
RuntimeError: Error(s) in loading state_dict for ViLTransformerSS:
Missing key(s) in state_dict: "vqa_classifier.0.weight", "vqa_classifier.0.bias", "vqa_classifier.1.weight", "vqa_classifier.1.bias", "vqa
_classifier.3.weight", "vqa_classifier.3.bias".
Unexpected key(s) in state_dict: "mlm_score.bias", "mlm_score.transform.dense.weight", "mlm_score.transform.dense.bias", "mlm_score.transf
orm.LayerNorm.weight", "mlm_score.transform.LayerNorm.bias", "mlm_score.decoder.weight", "itm_score.fc.weight", "itm_score.fc.bias".
Hello and so happy to see you use Pytorch-Lightning! 🎉
Just wondering if you already heard about quite the new Pytorch Lightning (PL) ecosystem CI where we would like to invite you to... You can check out our blog post about it: Stay Ahead of Breaking Changes with the New Lightning Ecosystem CI ⚡
As you use PL framework for your cool project, we would like to enhance your experience and offer you safe updates to our future releases. At this moment, you run tests with a particular PL version, but it may accidentally happen that the next version will be incompatible with your project... 😕 We do not intend to change anything on our project side, but still here we have a solution - ecosystem CI with testing both - your and our latest development head we can find it very early and prevent releasing eventually bad version... 👍
What is needed to do?
What will you get?
cc: @Borda
hi
when you compute the FLOPS in table 6 for baseline models such as ViLBERT, do you also include the FLOPS computation of feature extraction models?
Hi,
Is it necessary to manually add the Pytorch's DistributedSampler in the dataloader?
https://github.com/dandelin/ViLT/blob/master/vilt/datamodules/multitask_datamodule.py#L46
It seems that Pytorch Lightning automatically uses the Pytorch's DistributedSampler.
Hello again @dandelin ,
I was trying to reproduce the steps from https://github.com/dandelin/ViLT/blob/master/EVAL.md
to get the results from Flickr30k T2IR.
First I did what is suggested in https://github.com/dandelin/ViLT/blob/master/DATA.md
.
So I have in a folder /content/flickr30k
this structure:
/content/flickr30k
├── flickr30k_images
│ ├── ....jpg
| ├── ....jpg
├── karpathy
├── dataset_flickr30k.json
Then I do the transformation:
from vilt.utils.write_f30k_karpathy import make_arrow
make_arrow( '/content/flickr30k', '/content/arrow')
But when I run:
python run.py with data_root='/content/arrow' num_gpus=1 num_nodes=1 per_gpu_batchsize=4 task_finetune_irtr_f30k_randaug test_only=True load_path="/content/TFM_Sparse_Embeddings/vilt_irtr_f30k.ckpt"
I get the error:
ERROR - ViLT - Failed after 0:00:06!
Traceback (most recent calls WITHOUT Sacred internals):
File "run.py", line 73, in main
trainer.test(model, datamodule=dm)
File "/usr/local/lib/python3.7/dist-packages/pytorch_lightning/trainer/trainer.py", line 755, in test
results = self.__test_given_model(model, test_dataloaders)
File "/usr/local/lib/python3.7/dist-packages/pytorch_lightning/trainer/trainer.py", line 820, in __test_given_model
results = self.fit(model)
File "/usr/local/lib/python3.7/dist-packages/pytorch_lightning/trainer/trainer.py", line 473, in fit
results = self.accelerator_backend.train()
File "/usr/local/lib/python3.7/dist-packages/pytorch_lightning/accelerators/ddp_accelerator.py", line 152, in train
results = self.ddp_train(process_idx=self.task_idx, model=model)
File "/usr/local/lib/python3.7/dist-packages/pytorch_lightning/accelerators/ddp_accelerator.py", line 305, in ddp_train
results = self.train_or_test()
File "/usr/local/lib/python3.7/dist-packages/pytorch_lightning/accelerators/accelerator.py", line 67, in train_or_test
results = self.trainer.run_test()
File "/usr/local/lib/python3.7/dist-packages/pytorch_lightning/trainer/trainer.py", line 662, in run_test
eval_loop_results, _ = self.run_evaluation(test_mode=True)
File "/usr/local/lib/python3.7/dist-packages/pytorch_lightning/trainer/trainer.py", line 566, in run_evaluation
dataloaders, max_batches = self.evaluation_loop.get_evaluation_dataloaders(max_batches)
File "/usr/local/lib/python3.7/dist-packages/pytorch_lightning/trainer/evaluation_loop.py", line 56, in get_evaluation_dataloaders
self.trainer.reset_test_dataloader(model)
File "/usr/local/lib/python3.7/dist-packages/pytorch_lightning/trainer/data_loading.py", line 299, in reset_test_dataloader
self._reset_eval_dataloader(model, 'test')
File "/usr/local/lib/python3.7/dist-packages/pytorch_lightning/trainer/data_loading.py", line 249, in _reset_eval_dataloader
num_batches = len(dataloader) if has_len(dataloader) else float('inf')
File "/usr/local/lib/python3.7/dist-packages/pytorch_lightning/utilities/data.py", line 33, in has_len
raise ValueError('`Dataloader` returned 0 length.'
ValueError: `Dataloader` returned 0 length. Please make sure that your Dataloader at least returns 1 batch
Hi,
I've been reading the ViLT paper and was impressed by the simplicity, as it only adds text embeddings to a ViT.
As ViT is already available in HuggingFace Transformers, adding ViLT should be relatively easy.
I've currently implemented the model (see here for my current implementation). It includes a conversion script (convert_vilt_original_to_pytorch.py
) to convert the weights from this repository (the PyTorch Lightning module) to its HuggingFace counterpart, for all models (base one + the ones with a head on top).
However, I'm facing some issues when performing a forward pass with the original implementation in Google Colab (when just doing pip install -r requirements.txt
and running the demo_vqa.py
script, you get the following):
Traceback (most recent call last):
File "demo_vqa.py", line 17, in <module>
from vilt.modules import ViLTransformerSS
File "/content/ViLT/vilt/modules/__init__.py", line 1, in <module>
from .vilt_module import ViLTransformerSS
File "/content/ViLT/vilt/modules/vilt_module.py", line 3, in <module>
import pytorch_lightning as pl
File "/usr/local/lib/python3.7/dist-packages/pytorch_lightning/__init__.py", line 62, in <module>
from pytorch_lightning import metrics
File "/usr/local/lib/python3.7/dist-packages/pytorch_lightning/metrics/__init__.py", line 14, in <module>
from pytorch_lightning.metrics.metric import Metric
File "/usr/local/lib/python3.7/dist-packages/pytorch_lightning/metrics/metric.py", line 23, in <module>
from pytorch_lightning.metrics.utils import _flatten, dim_zero_cat, dim_zero_mean, dim_zero_sum
File "/usr/local/lib/python3.7/dist-packages/pytorch_lightning/metrics/utils.py", line 18, in <module>
from pytorch_lightning.utilities import rank_zero_warn
File "/usr/local/lib/python3.7/dist-packages/pytorch_lightning/utilities/__init__.py", line 24, in <module>
from pytorch_lightning.utilities.apply_func import move_data_to_device
File "/usr/local/lib/python3.7/dist-packages/pytorch_lightning/utilities/apply_func.py", line 25, in <module>
from torchtext.data import Batch
ImportError: cannot import name 'Batch' from 'torchtext.data' (/usr/local/lib/python3.7/dist-packages/torchtext/data/__init__.py)
If you suspect this is an IPython bug, please report it at:
https://github.com/ipython/ipython/issues
or send an email to the mailing list at [email protected]
You can print a more detailed traceback right now with "%tb", or use "%debug"
to interactively debug it.
Extra-detailed tracebacks for bug-reporting purposes can be enabled via:
%config Application.verbose_crash=True
Upgrading PyTorch Lightning to the latest version also returns an error:
Traceback (most recent call last):
File "demo_vqa.py", line 17, in <module>
from vilt.modules import ViLTransformerSS
File "/content/ViLT/vilt/modules/__init__.py", line 1, in <module>
from .vilt_module import ViLTransformerSS
File "/content/ViLT/vilt/modules/vilt_module.py", line 7, in <module>
from vilt.modules import heads, objectives, vilt_utils
File "/content/ViLT/vilt/modules/vilt_utils.py", line 11, in <module>
from vilt.gadgets.my_metrics import Accuracy, VQAScore, Scalar
File "/content/ViLT/vilt/gadgets/my_metrics.py", line 2, in <module>
from pytorch_lightning.metrics import Metric
ModuleNotFoundError: No module named 'pytorch_lightning.metrics'
As PL deprecated the metrics module.
Are you able to provide a simple Colab notebook to perform inference on an image+text pair?
Thanks!
I carefully processed the pretraining datasets. But the loss is unstable when pretraining. Which is not monotone decreasing.
Is this right? I worry that there are some mistakes in the complicated data preprocessing.
For example:
Epoch 0: 34%|███▍ | 12412/36673 [4:26:06<8:40:09, 1.29s/it, loss=2.58, v_num=0]
Epoch 0: 35%|███▍ | 12665/36673 [4:36:41<8:44:29, 1.31s/it, loss=3.12, v_num=0]
Epoch 0: 39%|███▉ | 14368/36673 [5:43:41<8:53:33, 1.44s/it, loss=2.57, v_num=0]
Epoch 0: 40%|███▉ | 14583/36673 [5:56:04<8:59:23, 1.47s/it, loss=3.36, v_num=0]
Hey @dandelin ,
I just want to share the results I reproduced with my own recall implementation. Here is my ViltModel
from typing import List, Dict
import torch
from transformers import BertTokenizer
from vilt.modules import ViLTransformerSS
class ViltModel(ViLTransformerSS):
def __init__(
self,
config,
*args,
**kwargs,
):
super().__init__(config)
self._config = config
if torch.cuda.is_available():
dev = "cuda:0"
else:
dev = "cpu"
self._device = torch.device(dev)
self.tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
self.eval()
@property
def in_cuda(self):
return next(self.parameters()).is_cuda
def rank_query_vs_images(self, query: str, images: List):
rank_scores = []
encoded_input = self.tokenizer(query, return_tensors='pt')
input_ids = encoded_input['input_ids'][:, :self._config['max_text_len']]
mask = encoded_input['attention_mask'][:, :self._config['max_text_len']]
in_cuda = self.in_cuda
if in_cuda:
input_ids = input_ids.to(self._device)
mask = mask.to(self._device)
batch = {'text_ids': input_ids, 'text_masks': mask, 'text_labels': None}
# no masking
for image in images:
if in_cuda:
image = image.to(self._device)
batch['image'] = [image.unsqueeze(0)]
score = self.rank_output(self.infer(batch)['cls_feats'])[:, 0]
rank_scores.append(score.detach().cpu().item())
return rank_scores
The compute recall method:
def compute_recall():
import copy
from vilt import config
from vilt.transforms.pixelbert import pixelbert_transform
from src.dataset.dataset import get_image_data_loader, get_captions_data_loader
from src.evaluate import evaluate
# Scared config is immutable object, so you need to deepcopy it.
conf = copy.deepcopy(config.config())
conf['load_path'] = VILT_BASE_MODEL_LOAD_PATH
conf['test_only'] = True
conf['max_text_len'] = 40
conf['max_text_len'] = 40
conf['data_root'] = '/hdd/master/tfm/arrow'
conf['datasets'] = ['f30k']
conf['batch_size'] = 1
conf['per_gpu_batchsize'] = 1
conf['draw_false_image'] = 0
conf['num_workers'] = 1
# You need to properly configure loss_names to initialize heads (0.5 means it initializes head, but ignores the
# loss during training)
loss_names = {
'itm': 0.5,
'mlm': 0,
'mpp': 0,
'vqa': 0,
'imgcls': 0,
'nlvr2': 0,
'irtr': 1,
'arc': 0,
}
conf['loss_names'] = loss_names
if torch.cuda.is_available():
dev = 'cuda:0'
else:
dev = 'cpu'
device = torch.device(dev)
print(f' conf for ViltModel {conf}')
vilt_model = ViltModel(conf)
vilt_model.to(device)
image_dataset = get_image_data_loader(root=DATASET_ROOT_PATH,
split_root=DATASET_SPLIT_ROOT_PATH,
split='test',
transform=pixelbert_transform(384),
batch_size=1) # loading the images with the pixelBert transformation
text_dataset = get_captions_data_loader(root=DATASET_ROOT_PATH,
split_root=DATASET_SPLIT_ROOT_PATH,
split='test',
batch_size=1) # loading the captions with the pixelBert transformation
images = []
filenames = []
for filenames_batch, images_batch in image_dataset:
filenames.extend(filenames_batch)
images.extend(images_batch)
retrieved_image_filenames = []
groundtruth_expected_image_filenames = []
print(f' number of queries {len(text_dataset)}, against {len(images)}') # this leads to 5000 captions against 1000 images
for matching_filename, query in text_dataset:
filename = matching_filename[0]
groundtruth_expected_image_filenames.append([filename])
q = query[0]
start = time.time()
scores = vilt_model.rank_query_vs_images(q, images)
print(f' time to rank a single query {time.time() - start}s')
retrieved_image_filenames.append([f for _, f in sorted(zip(scores, filenames), reverse=True)])
evaluate(['recall', 'reciprocal_rank'], retrieved_image_filenames,
groundtruth_expected_image_filenames,
[1, 5, 10, 20, 100, 200, 500, None],
{}, print_results=True)
The obtained results are:
Mean Recall@1 0.7584
Mean Recall@5 0.9554
Mean Recall@10 0.9826
Mean Recall@20 0.9932
Mean Recall@100 0.9998
Mean Recall@200 1.0
Mean Recall@500 1.0
Mean Recall@None 1.0
Mean Reciprocal rank 0.8449181004476226
I know that the results could differ from those in the paper, but this seems like an extremely good result? Is there something I am obviously doing wrong?
Thanks in advance
hi
do you remember your final pretraining itm and mlm training accuracy? I just wanna estimate my pretraining performance in comparison with yours.
we really appreciate it if you can share those data. Thanks
Thanks for your great codes!
In your paper, running the pre-training experiments needs 64 V100 GPUs.
For research purposes, it is too heavy.
If using a small batch size, the performance would drop? How much? Can you provide any empirical results?
Thanks for your great codes!
I carefully read your paper.
(in your paper) We resize the shorter edge of input images to 384 and limit the longer edge to under 640 while preserving the aspect ratio. This resizing scheme is also used during object detection in other VLP models, but with a larger size of the shorter edge (800). Patch projection of ViLT-B/32 yields 12 × 20 = 240 patches for an image with a resolution of 384*640.
However, I find that the "image_size=384" for all downstream tasks in this codes?
Would it have an effect on the performance of downstream tasks? At least with a shorter edge 800 can greatly increase the length of the sequence. So It should have a smaller batch size when using "shorter edge 800"
Is this model useful for visual grounding purposes? if so how should I change it?
Hi authors,
I take the provided pretrained 200k checkpoint and did the finetuning of flickr30k. The IR and TR scores after are 64.5 and 81.7. The TR score lower than the one in the paper. My finetuning command is
$PYTHONBIN run.py with data_root=vilt_dataset/ \
num_gpus=8 num_nodes=1 task_finetune_irtr_f30k \
per_gpu_batchsize=4 load_path="weights/vilt_200k.ckpt" \
exp_name="f30k/finetune_official"
I also test the given vilt_irtr_f30k.ckpt
and the results is good, with IR=65.3, TR=83.5. Can I ask what is the process of getting vilt_irtr_f30k.ckpt
?
The training process is DDP, can not debug with pdb. Could you please offer the version code without DDP?
or without pytorch_lighting
I run into an error
File "run.py",line 6, in
from vilt.modules import ViLTransformerSS
File "ViLT/vilt/moudles/vilt/moudules/init.py",line 1, in
form .vilt_module import ViLTransformerSS
File "ViLT/vilt/moudles/vilt_moudule.py",line 4, in
import vilt.module.vision_transformer as vit
AttributeError: module 'vilt' has no attribute 'modules'
when I run the "Evaluate VQAv2" command
hello, @dandelin
I have tracked the code at vilt_module.py
from the training_step function -> set_task function
pl_module.current_tasks = [
k for k, v in pl_module.hparams.config["loss_names"].items() if v >= 1
]
ITM task will enabled only when v>=1
However, no matter what pre training task, all ITM parameters are set to 0.5 in config.py
Is there a problem with my understanding?
thank you!
First of all, thanks for great work.
Can you tell us how long the pretraining took on your machine with 64 V100s ?
Thank you in advance
Finetune on F30K IR/TR,i found loss set ‘’loss_names = _loss_names({'item':0.5, 'irtr': 1})‘’
Isn't it a good idea to use separate irtr here?
I'm very sorry for my stupid question.
The datasets from the websites are the type of '.tsv' or else.
Before processing arrow files, some files like '.json' are required.
If it is convenient for you, could you share your codes for downloading images and processing tsv into json?
I am very sorry to disturb you.
Dear Authors,
Thanks for open sourcing the code. I tried pretrain 100k steps and finetune on vqav2, but my dev-test score is about 65, unlike the 70.8 on the paper.
Here is my pretrain and finetune command
python run.py with data_root=vilt_dataset/ \
num_gpus=8 num_nodes=8 task_mlm_itm whole_word_masking=True step100k \
per_gpu_batchsize=64 exp_name=pretrain
python run.py with data_root=vilt_dataset/ \
num_gpus=8 num_nodes=1 task_finetune_vqa_randaug \
per_gpu_batchsize=32 load_path="result/pretrain_seed0_from_/version_0/checkpoints/last.ckpt" \
exp_name=vqa_finetune
Generate JSON with
python run.py with data_root=vilt_dataset/ \
num_gpus=4 num_nodes=1 task_finetune_vqa \
per_gpu_batchsize=256 load_path="result/vqa_finetune_seed0_from_last/version_0/checkpoints/last.ckpt" \
test_only=True exp_name="test_vqa"
Hello,
I am trying to finetune ViLT on the VQAv2 task - I created the arrow_root
directory as instructed, and then ran:
python run.py with data_root=<PROJECT_DIR>/arrow_root/vqav2/ num_gpus=1 num_nodes=1 task_finetune_vqa per_gpu_batchsize=64 load_path="weights/vilt_200k_mlm_itm.ckpt"
However, once the model begins training, I get the following error:
Traceback (most recent calls WITHOUT Sacred internals):
File "run.py", line 71, in main
trainer.fit(model, datamodule=dm)
File "/home/t-tejass/.conda/envs/vilt-real/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 458, in fit
self._run(model)
File "/home/t-tejass/.conda/envs/vilt-real/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 756, in _run
self.dispatch()
File "/home/t-tejass/.conda/envs/vilt-real/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 797, in dispatch
self.accelerator.start_training(self)
File "/home/t-tejass/.conda/envs/vilt-real/lib/python3.7/site-packages/pytorch_lightning/accelerators/accelerator.py", line 96, in start_training
self.training_type_plugin.start_training(trainer)
File "/home/t-tejass/.conda/envs/vilt-real/lib/python3.7/site-packages/pytorch_lightning/plugins/training_type/training_type_plugin.py", line 144, in start_training
self._results = trainer.run_stage()
File "/home/t-tejass/.conda/envs/vilt-real/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 807, in run_stage
return self.run_train()
File "/home/t-tejass/.conda/envs/vilt-real/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 869, in run_train
self.train_loop.run_training_epoch()
File "/home/t-tejass/.conda/envs/vilt-real/lib/python3.7/site-packages/pytorch_lightning/trainer/training_loop.py", line 493, in run_training_epoch
batch_output = self.run_training_batch(batch, batch_idx, dataloader_idx)
File "/home/t-tejass/.conda/envs/vilt-real/lib/python3.7/site-packages/pytorch_lightning/trainer/training_loop.py", line 711, in run_training_batch
split_batch, batch_idx, opt_idx, optimizer, self.trainer.hiddens
File "/home/t-tejass/.conda/envs/vilt-real/lib/python3.7/site-packages/pytorch_lightning/trainer/training_loop.py", line 817, in training_step_and_backward
result = self.training_step(split_batch, batch_idx, opt_idx, hiddens)
File "/home/t-tejass/.conda/envs/vilt-real/lib/python3.7/site-packages/pytorch_lightning/trainer/training_loop.py", line 304, in training_step
closure_loss = training_step_output.minimize / self.trainer.accumulate_grad_batches
TypeError: unsupported operand type(s) for /: 'NoneType' and 'int'
I printed the value of training_step_output
right before the error: {'extra': {}, 'minimize': None}
. I am not too familiar with pyTorch-Lighting, but this doesn't seem to be the correct output.
Am I missing any steps here, apart from creating the arrow data and running the model?
Thanks for your great codes!
In your paper, running the pre-training experiments needs 64 V100 GPUs.
How long have you been training with 64 V100 GPUs?
Thank you!
Hi, i'm really impressed with your work and i'm getting a lot of help !
But, there is an error in vilt/modules/vision_transformer.py
when mask_it == True.
The error occur because self.mask_token
at 553 line in vision_transformer.py
is not initilized.
So, i wonder what self.mask_token
means?
Thanks.
i encounter this when i pre-train with coco:
WARNING - ViLT - No observers have been added to this run
INFO - ViLT - Running command 'main'
INFO - ViLT - Started
Global seed set to 0
INFO - lightning - Global seed set to 0
INFO - timm.models.helpers - Loading pretrained weights from url (https://github.com/rwightman/pytorch-image-models/releases/download/v0.1-vitjx/jx_vit_base_p32_384-830016f5.pth)
GPU available: True, used: True
INFO - lightning - GPU available: True, used: True
TPU available: None, using: 0 TPU cores
INFO - lightning - TPU available: None, using: 0 TPU cores
Using environment variable NODE_RANK for node rank ().
INFO - lightning - Using environment variable NODE_RANK for node rank ().
ERROR - ViLT - Failed after 0:00:06!
Traceback (most recent calls WITHOUT Sacred internals):
File "run.py", line 67, in main
val_check_interval=_config["val_check_interval"],
File "/data/fyuan/anaconda3/envs/pytorch/lib/python3.7/site-packages/pytorch_lightning/trainer/connectors/env_vars_connector.py", line 41, in overwrite_by_env_vars
return fn(self, **kwargs)
File "/data/fyuan/anaconda3/envs/pytorch/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 359, in init
deterministic,
File "/data/fyuan/anaconda3/envs/pytorch/lib/python3.7/site-packages/pytorch_lightning/accelerators/accelerator_connector.py", line 127, in on_trainer_init
self.trainer.node_rank = self.determine_ddp_node_rank()
File "/data/fyuan/anaconda3/envs/pytorch/lib/python3.7/site-packages/pytorch_lightning/accelerators/accelerator_connector.py", line 415, in determine_ddp_node_rank
return int(rank)
ValueError: invalid literal for int() with base 10: ''
Hi, @dandelin
When I turn on the mpp (Masked Patch Prediction), I get this error:
AttributeError: 'VisionTransformer' object has no attribute 'mask_token'
The above error is appear in vision_transformer.py. Could you please tell me how to address it?
Thank you for your help.
Best regards,
Ge-Peng.
Hello @dandelin,
I am trying to understand the interface expected by ViltTransformerSS
As I see the infer
signature is as follows:
def infer(
self,
batch,
mask_text=False,
mask_image=False,
image_token_type_idx=1,
image_embeds=None,
image_masks=None,
):
If my understanding is correct, If I want to do retrieval
and let VilT
compute the embedding before entering the coatention layers, I should leave the default values untouched.
But as per the batch
parameter, what is the exact signature, reading the code it seems that batch
is a Dictionary of Lists
with the following keys, which I would like to clarify:
text_ids = batch[f"text_ids"] (I guess this is the ids after tokenization, what is the type of this?)
text_labels = batch[f"text_labels"] (What if I do not have a label? Is it okey to be None?)
text_masks = batch[f"text_masks"] (Is this okey if it is None?)
img = batch["image"][0] (I guess this is the image (but in what format and with what preprocessing)?
Another thing I observed is that it seems that this inference method works with a single image a time, so I guess it works with a single text and a single image at a time. Is there an inference mode where this can be run with a real larger batch than 1?
Also the output of the function does not seem so clear to me:
ret = {
"text_feats": text_feats,
"image_feats": image_feats,
"cls_feats": cls_feats,
"raw_cls_feats": x[:, 0],
"image_labels": image_labels,
"image_masks": image_masks,
"text_labels": text_labels,
"text_ids": text_ids,
"text_masks": text_masks,
"patch_index": patch_index,
}
Which of these keys can be considered as the similarity metric? I guess the cls_feats
or what should I look for?
Thank you very much in advance
export MASTER_ADDR=$DIST_0_IP
export MASTER_PORT=$DIST_0_PORT
export NODE_RANK=$DIST_RANK
python run.py with data_root=<ARROW_ROOT> num_gpus=<NUM_GPUS> num_nodes=<NUM_NODES> task_mlm_itm whole_word_masking=True step200k per_gpu_batchsize=<BS_FITS_YOUR_GPU>
ex)
python run.py with data_root=/data2/dsets/dataset num_gpus=8 num_nodes=1 task_mlm_itm whole_word_masking=True step200k per_gpu_batchsize=64
How do I set up these codes and what other operations need to be done?
Hi,
Thank you for releasing the code!
I want to have a try on getting the result of fine-tuning on VQAv2 recently.
The result generated by the provided vilt_vqa.ckpt (ViLT-B/32 200k finetuned on VQAv2) is [{"test-dev": {"yes/no": 87.44, "number": 50.2, "other": 62.38, "overall": 71.32}}].
Then, I used "num_gpus=4 num_nodes=1 task_finetune_vqa_randaug per_gpu_batchsize=16 load_path=".../vilt_200k_mlm_itm.ckpt"" to fine-tune on VQAv2 using my GPUs and uploaded to VQA Challenge 2021 to evaluate. The result was [{"test-dev": {"yes/no": 85.28, "number": 46.63, "other": 58.87, "overall": 68.36}}].
Will there be the same result if I use the same settings?
Thanks for your paper and code, it helps me a lot.
There is a small problem that makes me feel confused. In your paper 3.1, the text embedding consists of word embedding, position embedding, and modal-type embedding.
while in the source code of vilt/modules/vilt_module.py, the text_embedding is implemented by:
from transformers.models.bert.modeling_bert import BertConfig, BertEmbeddings
...
self.text_embeddings = BertEmbeddings(bert_config)
and an extra token_type embedding
self.token_type_embeddings = nn.Embedding(2, config["hidden_size"])
As I know, BertEmbedding() already contains a token type embedding operation inside, so there are actually two token type embedding for text input, and one token type embedding for image input.
I know the self.token_type_embeddings is used as the modal_type embedding to distinguish between image and text.
Is it a mistake? Is it ok not to remove the token type embedding inside BertEmbeddings(bert_config)? Will it cause any difference?
Hope for your reply, thanks!
who can do a demo for my image-captioning of ViLT. pleaseee!@! I'm a newbie in NLP field <33
When I try to train the model, there are some problems with the Dataloader. I get many errors such as
'Error while read file idx 433 in conceptual_caption_val_0 -> cannot identify image file <_io.BytesIO object at 0x7f36766d9bd0>'.
Many images can not be load. I don't know why. Do you have any suggestions? or Can you share the scripts for downloading the GCC, SBU dataset? Thank you very much! :)
can you please explain what is the objective Masked Patch Prediction MPP means? I tried reading the paper but did not find any useful informaiton
Hi, thank you for the great work! Could you upload a license for this repo?
Hi, I find that the 'self.mask_token' has not been defined. So I can't run the pre-train code. Can you tell me how to solve it?
Thanks!
Hi,
I want to produce visualizations for VQA like you show in demo.py.
Can you please help in what is needed to be changed in demo.py to make it work for VQA?
For instance, I see loss_names has a value of 0.5 for 'mlm'. What value should be kept for VQA?
Looking forward to your help.
Best,
a
I read your impressive paper and now I try to reproduce your algorithm.
I have a similar issue with [(https://github.com//issues/4)] in pre-training step.
I read your comments on the previous issue (and fixed version of it), but the problem still occurs to me.
(I checked that the other processes do not occupy GPUs, and no problem when only one GPU is used.)
I think the main reason is the difference in the running environment.
I'm not familiar with the pytorch_distributed package, so it is hard to fix this issue.
Could you give me some suggestions for this??
My running environment is as follows:
GPU: 2 x ( Quadro RTX 6000)
Cudnn: 450.57
CUDA 10.2
python: 3.7.4
other packages are same as your requirements.txt
Thanks,
Error log:
#####################################################
Traceback (most recent calls WITHOUT Sacred internals):
File "run.py", line 74, in main
trainer.fit(model, datamodule=dm)
File "/home/byun/VLP/VLP_MS/ViLT/venv/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 473, in fit
results = self.accelerator_backend.train()
File "/home/byun/VLP/VLP_MS/ViLT/venv/lib/python3.7/site-packages/pytorch_lightning/accelerators/ddp_accelerator.py", line 152, in train
results = self.ddp_train(process_idx=self.task_idx, model=model)
File "/home/byun/VLP/VLP_MS/ViLT/venv/lib/python3.7/site-packages/pytorch_lightning/accelerators/ddp_accelerator.py", line 305, in ddp_train
results = self.train_or_test()
File "/home/byun/VLP/VLP_MS/ViLT/venv/lib/python3.7/site-packages/pytorch_lightning/accelerators/accelerator.py", line 69, in train_or_test
results = self.trainer.train()
File "/home/byun/VLP/VLP_MS/ViLT/venv/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 524, in train
self.train_loop.run_training_epoch()
File "/home/byun/VLP/VLP_MS/ViLT/venv/lib/python3.7/site-packages/pytorch_lightning/trainer/training_loop.py", line 572, in run_training_epoch
batch_output = self.run_training_batch(batch, batch_idx, dataloader_idx)
File "/home/byun/VLP/VLP_MS/ViLT/venv/lib/python3.7/site-packages/pytorch_lightning/trainer/training_loop.py", line 704, in run_training_batch
self.trainer.hiddens)
File "/home/byun/VLP/VLP_MS/ViLT/venv/lib/python3.7/site-packages/pytorch_lightning/trainer/training_loop.py", line 818, in training_step_and_backward
result = self.training_step(split_batch, batch_idx, opt_idx, hiddens)
File "/home/byun/VLP/VLP_MS/ViLT/venv/lib/python3.7/site-packages/pytorch_lightning/trainer/training_loop.py", line 339, in training_step
training_step_output = self.trainer.accelerator_backend.training_step(args)
File "/home/byun/VLP/VLP_MS/ViLT/venv/lib/python3.7/site-packages/pytorch_lightning/accelerators/ddp_accelerator.py", line 158, in training_step
return self._step(args)
File "/home/byun/VLP/VLP_MS/ViLT/venv/lib/python3.7/site-packages/pytorch_lightning/accelerators/ddp_accelerator.py", line 170, in _step
output = self.trainer.model(*args)
File "/home/byun/VLP/VLP_MS/ViLT/venv/lib/python3.7/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
result = self.forward(*input, **kwargs)
File "/home/byun/VLP/VLP_MS/ViLT/venv/lib/python3.7/site-packages/pytorch_lightning/overrides/data_parallel.py", line 179, in forward
output = self.module.training_step(*inputs[0], **kwargs[0])
File "/home/byun/VLP/VLP_MS/ViLT/vilt/modules/vilt_module.py", line 219, in training_step
vilt_utils.set_task(self)
File "/home/byun/VLP/VLP_MS/ViLT/vilt/modules/vilt_utils.py", line 179, in set_task
picked = all_gather(current_tasks)
File "/home/byun/VLP/VLP_MS/ViLT/vilt/modules/dist_utils.py", line 169, in all_gather
size_list, tensor = _pad_to_largest_tensor(tensor, group)
File "/home/byun/VLP/VLP_MS/ViLT/vilt/modules/dist_utils.py", line 133, in _pad_to_largest_tensor
dist.all_gather(size_list, local_size, group=group)
File "/home/byun/VLP/VLP_MS/ViLT/venv/lib/python3.7/site-packages/torch/distributed/distributed_c10d.py", line 1870, in all_gather
work.wait()
RuntimeError: [/pytorch/third_party/gloo/gloo/transport/tcp/unbound_buffer.cc:136] Timed out waiting 1800000ms for send operation to complete
hi, it seems the values for key 125 is missing from the vqa_dict.json file:
"124": "car", "126": "cargo",
so there are only 3128 labels for the vqa instead of 3129.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.