davidnvq / grit Goto Github PK

GRIT: Faster and Better Image-captioning Transformer (ECCV 2022)

Makefile 0.02% Python 84.58% C++ 1.18% Cuda 11.87% Jupyter Notebook 2.35%

image-cap coco-captions detr eccv2022 nocaps region-based-method swin-transformer transformer-models image-captioning object-detection

grit's Introduction

GRIT: Faster and Better Image captioning Transformer (ECCV 2022)

This is the code implementation for the paper titled: "GRIT: Faster and Better Image-captioning Transformer Using Dual Visual Features" (Accepted to ECCV 2022) [Arxiv].

Introduction

This paper proposes a Transformer neural architecture, dubbed GRIT (Grid- and Region-based Image captioning Transformer), that effectively utilizes the two visual features to generate better captions. GRIT replaces the CNN-based detector employed in previous methods with a DETR-based one, making it computationally faster.

Model Zoo

Model	Task	Checkpoint
Pretrained object detector (A) on Visual Genome	Object Detection	GG Drive link
Pretrained object detector (B) on 4 OD datasets	Object Detection	GG Drive link
GRIT (using the object detector A)	Image Captioning	GG Drive link
GRIT (using the object detector B)	Image Captioning	GG Drive link

Installation

Requirements

Python >= 3.9, CUDA >= 11.3
PyTorch >= 1.12.0, torchvision >= 0.6.1
Other packages: pycocotools, tensorboard, tqdm, h5py, nltk, einops, hydra, spacy, and timm
First, clone the repository locally:

git clone https://github.com/davidnvq/grit.git
cd grit

Then, create an environment and install PyTorch and torchvision:

conda create -n grit python=3.9
conda activate grit
pip install torch torchvision --extra-index-url https://download.pytorch.org/whl/cu113
# ^ if the CUDA version is not compatible with your system; visit pytorch.org for compatible matches.

Install other requirements:

pip install -r requirements.txt
python -m spacy download en

Install Deformable Attention:

cd models/ops/
python setup.py build develop
python test.py

Usage

Data preparation

Download and extract COCO 2014 for image captioning including train, val, and test images with annotations from http://cocodataset.org. We expect the directory structure to be the following:

path/to/coco_caption/
├── annotations/  # annotation json files and Karapthy files
├── train2014/    # train images
├── val2014/      # val images
└── test2014/     # test images

Copy the files in data/ to the above annotations folder. It includes vocab.json and some files containing Karapthy ids.

Training

The model is trained with default settings in the configurations file in configs/caption/coco_config.yaml: The training process takes around 16 hours on a machine with 8 A100 GPU. We also provide the code for extracting pretrained features (freezed object detector), which will speed up the training significantly.

With default configurations (e.g., 'parallel Attention', pretrained detectors on VG or 4DS, etc):

export DATA_ROOT=path/to/coco_dataset
# with pretrained object detector on 4 datasets
python train_caption.py exp.name=caption_4ds model.detector.checkpoint=4ds_detector_path

# with pretrained object detector on Visual Genome
python train_caption.py exp.name=caption_4ds model.detector.checkpoint=vg_detector_path

To freeze the backbone and detector, we can extract the region features and initial grid features first, saving it to dataset.hdf5_path in the config file.

Noted that: this additional strategy will only achieve about 134 CIDEr (as reported by some researchers). To obtain 139.2 CIDEr, please train the model with freezed backbone/detector (in Pytorch, using if 'backbone'/'detector' in n: p.requires_grad = False) with image augmentation at every iteration. It means that we read and process every image during training rather than loading extracted features from hdf5.

Then we can run the following script to train the model:

export DATA_ROOT=path/to/coco_dataset
# with pretrained object detector on 4 datasets
python train_caption.py exp.name=caption_4ds model.detector.checkpoint=4ds_detector_path \
optimizer.freezing_xe_epochs=10 \
optimizer.freezing_sc_epochs=10 \
optimizer.finetune_xe_epochs=0 \
optimizer.finetune_sc_epochs=0 \
optimizer.freeze_backbone=True \
optimizer.freeze_detector=True

Evaluation

The evaluation will be run on a single GPU.

Evaluation on Karapthy splits:

export DATA_ROOT=path/to/coco_caption
# evaluate on the validation split
python eval_caption.py +split='valid' exp.checkpoint=path_to_caption_checkpoint

# evaluate on the test split
python eval_caption.py +split='test' exp.checkpoint=path_to_caption_checkpoint

Evaluation on the online splits:

export DATA_ROOT=path/to/coco_caption
# evaluate on the validation split
python eval_caption_online.py +split='valid' exp.checkpoint=path_to_caption_checkpoint

# evaluate on the test split
python eval_caption_online.py +split='test' exp.checkpoint=path_to_caption_checkpoint

Inference on RGB Image

Perform Inference for a single image using the script inference_caption.py:

python inference_caption.py +img_path='notebooks/COCO_val2014_000000000772.jpg' \
+vocab_path='data/vocab.json' \
exp.checkpoint='path_to_caption_checkpoint'

Perform Inference for a single image using the Jupyter notebook notebooks/Inference.ipynb

# Require installing Jupyter(lab)
pip install jupyterlab

cd notebooks
# Open jupyter notebook
jupyter lab

Finetune / Retrain GRIT on your own Dataset

We provide an example of how we finetune GRIT on the custom dataset (here is Vietnamese Image Captioning). Interestingly, the result shows that the GRIT checkpoint on COCO (English) benefits another language captioning task. You may need to modify a few files only. For exapmle, we prepare 3 files in the vicap branch:

Citation

If you find this code useful, please kindly cite the paper with the following bibtex:

@inproceedings{nguyen2022grit,
  title={Grit: Faster and better image captioning transformer using dual visual features},
  author={Nguyen, Van-Quang and Suganuma, Masanori and Okatani, Takayuki},
  booktitle={Computer Vision--ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23--27, 2022, Proceedings, Part XXXVI},
  pages={167--184},
  year={2022},
  organization={Springer}
}

Acknowledgement

We have inherited several open source projects into ours: i) implmentation of Swin Transformer, ii) implementation of Deformable DETR, and iii) implementation of image captioning base from M2-Transformer. We thank the authors of these open source projects.

grit's People

Contributors

Stargazers

Watchers

grit's Issues

Inference on new images using pretrained model

I'd like to ask how to generate caption for new images using pretrained model. There seems to be evaluation code only. Thank you and have a great day!

where can I find the function where name like "extract_vis_features"

from tools.extract_features import extract_vis_features

could not find function "extract_vis_features“ in extract_features.py

Replacement for MSDeformAttn?

Hi, thanks for your work!
is there a drop is replacement in pure pytorch for MSDeformAttn which we can use instead, or an alternative you can recommend? since it is not implemented for CPU usage.

The COCO dataset

Dear Author,
The number of training sets in the COCO dataset is 82783, but the number of training sets in the code is 566435, which will greatly increase training time. Why do we do this?

Reproduce accuracy

Hello! I'm trying to reproduce your exciting work with your latest version of code. I act just like the readme says

1.create conda env 
2.install torch 1.13 and torchvision 0.14.0
3.pip install -r requirements.txt
4.python -m spacy download en
5.Install Deformable Attention(Passed the test.py)
6.mkdir for coco2014 and copy the annotations in data/
7.download detector_4ds.pth
8.train with default:
  export DATA_ROOT=path/to/coco_dataset
  python train_caption.py exp.name=caption_4ds model.detector.checkpoint=4ds_detector_path

I run with 8xV100. Then i get result.txt like:

Epoch 0: test scores: {'BLEU': [0.4826740064607102, 0.2825123386249825, 0.17192786207638042, 0.11124324587035776], 'METEOR': 0.12460705088131721, 'ROUGE': 0.3603731167452261, 'CIDEr': 0.21130171676509463}
Epoch 0: valid scores: {'BLEU': [0.4809478825363297, 0.28075657670130794, 0.1701937593188972, 0.11024780845475272], 'METEOR': 0.12345426307813455, 'ROUGE': 0.36123064067989835, 'CIDEr': 0.20620570719068415}
Epoch 1: test scores: {'BLEU': [0.5248353703853736, 0.33167281308033464, 0.21797301072361094, 0.15186914047291508], 'METEOR': 0.14933552212290435, 'ROUGE': 0.3929921108156746, 'CIDEr': 0.33549387368957595}
Epoch 1: valid scores: {'BLEU': [0.5198158763794527, 0.3266221218440906, 0.21083615382007403, 0.14466446939702118], 'METEOR': 0.14857412665329203, 'ROUGE': 0.3901667719177014, 'CIDEr': 0.3238176195085862}
Epoch 2: test scores: {'BLEU': [0.5363390891046735, 0.3433352194110444, 0.22972261169705058, 0.161702404326391], 'METEOR': 0.15889498191446866, 'ROUGE': 0.40265999677039577, 'CIDEr': 0.3918358848411488}
Epoch 2: valid scores: {'BLEU': [0.5344971969609377, 0.3429908159683274, 0.22816690874613718, 0.15954900504768488], 'METEOR': 0.15884193274368288, 'ROUGE': 0.40092146958640706, 'CIDEr': 0.38624709418187864}
Epoch 0: test scores: {'BLEU': [0.4690450822202827, 0.2740674618595161, 0.16612616248488782, 0.10678866508543383], 'METEOR': 0.12288610247402906, 'ROUGE': 0.3544446897287239, 'CIDEr': 0.20692863119165858}
Epoch 0: valid scores: {'BLEU': [0.4686423286424349, 0.273403803935602, 0.16527346300876197, 0.10698574270873559], 'METEOR': 0.12199810341449624, 'ROUGE': 0.3551028368719895, 'CIDEr': 0.19943273111507906}
Epoch 1: test scores: {'BLEU': [0.5234146265494272, 0.33013144260381155, 0.21599108394900485, 0.1502956212753436], 'METEOR': 0.14913752586558116, 'ROUGE': 0.3926932894298893, 'CIDEr': 0.33945321026393854}
Epoch 1: valid scores: {'BLEU': [0.5212400103773532, 0.3260280885519397, 0.20978370480878183, 0.14325363587559073], 'METEOR': 0.14863569235301125, 'ROUGE': 0.38937877098153806, 'CIDEr': 0.3285524742290841}
Epoch 2: test scores: {'BLEU': [0.5392462990493665, 0.34574838094101, 0.23177218610868325, 0.1634024087207341], 'METEOR': 0.15928696514321405, 'ROUGE': 0.40350642053255725, 'CIDEr': 0.39954665532511835}
Epoch 2: valid scores: {'BLEU': [0.5384198160798076, 0.34729459175965427, 0.23149505946368845, 0.16200475114204513], 'METEOR': 0.15905405773409495, 'ROUGE': 0.4019090675184717, 'CIDEr': 0.38586817830420467}
Epoch 3: test scores: {'BLEU': [0.5576247853374767, 0.3681748259876384, 0.2503556363905668, 0.1775286945740459], 'METEOR': 0.16678130307858485, 'ROUGE': 0.41578602791187025, 'CIDEr': 0.4493921391880315}
Epoch 3: valid scores: {'BLEU': [0.5587686732113375, 0.36847411566621185, 0.24938174533741023, 0.17672797938397283], 'METEOR': 0.1669871796107814, 'ROUGE': 0.41688115695334615, 'CIDEr': 0.4420037453281138}
Epoch 4: test scores: {'BLEU': [0.5592815336438278, 0.37090939503203824, 0.2544594923800747, 0.18252102735598846], 'METEOR': 0.1695217901226202, 'ROUGE': 0.418990780522334, 'CIDEr': 0.4532828270207605}
Epoch 4: valid scores: {'BLEU': [0.561585946157828, 0.3739822313654224, 0.2568688613781989, 0.1835018180694584], 'METEOR': 0.17109845207704907, 'ROUGE': 0.4209797796986528, 'CIDEr': 0.45742805246218415}
Epoch 5: valid scores: {'BLEU': [0.5747958533802986, 0.3837748429662631, 0.26201906296908256, 0.18690747103220676], 'METEOR': 0.17389373960984333, 'ROUGE': 0.4246595217074625, 'CIDEr': 0.48530243964837516}
Epoch 5: test scores: {'BLEU': [0.5724500348071474, 0.3829436244456247, 0.2641280181317395, 0.1893176137588956], 'METEOR': 0.17483616426523868, 'ROUGE': 0.42583893623562136, 'CIDEr': 0.48865713989532106}
Epoch 6: test scores: {'BLEU': [0.574939321582381, 0.38852217356066643, 0.2706095127847032, 0.19493552662884184], 'METEOR': 0.18099364454018793, 'ROUGE': 0.42937929401910385, 'CIDEr': 0.5084514915519348}
Epoch 6: valid scores: {'BLEU': [0.5765404516785712, 0.3890258699624733, 0.27032996824320976, 0.19495512670880544], 'METEOR': 0.17978292898933637, 'ROUGE': 0.4300951449464768, 'CIDEr': 0.5014036231019827}
Epoch 7: valid scores: {'BLEU': [0.5932745014973693, 0.4056438567376153, 0.28216531593478134, 0.20300411435726073], 'METEOR': 0.18339969708988216, 'ROUGE': 0.4388847069290379, 'CIDEr': 0.5257250560479939}
Epoch 7: test scores: {'BLEU': [0.5891616972024817, 0.4041254615688441, 0.28296731501137296, 0.20569355273165876], 'METEOR': 0.1830829671835631, 'ROUGE': 0.4385432962005867, 'CIDEr': 0.5369374656759178}
Epoch 8: test scores: {'BLEU': [0.5967015091611156, 0.41173264320381114, 0.2876675790613734, 0.2081232947977035], 'METEOR': 0.18702007297932405, 'ROUGE': 0.4422395356742119, 'CIDEr': 0.5558009883289353}
Epoch 8: valid scores: {'BLEU': [0.597914406493104, 0.41062948504182106, 0.2839210574499419, 0.20276999539028312], 'METEOR': 0.18575055658208975, 'ROUGE': 0.44045625983744213, 'CIDEr': 0.5482074695599521}

Different from your provided training log in the previous issue by me.
The difference with your origin env:

The python version for me is 3.8.13
The torchvision version is 0.13 which is much higher than yours
I can't write the conda dir in our server since the installation need to write it, so I just compile the Deformable Attention and import it by adding the lib to sys.path (The sanity check in model/ops/test.py is passed)
I replace one broken image in coco2014 train according to the web because the program will crash when reading it.

problem about GridFeatureNetwork code

class GridFeatureNetwork(nn.Module):

    def __init__(
        self,
        n_layers,
        pad_idx,
        d_in=1024,
        d_model=512,
        n_heads=8,
        d_ff=2048,
        dropout=0.1,
        attn_dropout=0.0,
        attention_module=None,
        **kwargs,
    ):
        super().__init__()
        self.fc = nn.Linear(d_in, d_model)
        self.dropout = nn.Dropout(p=dropout)
        self.layer_norm = nn.LayerNorm(d_model)
        self.layers = nn.ModuleList([
            TransformerLayer(
                d_model,
                n_heads,
                d_ff,
                dropout,
                attn_dropout=attn_dropout,
                attention_module=attention_module,
                **kwargs,
            ) for _ in range(n_layers)
        ])

    def forward(self, input, attention_mask=None, attention_weights=None):
        out = F.relu(self.fc(input))
        out = self.dropout(out)
        out = self.layer_norm(out)

        if attention_mask is None:
            attention_mask = (torch.sum(out, dim=-1) == self.padding_idx)
            attention_mask = repeat(attention_mask, 'B N -> B 1 1 N')  # [B Head Nq N]

        outs = []
        for l in self.layers:
            out = l(out, out, out, attention_mask, attention_weights)
            outs.append(out.unsqueeze(1))

        outs = torch.cat(outs, 1)
        return outs, attention_mask

there are some problem about GridFeatureNetwork for file of models.caption.grid_net.py
'GridFeatureNetwork' object has no attribute 'padding_idx'， however self.padding_idx was used in attention_mask = (torch.sum(out, dim=-1) == self.padding_idx)
i

hydra version (1.2.0) issues - downgrade to hydra 1.1.0 to fix issues

FYI, the latest hydra version (1.2.0) has unexpected problems when training/evaluation. To be precise, it saves the files in undesired locations.

For those who are new to this project, please skip this.

If you don't want to install the new environment again, please downgrade hydra to version 1.1.0 to make it run properly

conda activate grit
pip uninstall hydra-core
pip install hydra-core=1.1.0

# I make new commits to handle with hydra issues, please pull the project again (or clone the new code).
git pull origin main

# ^ check this new commit: https://github.com/davidnvq/grit/commit/c6fac8db4cbe7d60246e86f98d1c07220caa03b3

Sorry for this inconvenience.

all_splits.h5

Dear author. Can you share how to generate all_splits.h5 files?

Training time

excuse me, May I ask you how long this model training will take？

accuracy of training_caption with freezing detector

with suggestion of accumulation_steps in #15 , I modified the code of caption_engine.py of train_xe function

    with tqdm(desc=f'Epoch {epoch} - train', unit='it', total=len(dataloaders['train'])) as pbar:
        for it, batch in enumerate(dataloaders['train']):
            out = model(batch['samples'], batch['captions'])

            captions_gt = batch['captions'][:, 1:].contiguous()
            out = out[:, :-1].contiguous()
            loss = loss_fn(out.view(-1, len(text_field.vocab)), captions_gt.view(-1))
            loss = loss / config.optimizer.accumulation_steps
            loss.backward()

            loss = gather_result(loss)
            running_loss += loss.item()

            pbar.set_postfix(loss=running_loss / (it + 1))
            pbar.update()

            if scheduler is not None:
                # accumate
                if (it + 1) % config.optimizer.accumulation_steps == 0:
                    optimizers['model'].step()
                    optimizers['backbone'].step()

                    lr = scheduler.step()
                    assert optimizers['model'].param_groups[0]['lr'] == lr, "LR scheduler doesn't work properly."

                    optimizers['model'].zero_grad()
                    optimizers['backbone'].zero_grad()

but the CIDEr score of CIDEr first epoch is very low

Epoch 0 - train: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████| 4425/4425 [13:18<00:00,  5.01it/s, loss=1.25][W CudaIPCTypes.cpp:15] Producer process has been terminated before all shared CUDA tensors released. See Note [Sharing CUDA tensors]
Epoch 0 - train: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████| 4425/4425 [13:21<00:00,  5.52it/s, loss=1.25]
Epoch 0 - validation:  99%|████████████████████████████████████████████████████████████████████████████████████████████████▌| 195/196 [00:25<00:00, 12.46it/s, loss=3.97][W CudaIPCTypes.cpp:15] Producer process has been terminated before all shared CUDA tensors released. See Note [Sharing CUDA tensors]
Epoch 0 - validation: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████| 196/196 [00:25<00:00,  7.59it/s, loss=3.97]
Epoch 0 - evaluation on valid:   0%|                                                                                                             | 0/157 [00:00<?, ?it/s]Number of iterations: 1, batch_size=32, Total time per 1 batch: 0.22200s
Epoch 0 - evaluation on valid:  64%|███████████████████████████████████████████████████████████████                                    | 100/157 [00:25<00:11,  4.81it/s]Number of iterations: 101, batch_size=32, Total time per 1 batch: 0.20595s
Epoch 0 - evaluation on valid:  99%|██████████████████████████████████████████████████████████████████████████████████████████████████▎| 156/157 [00:37<00:00,  4.76it/s][W CudaIPCTypes.cpp:15] Producer process has been terminated before all shared CUDA tensors released. See Note [Sharing CUDA tensors]
Epoch 0 - evaluation on valid: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████| 157/157 [00:37<00:00,  4.16it/s]
Epoch: 0 iters: 157
Total time per 1 batch: 0.20528s
Epoch 0: valid scores: {'BLEU': [0.4669783296661464, 0.24582240555971746, 0.13479587494094583, 0.07239531663650367], 'METEOR': 0.10954924428703441, 'ROUGE': 0.34872282632850365, 'CIDEr': 0.08790172488631927}

caption_4ds_20220906, B-IM, 384_640, maxwh, True, 0, valid, 8.79, 46.70, 7.24, 34.87, 10.95, 24.58, 13.48, 1.25, 0.00, 0.00, fr_xe, 3.97
Epoch 0 - evaluation on test:   0%|                                                                                                              | 0/157 [00:00<?, ?it/s]Number of iterations: 1, batch_size=32, Total time per 1 batch: 0.21069s
Epoch 0 - evaluation on test:  64%|███████████████████████████████████████████████████████████████▋                                    | 100/157 [00:25<00:11,  4.82it/s]Number of iterations: 101, batch_size=32, Total time per 1 batch: 0.20469s
Epoch 0 - evaluation on test:  99%|███████████████████████████████████████████████████████████████████████████████████████████████████▎| 156/157 [00:37<00:00,  4.81it/s][W CudaIPCTypes.cpp:15] Producer process has been terminated before all shared CUDA tensors released. See Note [Sharing CUDA tensors]
Epoch 0 - evaluation on test: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████| 157/157 [00:37<00:00,  4.17it/s]
Epoch: 0 iters: 157
Total time per 1 batch: 0.20416s
Epoch 0: test scores: {'BLEU': [0.4664962002208445, 0.24325046136142325, 0.1341976162794273, 0.0742708785109443], 'METEOR': 0.10937207416996544, 'ROUGE': 0.34777669352106455, 'CIDEr': 0.08940798205091863}

caption_4ds_20220906, B-IM, 384_640, maxwh, True, 0, test , 8.94, 46.65, 7.43, 34.78, 10.94, 24.33, 13.42, 1.25, 0.00, 0.00, fr_xe, 3.97

then the CIDEr score not in range of [1.05 - 1.29] ( #17 (comment)_ )

training with my own dataset

Hello author, thank you for your excellent work!

I want to train my dataset, which has been annotated in Coco format. How can I obtain a file similar to 'coco train_ids. npy'

init_process_group stuck

Dear author:
i follow the training tutorial like this:

export DATA_ROOT=path/to/coco_dataset 
# with pretrained object detector on 4 datasets 
python train_caption.py exp.name=caption_4ds model.detector.checkpoint=4ds_detector_path

and run the train_caption.py script. then i found the code is stuck. After debugging the code, it turns out that the code is stuck in this line:

i use only one 2080ti, how should i do to run the training script?
looking forward to reply, thank you!

training with my own dataset

Thank authors. Your contributions is wonderful.
Can you help me how to train with my own dataset. Thank you!

the data sets

Dear Author, Can you share the four data sets after consolidation(i.e., Visual Genome (VG),COCO, OpenImages, and Objects365)

Question about Training log

Dear Auther, you give the script that training the model end-to-end, I want to know whether you record the training result after each epoch using 4 datasets checkpoints.

vocab.json

Is it possible to get the vocab.json for use with the pretrained checkpoints that are downloadable? Thanks!

The code without detector

Hi, Dr. Nguyen, thanks for your great work. I met a problem when reproducing your code. Because I have two 3090 GPUs with about 48GB memories, I cannot run this code. I'd like to know if without detector and only using grid features, how many gpu memories do I need?

does small batch size can affect accuracy？[1 GPU]

I only can use batch size = 4 to run train_caption because I only have one GPU

what is the function of the config when dataset.overfit = True

in configs/caption/coco_config.yaml :

dataset:
  overfit: True
  ann_root: '${oc.env:DATA_ROOT}/annotations'
  img_root: '${oc.env:DATA_ROOT}'
  hdf5_path: '${oc.env:DATA_ROOT}/all_splits.h5' # this is used for freezed extractor; fast to train.
  vocab_path: '${oc.env:DATA_ROOT}/annotations/vocab.json'
  use_gri_feat: ${model.use_gri_feat}
  use_reg_feat: ${model.use_reg_feat}

what is the function of params of dataset.overfit = True

in annother file :

class CPairedDataset:

    def __init__(self, examples, image_field, overfit=False):
        self.examples = examples
        self.image_field = image_field
        self.overfit = overfit

    def __getitem__(self, idx):
        example = self.examples[idx]
        img_path, caption = example.image, example.tokens
        image_id = example.image_id
        img = self.image_field.preprocess(img_path)
        return img, caption, image_id

    def __len__(self):
        if self.overfit:
            return OVERFIT_SIZE
        return len(self.examples)

when overfit is True, why return lenght of CPaireDataset = 64?

Finetuning

It is great job.Thank you to share. I just want to ask how can I fine-tune your model with my own dataset? I saw you already added vicap.

I am using this code for fine-tuning and same vocab file that you provide.Naturally some of the tokens of my dataset is not included at vocab file. First I thought I can simply add the tokens to the vocab file. But there is a parameter named vocab_size. I updated this parameter properly. When I try to use your per-trained model, I am getting size mismatch error. Is there any way for fine-tuning without retrain the entire model.
Thank you.

The Visual Genome dataset

Dear Author,
The Visual Genome dataset I downloaded was not partitioned. How to divide the Visual Genome dataset into training, testing, and validation sets?How to form the annotations folder?

ArtEmis: dataloader & training

Dear Author,

can you please share some of your used data-loading code for ArtEmis
for Artemis, is it true that you trained it without freezing the backbone/detector?

Thanks for the excellent work!

inference error

when i follow the installation tutorial, downloaded dataset and checkpoints, and try to perform Inference for a single image using the inference_caption.py. there comes some errors.

i don't know how to fix it and somebody help me?

The missing files

Dear author. The object detector missing some important files for training in train_config.yaml. Can you provide those files ! Thanks
${oc.env:DATA_ROOT}/vg/annotations/train_ann_lmdb ${oc.env:HOME}/datasets/vg/annotations/train_objects.json ${oc.env:HOME}/datasets/vg/annotations/attribute2ind.json ${oc.env:HOME}/datasets/vg/annotations/oid2attr.json
${oc.env:HOME}/datasets/vg/annotations/test_objects.json ${oc.env:HOME}/datasets/vg/annotations/test_coco.pkl ${oc.env:HOME}/datasets/vg/annotations/val_objects.json ${oc.env:HOME}/datasets/vg/annotations/val_coco.pkl
${oc.env:HOME}/datasets/coco/annotations/anno_1848_val2017.json
${oc.env:HOME}/datasets/coco/annotations/coco_vgoiv6_class2ind.json

could not find the file of checkpoint_best_valid.pth and the config of detector.backbone_name of caption

when I run the script of train_caption.py as follow command, the error occur：

python train_caption.py exp.name=caption_4ds model.detector.checkpoint=4ds_detector_path

could not find the file : checkpoint_best_valid.pth

I found the code like this

checkpoint = torch.load('checkpoint_best_valid.pth', map_location='cpu')

but where save the file of 'checkpoint_best_valid.pth' ?

can you give a example config and script for fast to train with freezed extractor?

I try to training faster with freezed extractor, the error like this:

omegaconf.errors.ConfigAttributeError: Key 'grid_stage' is not in struct
    full_key: model.grid_stage
    object_type=dict

omegaconf.errors.ConfigAttributeError: Key 'det_module' is not in struct
    full_key: model.detector.det_module
    object_type=dict

How to generate more captions

Hello,

I would like to generate multiple candidate captions for one image when doing image captioning. How could I do this? Is there any parameter I can set?

Thanks!

Grid Feature Network

Dear Author,
I read the paper and don't understand how Grid Feature Network works. You said: "We intend to extract contextual information hidden in the input image by modeling the spatial interaction between the grid features". That means it uses scene graph. If not how you can extract contextual information ? Thanks

Performance of Frozen detector

Have you tested the performance of frozen detector? The training speed with detector is so slow for me to reproduce.

training with my own dataset

Hello author, thank you for your excellent work.
I want to train my dataset, how do I generate a ‘coco train ids.npy’ file for my own dataset

L in extract_features.py

Dear authors,
I extract feature and save into HDF5 file but number of img_ids don't equal number of reg_feat and gri_feat.
In extract_features.py,
Why does batch_size in Dataloader equal BATCH_SIZE - 1 and why append random tensor to imgs from batch in dataloader.
Thank authors.

The help for the evaluation

Hi, thank you for sharing this great work.I'm trying to test the effects of training，But I found that there were some missing code about the TABLE 7,such as 'Object Attr. Relation Color Count Size CLIP',So how do I get it？

some questions about freezing training

Hello author, I have some questions about freezing training and hope to get your reply
I noticed that you mentioned freezing the backbone network and detector, I want to ask what the specific purpose of this is, in order to understand that you can give some specific instructions for freezing the backbone and detector, when reading the code I found that the backbone is included in the model, what is the purpose of freezing the backbone in this case

TypeError: 'NoneType' object does not support item deletion

hello,When I train the model, the following error occurs：

File "train_demo.py", line 316, in run_main
main(config)
File "train_demo.py", line 121, in main
dataloaders, samplers = build_coco_dataloaders(config, mode='finetune', device=device)
File "J:\GRIT\datasets\caption\coco.py", line 310, in build_coco_dataloaders
text_field = TextField(vocab_path=config.dataset.vocab_path)
File "J:\GRIT\datasets\caption\field.py", line 137, in init
self.vocab = Vocab(vocab_path=vocab_path)
File "J:\GRIT\datasets\caption\vocab.py", line 70, in init
del counter[tok]
TypeError: 'NoneType' object does not support item deletion

Is there any way to solve it？

Time to release the code.

Hello, this is a good work. I read your paper and wonder whether the code will be released or not and when ?
waiting for your ans.

Frozen sc training error issue

Hello! When I try your frozen mode as:

python train_caption.py exp.name=caption_4ds model.detector.checkpoint=ckpts/detector_checkpoint_vg.pth \
exp.ngpus_per_node=1 \
exp.world_size=1 \
optimizer.freezing_xe_epochs=10 \
optimizer.freezing_sc_epochs=10 \
optimizer.finetune_xe_epochs=0 \
optimizer.finetune_sc_epochs=0

XE stage runs well, error occurs in frozen fc stage

caption_engine.py:train_sc
with tqdm(desc='Epoch %d - train' % epoch, unit='it', total=len(dataloaders['train_dict'])) as pbar:
    for it, batch in enumerate(dataloaders['train_dict']):
        if 'samples' in batch:
            b_s = batch['samples'].tensors.shape[0]
        elif 'vis_feat' in batch:
            b_s = batch['vis_feat'].shape[0]

the exception says: batch['samples'] :'dict' object has no attribute 'tensors'
It seems that the batch['samples'] should be NestedTensor

But actually in frozen sc mode, the dataloader act like this:

coco.py:
if self.img_field.use_hdf5_feat:
    samples = {}
    if self.img_field.use_gri_feat:
        samples['gri_feat'] = torch.stack([im['gri_feat'] for im in imgs]).to(self.device)
        samples['gri_mask'] = torch.stack([im['gri_mask'] for im in imgs]).to(self.device)
    if self.img_field.use_reg_feat:
        samples['reg_feat'] = torch.stack([im['reg_feat'] for im in imgs]).to(self.device)
        samples['reg_mask'] = torch.stack([im['reg_mask'] for im in imgs]).to(self.device)
    outputs['samples'] = samples
else:
    outputs['samples'] = nested_tensor_from_tensor_list(imgs).to(self.device)

so it returns tensors in hdf5 mode but frozen sc want a Nested tensor?
Or am I wrong in somewhere when running the code?
Thank you for your attention.

inference

I run inference notebook and i realise every time i run cell Inference and Decode (from torch.no_grad to end), generate different captions. Example with same image, 'three sheep standing next to a fence in a field' or 'two sheep standing next to a fence in a field' or 'three sheep standing next to a fence in the grass' generated.
Can you check my issue?
Thanh you

Some issues regarding generating vocab.json files

Example of how you previously answered other people's questions:

Suppose that it is similar to the English tokenizer, use can obtain a vocab.json file by:

from datasets.caption.field import TextField

text_field = TextField(vocab_path="path_to_save_vocab.json", build_vocab=True)

given a list of captions

source = [
"This is a first caption",
"This is a second caption",
....
]

text_field.build_vocab(source)
That's how it works.

Hello author, according to your example, the vocab.json file I generated contains "freqs" and "itos", but I cannot obtain "stoi". May I ask if you could tell me why? Or is there no need for 'stoi' participation in GRIT training
I would greatly appreciate it if you could reply to me as soon as possible

The inference script is not generating a complete caption.

Hi, thank you for sharing this great work.

I’m trying to reproduce the paper result on the 5k Karpathy test split test set using the inference script, but I’m getting a lower scores:

Bleu_1: 0.810
Bleu_2: 0.655
Bleu_3: 0.510
Bleu_4: 0.388
METEOR: 0.295
ROUGE_L: 0.587
CIDEr: 1.333
SPICE: 0.230

And after some digging, the caption is not fully generated,
I managed to duplicate the problem in Colab as well.

https://colab.research.google.com/drive/1BvtscubSujlxOFhOchVGNB79KkKYoMiH?usp=sharing

Request for Label to index file

Hey there, first of all great work. We're exploring your project for our research work, wanted to know if you can provide with the object labels to index file.

It seems you've created 1849 classes by aggregating the 4 datasets, we're unable to regenerate it. Would really appreciate it.

Train your own coco format data

Hi, this is really a great job!
I want to train my own dataset now, but have some questions like how to generate the vocab.json, can you give the exact script to process the dataset, if so, I'd appreciate it!

Missing detection dataset

Dear author. The object detector missing some important files for training in train_config.yaml. Can you provide those files ! Thanks
${oc.env:DATA_ROOT}/vg/annotations/train_ann_lmdb ${oc.env:HOME}/datasets/vg/annotations/train_objects.json ${oc.env:HOME}/datasets/vg/annotations/attribute2ind.json ${oc.env:HOME}/datasets/vg/annotations/oid2attr.json

Some problems on freezing training

Dear author, I have used the code you provided to freeze the trunk and detector to train the model. I trained on four GTX 1080s. Only the parameters involved in distributed training were changed, and other parameters remained unchanged. However, after 10 times of XE and 10 times of SC, the final CIDER score was 132.0. What may be the cause of this problem? What methods should be used to improve the performance of the model?

Question about freeze training

Dear Author, I have used the code you give to freeze the backbone and detector to training the model. But after 10 epoch for XE and 10 epoch for SC, the final CIDER score is 38.0. Do you have met the same problem or do you have any idea about what is the possible reason.

Number of epochs

How to reduce the number of epochs? And how many epochs the model are seted originally?

Error executing job with overrides

Dear Author,
I have encountered such an error:

Epoch 0 - train: 100%|▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒| 566435/566435 [40:02:24<00:00, 3.93it/s, loss=3.45]
Epoch 0 - validation: 0%| | 1/1563 [00:02<1:02:34, 2.40s/it, loss=3.Epoch 0 - validation: 0%| | 1/1563 [00:03<1:02:34, 2.40s/it, loss=3.Epoch 0 - validation: 0%| | 2/1563 [00:03<35:51, 1.38s/it, loss=3.78Epoch 0 - validation: 0%| | 2/1563 [00:04<55:13, 2.12s/it, loss=3.78]
Error executing job with overrides: []
Traceback (most recent call last):
File "train_demo.py", line 324, in run_main
main(config)
File "train_demo.py", line 174, in main
train_res = train_xe(
File "J:\GRIT\engine\caption_engine.py", line 383, in train_xe
val_loss = evaluate_loss(model, dataloaders['valid'], loss_fn, text_field, epoch, writer)
File "J:\GRIT\engine\caption_engine.py", line 298, in evaluate_loss
out = model(batch['samples'], batch['captions'])
File "D:\Anaconda3\lib\site-packages\torch\nn\modules\module.py", line 1190, in _call_impl
return forward_call(*input, **kwargs)
File "J:\GRIT\models\caption\transformer.py", line 89, in forward
vis_inputs = self.detector(images)
File "D:\Anaconda3\lib\site-packages\torch\nn\modules\module.py", line 1190, in _call_impl
return forward_call(*input, **kwargs)
File "J:\GRIT\models\caption\detector.py", line 53, in forward
features = self.backbone(x)
File "D:\Anaconda3\lib\site-packages\torch\nn\modules\module.py", line 1190, in _call_impl
return forward_call(*input, **kwargs)
File "J:\GRIT\models\common\swin_model.py", line 662, in forward
x_out, H, W, x, Wh, Ww = layer(x, Wh, Ww)
File "D:\Anaconda3\lib\site-packages\torch\nn\modules\module.py", line 1190, in _call_impl
return forward_call(input, **kwargs)
File "J:\GRIT\models\common\swin_model.py", line 448, in forward
x = blk(x, attn_mask)
File "D:\Anaconda3\lib\site-packages\torch\nn\modules\module.py", line 1190, in _call_impl
return forward_call(input, **kwargs)
File "J:\GRIT\models\common\swin_model.py", line 279, in forward
attn_windows = self.attn(x_windows, mask=attn_mask) # nWB, window_sizewindow_size, C
File "D:\Anaconda3\lib\site-packages\torch\nn\modules\module.py", line 1190, in _call_impl
return forward_call(*input, **kwargs)
File "J:\GRIT\models\common\swin_model.py", line 171, in forward
attn = attn + relative_position_bias.unsqueeze(0)
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 568.00 MiB (GPU 0; 6.00 GiB total capacity; 3.99 GiB already allocated; 0 bytes free; 4.77 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

Set the environment variable HYDRA_FULL_ERROR=1 for a complete stack trace.

Is there any way to solve it?

How to train my dataset

I don't know how to start training my dataset. Can you tell me the steps?

Some problems about load pretrain model

Dear author @davidnvq ,
It is a perfect work! When i use it, there are some problems. How can i solve it? I use "detector_checkpoint_vg.pth" pretrain model. But some problems occur.

Could i change code of line 69?
before change:
detector=detector.module,
after change:
detector=detector.det_module,

multi gpus training cause out of memory error

when i try to run the train_caption.py script like this:

export Data_ROOT=path/to/coco_dataset
python train_caption.py exp.name=caption_rds moel.detector.checkpoint=4ds_detector_path

i encountered some errors like this:

belows are changes i made in coco_config.yml:
ngpus_per_node:2
world_size:2
batch_size:4
num_workers:2

however,when i set the ngpus_per_node:1 world_size:1，it can run properly.

anyone can help? thanks a lot!

Implemention details of pre-training the object detector on 4ds / VG ?

Hello! I'm keep following you exciting work. I want to know some implement detail about the pre training process of object detector. eg. the class defination of the 1849 classes you mentioned in Additional Details.
Is there any plan to release this part of code?

davidnvq / grit Goto Github PK

grit's Introduction

GRIT: Faster and Better Image captioning Transformer (ECCV 2022)

Introduction

Model Zoo

Installation

Requirements

Usage

Data preparation

Training

Evaluation

Inference on RGB Image

Finetune / Retrain GRIT on your own Dataset

Citation

Acknowledgement

grit's People

Contributors

Stargazers

Watchers

Forkers

grit's Issues

For those who are new to this project, please skip this.

given a list of captions

Recommend Projects

Recommend Topics

Recommend Org

Jobs