GithubHelp home page GithubHelp logo

davidnvq / grit Goto Github PK

View Code? Open in Web Editor NEW
177.0 3.0 27.0 86.18 MB

GRIT: Faster and Better Image-captioning Transformer (ECCV 2022)

Makefile 0.02% Python 84.58% C++ 1.18% Cuda 11.87% Jupyter Notebook 2.35%
image-cap coco-captions detr eccv2022 nocaps region-based-method swin-transformer transformer-models image-captioning object-detection

grit's Introduction

GRIT: Faster and Better Image captioning Transformer (ECCV 2022)

This is the code implementation for the paper titled: "GRIT: Faster and Better Image-captioning Transformer Using Dual Visual Features" (Accepted to ECCV 2022) [Arxiv].

Introduction

This paper proposes a Transformer neural architecture, dubbed GRIT (Grid- and Region-based Image captioning Transformer), that effectively utilizes the two visual features to generate better captions. GRIT replaces the CNN-based detector employed in previous methods with a DETR-based one, making it computationally faster.

Model Zoo

Model Task Checkpoint
Pretrained object detector (A) on Visual Genome Object Detection GG Drive link
Pretrained object detector (B) on 4 OD datasets Object Detection GG Drive link
GRIT (using the object detector A) Image Captioning GG Drive link
GRIT (using the object detector B) Image Captioning GG Drive link

Installation

Requirements

  • Python >= 3.9, CUDA >= 11.3

  • PyTorch >= 1.12.0, torchvision >= 0.6.1

  • Other packages: pycocotools, tensorboard, tqdm, h5py, nltk, einops, hydra, spacy, and timm

  • First, clone the repository locally:

git clone https://github.com/davidnvq/grit.git
cd grit
  • Then, create an environment and install PyTorch and torchvision:
conda create -n grit python=3.9
conda activate grit
pip install torch torchvision --extra-index-url https://download.pytorch.org/whl/cu113
# ^ if the CUDA version is not compatible with your system; visit pytorch.org for compatible matches.
  • Install other requirements:
pip install -r requirements.txt
python -m spacy download en
  • Install Deformable Attention:
cd models/ops/
python setup.py build develop
python test.py

Usage

Data preparation

Download and extract COCO 2014 for image captioning including train, val, and test images with annotations from http://cocodataset.org. We expect the directory structure to be the following:

path/to/coco_caption/
├── annotations/  # annotation json files and Karapthy files
├── train2014/    # train images
├── val2014/      # val images
└── test2014/     # test images
  • Copy the files in data/ to the above annotations folder. It includes vocab.json and some files containing Karapthy ids.

Training

The model is trained with default settings in the configurations file in configs/caption/coco_config.yaml: The training process takes around 16 hours on a machine with 8 A100 GPU. We also provide the code for extracting pretrained features (freezed object detector), which will speed up the training significantly.

  • With default configurations (e.g., 'parallel Attention', pretrained detectors on VG or 4DS, etc):
export DATA_ROOT=path/to/coco_dataset
# with pretrained object detector on 4 datasets
python train_caption.py exp.name=caption_4ds model.detector.checkpoint=4ds_detector_path

# with pretrained object detector on Visual Genome
python train_caption.py exp.name=caption_4ds model.detector.checkpoint=vg_detector_path
  • To freeze the backbone and detector, we can extract the region features and initial grid features first, saving it to dataset.hdf5_path in the config file.

Noted that: this additional strategy will only achieve about 134 CIDEr (as reported by some researchers). To obtain 139.2 CIDEr, please train the model with freezed backbone/detector (in Pytorch, using if 'backbone'/'detector' in n: p.requires_grad = False) with image augmentation at every iteration. It means that we read and process every image during training rather than loading extracted features from hdf5.

Then we can run the following script to train the model:

export DATA_ROOT=path/to/coco_dataset
# with pretrained object detector on 4 datasets
python train_caption.py exp.name=caption_4ds model.detector.checkpoint=4ds_detector_path \
optimizer.freezing_xe_epochs=10 \
optimizer.freezing_sc_epochs=10 \
optimizer.finetune_xe_epochs=0 \
optimizer.finetune_sc_epochs=0 \
optimizer.freeze_backbone=True \
optimizer.freeze_detector=True

Evaluation

The evaluation will be run on a single GPU.

  • Evaluation on Karapthy splits:
export DATA_ROOT=path/to/coco_caption
# evaluate on the validation split
python eval_caption.py +split='valid' exp.checkpoint=path_to_caption_checkpoint

# evaluate on the test split
python eval_caption.py +split='test' exp.checkpoint=path_to_caption_checkpoint
  • Evaluation on the online splits:
export DATA_ROOT=path/to/coco_caption
# evaluate on the validation split
python eval_caption_online.py +split='valid' exp.checkpoint=path_to_caption_checkpoint

# evaluate on the test split
python eval_caption_online.py +split='test' exp.checkpoint=path_to_caption_checkpoint

Inference on RGB Image

  • Perform Inference for a single image using the script inference_caption.py:
python inference_caption.py +img_path='notebooks/COCO_val2014_000000000772.jpg' \
+vocab_path='data/vocab.json' \
exp.checkpoint='path_to_caption_checkpoint'
  • Perform Inference for a single image using the Jupyter notebook notebooks/Inference.ipynb
# Require installing Jupyter(lab)
pip install jupyterlab

cd notebooks
# Open jupyter notebook
jupyter lab

Finetune / Retrain GRIT on your own Dataset

We provide an example of how we finetune GRIT on the custom dataset (here is Vietnamese Image Captioning). Interestingly, the result shows that the GRIT checkpoint on COCO (English) benefits another language captioning task. You may need to modify a few files only. For exapmle, we prepare 3 files in the vicap branch:

Citation

If you find this code useful, please kindly cite the paper with the following bibtex:

@inproceedings{nguyen2022grit,
  title={Grit: Faster and better image captioning transformer using dual visual features},
  author={Nguyen, Van-Quang and Suganuma, Masanori and Okatani, Takayuki},
  booktitle={Computer Vision--ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23--27, 2022, Proceedings, Part XXXVI},
  pages={167--184},
  year={2022},
  organization={Springer}
}

Acknowledgement

We have inherited several open source projects into ours: i) implmentation of Swin Transformer, ii) implementation of Deformable DETR, and iii) implementation of image captioning base from M2-Transformer. We thank the authors of these open source projects.

grit's People

Contributors

davidnvq avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar

grit's Issues

Replacement for MSDeformAttn?

Hi, thanks for your work!
is there a drop is replacement in pure pytorch for MSDeformAttn which we can use instead, or an alternative you can recommend? since it is not implemented for CPU usage.

The COCO dataset

Dear Author,
The number of training sets in the COCO dataset is 82783, but the number of training sets in the code is 566435, which will greatly increase training time. Why do we do this?

Reproduce accuracy

Hello! I'm trying to reproduce your exciting work with your latest version of code. I act just like the readme says

1.create conda env 
2.install torch 1.13 and torchvision 0.14.0
3.pip install -r requirements.txt
4.python -m spacy download en
5.Install Deformable Attention(Passed the test.py)
6.mkdir for coco2014 and copy the annotations in data/
7.download detector_4ds.pth
8.train with default:
  export DATA_ROOT=path/to/coco_dataset
  python train_caption.py exp.name=caption_4ds model.detector.checkpoint=4ds_detector_path

I run with 8xV100. Then i get result.txt like:

Epoch 0: test scores: {'BLEU': [0.4826740064607102, 0.2825123386249825, 0.17192786207638042, 0.11124324587035776], 'METEOR': 0.12460705088131721, 'ROUGE': 0.3603731167452261, 'CIDEr': 0.21130171676509463}
Epoch 0: valid scores: {'BLEU': [0.4809478825363297, 0.28075657670130794, 0.1701937593188972, 0.11024780845475272], 'METEOR': 0.12345426307813455, 'ROUGE': 0.36123064067989835, 'CIDEr': 0.20620570719068415}
Epoch 1: test scores: {'BLEU': [0.5248353703853736, 0.33167281308033464, 0.21797301072361094, 0.15186914047291508], 'METEOR': 0.14933552212290435, 'ROUGE': 0.3929921108156746, 'CIDEr': 0.33549387368957595}
Epoch 1: valid scores: {'BLEU': [0.5198158763794527, 0.3266221218440906, 0.21083615382007403, 0.14466446939702118], 'METEOR': 0.14857412665329203, 'ROUGE': 0.3901667719177014, 'CIDEr': 0.3238176195085862}
Epoch 2: test scores: {'BLEU': [0.5363390891046735, 0.3433352194110444, 0.22972261169705058, 0.161702404326391], 'METEOR': 0.15889498191446866, 'ROUGE': 0.40265999677039577, 'CIDEr': 0.3918358848411488}
Epoch 2: valid scores: {'BLEU': [0.5344971969609377, 0.3429908159683274, 0.22816690874613718, 0.15954900504768488], 'METEOR': 0.15884193274368288, 'ROUGE': 0.40092146958640706, 'CIDEr': 0.38624709418187864}
Epoch 0: test scores: {'BLEU': [0.4690450822202827, 0.2740674618595161, 0.16612616248488782, 0.10678866508543383], 'METEOR': 0.12288610247402906, 'ROUGE': 0.3544446897287239, 'CIDEr': 0.20692863119165858}
Epoch 0: valid scores: {'BLEU': [0.4686423286424349, 0.273403803935602, 0.16527346300876197, 0.10698574270873559], 'METEOR': 0.12199810341449624, 'ROUGE': 0.3551028368719895, 'CIDEr': 0.19943273111507906}
Epoch 1: test scores: {'BLEU': [0.5234146265494272, 0.33013144260381155, 0.21599108394900485, 0.1502956212753436], 'METEOR': 0.14913752586558116, 'ROUGE': 0.3926932894298893, 'CIDEr': 0.33945321026393854}
Epoch 1: valid scores: {'BLEU': [0.5212400103773532, 0.3260280885519397, 0.20978370480878183, 0.14325363587559073], 'METEOR': 0.14863569235301125, 'ROUGE': 0.38937877098153806, 'CIDEr': 0.3285524742290841}
Epoch 2: test scores: {'BLEU': [0.5392462990493665, 0.34574838094101, 0.23177218610868325, 0.1634024087207341], 'METEOR': 0.15928696514321405, 'ROUGE': 0.40350642053255725, 'CIDEr': 0.39954665532511835}
Epoch 2: valid scores: {'BLEU': [0.5384198160798076, 0.34729459175965427, 0.23149505946368845, 0.16200475114204513], 'METEOR': 0.15905405773409495, 'ROUGE': 0.4019090675184717, 'CIDEr': 0.38586817830420467}
Epoch 3: test scores: {'BLEU': [0.5576247853374767, 0.3681748259876384, 0.2503556363905668, 0.1775286945740459], 'METEOR': 0.16678130307858485, 'ROUGE': 0.41578602791187025, 'CIDEr': 0.4493921391880315}
Epoch 3: valid scores: {'BLEU': [0.5587686732113375, 0.36847411566621185, 0.24938174533741023, 0.17672797938397283], 'METEOR': 0.1669871796107814, 'ROUGE': 0.41688115695334615, 'CIDEr': 0.4420037453281138}
Epoch 4: test scores: {'BLEU': [0.5592815336438278, 0.37090939503203824, 0.2544594923800747, 0.18252102735598846], 'METEOR': 0.1695217901226202, 'ROUGE': 0.418990780522334, 'CIDEr': 0.4532828270207605}
Epoch 4: valid scores: {'BLEU': [0.561585946157828, 0.3739822313654224, 0.2568688613781989, 0.1835018180694584], 'METEOR': 0.17109845207704907, 'ROUGE': 0.4209797796986528, 'CIDEr': 0.45742805246218415}
Epoch 5: valid scores: {'BLEU': [0.5747958533802986, 0.3837748429662631, 0.26201906296908256, 0.18690747103220676], 'METEOR': 0.17389373960984333, 'ROUGE': 0.4246595217074625, 'CIDEr': 0.48530243964837516}
Epoch 5: test scores: {'BLEU': [0.5724500348071474, 0.3829436244456247, 0.2641280181317395, 0.1893176137588956], 'METEOR': 0.17483616426523868, 'ROUGE': 0.42583893623562136, 'CIDEr': 0.48865713989532106}
Epoch 6: test scores: {'BLEU': [0.574939321582381, 0.38852217356066643, 0.2706095127847032, 0.19493552662884184], 'METEOR': 0.18099364454018793, 'ROUGE': 0.42937929401910385, 'CIDEr': 0.5084514915519348}
Epoch 6: valid scores: {'BLEU': [0.5765404516785712, 0.3890258699624733, 0.27032996824320976, 0.19495512670880544], 'METEOR': 0.17978292898933637, 'ROUGE': 0.4300951449464768, 'CIDEr': 0.5014036231019827}
Epoch 7: valid scores: {'BLEU': [0.5932745014973693, 0.4056438567376153, 0.28216531593478134, 0.20300411435726073], 'METEOR': 0.18339969708988216, 'ROUGE': 0.4388847069290379, 'CIDEr': 0.5257250560479939}
Epoch 7: test scores: {'BLEU': [0.5891616972024817, 0.4041254615688441, 0.28296731501137296, 0.20569355273165876], 'METEOR': 0.1830829671835631, 'ROUGE': 0.4385432962005867, 'CIDEr': 0.5369374656759178}
Epoch 8: test scores: {'BLEU': [0.5967015091611156, 0.41173264320381114, 0.2876675790613734, 0.2081232947977035], 'METEOR': 0.18702007297932405, 'ROUGE': 0.4422395356742119, 'CIDEr': 0.5558009883289353}
Epoch 8: valid scores: {'BLEU': [0.597914406493104, 0.41062948504182106, 0.2839210574499419, 0.20276999539028312], 'METEOR': 0.18575055658208975, 'ROUGE': 0.44045625983744213, 'CIDEr': 0.5482074695599521}

Different from your provided training log in the previous issue by me.
The difference with your origin env:

  • The python version for me is 3.8.13
  • The torchvision version is 0.13 which is much higher than yours
  • I can't write the conda dir in our server since the installation need to write it, so I just compile the Deformable Attention and import it by adding the lib to sys.path (The sanity check in model/ops/test.py is passed)
  • I replace one broken image in coco2014 train according to the web because the program will crash when reading it.

problem about GridFeatureNetwork code

class GridFeatureNetwork(nn.Module):

    def __init__(
        self,
        n_layers,
        pad_idx,
        d_in=1024,
        d_model=512,
        n_heads=8,
        d_ff=2048,
        dropout=0.1,
        attn_dropout=0.0,
        attention_module=None,
        **kwargs,
    ):
        super().__init__()
        self.fc = nn.Linear(d_in, d_model)
        self.dropout = nn.Dropout(p=dropout)
        self.layer_norm = nn.LayerNorm(d_model)
        self.layers = nn.ModuleList([
            TransformerLayer(
                d_model,
                n_heads,
                d_ff,
                dropout,
                attn_dropout=attn_dropout,
                attention_module=attention_module,
                **kwargs,
            ) for _ in range(n_layers)
        ])

    def forward(self, input, attention_mask=None, attention_weights=None):
        out = F.relu(self.fc(input))
        out = self.dropout(out)
        out = self.layer_norm(out)

        if attention_mask is None:
            attention_mask = (torch.sum(out, dim=-1) == self.padding_idx)
            attention_mask = repeat(attention_mask, 'B N -> B 1 1 N')  # [B Head Nq N]

        outs = []
        for l in self.layers:
            out = l(out, out, out, attention_mask, attention_weights)
            outs.append(out.unsqueeze(1))

        outs = torch.cat(outs, 1)
        return outs, attention_mask

there are some problem about GridFeatureNetwork for file of models.caption.grid_net.py
'GridFeatureNetwork' object has no attribute 'padding_idx', however self.padding_idx was used in attention_mask = (torch.sum(out, dim=-1) == self.padding_idx)
i

hydra version (1.2.0) issues - downgrade to hydra 1.1.0 to fix issues

FYI, the latest hydra version (1.2.0) has unexpected problems when training/evaluation. To be precise, it saves the files in undesired locations.

For those who are new to this project, please skip this.

If you don't want to install the new environment again, please downgrade hydra to version 1.1.0 to make it run properly

conda activate grit
pip uninstall hydra-core
pip install hydra-core=1.1.0

# I make new commits to handle with hydra issues, please pull the project again (or clone the new code).
git pull origin main

# ^ check this new commit: https://github.com/davidnvq/grit/commit/c6fac8db4cbe7d60246e86f98d1c07220caa03b3

Sorry for this inconvenience.

all_splits.h5

Dear author. Can you share how to generate all_splits.h5 files?

Training time

excuse me, May I ask you how long this model training will take?

accuracy of training_caption with freezing detector

with suggestion of accumulation_steps in #15 , I modified the code of caption_engine.py of train_xe function

    with tqdm(desc=f'Epoch {epoch} - train', unit='it', total=len(dataloaders['train'])) as pbar:
        for it, batch in enumerate(dataloaders['train']):
            out = model(batch['samples'], batch['captions'])

            captions_gt = batch['captions'][:, 1:].contiguous()
            out = out[:, :-1].contiguous()
            loss = loss_fn(out.view(-1, len(text_field.vocab)), captions_gt.view(-1))
            loss = loss / config.optimizer.accumulation_steps
            loss.backward()

            loss = gather_result(loss)
            running_loss += loss.item()

            pbar.set_postfix(loss=running_loss / (it + 1))
            pbar.update()

            if scheduler is not None:
                # accumate
                if (it + 1) % config.optimizer.accumulation_steps == 0:
                    optimizers['model'].step()
                    optimizers['backbone'].step()

                    lr = scheduler.step()
                    assert optimizers['model'].param_groups[0]['lr'] == lr, "LR scheduler doesn't work properly."

                    optimizers['model'].zero_grad()
                    optimizers['backbone'].zero_grad()


but the CIDEr score of CIDEr first epoch is very low

Epoch 0 - train: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████| 4425/4425 [13:18<00:00,  5.01it/s, loss=1.25][W CudaIPCTypes.cpp:15] Producer process has been terminated before all shared CUDA tensors released. See Note [Sharing CUDA tensors]
Epoch 0 - train: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████| 4425/4425 [13:21<00:00,  5.52it/s, loss=1.25]
Epoch 0 - validation:  99%|████████████████████████████████████████████████████████████████████████████████████████████████▌| 195/196 [00:25<00:00, 12.46it/s, loss=3.97][W CudaIPCTypes.cpp:15] Producer process has been terminated before all shared CUDA tensors released. See Note [Sharing CUDA tensors]
Epoch 0 - validation: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████| 196/196 [00:25<00:00,  7.59it/s, loss=3.97]
Epoch 0 - evaluation on valid:   0%|                                                                                                             | 0/157 [00:00<?, ?it/s]Number of iterations: 1, batch_size=32, Total time per 1 batch: 0.22200s
Epoch 0 - evaluation on valid:  64%|███████████████████████████████████████████████████████████████                                    | 100/157 [00:25<00:11,  4.81it/s]Number of iterations: 101, batch_size=32, Total time per 1 batch: 0.20595s
Epoch 0 - evaluation on valid:  99%|██████████████████████████████████████████████████████████████████████████████████████████████████▎| 156/157 [00:37<00:00,  4.76it/s][W CudaIPCTypes.cpp:15] Producer process has been terminated before all shared CUDA tensors released. See Note [Sharing CUDA tensors]
Epoch 0 - evaluation on valid: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████| 157/157 [00:37<00:00,  4.16it/s]
Epoch: 0 iters: 157
Total time per 1 batch: 0.20528s
Epoch 0: valid scores: {'BLEU': [0.4669783296661464, 0.24582240555971746, 0.13479587494094583, 0.07239531663650367], 'METEOR': 0.10954924428703441, 'ROUGE': 0.34872282632850365, 'CIDEr': 0.08790172488631927}

caption_4ds_20220906, B-IM, 384_640, maxwh, True, 0, valid, 8.79, 46.70, 7.24, 34.87, 10.95, 24.58, 13.48, 1.25, 0.00, 0.00, fr_xe, 3.97
Epoch 0 - evaluation on test:   0%|                                                                                                              | 0/157 [00:00<?, ?it/s]Number of iterations: 1, batch_size=32, Total time per 1 batch: 0.21069s
Epoch 0 - evaluation on test:  64%|███████████████████████████████████████████████████████████████▋                                    | 100/157 [00:25<00:11,  4.82it/s]Number of iterations: 101, batch_size=32, Total time per 1 batch: 0.20469s
Epoch 0 - evaluation on test:  99%|███████████████████████████████████████████████████████████████████████████████████████████████████▎| 156/157 [00:37<00:00,  4.81it/s][W CudaIPCTypes.cpp:15] Producer process has been terminated before all shared CUDA tensors released. See Note [Sharing CUDA tensors]
Epoch 0 - evaluation on test: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████| 157/157 [00:37<00:00,  4.17it/s]
Epoch: 0 iters: 157
Total time per 1 batch: 0.20416s
Epoch 0: test scores: {'BLEU': [0.4664962002208445, 0.24325046136142325, 0.1341976162794273, 0.0742708785109443], 'METEOR': 0.10937207416996544, 'ROUGE': 0.34777669352106455, 'CIDEr': 0.08940798205091863}

caption_4ds_20220906, B-IM, 384_640, maxwh, True, 0, test , 8.94, 46.65, 7.43, 34.78, 10.94, 24.33, 13.42, 1.25, 0.00, 0.00, fr_xe, 3.97

then the CIDEr score not in range of [1.05 - 1.29] ( #17 (comment)_ )

training with my own dataset

Hello author, thank you for your excellent work!

I want to train my dataset, which has been annotated in Coco format. How can I obtain a file similar to 'coco train_ids. npy'
image

init_process_group stuck

Dear author:
i follow the training tutorial like this:

export DATA_ROOT=path/to/coco_dataset 
# with pretrained object detector on 4 datasets 
python train_caption.py exp.name=caption_4ds model.detector.checkpoint=4ds_detector_path

and run the train_caption.py script. then i found the code is stuck. After debugging the code, it turns out that the code is stuck in this line:
bug
i use only one 2080ti, how should i do to run the training script?
looking forward to reply, thank you!

the data sets

Dear Author, Can you share the four data sets after consolidation(i.e., Visual Genome (VG),COCO, OpenImages, and Objects365)

Question about Training log

image

Dear Auther, you give the script that training the model end-to-end, I want to know whether you record the training result after each epoch using 4 datasets checkpoints.

vocab.json

Is it possible to get the vocab.json for use with the pretrained checkpoints that are downloadable? Thanks!

The code without detector

Hi, Dr. Nguyen, thanks for your great work. I met a problem when reproducing your code. Because I have two 3090 GPUs with about 48GB memories, I cannot run this code. I'd like to know if without detector and only using grid features, how many gpu memories do I need?

what is the function of the config when dataset.overfit = True

in configs/caption/coco_config.yaml :

dataset:
  overfit: True
  ann_root: '${oc.env:DATA_ROOT}/annotations'
  img_root: '${oc.env:DATA_ROOT}'
  hdf5_path: '${oc.env:DATA_ROOT}/all_splits.h5' # this is used for freezed extractor; fast to train.
  vocab_path: '${oc.env:DATA_ROOT}/annotations/vocab.json'
  use_gri_feat: ${model.use_gri_feat}
  use_reg_feat: ${model.use_reg_feat}

what is the function of params of dataset.overfit = True

in annother file :

class CPairedDataset:

    def __init__(self, examples, image_field, overfit=False):
        self.examples = examples
        self.image_field = image_field
        self.overfit = overfit

    def __getitem__(self, idx):
        example = self.examples[idx]
        img_path, caption = example.image, example.tokens
        image_id = example.image_id
        img = self.image_field.preprocess(img_path)
        return img, caption, image_id

    def __len__(self):
        if self.overfit:
            return OVERFIT_SIZE
        return len(self.examples)

when overfit is True, why return lenght of CPaireDataset = 64?

Finetuning

It is great job.Thank you to share. I just want to ask how can I fine-tune your model with my own dataset? I saw you already added vicap.

I am using this code for fine-tuning and same vocab file that you provide.Naturally some of the tokens of my dataset is not included at vocab file. First I thought I can simply add the tokens to the vocab file. But there is a parameter named vocab_size. I updated this parameter properly. When I try to use your per-trained model, I am getting size mismatch error. Is there any way for fine-tuning without retrain the entire model.
Thank you.

The Visual Genome dataset

Dear Author,
The Visual Genome dataset I downloaded was not partitioned. How to divide the Visual Genome dataset into training, testing, and validation sets?How to form the annotations folder?

ArtEmis: dataloader & training

Dear Author,

  1. can you please share some of your used data-loading code for ArtEmis
  2. for Artemis, is it true that you trained it without freezing the backbone/detector?

Thanks for the excellent work!

inference error

when i follow the installation tutorial, downloaded dataset and checkpoints, and try to perform Inference for a single image using the inference_caption.py. there comes some errors.
微信图片_20231106194710
i don't know how to fix it and somebody help me?

The missing files

Dear author. The object detector missing some important files for training in train_config.yaml. Can you provide those files ! Thanks
${oc.env:DATA_ROOT}/vg/annotations/train_ann_lmdb ${oc.env:HOME}/datasets/vg/annotations/train_objects.json ${oc.env:HOME}/datasets/vg/annotations/attribute2ind.json ${oc.env:HOME}/datasets/vg/annotations/oid2attr.json
${oc.env:HOME}/datasets/vg/annotations/test_objects.json ${oc.env:HOME}/datasets/vg/annotations/test_coco.pkl ${oc.env:HOME}/datasets/vg/annotations/val_objects.json ${oc.env:HOME}/datasets/vg/annotations/val_coco.pkl
${oc.env:HOME}/datasets/coco/annotations/anno_1848_val2017.json
${oc.env:HOME}/datasets/coco/annotations/coco_vgoiv6_class2ind.json

How to generate more captions

Hello,

I would like to generate multiple candidate captions for one image when doing image captioning. How could I do this? Is there any parameter I can set?

Thanks!

Grid Feature Network

Dear Author,
I read the paper and don't understand how Grid Feature Network works. You said: "We intend to extract contextual information hidden in the input image by modeling the spatial interaction between the grid features". That means it uses scene graph. If not how you can extract contextual information ? Thanks

training with my own dataset

Hello author, thank you for your excellent work.
I want to train my dataset, how do I generate a ‘coco train ids.npy’ file for my own dataset

1702349913604

L in extract_features.py

Dear authors,
I extract feature and save into HDF5 file but number of img_ids don't equal number of reg_feat and gri_feat.
In extract_features.py,
Why does batch_size in Dataloader equal BATCH_SIZE - 1 and why append random tensor to imgs from batch in dataloader.
Thank authors.

The help for the evaluation

Hi, thank you for sharing this great work.I'm trying to test the effects of training,But I found that there were some missing code about the TABLE 7,such as 'Object Attr. Relation Color Count Size CLIP',So how do I get it?
WeChat3d3dd86b1070936362aec54ddd4da0b7

some questions about freezing training

Hello author, I have some questions about freezing training and hope to get your reply
I noticed that you mentioned freezing the backbone network and detector, I want to ask what the specific purpose of this is, in order to understand that you can give some specific instructions for freezing the backbone and detector, when reading the code I found that the backbone is included in the model, what is the purpose of freezing the backbone in this case

TypeError: 'NoneType' object does not support item deletion

hello,When I train the model, the following error occurs:

File "train_demo.py", line 316, in run_main
main(config)
File "train_demo.py", line 121, in main
dataloaders, samplers = build_coco_dataloaders(config, mode='finetune', device=device)
File "J:\GRIT\datasets\caption\coco.py", line 310, in build_coco_dataloaders
text_field = TextField(vocab_path=config.dataset.vocab_path)
File "J:\GRIT\datasets\caption\field.py", line 137, in init
self.vocab = Vocab(vocab_path=vocab_path)
File "J:\GRIT\datasets\caption\vocab.py", line 70, in init
del counter[tok]
TypeError: 'NoneType' object does not support item deletion

Is there any way to solve it?

Time to release the code.

Hello, this is a good work. I read your paper and wonder whether the code will be released or not and when ?
waiting for your ans.

Frozen sc training error issue

Hello! When I try your frozen mode as:

python train_caption.py exp.name=caption_4ds model.detector.checkpoint=ckpts/detector_checkpoint_vg.pth \
exp.ngpus_per_node=1 \
exp.world_size=1 \
optimizer.freezing_xe_epochs=10 \
optimizer.freezing_sc_epochs=10 \
optimizer.finetune_xe_epochs=0 \
optimizer.finetune_sc_epochs=0

XE stage runs well, error occurs in frozen fc stage

caption_engine.py:train_sc
with tqdm(desc='Epoch %d - train' % epoch, unit='it', total=len(dataloaders['train_dict'])) as pbar:
    for it, batch in enumerate(dataloaders['train_dict']):
        if 'samples' in batch:
            b_s = batch['samples'].tensors.shape[0]
        elif 'vis_feat' in batch:
            b_s = batch['vis_feat'].shape[0]

the exception says: batch['samples'] :'dict' object has no attribute 'tensors'
It seems that the batch['samples'] should be NestedTensor

But actually in frozen sc mode, the dataloader act like this:

coco.py:
if self.img_field.use_hdf5_feat:
    samples = {}
    if self.img_field.use_gri_feat:
        samples['gri_feat'] = torch.stack([im['gri_feat'] for im in imgs]).to(self.device)
        samples['gri_mask'] = torch.stack([im['gri_mask'] for im in imgs]).to(self.device)
    if self.img_field.use_reg_feat:
        samples['reg_feat'] = torch.stack([im['reg_feat'] for im in imgs]).to(self.device)
        samples['reg_mask'] = torch.stack([im['reg_mask'] for im in imgs]).to(self.device)
    outputs['samples'] = samples
else:
    outputs['samples'] = nested_tensor_from_tensor_list(imgs).to(self.device)

so it returns tensors in hdf5 mode but frozen sc want a Nested tensor?
Or am I wrong in somewhere when running the code?
Thank you for your attention.

inference

I run inference notebook and i realise every time i run cell Inference and Decode (from torch.no_grad to end), generate different captions. Example with same image, 'three sheep standing next to a fence in a field' or 'two sheep standing next to a fence in a field' or 'three sheep standing next to a fence in the grass' generated.
Can you check my issue?
Thanh you

Some issues regarding generating vocab.json files

Example of how you previously answered other people's questions:

Suppose that it is similar to the English tokenizer, use can obtain a vocab.json file by:

from datasets.caption.field import TextField

text_field = TextField(vocab_path="path_to_save_vocab.json", build_vocab=True)

given a list of captions

source = [
"This is a first caption",
"This is a second caption",
....
]

text_field.build_vocab(source)
That's how it works.

Hello author, according to your example, the vocab.json file I generated contains "freqs" and "itos", but I cannot obtain "stoi". May I ask if you could tell me why? Or is there no need for 'stoi' participation in GRIT training
I would greatly appreciate it if you could reply to me as soon as possible

The inference script is not generating a complete caption.

Hi, thank you for sharing this great work.

I’m trying to reproduce the paper result on the 5k Karpathy test split test set using the inference script, but I’m getting a lower scores:

Bleu_1: 0.810
Bleu_2: 0.655
Bleu_3: 0.510
Bleu_4: 0.388
METEOR: 0.295
ROUGE_L: 0.587
CIDEr: 1.333
SPICE: 0.230

And after some digging, the caption is not fully generated,
I managed to duplicate the problem in Colab as well.

https://colab.research.google.com/drive/1BvtscubSujlxOFhOchVGNB79KkKYoMiH?usp=sharing

Request for Label to index file

Hey there, first of all great work. We're exploring your project for our research work, wanted to know if you can provide with the object labels to index file.

It seems you've created 1849 classes by aggregating the 4 datasets, we're unable to regenerate it. Would really appreciate it.

Train your own coco format data

Hi, this is really a great job!
I want to train my own dataset now, but have some questions like how to generate the vocab.json, can you give the exact script to process the dataset, if so, I'd appreciate it!

Missing detection dataset

Dear author. The object detector missing some important files for training in train_config.yaml. Can you provide those files ! Thanks
${oc.env:DATA_ROOT}/vg/annotations/train_ann_lmdb ${oc.env:HOME}/datasets/vg/annotations/train_objects.json ${oc.env:HOME}/datasets/vg/annotations/attribute2ind.json ${oc.env:HOME}/datasets/vg/annotations/oid2attr.json

Some problems on freezing training

Dear author, I have used the code you provided to freeze the trunk and detector to train the model. I trained on four GTX 1080s. Only the parameters involved in distributed training were changed, and other parameters remained unchanged. However, after 10 times of XE and 10 times of SC, the final CIDER score was 132.0. What may be the cause of this problem? What methods should be used to improve the performance of the model?

Question about freeze training

Dear Author, I have used the code you give to freeze the backbone and detector to training the model. But after 10 epoch for XE and 10 epoch for SC, the final CIDER score is 38.0. Do you have met the same problem or do you have any idea about what is the possible reason.

Number of epochs

How to reduce the number of epochs? And how many epochs the model are seted originally?

Error executing job with overrides

Dear Author,
I have encountered such an error:

Epoch 0 - train: 100%|▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒| 566435/566435 [40:02:24<00:00, 3.93it/s, loss=3.45]
Epoch 0 - validation: 0%| | 1/1563 [00:02<1:02:34, 2.40s/it, loss=3.Epoch 0 - validation: 0%| | 1/1563 [00:03<1:02:34, 2.40s/it, loss=3.Epoch 0 - validation: 0%| | 2/1563 [00:03<35:51, 1.38s/it, loss=3.78Epoch 0 - validation: 0%| | 2/1563 [00:04<55:13, 2.12s/it, loss=3.78]
Error executing job with overrides: []
Traceback (most recent call last):
File "train_demo.py", line 324, in run_main
main(config)
File "train_demo.py", line 174, in main
train_res = train_xe(
File "J:\GRIT\engine\caption_engine.py", line 383, in train_xe
val_loss = evaluate_loss(model, dataloaders['valid'], loss_fn, text_field, epoch, writer)
File "J:\GRIT\engine\caption_engine.py", line 298, in evaluate_loss
out = model(batch['samples'], batch['captions'])
File "D:\Anaconda3\lib\site-packages\torch\nn\modules\module.py", line 1190, in _call_impl
return forward_call(*input, **kwargs)
File "J:\GRIT\models\caption\transformer.py", line 89, in forward
vis_inputs = self.detector(images)
File "D:\Anaconda3\lib\site-packages\torch\nn\modules\module.py", line 1190, in _call_impl
return forward_call(*input, **kwargs)
File "J:\GRIT\models\caption\detector.py", line 53, in forward
features = self.backbone(x)
File "D:\Anaconda3\lib\site-packages\torch\nn\modules\module.py", line 1190, in _call_impl
return forward_call(*input, **kwargs)
File "J:\GRIT\models\common\swin_model.py", line 662, in forward
x_out, H, W, x, Wh, Ww = layer(x, Wh, Ww)
File "D:\Anaconda3\lib\site-packages\torch\nn\modules\module.py", line 1190, in _call_impl
return forward_call(input, **kwargs)
File "J:\GRIT\models\common\swin_model.py", line 448, in forward
x = blk(x, attn_mask)
File "D:\Anaconda3\lib\site-packages\torch\nn\modules\module.py", line 1190, in _call_impl
return forward_call(input, **kwargs)
File "J:\GRIT\models\common\swin_model.py", line 279, in forward
attn_windows = self.attn(x_windows, mask=attn_mask) # nW
B, window_size
window_size, C
File "D:\Anaconda3\lib\site-packages\torch\nn\modules\module.py", line 1190, in _call_impl
return forward_call(*input, **kwargs)
File "J:\GRIT\models\common\swin_model.py", line 171, in forward
attn = attn + relative_position_bias.unsqueeze(0)
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 568.00 MiB (GPU 0; 6.00 GiB total capacity; 3.99 GiB already allocated; 0 bytes free; 4.77 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

Set the environment variable HYDRA_FULL_ERROR=1 for a complete stack trace.

Is there any way to solve it?

Some problems about load pretrain model

Dear author @davidnvq ,
It is a perfect work! When i use it, there are some problems. How can i solve it? I use "detector_checkpoint_vg.pth" pretrain model. But some problems occur.
image

Could i change code of line 69?
before change:
detector=detector.module,
after change:
detector=detector.det_module,

multi gpus training cause out of memory error

when i try to run the train_caption.py script like this:

export Data_ROOT=path/to/coco_dataset
python train_caption.py exp.name=caption_rds moel.detector.checkpoint=4ds_detector_path

i encountered some errors like this:
bug2

belows are changes i made in coco_config.yml:
ngpus_per_node:2
world_size:2
batch_size:4
num_workers:2

however,when i set the ngpus_per_node:1 world_size:1,it can run properly.
微信图片_20231108104933

anyone can help? thanks a lot!

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.