Official code for CVPR 2022 (Oral) paper "Deep Visual Geo-localization Benchmark"

License: MIT License

Python 100.00%

benchmark computer-vision datasets deep-learning geolocalization localization place-recognition pytorch descriptors image-retrieval

deep-visual-geo-localization-benchmark's Introduction

Hello, I'm Gabriele Berton, a PhD student from the Vandal Lab at Polytechnic of Turin !

Languages and Tools:

deep-visual-geo-localization-benchmark's People

Contributors

Stargazers

Watchers

deep-visual-geo-localization-benchmark's Issues

[Questions] Input image size

Hi~ thx for your great work! 😄

I have a question about the size of the input image.

I found that in the MSLS dataset, the size of the image is not completely consistent. Some images are large in size and some are small in size, which not has the same size as the Pitts-250k dataset.

So, I want to ask: How to deal with input images of different sizes during training?

I would like to use ViT or other variant as backbone, as we all know ViT needs unified image input size(224224), how to deal with large size input(e.g. 6001200)? Direct resize image to uniform size will lose information and lead to poor results.

Visualize the predictions

Hello, @gmberton @ga1i13o ,I really appreciate for your great work. Now I want to obtain a list of the top N images retrieved for each query in the test dataset in the test stage.It just need to output a txt file( including the prediction results ),so as to verify the retrieval ability of the algorithms mentioned in the paper.
Is it possible to add a piece of code on the basis of this benchmark to achieve this？What should I do? Could you give me some suggestions?
I am always looking forward to your kind response.
Best regards.

Advice on adjusting output dimension in Resnet101conv4 for Deep Visual Geo-localization Benchmark

Hello, @gmberton!

I am currently working on a project involving deep visual geo-localization benchmark using a ResNet101conv4 model, which by default has an output dimension of 1024. For my specific application, I need to adjust the model's output dimension to 2048. I have identified two potential approaches to achieve this, and I would greatly appreciate your insights on which method might be more suitable or if there's another recommended strategy.

1. Utilizing a command-line argument to set the output dimension directly in the parser with --fc_output_dim=2048.

2. Modifying the network.py file to manually insert an additional Convolutional Layer into the existing CNN architecture, using the following code:
layers.append(nn.Conv2d(1024, 2048, kernel_size=(1,1), stride=(1,1), bias=False)).

Could you please provide guidance on the advantages or disadvantages of these approaches in the context of Deep Visual Geo-localization Benchmark? Which method would you recommend for effectively changing the output dimension while maintaining or enhancing model performance?

Looking forward to your reply. Thank you in advance!

ERROR raised as downloading the pretrained model

Hi, as I reproduce the result of cct+netvlad, I used the command: python train.py --dataset_name=msls --backbone=cct384 --aggregation=netvlad --mining=partial --trunc_te=8 --freeze_te=1 --resize 384 384 --negs_num_per_query=5, and then an error raised urllib.error.HTTPError: HTTP Error 404: Condition Intercepted it seems like I cannot log into the web of pretrained model. Then I directly log into the website, it shows

thank you for your request. Unfortunately, the page you requested /~alih/compact transformers/checkpoints/finetuned/cct_14_7x2_384_imagenet.pth does not exist.

Issue with Saving Checkpoint in Training Loop

Hello, @gmberton!

I've encountered an issue when using the --resume option to load the best_model.pth file. Specifically, while the "model_state_dict" and "recalls" in the checkpoint file correctly store the weights and recall values for the best-performing epoch, the "epoch_num", "best_r5", and "not_improved_num" are from the epoch immediately before the best epoch.

For example, consider the following training log where epoch 77 achieves the highest R@5 of 79.1, after which the training stops:

2024-05-20 14:38:01   Start training epoch: 77
2024-05-20 14:38:01   Cache: 0 / 5
2024-05-20 14:40:31   Epoch[77](0/5): current batch triplet loss = 0.0058, average epoch triplet loss = 0.0059
2024-05-20 14:40:31   Cache: 1 / 5
2024-05-20 14:42:58   Epoch[77](1/5): current batch triplet loss = 0.0030, average epoch triplet loss = 0.0066
2024-05-20 14:42:58   Cache: 2 / 5
2024-05-20 14:45:25   Epoch[77](2/5): current batch triplet loss = 0.0096, average epoch triplet loss = 0.0065
2024-05-20 14:45:25   Cache: 3 / 5
2024-05-20 14:47:54   Epoch[77](3/5): current batch triplet loss = 0.0000, average epoch triplet loss = 0.0066
2024-05-20 14:47:54   Cache: 4 / 5
2024-05-20 14:50:23   Epoch[77](4/5): current batch triplet loss = 0.0039, average epoch triplet loss = 0.0065
2024-05-20 14:50:23   Finished epoch 77 in 0:12:22, average epoch triplet loss = 0.0065
2024-05-20 14:50:23   Extracting database features for evaluation/testing
2024-05-20 14:51:36   Extracting queries features for evaluation/testing
2024-05-20 14:52:17   Calculating recalls
2024-05-20 14:52:19   Recalls on val set < BaseDataset, msls - #database: 18871; #queries: 11084 >: R@1: 65.5, R@5: 79.1, R@10: 82.7, R@20: 86.0
2024-05-20 14:52:20   Improved: previous best R@5 = 78.4, current R@5 = 79.1

The best_model.pth file then contains:

epoch_num: <class 'int'>
  value: 77
model_state_dict: <class 'collections.OrderedDict'>
optimizer_state_dict: <class 'dict'>
recalls: <class 'numpy.ndarray'>
  value: [65.53590761 79.12306027 82.70479971 86.02490076]
best_r5: <class 'numpy.float64'>
  value: 78.44640923854205
not_improved_num: <class 'int'>
  value: 0

When I resume training with this checkpoint, the log shows:

2024-05-21 01:40:17   Loaded checkpoint: start_epoch_num = 77, current_best_R@5 = 78.4
2024-05-21 01:40:17   Resuming from epoch 77 with best recall@5 78.4

It appears that the checkpoint is saved before best_r5 and related variables are updated.
I think this issue can be resolved by updating these variables before saving the checkpoint.

Current train.py code:

...

is_best = recalls[1] > best_r5

# Save checkpoint, which contains all training parameters
util.save_checkpoint(args, {
    "epoch_num": epoch_num, "model_state_dict": model.state_dict(),
    "optimizer_state_dict": optimizer.state_dict(), "recalls": recalls, "best_r5": best_r5,
    "not_improved_num": not_improved_num
}, is_best, filename="last_model.pth")

# If recall@5 did not improve for "many" epochs, stop training
if is_best:
    logging.info(f"Improved: previous best R@5 = {best_r5:.1f}, current R@5 = {recalls[1]:.1f}")
    best_r5 = recalls[1]
    best_epoch = epoch_num
    not_improved_num = 0
else:
    not_improved_num += 1
    logging.info(f"Not improved: {not_improved_num} / {args.patience}: best R@5 = {best_r5:.1f} at epoch: {best_epoch:.1f}, current R@5 = {recalls[1]:.1f}")
    if not_improved_num >= args.patience:
        logging.info(f"Performance did not improve for {not_improved_num} epochs. Stop training.")
        break

Proposed change:

is_best = recalls[1] > best_r5

# If recall@5 did not improve for "many" epochs, stop training
if is_best:
    logging.info(f"Improved: previous best R@5 = {best_r5:.1f}, current R@5 = {recalls[1]:.1f}")
    best_r5 = recalls[1]
    best_epoch = epoch_num
    not_improved_num = 0
else:
    not_improved_num += 1
    logging.info(f"Not improved: {not_improved_num} / {args.patience}: best R@5 = {best_r5:.1f} at epoch: {best_epoch:.1f}, current R@5 = {recalls[1]::.1f}")
    if not_improved_num >= args.patience:
        logging.info(f"Performance did not improve for {not_improved_num} epochs. Stop training.")
        break

# Save checkpoint, which contains all training parameters
util.save_checkpoint(args, {
    "epoch_num": epoch_num+1, "model_state_dict": model.state_dict(),
    "optimizer_state_dict": optimizer.state_dict(), "recalls": recalls, "best_r5": best_r5,
    "not_improved_num": not_improved_num
}, is_best, filename="last_model.pth")

Would this change be a proper solution to ensure that the best_model.pth file correctly reflects the best-performing epoch?

Looking forward to your reply. Thank you in advance!

how to reproduce the result about VIT-Netvlad

hi, thanks for your great work about the VPR bench!
But when I try to reproduce the VIT-netvlad, I don't know how to do, Could you show me the process that to reproduce the result.
Another question is that, I check the VIT model in the network.py, but can not find the structure of VIT, about its block number or some other information. So can I change another VIT model in your code structure?

Bug when Training self model with pitts30k dataset

When I'm training my model, this error occurs.
I checked my code, found that the bug occured in this section of the code.

When images.shape is (2,3,xxx,xxx), features.shape becomes (2048), not (2,1024). And I checked my model carefully and found no issues.

A problem for training

Hello! The following problems occurred during my training. No response for a long time, about a few hours. What caused this?

Use of CCT-NetVLAD

Hello, I really appreciate for your great work. I'm having some trouble with the model you provided. For example, I want to use the model you provided (CCT-NetVLAD) to extract a 1D vector of an image. How should I proceed?
I'm sorry to bother you.
Best wishes!
Thank you in advance.

torchscan: list index out of range

Hi,

The torchscan seems not support CUDA 11.

question about aggregation

Hi， @gmberton @ga1i13o
I wonder that If I don't want to use any aggregation in the program and only want to use the features output by Backbone as image representation for matching, how should I modify it? What suggestions do you have for not using the aggregation?

Producing SARE loss results on vanilla NetVLAD (Vgg16 based)

I can see loss function (sare-joint and sare-ind) in the code. How could I produce their paper results

Liu Liu, Hongdong Li, and Yuchao Dai. Stochastic Attraction-Repulsion Embedding for Large Scale Image Localization. In IEEE International Conference on Computer Vision, 2019.

I tried but couldn't be successful.

python3 train.py --dataset_name=pitts30k --backbone=vgg16 --criterion=sare_joint

Could you please suggest something?

Pretrained Model Links are not valid

Hi,

The models links are no longer valid now.

Pretrained models on Google Landmarks v2 and Places 365

PRETRAINED_MODELS = {
'resnet18_places' : '1DnEQXhmPxtBUrRc81nAvT8z17bk-GBj5',
'resnet50_places' : '1zsY4mN4jJ-AsmV3h4hjbT72CBfJsgSGC',
'resnet101_places' : '1E1ibXQcg7qkmmmyYgmwMTh7Xf1cDNQXa',
'vgg16_places' : '1UWl1uz6rZ6Nqmp1K5z3GHAIZJmDh4bDu',
'resnet18_gldv2' : '1wkUeUXFXuPHuEvGTXVpuP5BMB-JJ1xke',
'resnet50_gldv2' : '1UDUv6mszlXNC1lv6McLdeBNMq9-kaA70',
'resnet101_gldv2' : '1apiRxMJpDlV0XmKlC5Na_Drg2jtGL-uE',
'vgg16_gldv2' : '10Ov9JdO7gbyz6mB5x0v_VSAUMj91Ta4o'
}

How to understand the data in Table 3 of the article?

In section 4.1, you mentioned the following:Moreover, results considerably depend on the training data: as an example, training the same network on Pitts30k or MSLS yields a 30% gap testing the model on St. Lucia, as well as a noticeable difference on other datasets too. This effect demonstrates that comparing models trained on different datasets, as done in [85], can be misleading.

In the fourth row of data in Table 3, the model was trained in MSLS and then tested in MSLS, which was far worse than that tested in St Lucia. How to understand this? Generally speaking, the model is trained on data set A, so the test effect on data set A should be better than that on other data sets.

About the reproduced results on the tokyo 247 dataset

Hi, gmberton!
I trained a model using the pitts30k dataset and evaluated it on the tokyo 247 dataset. The output is as follows:
2022-09-15 09:26:21 Calculating recalls
2022-09-15 09:27:12 Recalls on < BaseDataset, tokyo247 - #database: 75984; #queries: 315 >: R@1: 55.2, R@5: 69.5, R@10: 75.2, R@20: 77.5
2022-09-15 09:27:12 Finished in 0:04:37

Here are some key parameters I use:
aggregation='netvlad', backbone='resnet18conv4', mining='partial', train_batch_size=16
I guess it should be that I adjusted the batch size that caused the difference in the results.

So I downloaded the model you provided. The input is as follows:
python3 eval.py --backbone=resnet18conv4 --aggregation=netvlad --resume=logs/pretrained/pitt_r18l3_netvlad_full.pth --dataset_name=tokyo247
Then I got the error:
Traceback (most recent call last):
File "eval.py", line 89, in <module>
state_dict = torch.load(args.resume)["model_state_dict"]
KeyError: 'model_state_dict'
So am I typing wrong in the terminal?

And I found a small error in "README.md".
In the "Pretained networks employing different backbones" and "Pretrained models with different mining methods", you seem to have entered some mismatched results.

Some questions about VITWrapper in network.py

Hi there, I noticed the VitWrapper in network.py. Would you please show me the meaning of VitWrapper? why the vit backbone should pop the class token self.vit_model(x).last_hidden_state[:, 1:, :] when connect with netvlad or gem aggregation layer? ^ ^

WEBSITE COULD NOT BE REACHED

Hi, the website posted in README cannot be reached.

reproduce the result of CCT-NetVLAD

Hi, I really appreciate for the work you've done. I wonder if the CCT-NetVLAD model will be released. How should I set parameters (trunc_te/freeze_te) to reproduce the result of CCT-NetVLAD on MSLS dataset.

Why do L2Norm before GeM

Hi, thanks for this nice work first!
I just confused by one thing: why you using L2Norm before GeM?
I had also study the architecture proposed in original GeM paper inwhich the author was normalize the final vector instead of before pooling layer. so have you ever benchmarking the performance between using L2Norm before and after pooling layer?
Looking forward to your reply!

Modelzoo

Great paper!
Where can I find the trained models?
Thanks

Is this website（https://deep-vg-bench.herokuapp.com/） unavailable?

Is this website（https://deep-vg-bench.herokuapp.com/） unavailable?
It is a great work,but unfortunately I can't open this website

Network structure of CCT-NetVLAD

Thanks for your great work!
I have some questions about network structure of CCT-Netvlad. I'm not sure if there is a Seqpooling layer in its structure or not. Which layer in the CCT is connected to the NetVLAD layer?
I'm sorry to bother you.
Looking forward to your reply!

ViT model

Thanks for your great work!Can you provide a download of your ViT model and the training configurations?

Hi, I have selected gl18-tl-resnet50-gem-w, how can I convert the vector dimension to 512?

By default, gl18-tl-resnet50-gem-w outputs a vector dimension of 2048, but what if I want to convert it to 512 dimensions? What should I do?

gmberton / deep-visual-geo-localization-benchmark Goto Github PK

deep-visual-geo-localization-benchmark's Introduction

Hello, I'm Gabriele Berton, a PhD student from the Vandal Lab at Polytechnic of Turin !

deep-visual-geo-localization-benchmark's People

Contributors

Stargazers

Watchers

Forkers

deep-visual-geo-localization-benchmark's Issues

Pretrained models on Google Landmarks v2 and Places 365

Recommend Projects

Recommend Topics

Recommend Org

Jobs