Hello, I'm Gabriele Berton, a PhD student from the Vandal Lab at Polytechnic of Turin !
Languages and Tools:
Official code for CVPR 2022 (Oral) paper "Deep Visual Geo-localization Benchmark"
License: MIT License
Languages and Tools:
Hi~ thx for your great work! 😄
I have a question about the size of the input image.
I found that in the MSLS dataset, the size of the image is not completely consistent. Some images are large in size and some are small in size, which not has the same size as the Pitts-250k dataset.
So, I want to ask: How to deal with input images of different sizes during training?
I would like to use ViT or other variant as backbone, as we all know ViT needs unified image input size(224224), how to deal with large size input(e.g. 6001200)? Direct resize image to uniform size will lose information and lead to poor results.
Hello, @gmberton @ga1i13o ,I really appreciate for your great work. Now I want to obtain a list of the top N images retrieved for each query in the test dataset in the test stage.It just need to output a txt file( including the prediction results ),so as to verify the retrieval ability of the algorithms mentioned in the paper.
Is it possible to add a piece of code on the basis of this benchmark to achieve this?What should I do? Could you give me some suggestions?
I am always looking forward to your kind response.
Best regards.
Hello, @gmberton!
I am currently working on a project involving deep visual geo-localization benchmark using a ResNet101conv4 model, which by default has an output dimension of 1024. For my specific application, I need to adjust the model's output dimension to 2048. I have identified two potential approaches to achieve this, and I would greatly appreciate your insights on which method might be more suitable or if there's another recommended strategy.
1. Utilizing a command-line argument to set the output dimension directly in the parser with --fc_output_dim=2048.
2. Modifying the network.py file to manually insert an additional Convolutional Layer into the existing CNN architecture, using the following code:
layers.append(nn.Conv2d(1024, 2048, kernel_size=(1,1), stride=(1,1), bias=False)).
Could you please provide guidance on the advantages or disadvantages of these approaches in the context of Deep Visual Geo-localization Benchmark? Which method would you recommend for effectively changing the output dimension while maintaining or enhancing model performance?
Looking forward to your reply. Thank you in advance!
Hi, as I reproduce the result of cct+netvlad, I used the command: python train.py --dataset_name=msls --backbone=cct384 --aggregation=netvlad --mining=partial --trunc_te=8 --freeze_te=1 --resize 384 384 --negs_num_per_query=5
, and then an error raised urllib.error.HTTPError: HTTP Error 404: Condition Intercepted
it seems like I cannot log into the web of pretrained model. Then I directly log into the website, it shows
thank you for your request. Unfortunately, the page you requested /~alih/compact transformers/checkpoints/finetuned/cct_14_7x2_384_imagenet.pth does not exist.
Hello, @gmberton!
I've encountered an issue when using the --resume option to load the best_model.pth file. Specifically, while the "model_state_dict" and "recalls" in the checkpoint file correctly store the weights and recall values for the best-performing epoch, the "epoch_num", "best_r5", and "not_improved_num" are from the epoch immediately before the best epoch.
For example, consider the following training log where epoch 77 achieves the highest R@5 of 79.1, after which the training stops:
2024-05-20 14:38:01 Start training epoch: 77
2024-05-20 14:38:01 Cache: 0 / 5
2024-05-20 14:40:31 Epoch[77](0/5): current batch triplet loss = 0.0058, average epoch triplet loss = 0.0059
2024-05-20 14:40:31 Cache: 1 / 5
2024-05-20 14:42:58 Epoch[77](1/5): current batch triplet loss = 0.0030, average epoch triplet loss = 0.0066
2024-05-20 14:42:58 Cache: 2 / 5
2024-05-20 14:45:25 Epoch[77](2/5): current batch triplet loss = 0.0096, average epoch triplet loss = 0.0065
2024-05-20 14:45:25 Cache: 3 / 5
2024-05-20 14:47:54 Epoch[77](3/5): current batch triplet loss = 0.0000, average epoch triplet loss = 0.0066
2024-05-20 14:47:54 Cache: 4 / 5
2024-05-20 14:50:23 Epoch[77](4/5): current batch triplet loss = 0.0039, average epoch triplet loss = 0.0065
2024-05-20 14:50:23 Finished epoch 77 in 0:12:22, average epoch triplet loss = 0.0065
2024-05-20 14:50:23 Extracting database features for evaluation/testing
2024-05-20 14:51:36 Extracting queries features for evaluation/testing
2024-05-20 14:52:17 Calculating recalls
2024-05-20 14:52:19 Recalls on val set < BaseDataset, msls - #database: 18871; #queries: 11084 >: R@1: 65.5, R@5: 79.1, R@10: 82.7, R@20: 86.0
2024-05-20 14:52:20 Improved: previous best R@5 = 78.4, current R@5 = 79.1
The best_model.pth file then contains:
epoch_num: <class 'int'>
value: 77
model_state_dict: <class 'collections.OrderedDict'>
optimizer_state_dict: <class 'dict'>
recalls: <class 'numpy.ndarray'>
value: [65.53590761 79.12306027 82.70479971 86.02490076]
best_r5: <class 'numpy.float64'>
value: 78.44640923854205
not_improved_num: <class 'int'>
value: 0
When I resume training with this checkpoint, the log shows:
2024-05-21 01:40:17 Loaded checkpoint: start_epoch_num = 77, current_best_R@5 = 78.4
2024-05-21 01:40:17 Resuming from epoch 77 with best recall@5 78.4
It appears that the checkpoint is saved before best_r5 and related variables are updated.
I think this issue can be resolved by updating these variables before saving the checkpoint.
Current train.py code:
...
is_best = recalls[1] > best_r5
# Save checkpoint, which contains all training parameters
util.save_checkpoint(args, {
"epoch_num": epoch_num, "model_state_dict": model.state_dict(),
"optimizer_state_dict": optimizer.state_dict(), "recalls": recalls, "best_r5": best_r5,
"not_improved_num": not_improved_num
}, is_best, filename="last_model.pth")
# If recall@5 did not improve for "many" epochs, stop training
if is_best:
logging.info(f"Improved: previous best R@5 = {best_r5:.1f}, current R@5 = {recalls[1]:.1f}")
best_r5 = recalls[1]
best_epoch = epoch_num
not_improved_num = 0
else:
not_improved_num += 1
logging.info(f"Not improved: {not_improved_num} / {args.patience}: best R@5 = {best_r5:.1f} at epoch: {best_epoch:.1f}, current R@5 = {recalls[1]:.1f}")
if not_improved_num >= args.patience:
logging.info(f"Performance did not improve for {not_improved_num} epochs. Stop training.")
break
Proposed change:
is_best = recalls[1] > best_r5
# If recall@5 did not improve for "many" epochs, stop training
if is_best:
logging.info(f"Improved: previous best R@5 = {best_r5:.1f}, current R@5 = {recalls[1]:.1f}")
best_r5 = recalls[1]
best_epoch = epoch_num
not_improved_num = 0
else:
not_improved_num += 1
logging.info(f"Not improved: {not_improved_num} / {args.patience}: best R@5 = {best_r5:.1f} at epoch: {best_epoch:.1f}, current R@5 = {recalls[1]::.1f}")
if not_improved_num >= args.patience:
logging.info(f"Performance did not improve for {not_improved_num} epochs. Stop training.")
break
# Save checkpoint, which contains all training parameters
util.save_checkpoint(args, {
"epoch_num": epoch_num+1, "model_state_dict": model.state_dict(),
"optimizer_state_dict": optimizer.state_dict(), "recalls": recalls, "best_r5": best_r5,
"not_improved_num": not_improved_num
}, is_best, filename="last_model.pth")
Would this change be a proper solution to ensure that the best_model.pth file correctly reflects the best-performing epoch?
Looking forward to your reply. Thank you in advance!
hi, thanks for your great work about the VPR bench!
But when I try to reproduce the VIT-netvlad, I don't know how to do, Could you show me the process that to reproduce the result.
Another question is that, I check the VIT model in the network.py, but can not find the structure of VIT, about its block number or some other information. So can I change another VIT model in your code structure?
Hello, I really appreciate for your great work. I'm having some trouble with the model you provided. For example, I want to use the model you provided (CCT-NetVLAD) to extract a 1D vector of an image. How should I proceed?
I'm sorry to bother you.
Best wishes!
Thank you in advance.
I can see loss function (sare-joint and sare-ind) in the code. How could I produce their paper results
Liu Liu, Hongdong Li, and Yuchao Dai. Stochastic Attraction-Repulsion Embedding for Large Scale Image Localization. In IEEE International Conference on Computer Vision, 2019.
I tried but couldn't be successful.
python3 train.py --dataset_name=pitts30k --backbone=vgg16 --criterion=sare_joint
Could you please suggest something?
Hi,
The models links are no longer valid now.
PRETRAINED_MODELS = {
'resnet18_places' : '1DnEQXhmPxtBUrRc81nAvT8z17bk-GBj5',
'resnet50_places' : '1zsY4mN4jJ-AsmV3h4hjbT72CBfJsgSGC',
'resnet101_places' : '1E1ibXQcg7qkmmmyYgmwMTh7Xf1cDNQXa',
'vgg16_places' : '1UWl1uz6rZ6Nqmp1K5z3GHAIZJmDh4bDu',
'resnet18_gldv2' : '1wkUeUXFXuPHuEvGTXVpuP5BMB-JJ1xke',
'resnet50_gldv2' : '1UDUv6mszlXNC1lv6McLdeBNMq9-kaA70',
'resnet101_gldv2' : '1apiRxMJpDlV0XmKlC5Na_Drg2jtGL-uE',
'vgg16_gldv2' : '10Ov9JdO7gbyz6mB5x0v_VSAUMj91Ta4o'
}
In section 4.1, you mentioned the following:Moreover, results considerably depend on the training data: as an example, training the same network on Pitts30k or MSLS yields a 30% gap testing the model on St. Lucia, as well as a noticeable difference on other datasets too. This effect demonstrates that comparing models trained on different datasets, as done in [85], can be misleading.
In the fourth row of data in Table 3, the model was trained in MSLS and then tested in MSLS, which was far worse than that tested in St Lucia. How to understand this? Generally speaking, the model is trained on data set A, so the test effect on data set A should be better than that on other data sets.
Hi, gmberton!
I trained a model using the pitts30k dataset and evaluated it on the tokyo 247 dataset. The output is as follows:
2022-09-15 09:26:21 Calculating recalls
2022-09-15 09:27:12 Recalls on < BaseDataset, tokyo247 - #database: 75984; #queries: 315 >: R@1: 55.2, R@5: 69.5, R@10: 75.2, R@20: 77.5
2022-09-15 09:27:12 Finished in 0:04:37
Here are some key parameters I use:
aggregation='netvlad', backbone='resnet18conv4', mining='partial', train_batch_size=16
I guess it should be that I adjusted the batch size that caused the difference in the results.
So I downloaded the model you provided. The input is as follows:
python3 eval.py --backbone=resnet18conv4 --aggregation=netvlad --resume=logs/pretrained/pitt_r18l3_netvlad_full.pth --dataset_name=tokyo247
Then I got the error:
Traceback (most recent call last):
File "eval.py", line 89, in <module>
state_dict = torch.load(args.resume)["model_state_dict"]
KeyError: 'model_state_dict'
So am I typing wrong in the terminal?
And I found a small error in "README.md".
In the "Pretained networks employing different backbones" and "Pretrained models with different mining methods", you seem to have entered some mismatched results.
Hi there, I noticed the VitWrapper in network.py. Would you please show me the meaning of VitWrapper? why the vit backbone should pop the class token self.vit_model(x).last_hidden_state[:, 1:, :]
when connect with netvlad or gem aggregation layer? ^ ^
Hi, the website posted in README cannot be reached.
Hi, I really appreciate for the work you've done. I wonder if the CCT-NetVLAD model will be released. How should I set parameters (trunc_te/freeze_te) to reproduce the result of CCT-NetVLAD on MSLS dataset.
Hi, thanks for this nice work first!
I just confused by one thing: why you using L2Norm before GeM?
I had also study the architecture proposed in original GeM paper inwhich the author was normalize the final vector instead of before pooling layer. so have you ever benchmarking the performance between using L2Norm before and after pooling layer?
Looking forward to your reply!
Great paper!
Where can I find the trained models?
Thanks
Is this website(https://deep-vg-bench.herokuapp.com/) unavailable?
It is a great work,but unfortunately I can't open this website
Thanks for your great work!
I have some questions about network structure of CCT-Netvlad. I'm not sure if there is a Seqpooling layer in its structure or not. Which layer in the CCT is connected to the NetVLAD layer?
I'm sorry to bother you.
Looking forward to your reply!
Thanks for your great work!Can you provide a download of your ViT model and the training configurations?
Hi, I have selected gl18-tl-resnet50-gem-w, how can I convert the vector dimension to 512?
By default, gl18-tl-resnet50-gem-w outputs a vector dimension of 2048, but what if I want to convert it to 512 dimensions? What should I do?
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.