vicentevivan / geo-clip Goto Github PK

This is an official PyTorch implementation of our NeurIPS 2023 paper "GeoCLIP: Clip-Inspired Alignment between Locations and Images for Effective Worldwide Geo-localization"

Home Page: https://arxiv.org/abs/2309.16020

License: MIT License

Python 100.00%

deep-learning geolocalization geolocation-estimation machine-learning pytorch gps-embeddings geography

geo-clip's People

Contributors

Stargazers

Watchers

Forkers

stupidbuluchacha cem-sirin hoteryoung 06opotehb gaoshuang98 buptspig oshkr gedeon-m-gedus everyshare-code shossain

geo-clip's Issues

Some questions

Dear author,
I appreciate your work and would like to get the details of Geoclip's training. Can you publish the complete training code? Thank you.

Also, in the geoclip code, self. opt is not defined in init. How can this be changed?

    def _dequeue_and_enqueue(self, gps):
        """ Update GPS queue

    Args:
        gps (torch.Tensor): GPS tensor of shape (batch_size, 2)
    """
        opt = self.opt
        gps_batch_size = gps.shape[0]
        batch_size = opt.batch_size

        gps_ptr = int(self.gps_queue_ptr)
        assert self.queue_size % batch_size == 0

        # Replace the GPS from ptr to ptr+batch_size (dequeue and enqueue)
        self.gps_queue[:, gps_ptr:gps_ptr + gps_batch_size] = gps.t()
        gps_ptr = (gps_ptr + batch_size) % self.queue_size  # move pointer
        self.gps_queue_ptr[0] = gps_ptr

In the following code, weather the self.gps of self._dequeue_and_enqueue(self.gps) should be modified to self.gps_queue?

    def append_gps_queue_features(self, gps_features):
        """ Compute the GPS queue features and append them to the given GPS features."""
        # Get the GPS queue features
        location_queue = self.gps_queue.t().detach()
        gps_queue_features = self.location_encoder(location_queue)
        gps_queue_features = F.normalize(gps_queue_features, dim=1)

        # Concatenate Features (GPS Features & GPS Queue Features)
        gps_features = torch.cat([gps_features, gps_queue_features], dim=0)

        # Update GPS queue
        self._dequeue_and_enqueue(self.gps) 

        return gps_features

Looking forward to your response, thank you.

When will the testing code be released?

Dear author,
Great work! I have recently been developing a new model to solve the same task and would like to have your TESTING code to test the performance of my model.

I would appreciate your generosity. Looking forward to your response, thank you.

Ten Crop benchmark

@VicenteVivan It is mentioned in the paper that a ten crop method is taken to evaluation where you average your prediction over all these 10 cropped images. How do you perform the averaging?

For example, you could average the predicted GPS coordinates or you could average the embeddings before you evaluate. Both methods will give very different results. Thankful for any answer ^^

When will the training code be released?

Dear author,
I'm interested in your project and I want to know when the training code will be released. Thanks.

Higher Resolution for GPS Coordinates

Thanks for the great work!

Did I understand correctly that the sigma parameter controls the resolution of the frequencies and if you need a higher resolution for the GPS coordinates you have to increase it? You use [20, 24, 2**8] for a resolution of up to one km, what about metre-level resolution?

Thank you very much

Request for detailed GEO-CLIP training code

Thank you very much for the excellent work you are doing! I want to try to train it by myself, but I find that there is only a simple training loop python code on your github. Could you please realse your detailed and completed training code? Thank you very much!

Where can I get training data?

Dear author.
Thank you for sharing interesting research.

I'm trying to download the training data in the paper(MP-16), but I can't find it on the official website.
http://www.multimediaeval.org/mediaeval2016/placing/
Can you please give me a link?

Loss function used for training

Hello, me and a partner are interested in finetuning the GeoCLIP model however we are unsure of the implementation of the loss function. Could you share the loss function you used or give any tips for implementing it?

GWS15k Benchmark Dataset

Great Paper!

Is the GWS15k benchmark dataset available somewhere?

Questions regarding the Loss function

Dear authors,

I am currently working on reproducing the results from your paper. It doesn't seem like you haven't included any code regarding the implementation of your loss function, and I therefore have some questions on the matter.

From my understanding of the loss, you have modified it in order to account for the dynamical queue (additional gps embeddings).
$P$ - corresponds to the different views from an image of a given batch, lets take it as being 1 view for simplicity.
$V$ - is the embedded image
$L$ - is the embedded GPS coordinate

This simplifies the Loss for a single view of a single image in a batch to the following:

$$L_i = - \log \frac{ \exp(V_i \cdot L_i / \tau)}{\sum_{i = 0} \exp(V_i \cdot L_i / \tau) + \sum_{i = 0} \exp(V_i \cdot \tilde{L}_i / \tau)}$$

Where in the denominator, the first sum is for a batch of length B, and the second sum is for the dynamic queue of length S.

My questions are the following:

1. It seems like you are using the same index $i$ for both the $i^{th}$ sample of a batch, the sum over the batch, and the sum over the dynamical queue. Did you mean to take something like the loss below (index $i$ changed to $k$ in the denominator)?

$$L_i = - \log \frac{ \exp(V_i \cdot L_i / \tau)}{\sum_{k = 0} \exp(V_i \cdot L_k / \tau) + \sum_{k = 0} \exp(V_i \cdot \tilde{L}_k / \tau)}$$

By doing so, you do contrastive learning of each image over all other coordinates while keeping the same image $V_i$ in the denominator.

1. If it is true that you do contrastive learning of each image over all other coordinates, why did you decide not to do contrastive learning of each GPS coordinate over all other images? In fact in the original CLIP paper, the Cross Entropy Loss is utilized both horizontally and vertically, yet you have chosen only to use it horizontally. Is there a specific reason for this decision?
1. Going back to the $P$ augmented views, you mention in your paper that a benefit of using a frozen CLIP backbone is that one can pre-encode all images, making the training process faster. Yet if you perform $P$ augmentations for each image and for each batch, didn't you have to re-encode the augmented images again, thus not being able to take advantage from this benefit?

I look forward to hearing from you! Thanks.

vicentevivan / geo-clip Goto Github PK

geo-clip's People

Contributors

Stargazers

Watchers

Forkers

geo-clip's Issues

Some questions

When will the testing code be released?

Ten Crop benchmark

When will the training code be released?

Higher Resolution for GPS Coordinates

Request for detailed GEO-CLIP training code

Where can I get training data?

Loss function used for training

GWS15k Benchmark Dataset

Questions regarding the Loss function

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent

Jobs