GithubHelp home page GithubHelp logo

clt29 / semantic_neighborhoods Goto Github PK

View Code? Open in Web Editor NEW
9.0 3.0 6.0 3.25 MB

Preserving Semantic Neighborhoods for Robust Cross-modal Retrieval [ECCV 2020]

Home Page: http://www.cs.pitt.edu/~chris/semantic_neighborhoods

Python 100.00%
eccv2020 retrieval computer-vision cross-modal-retrieval cross-modal visual-semantic-embedding code goodnews politics mscoco-dataset

semantic_neighborhoods's Introduction

Preserving Semantic Neighborhoods for Robust Cross-modal Retrieval [ECCV 2020]

Preserving Semantic Neighborhoods for Robust Cross-modal Retrieval,
Christopher Thomas and Adriana Kovashka, Proceedings of the European Conference on Computer Vision, 2020

Video presentations (both short and long) of our paper are available on the project webpage.

Abstract

The abundance of multimodal data (e.g. social media posts with text and images) have inspired interest in cross-modal retrieval methods. However, most prior methods have focused on the case where image and text convey redundant information; in contrast, real-world image-text pairs convey complementary information with little overlap. Popular approaches to cross-modal retrieval rely on a variety of metric learning losses, which prescribe what the proximity of image and text should be, in the learned space. However, images in news articles and media portray topics in a visually diverse fashion; thus, we need to take special care to ensure a meaningful image representation. We propose novel within-modality losses which ensure that not only are paired images and texts close, but the expected image-image and text-text relationships are also observed. Specifically, our method encourages semantic coherency in both the text and image subspaces, and improves the results of cross-modal retrieval in three challenging scenarios.

Method Overview

We propose a metric learning approach where we use the semantic relationships between text segments, to guide the embedding learned for corresponding images. In other words, to understand what an image "means", we look at what articles it appeared with. Unlike prior approaches, we capture this information not only across modalities, but within the image modality itself. If texts y_i and y_j are semantically similar, we learn an embedding where we explicitly encourage their paired images x_i and x_j to be similar, using a new unimodal loss. Note that in general x_i and x_j need not be similar in the original visual space. In addition, we encourage texts y_i and y_j, who were close in the unimodal space, to remain close. Our novel loss formulation explicitly encourages within-modality semantic coherence. We show how our method brings paired images and text closer, while also preserving semantically coherent regions, e.g. the texts remained close in the graphic above.

Setup

This code was developed using Python 3.7.6. It requires PyTorch version 1.5.0, torchvision, and tensorboard. Anaconda is also strongly recommended. You will also need the following packages not included in the Anaconda distribution:

conda install -c anaconda gensim 
conda install -c conda-forge nltk

Additionally, you will need to install NMSlib for nearest neighbor computation.

Training a model

Setup and neighborhood computation

Our method begins by calculating semantic neighborhoods in text space, using a pre-trained Doc2Vec model. These instructions all assume the Politics dataset is used. However, the code can easily be modified to work with any image-text paired dataset. Begin by downloading the Politics dataset from Politics. You may need to adjust the path to the dataset in the code, as well as make sure that the paths to the images on your computer correspond to the relative paths in train_test_paths.pickle. In general, porting any arbitrary dataset requires little effort.

# Train the Doc2Vec model on the train set of text from the Politics dataset.
python train_doc2vec.py

# Extract Doc2Vec vectors for the train set of text
python extract_doc2vec_vectors.py

# Perform approximate k-nearest neighbors using NMSLib
python knn_document_features.py

This implementation closely follows our method described in the text. However, you may need to adjust several parameters depending on the specifics of your dataset. For example, you may wish to train Doc2Vec for more than 20 epochs (e.g. we trained Doc2Vec for 50 epochs on COCO due to its smaller size). You may also consider training Doc2Vec on the entire articles from GoodNews, rather than just the captions. Similarly, we constrain the text to the first two sentences of the text in Politics (due to the lack of captions), but you may wish to train on the entire caption.

Training

The prior steps trained a Doc2Vec model and calculated the semantic neighborhoods to be preserved for each image-text pair. We next train the cross-modal model with our constraints to preserve the semantic neighborhoods discovered in the previous step. The first parameter is the weight of the symmetric retrieval constraint (text to image retrieval). The second weight is the within-modality image retrieval constraint (image -> image neighbors, i.e. L_img), and the third is the text within-modality constraint, i.e. L_text. These should be optimized on your dataset.

# Train the cross-modal retrieval model
python train_cross_modal_retrieval_model.py 1 0.3 0.2  

Training should be stopped after validation loss fails to decrease after ~10 epochs.

BibTex Citation

@inproceedings{thomas2020preserving,
  title={Preserving Semantic Neighborhoods for Robust Cross-modal Retrieval},
  author={Thomas, Christopher and Kovashka, Adriana},
  booktitle={Proceedings of the European Conference on Computer Vision (ECCV)},
  month = {August},
  year = {2020}
}

semantic_neighborhoods's People

Contributors

clt29 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar

semantic_neighborhoods's Issues

Can not find the 'complete_db.pickle' and ''db.pickle' in the Politics dataset.

Hello, thanks for your nice code. But I meet a problem that I can not find the 'complete_db.pickle' and ''db.pickle' in in the Politics dataset when I run 'train_doc2vec.py'.
def main(): dataset = pickle.load(open('db.pickle', 'rb'))

and
def get_db(): db = pickle.load(open('complete_db.pickle', 'rb'))

Thanks for your help.

Best wishes !

Zhiqiang

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.