GithubHelp home page GithubHelp logo

abhimanyudubey / geoyfcc Goto Github PK

View Code? Open in Web Editor NEW
14.0 4.0 0.0 4.73 MB

Dataset accompanying the paper "Adaptive Methods for Real-World Domain Generalization"

License: Creative Commons Zero v1.0 Universal

geoyfcc's Introduction

Geographically Split YFCC for Domain Generalization (Geo-YFCC)

Samples from GeoYFCC

This repository contains a link to the dataset and instructions to download the Geo-YFCC dataset presented in the paper "Adaptive Methods for Real-World Domain Generalization", appearing at CVPR 2021. This dataset contains a subset of the popular YFCC100M dataset , that are partitioned based on the images' country of origin. Note that in order to limit the hosting space required, the dataset available here only contains the metadata and domain assignments, and not the original images. To obtain the original images, we urge the user to download them directly via the YFCC100M link (or use this API).

Information about the dataset: We use geotags to partition images based on their country of origin. For the label space, we consider the 4K categories from ImageNet-5K not present in ILSVRC12. These categories are selected in order to eliminate biased prior knowledge from pre-training on ILSVRC12. For each of the 4K labels, we select the corresponding images from YFCC100M based on a simple keyword-filtering of image tags. This provides us 1261 categories with at least 1 image present, and each category is present in at least 5 countries. We group images by their country of origin and only retain countries that have at least 10K images. For any domain with more than 20K images, we randomly sub-sample to limit it to 20K images. Therefore, each domain (i.e., country) has anywhere between 10K-20K images, giving us a total of 1,147,059 images from 1,261 categories across 62 countries (domains), and each image is associated with a class label and country (domain). We randomly partition the data in to 40 training, 7 validation and 15 test domains (by country). For each domain, we sample 3K points to create a per-domain test set and use the remaining points for training and validation.

The metadata file is available at this Google Drive link. To download this via CLI, we suggest using gdown.

pip install gdown
gdown http://drive.google.com/uc?id=1HvpAeEc37R9nLcI79iSeVCX2PYg3AgXZ
echo "db7419355b1e9827a2cf8f480ee36120  GeoYFCC.tar.gz" | md5sum -c -

The file is downloaded correctly if the above code ends with OK. The md5sum for the correctly downloaded file should match db7419355b1e9827a2cf8f480ee36120. The file is tar-zipped and unzipped results in a pickle file that stores a pandas dataframe:

Column Description
yfcc_row_id Corresponding row ID within YFCC100M (present in the first column of the yfcc100m_dataset file from YFCC100M, begins with 0)
label_ids Labels, since this dataset is multi-label this is a list
country Plaintext name of domain (country)
country_id Serialization of domain (country) from 0-61
in_5k_label_ids Corresponding labels in the ImageNet-5K dataset (Use the in5k_map.json file to map these IDs to synset IDs)
is_train Boolean specifying whether row is in the training image split
yfcc_metadata Copy of the original YFCC metadata for image

The default domain splits are 0-39 as training, 40-46 as validation domains, and 47-61 as test domains. Each domain is further split into a train and test split which is specified in the is_train field.

If you find this dataset relevant to your research, please consider citing our work below.

@InProceedings{dubey2021adaptive,
  title={Adaptive Methods for Real-World Domain Generalization},
  author={Dubey, Abhimanyu and Ramanathan, Vignesh and Pentland, Alex and Mahajan, Dhruv},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2021}
}

geoyfcc's People

Contributors

abhimanyudubey avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar

geoyfcc's Issues

Wordnet Ids corresponding to class labels

Hi! Thanks for creating the GeoYFCC dataset! I am trying to do analysis over how well labels describe actual content of the images and realize that there is no text label (only label id) in the metadata. Do you have corresponding texts or wordnet ids for all the classes? That will help a lot. Thank you!

download GeoYFCC

Hi~it is an awesome work to collect such a large dataset for domain generalization! Could you please share a script to download the dataset? Many Thanks!

LT-ImageNet dataset

Hi, excellent work about Domain Generalizaton!
Except for the large dataset---GeoYFCC, I'm wondering if you would like to open source the LT-ImageNet?

yfcc_row_id 0 indexed or 1 indexed

Hi,

Thank you for making the dataset public. I assume yfcc_row_id is the line number in the yfcc100m_dataset file. May I know if yfcc_row_id is 0 indexed or 1 indexed?

Thanks

URL's for many images in the metadata file are broken

Thanks for releasing this dataset! I was able to download the YFCC images from the URL's provided in the metadata file for ~730k out of the 1.1 million images. The URL's for the rest unfortunately appear to be broken.

Do you have any recommendations for a source that has all the images? I was thinking of trying https://pypi.org/project/yfcc100m/ that you recommend in your README โ€“ have you had luck with using that to download all images?

Thanks!

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.