GithubHelp home page GithubHelp logo

solertis / mih Goto Github PK

View Code? Open in Web Editor NEW

This project forked from norouzi/mih

0.0 3.0 0.0 7.55 MB

Fast exact nearest neighbor search in Hamming distance on binary codes with Multi-index hashing

Home Page: http://www.cs.toronto.edu/~norouzi/research/mih/

License: Other

CMake 0.85% Shell 3.61% MATLAB 17.46% C++ 70.92% C 7.16%

mih's Introduction

Multi Index Hashing (MIH)

An implementation of "Fast Exact Search in Hamming Space with Multi-Index Hashing, M. Norouzi, A. Punjani, D. J. Fleet, IEEE TPAMI 2014". See http://www.cs.toronto.edu/~norouzi/research/mih/.

This algorithm performs fast exact nearest neighbor search in Hamming distance on binary codes. Using this code, one can re-run the experiments described in the paper. For best results, consider using libhugetlbfs with multi-index hashing.

Compilation

You need make, cmake, hdf5 library, hdf5-dev package to build this project. To compile, create a folder called build, and run:

cd build
rm * -rf
cmake ..
make

This should generate two binary files: mih and linscan

Datasets

An example binary code dataset with 1 million 64-bit codes from SIFT is stored in the data folder. To generate larger binary code datasets, one should download raw data which can be converted to binary codes using hashing techniques (e.g., LSH or MLH). For example, download the INRIA bigann dataset (1 billion SIFT features) from http://corpus-texmex.irisa.fr/ and store it under data/inria/. You can also download the Tiny images dataset (80 million GIST descriptors) from http://horatio.cs.nyu.edu/mit/tiny/data/index.html and store it under data/tiny.

By running create_lsh_codes.m (a matlab snippet) one can generate binary codes from the above datasets using random projections (LSH, "Similarity estimation techniques from rounding algorithms, M. Charikar, STOC. 2002"). By changing the first few lines of create_lsh_codes, you can control the parameters of the matlab snippet. The output is in matlab (version 7.3) format, which is essentially hdf5 format. Hence, we use hdf5 library to read the binary code datasets.

Usage

RUN.sh is a bash script showing an example run of the program 64-bit codes. Set the parameters nb, HUGE, hashfunc, etc. to change the setting. RUN.sh includes suggested values for m: number of hash tables.

Linear scan

linscan provides an efficient implementation of exhaustive linear scan for kNN in Hamming distance on binary codes. This serves as a good baseline.

linscan <infile> <outfile> [options]
Options:
  -N <number>       Set the number of binary codes from the beginning of the dataset file to be used
  -B <number>       Set the number of bits per code, default autodetect
  -Q <number>       Set the number of query points to use from <infile>, default all
  -K <number>       Set number of nearest neighbors to be retrieved

Examples:

./build/linscan data/lsh_64_sift_1M.mat linscan_64_1M.h5 -N 100000  -B 64 -Q 1000 -K 100
./build/linscan data/lsh_64_sift_1M.mat linscan_64_1M.h5 -N 1000000 -B 64 -Q 1000 -K 100

Assuming that a dataset of 128-bit binary codes is stored at codes/lsh_64_sift_1M.mat, running the above lines will create an output file linscan_64_1M.h5, which stores the results and timings for 100-NN search on 100K and 1M binary codes. If the output file does not exist (the first time), the output file is created. If the output file exists (since the second time), the file is appended with the new results.

Multi Index Hashing

mih provides an implementation of multi-index hashing for fast exact kNN in Hamming distance on binary codes.

mih <infile> <outfile> [options]
Options:
  -N <number>          Set the number of binary codes from the beginning of the dataset file to be used
  -B <number>          Set the number of bits per code, default autodetect
  -Q <number>          Set the number of query points to use from <infile>, default all
  -m <number>          Set the number of chunks to use, default 1
  -K <number>          Set number of nearest neighbors to be retrieved
  -R <number>          Set the number of codes (in Millions) to use in computing the optimal bit reordering, default OFF (0)

Examples:

./build/mih data/lsh_64_sift_1M.mat mih_64_1M.h5 -N 100000 -B 64 -m 5 -Q 10000 -K 100
./build/mih data/lsh_64_sift_1M.mat mih_64_1M.h5 -N 1000000 -B 64 -m 4 -Q 10000 -K 100

The mih's options are very similar to linscan. It has an additional argument (-m) to determine the number of hash tables. It also has a flag (-R) to determine whether the assignment of bits to the substrings should be optimized.

FAQs

Q: I have tried your code with some of my datasets. It works well when I used small datasets, but it does not perform well with large datasets.

A: Did you try decreasing the number of hash tables (by the -m switch) as you increased the size of the database? My experience is that with the correct choice of m, the speedup on larger datasets should be much better. In the RUN.sh file, I have a set of suggestions for the values of m for different number of codes in the datasets.

License

Copyright (c) 2012, Mohammad Norouzi [[email protected]] and Ali Punjani [[email protected]]. This is a free software; for license information please refer to license.txt file.

TODO

  • Automatic suggestion for the m parameter.
  • Multi-core functionality.
  • Improve SparseHashtable insertion speed. It is currently very slow, but can be improved in different ways.

mih's People

Contributors

kpjoshi avatar norouzi avatar simongog avatar

Watchers

 avatar  avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.