nmslib / nmslib Goto Github PK

Non-Metric Space Library (NMSLIB): An efficient similarity search library and a toolkit for evaluation of k-NN methods for generic non-metric spaces.

License: Apache License 2.0

Python 10.22% Shell 2.80% Makefile 0.59% Perl 5.65% C 0.51% C++ 64.22% CMake 0.63% R 0.04% Java 0.42% Thrift 0.14% Batchfile 0.60% Roff 0.58% TeX 10.09% Jupyter Notebook 3.51%

knn-search non-metric neighborhood-graphs k-nn-graphs vp-tree

nmslib's Introduction

Non-Metric Space Library (NMSLIB)

Important Notes

NMSLIB is generic but fast, see the results of ANN benchmarks.
A standalone implementation of our fastest method HNSW also exists as a header-only library.
All the documentation (including using Python bindings and the query server, description of methods and spaces, building the library, etc) can be found on this page.
For generic questions/inquiries, please, use the Gitter chat: GitHub issues page is for bugs and feature requests.

Objectives

Non-Metric Space Library (NMSLIB) is an efficient cross-platform similarity search library and a toolkit for evaluation of similarity search methods. The core-library does not have any third-party dependencies. It has been gaining popularity recently. In particular, it has become a part of Amazon Elasticsearch Service.

The goal of the project is to create an effective and comprehensive toolkit for searching in generic and non-metric spaces. Even though the library contains a variety of metric-space access methods, our main focus is on generic and approximate search methods, in particular, on methods for non-metric spaces. NMSLIB is possibly the first library with a principled support for non-metric space searching.

NMSLIB is an extendible library, which means that is possible to add new search methods and distance functions. NMSLIB can be used directly in C++ and Python (via Python bindings). In addition, it is also possible to build a query server, which can be used from Java (or other languages supported by Apache Thrift (version 0.12). Java has a native client, i.e., it works on many platforms without requiring a C++ library to be installed.

Authors: Bilegsaikhan Naidan, Leonid Boytsov, Yury Malkov, David Novak. With contributions from Ben Frederickson, Lawrence Cayton, Wei Dong, Avrelin Nikita, Dmitry Yashunin, Bob Poekert, @orgoro, @gregfriedland, Scott Gigante, Maxim Andreev, Daniel Lemire, Nathan Kurz, Alexander Ponomarenko.

Brief History

NMSLIB started as a personal project of Bilegsaikhan Naidan, who created the initial code base, the Python bindings, and participated in earlier evaluations. The most successful class of methods--neighborhood/proximity graphs--is represented by the Hierarchical Navigable Small World Graph (HNSW) due to Malkov and Yashunin (see the publications below). Other most useful methods, include a modification of the VP-tree due to Boytsov and Naidan (2013), a Neighborhood APProximation index (NAPP) proposed by Tellez et al. (2013) and improved by David Novak, as well as a vanilla uncompressed inverted file.

Credits and Citing

If you find this library useful, feel free to cite our SISAP paper [BibTex] as well as other papers listed in the end. One crucial contribution to cite is the fast Hierarchical Navigable World graph (HNSW) method [BibTex]. Please, also check out the stand-alone HNSW implementation by Yury Malkov, which is released as a header-only HNSWLib library.

License

The code is released under the Apache License Version 2.0 http://www.apache.org/licenses/. Older versions of the library include additional components, which have different licenses (but this does not apply to NMLISB 2.x):

Older versions of the library included the following components:

The LSHKIT, which is embedded in our library, is distributed under the GNU General Public License, see http://www.gnu.org/licenses/.
The k-NN graph construction algorithm NN-Descent due to Dong et al. 2011 (see the links below), which is also embedded in our library, seems to be covered by a free-to-use license, similar to Apache 2.
FALCONN library's licence is MIT.

Funding

Leonid Boytsov was supported by the Open Advancement of Question Answering Systems (OAQA) group and the following NSF grant #1618159: "Matching and Ranking via Proximity Graphs: Applications to Question Answering and Beyond". Bileg was supported by the iAd Center.

Related Publications

Most important related papers are listed below in the chronological order:

L. Boytsov, D. Novak, Y. Malkov, E. Nyberg (2016). Off the Beaten Path: Let’s Replace Term-Based Retrieval with k-NN Search. In proceedings of CIKM'16. [BibTex] We use a special branch of this library, plus the following Java code.
Malkov, Y.A., Yashunin, D.A.. (2016). Efficient and robust approximate nearest neighbor search using Hierarchical Navigable Small World graphs. CoRR, abs/1603.09320. [BibTex]
Bilegsaikhan, N., Boytsov, L. 2015 Permutation Search Methods are Efficient, Yet Faster Search is Possible PVLDB, 8(12):1618--1629, 2015 [BibTex]
Ponomarenko, A., Averlin, N., Bilegsaikhan, N., Boytsov, L., 2014. Comparative Analysis of Data Structures for Approximate Nearest Neighbor Search. [BibTex]
Malkov, Y., Ponomarenko, A., Logvinov, A., & Krylov, V., 2014. Approximate nearest neighbor algorithm based on navigable small world graphs. Information Systems, 45, 61-68. [BibTex]
Boytsov, L., Bilegsaikhan, N., 2013. Engineering Efficient and Effective Non-Metric Space Library. In Proceedings of the 6th International Conference on Similarity Search and Applications (SISAP 2013). [BibTex]
Boytsov, L., Bilegsaikhan, N., 2013. Learning to Prune in Metric and Non-Metric Spaces. In Advances in Neural Information Processing Systems 2013. [BibTex]
Tellez, Eric Sadit, Edgar Chávez, and Gonzalo Navarro. Succinct nearest neighbor search. Information Systems 38.7 (2013): 1019-1030. [BibTex]
A. Ponomarenko, Y. Malkov, A. Logvinov, and V. Krylov Approximate nearest neighbor search small world approach. ICTA 2011
Dong, Wei, Charikar Moses, and Kai Li. 2011. Efficient k-nearest neighbor graph construction for generic similarity measures. Proceedings of the 20th international conference on World wide web. ACM, 2011. [BibTex]
L. Cayton, 2008 Fast nearest neighbor retrieval for bregman divergences. Twenty-Fifth International Conference on Machine Learning (ICML). [BibTex]
Amato, Giuseppe, and Pasquale Savino. 2008 Approximate similarity search in metric spaces using inverted files. [BibTex]
Gonzalez, Edgar Chavez, Karina Figueroa, and Gonzalo Navarro. Effective proximity retrieval by ordering permutations. Pattern Analysis and Machine Intelligence, IEEE Transactions on 30.9 (2008): 1647-1658. [BibTex]

nmslib's People

Contributors

Stargazers

Watchers

Forkers

nikitasergeevich zehsilva jizhihang xuanhan863 xieyanfu dgmp88 c4e8ece0 epocolis zyx1986 agangzz psy2013github rtvt123 egorskor stevenlol alantsev katemonkeys rajeev3983 snazz2001 bileg yurymalkov rmax-contrib bejvisek vrozental bkj handongfeng ladislavsopko benjamesbabala xnlp zjxheng orgoro svirg bobpoekert gitter-badger xternalz cdump yitongfeng adrianyu theolivenbaum nliu86 justinbuzzni ryanjay0 cserxy lancifollia doneladams skobets aicodes plenkl ddurschlag chirayukong foursquare jschmitz28 alexxnica kryndex appierys benfred 5059 visualsemantics garethblock xiaojieqiu icedwater yiliangnie stevenqiulei andrusha97 rinatsafianov zhouyonglong fromradio kelvict tuqc maxy218 zgsxwsdxg martin-laurent sanjeeku creepyghost minjiazhang drewm1980 chenchy longchuan1985 btbujiangjun zhfzhmsra happynoom svebk lediona jimmycao bigrlab eracle zhengyangcs shannonyu laisun 8w9ag wangjianyong zengjianping jiajun-liu xidianw3 levonbaghoyan mindis kinoyasu dima86 deep-learning vipercraft ybbgpy

nmslib's Issues

Port NN-descent

We need at least one MORE implementation of a knn-graph. Don't forget to update the list of publications.

Make it possible to compile and run on Windows

Save the time to build index

Save the time to build index in the output file.

Don't duplicate memory in MPLSH

sic

Caching of ground-truth data

Currently, in each test we randomly divide the data set into testing and indexable subsets. Then, for each division we compute the exact result set for the NN or range search through sequential searching. In that, we also memorize the search time of sequential searching.

This is ok for fast distances and small datasets. Yet, it is rather slow for large data sets (and slow distances). We should support caching of gold truth data. If a special option is given, we should memorize the split params and the gold truth data for a specific data set (and a selection of parameters). Then, subsequent re-runs will take no time.

Implement an approximate version of AESA/LAESA

Implement an SIMD version for dense-vector cosine similarity

The current implementation is vectorized automatically by ICC and GCC. However, not by CLang yet. In addition, we can't rely on automatic vectorization, as it is fragile and may stop working any time.

Documentation improvement

Give an example to start each method in the manual (command line).

Support minimalistic objects without the wrapper Object class

Currently, we place all our data objects inside instances of the class Object. However, it is a bit complicated for users. Perhaps, a better approach is to work directly on pointers. We sure need to store object meta information somewhere (id, class label, type, and the size of the data). Yet, this can be kept in a separate hash table, where object pointers are keys.

Perhaps, one solution is a thin wrapper Object class.

This is not for the closest release, but for some more distant future.

Write tests for the evaluation components

Sync the documentation

Cover several issues:

New methods
New utilities.
Changes in the interface of the Space
Auto-tuning for VP-tree (in the VP-tree section).

Improve the search algorithm of NN-descent

Recently, the algorithm of the small_world_rand was made about 3x faster. Port these improvements to the algorithm that runs on top of NN-descent.

Visual Studio sub-projects should build code in separate directoriers

Currently they overlap and the build might fail sometimes. It is a minor problem that occurs only when projects are built in parallel. Typically, just hitting F7 another time lead to a successful build.

Don't disable copy protect for MSVC for Windows

Currently, DISABLE_COPY_AND_ASSIGN is not used under Windows, because it generates a lot of wranings C4661. Instead of making DISABLE_COPY_AND_ASSIGN void, just disable the warning.
https://msdn.microsoft.com/en-us/library/2c8f766e.aspx

Porting LSH forest

http://scikit-learn.org/dev/modules/neighbors.html#approximate-nearest-neighbors

Move all utility code to a separate util library

Visual studio project files should be changed as well. Currently, we have the following utility binaries, which are better in separate util/apps folder. Now they are either in the src or in the test folder:

bench_distfunc
bench_projection
tune_vptree
dummy_app
report_intr_dim

Fixing JSPrecompSIMDApproxLog for Intel

Currently, I use JSPrecompSIMDApproxLog instead (see src/space/space_js.cpp)

Permutation index (perm_incsort) crashes on Windows (with default parameters)

Record the # of queries per second

This is a necessary (but missing) statistics for multi-threading testing. Even though, it is possible to compute it from the average query time, it will be best to produce it automatically.

Perhaps, there is a minor bug in the evaluation code

This one happens only for float, see TODO in eval_resulst.h:

      /*
       * TODO: @leo These eps are quite adhoc.
       *            There can be a bug here (where approx is better than exact??), 
       *            to reproduce a situation when the below condition is triggered 
       *            for epsRel = 1e-5 use & epsAbs = 1-e5:
       *            release/experiment  --dataFile ~/TextCollect/VectorSpaces/colors112.txt --knn 1 --testSetQty 1 --maxNumQuery 1000  --method vptree:alphaLeft=0.8,alphaRight=0.8  -s cosinesimi 
       *
       */
      if (mx > 0 && (1- mn/mx) > epsRel && (mx - mn) > epsAbs) {
        for (size_t i = 0; i < std::min(ExactDists_.size(), ApproxDists_.size()); ++i ) {
          LOG(INFO) << "Ex: " << ExactDists_[i].first <<
                       " -> Apr: " << ApproxDists_[i] <<
                       " 1 - ratio: " << (1 - mn/mx) << " diff: " << (mx - mn);
        }

Consider method/space name renaming

For example, cosinesimil will become cosine
small_world_rand -> sw-graph

Important keep older names as synonyms.

Add an implementation of a kernelized hashing

http://www.cse.ohio-state.edu/~kulis/klsh/klsh.htm

Papers:

Kernelized Locality-Sensitive Hashing for Scalable Image Search
Brian Kulis & Kristen Grauman
In Proc. 12th International Conference on Computer Vision (ICCV), 2009.
[pdf]
Also see the following related papers, which apply LSH to learned Mahalanobis metrics:

Fast Similarity Search for Learned Metrics
Brian Kulis, Prateek Jain, & Kristen Grauman
IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 31, no. 12, pp. 2143--2157, 2009.
[pdf]
Fast Image Search for Learned Metrics
Prateek Jain, Brian Kulis, & Kristen Grauman
In. Proc. IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2008.
[pdf]

Sync the integration utility (test_integr_util)

Some parameters changed, some new methods were added.

The VP-tree should be able to call the tuning procedure itself

Make it possible to call the tunning procedure (current you can only run tune_vptree) from vp-tree automatically. They user may be able to specify the following parameters:
*minExp (default 1)
*maxExp (defalt 1)
*maxTuneQty (default 10000)
*bucketSize (50)

*Require the following * bucketSize maxTuneQty >= dataQty.

Improve a vp-tree pivot-selection procedure (for low-dimensional spaces).

While in higher-dimensional spaces (e.g., SIFT) with narrow histograms of distances (between random pairs of points), random pivots may be almost as good as any other pivots, this is not the case of lower dimensional spaces, or high-dimensional spaces with very low intrinsic dimensionality. A user should be able to specify an option for a better pivot-selection. In particular, trying several pivots and choosing one with the highest distance spread should help in many cases.

Logging improvements

Need to handle logging better than just dumping everything to the screen. Two improvements are needed

Log-level control through a switch
An ability to redirect non-fatal errors to the separate log-file

Better caching of ground-truth data

Currently, we are going to save only a fixed number of GS entries, which is 1000 by default. This is fine for most kNN scenarios, however, may be insufficient for range searching. We need a "knob" to specify what is the maximum number of entries we want to save, but relative to the result size.

In addition, storing complete set of all GS entries uses a lot of memory. In some case, we simply run out of memory. As a quick fix, (see commit 91298da), we never keep more than a specified number of GS entries. Obviously, this is not a satisfactory solution in general. So something more clever needs to be done.

Build a 64-bit version of GSL for Windows.

Looks like it should be possible. At least, there are project files to build libGSL with VC 8.

Getting rid of LOG(LIB_FATAL)

In many places, we use LOG(LIB_FATAL), but it kills the app. This call can also reside inside CHECK.

This is hardly acceptable for a good library. We should replace most such calls with throwing an exception. Some care should be taken to ensure this doesn't cause leaked memory.

Support maxLeavesToVisit in sa-tree

Reporting memory for the case when query-time parameters are being changed

If the index is not recreated, but rather only query-time parameters are changed, we report the correct memory usage only once. For the subsequent testing bouts, the memory consumption is equal to the size of the data size. This need to be fixed.

Class hierarchy for objects

We absolutely need to have a hierarchy of objects, which can do the following

Read an object from input stream
Write object to an output stream
How to provide access to the data to compute distances.

Currently there is no way to write objects (not even for debug purposes). Reading is done by classes inherited from the Space. This is not a good practice, we need to decouple this object-related functionality from spaces. We need to have objects such as sparse vectors, dense vectors, etc... (in the future we will add strings and other weird thingies).

Segfaulting when the ground truth/gold standard cache file is created for a larger data set than required

Try to create for 100K records and then run with 10K.

Ensure that maxNumQuery > 0

It should also have a default positive value.

Implmenting DynDex: A Dynamic and Non-metric Space Indexer

In the new version, it is named: nonmetr_list_clust

Efficient distances for sparse vector spaces

Efficient computation of scalar products: file distcomp_cosine.cpp
Efficient intersection of sparse vectors: function ComputeDistanceHelper

Separate tests of distance efficiency and regular unit tests.

Implement M-index

Metric Index: An efficient and scalable solution for precise and approximate similarity search
David NovakCorresponding author contact information, E-mail the corresponding author, Michal Batko E-mail the corresponding author, Pavel Zezula E-mail the corresponding author.

Source code is available at:

http://mufin.fi.muni.cz/trac/m-index

Implement Fagin's rank aggregation methods

Efficient similarity search and classification via rank aggregation
R Fagin, R Kumar, D Sivakumar - Proceedings of the 2003 ACM …, 2003 - dl.acm.org

It's also patented:

Efficient similarity search and classification via rank aggregation
United States Patent Application 20040249831 Kind Code: A1

Implement rank-cover tree

The faster cover trees can be implemented based on paper "Faster cover trees
M Izbicki, C Shelton"
The code for faster rank-cover tree can be found here: https://github.com/DBWangGroupUNSW/nns_benchmark

Implementing advanced projection trees and hierarchical k-means

FLANN
Trees from "Trinary-Projection Trees for Approximate Nearest Neighbor Search"

Better proxy for distance computations

Instead of the current approach where space provides a virtual function to compute distances, we need the space to provide a function pointer to compute the distance! To estimate the number of distance computations in experiments we will simply create a wrapper space that would provide a wrapper distance function. These function will proxy computations and count their numbers. These will be only done during one type of experiment where we also compute accuracy.

This is related to closed #78 : for low-dim spaces even pointer can be an overhead, though.

Some motivation
Space objects can be used to compute distances only during indexing time.
search time, all distance computations are proxied through a query object.
During the indexing time, the query parameter can be NULL, but during
search time, one needs to supply the actual query.

This is actually not such a good approach, we need a separate
object to proxy distance computations. It should not be the query!
Then, functions like Projectin::compProj will not need to use both the pObj (at
index time) and the pQuery (at search time).

Need to add a check for unused/wrong params in every method

See, e.g., how it's done in VP-tree:
7538b4a
Previously, this check was done in the AnyParamManager destructor. Yet, this was problematic for several reasons.

Object class improvements

In version 2, we will introduce several types of memory allocators (available to indexing methods, which will do memory allocations for data themselves) that will take care of

An Object will be renamed to DataPoint

allocating
freeing
properly aligning memory (previously #9)

Hierarchy of class objects referenced in #7 will also be very useful

These allocators will allow several modes:

fully dynamic: can delete/add each object independently
statically mmaped
statically allocated
perhaps dynamically growing allocated as a single chunk
perhaps dynamically growing allocated as multiple chunks

We need to take special care to efficiently represent constant-size objects and variable-size objects.

Some motivation

It isn't clear which code should delete Object instances stored, e.g., in ObjectVector. This is why they are not always deleted, e.g., in the utility 'report_intr_dim'.

We need a policy that would specify Object ownership. One option is to use shared_ptr, but this choice may affect performance due to locks being synced among threads.

Important note: Currently, Object instances are automatically deleted only by ExperimentConfig. So, we are fine in this case, but are screwed up when somebody calls Space<dist_t>::ReadDataSet directly, e.g.,:

 ObjectVector data;
  space->ReadDataset(data, NULL, dataFile.c_str(), maxNumData);

128-bit alignment

We need to ensure that all data arrays of integers, floats, and doubles are aligned on 128-bit boundaries. This needs to be done in two places

When the object is created (this would automatically be ensured if the header size is 16k, k some int)
When we "chunk" bucket objects, i.e., put them into a single contiguous memory chunk.

Otherwise, we may crash
http://searchivarius.org/blog/what-you-mut-know-about-alignment-21st-century

Implement a new version of inverted index

Uncompressed and simplified (mergesort-based intersection instead of adaptive intersection) index from the paper:

Succinct Nearest Neighbor Search

Incremental sorting bug

Try e.g., commenting out incremental sorting for now

release//experiment --dataFile /home/leonid/TextCollect/sift_texmex_learn5m.txt --maxNumData 1000000 --maxNumQuery 200 --distType double --spaceType l2 --knn 10 --testSetQty 1 --cachePrefixGS /home/leonid/GIT/NonMetricSpaceLib/similarity_search/gs_cache/sift1M --method proj_incsort:projType=rand,projDim=1024,knnAmp=60 --method proj_incsort:projType=rand,projDim=1024,knnAmp=200 --method proj_incsort:projType=rand,projDim=1024,knnAmp=400