GithubHelp home page GithubHelp logo

nmslib / nmslib Goto Github PK

View Code? Open in Web Editor NEW
3.3K 94.0 438.0 96.87 MB

Non-Metric Space Library (NMSLIB): An efficient similarity search library and a toolkit for evaluation of k-NN methods for generic non-metric spaces.

License: Apache License 2.0

Python 10.22% Shell 2.80% Makefile 0.59% Perl 5.65% C 0.51% C++ 64.22% CMake 0.63% R 0.04% Java 0.42% Thrift 0.14% Batchfile 0.60% Roff 0.58% TeX 10.09% Jupyter Notebook 3.51%
knn-search non-metric neighborhood-graphs k-nn-graphs vp-tree

nmslib's Introduction

Pypi version Downloads Downloads Build Status Windows Build Status Join the chat at https://gitter.im/nmslib/Lobby

Non-Metric Space Library (NMSLIB)

Important Notes

  • NMSLIB is generic but fast, see the results of ANN benchmarks.
  • A standalone implementation of our fastest method HNSW also exists as a header-only library.
  • All the documentation (including using Python bindings and the query server, description of methods and spaces, building the library, etc) can be found on this page.
  • For generic questions/inquiries, please, use the Gitter chat: GitHub issues page is for bugs and feature requests.

Objectives

Non-Metric Space Library (NMSLIB) is an efficient cross-platform similarity search library and a toolkit for evaluation of similarity search methods. The core-library does not have any third-party dependencies. It has been gaining popularity recently. In particular, it has become a part of Amazon Elasticsearch Service.

The goal of the project is to create an effective and comprehensive toolkit for searching in generic and non-metric spaces. Even though the library contains a variety of metric-space access methods, our main focus is on generic and approximate search methods, in particular, on methods for non-metric spaces. NMSLIB is possibly the first library with a principled support for non-metric space searching.

NMSLIB is an extendible library, which means that is possible to add new search methods and distance functions. NMSLIB can be used directly in C++ and Python (via Python bindings). In addition, it is also possible to build a query server, which can be used from Java (or other languages supported by Apache Thrift (version 0.12). Java has a native client, i.e., it works on many platforms without requiring a C++ library to be installed.

Authors: Bilegsaikhan Naidan, Leonid Boytsov, Yury Malkov, David Novak. With contributions from Ben Frederickson, Lawrence Cayton, Wei Dong, Avrelin Nikita, Dmitry Yashunin, Bob Poekert, @orgoro, @gregfriedland, Scott Gigante, Maxim Andreev, Daniel Lemire, Nathan Kurz, Alexander Ponomarenko.

Brief History

NMSLIB started as a personal project of Bilegsaikhan Naidan, who created the initial code base, the Python bindings, and participated in earlier evaluations. The most successful class of methods--neighborhood/proximity graphs--is represented by the Hierarchical Navigable Small World Graph (HNSW) due to Malkov and Yashunin (see the publications below). Other most useful methods, include a modification of the VP-tree due to Boytsov and Naidan (2013), a Neighborhood APProximation index (NAPP) proposed by Tellez et al. (2013) and improved by David Novak, as well as a vanilla uncompressed inverted file.

Credits and Citing

If you find this library useful, feel free to cite our SISAP paper [BibTex] as well as other papers listed in the end. One crucial contribution to cite is the fast Hierarchical Navigable World graph (HNSW) method [BibTex]. Please, also check out the stand-alone HNSW implementation by Yury Malkov, which is released as a header-only HNSWLib library.

License

The code is released under the Apache License Version 2.0 http://www.apache.org/licenses/. Older versions of the library include additional components, which have different licenses (but this does not apply to NMLISB 2.x):

Older versions of the library included the following components:

  • The LSHKIT, which is embedded in our library, is distributed under the GNU General Public License, see http://www.gnu.org/licenses/.
  • The k-NN graph construction algorithm NN-Descent due to Dong et al. 2011 (see the links below), which is also embedded in our library, seems to be covered by a free-to-use license, similar to Apache 2.
  • FALCONN library's licence is MIT.

Funding

Leonid Boytsov was supported by the Open Advancement of Question Answering Systems (OAQA) group and the following NSF grant #1618159: "Matching and Ranking via Proximity Graphs: Applications to Question Answering and Beyond". Bileg was supported by the iAd Center.

Related Publications

Most important related papers are listed below in the chronological order:

nmslib's People

Contributors

8w9ag avatar amoussawi avatar andrusha97 avatar bejvisek avatar benfred avatar bhavaygg avatar bileg avatar bobpoekert avatar brendanchambers avatar cadovvl avatar deneutoy avatar doug-friedman avatar geofft avatar gitter-badger avatar gokceneraslan avatar gregf-atomwise avatar huonw avatar janaknat avatar jjjamie avatar jmazanec15 avatar jschmitz28 avatar jvkersch avatar mdeff avatar neggert avatar orgoro avatar pabs3 avatar scottgigante avatar searchivarius avatar sjhewitt avatar yurymalkov avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

nmslib's Issues

Port NN-descent

We need at least one MORE implementation of a knn-graph. Don't forget to update the list of publications.

Caching of ground-truth data

Currently, in each test we randomly divide the data set into testing and indexable subsets. Then, for each division we compute the exact result set for the NN or range search through sequential searching. In that, we also memorize the search time of sequential searching.

This is ok for fast distances and small datasets. Yet, it is rather slow for large data sets (and slow distances). We should support caching of gold truth data. If a special option is given, we should memorize the split params and the gold truth data for a specific data set (and a selection of parameters). Then, subsequent re-runs will take no time.

Support minimalistic objects without the wrapper Object class

Currently, we place all our data objects inside instances of the class Object. However, it is a bit complicated for users. Perhaps, a better approach is to work directly on pointers. We sure need to store object meta information somewhere (id, class label, type, and the size of the data). Yet, this can be kept in a separate hash table, where object pointers are keys.

Perhaps, one solution is a thin wrapper Object class.

This is not for the closest release, but for some more distant future.

Sync the documentation

Cover several issues:

  1. New methods
  2. New utilities.
  3. Changes in the interface of the Space
  4. Auto-tuning for VP-tree (in the VP-tree section).

Move all utility code to a separate util library

Visual studio project files should be changed as well. Currently, we have the following utility binaries, which are better in separate util/apps folder. Now they are either in the src or in the test folder:

  1. bench_distfunc
  2. bench_projection
  3. tune_vptree
  4. dummy_app
  5. report_intr_dim

Record the # of queries per second

This is a necessary (but missing) statistics for multi-threading testing. Even though, it is possible to compute it from the average query time, it will be best to produce it automatically.

Perhaps, there is a minor bug in the evaluation code

This one happens only for float, see TODO in eval_resulst.h:

      /*
       * TODO: @leo These eps are quite adhoc.
       *            There can be a bug here (where approx is better than exact??), 
       *            to reproduce a situation when the below condition is triggered 
       *            for epsRel = 1e-5 use & epsAbs = 1-e5:
       *            release/experiment  --dataFile ~/TextCollect/VectorSpaces/colors112.txt --knn 1 --testSetQty 1 --maxNumQuery 1000  --method vptree:alphaLeft=0.8,alphaRight=0.8  -s cosinesimi 
       *
       */
      if (mx > 0 && (1- mn/mx) > epsRel && (mx - mn) > epsAbs) {
        for (size_t i = 0; i < std::min(ExactDists_.size(), ApproxDists_.size()); ++i ) {
          LOG(INFO) << "Ex: " << ExactDists_[i].first <<
                       " -> Apr: " << ApproxDists_[i] <<
                       " 1 - ratio: " << (1 - mn/mx) << " diff: " << (mx - mn);
        }

Add an implementation of a kernelized hashing

http://www.cse.ohio-state.edu/~kulis/klsh/klsh.htm

Papers:

Kernelized Locality-Sensitive Hashing for Scalable Image Search
Brian Kulis & Kristen Grauman
In Proc. 12th International Conference on Computer Vision (ICCV), 2009.
[pdf]
Also see the following related papers, which apply LSH to learned Mahalanobis metrics:

Fast Similarity Search for Learned Metrics
Brian Kulis, Prateek Jain, & Kristen Grauman
IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 31, no. 12, pp. 2143--2157, 2009.
[pdf]
Fast Image Search for Learned Metrics
Prateek Jain, Brian Kulis, & Kristen Grauman
In. Proc. IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2008.
[pdf]

The VP-tree should be able to call the tuning procedure itself

Make it possible to call the tunning procedure (current you can only run tune_vptree) from vp-tree automatically. They user may be able to specify the following parameters:
*minExp (default 1)
*maxExp (defalt 1)
*maxTuneQty (default 10000)
*bucketSize (50)

*Require the following * bucketSize maxTuneQty >= dataQty.

Improve a vp-tree pivot-selection procedure (for low-dimensional spaces).

While in higher-dimensional spaces (e.g., SIFT) with narrow histograms of distances (between random pairs of points), random pivots may be almost as good as any other pivots, this is not the case of lower dimensional spaces, or high-dimensional spaces with very low intrinsic dimensionality. A user should be able to specify an option for a better pivot-selection. In particular, trying several pivots and choosing one with the highest distance spread should help in many cases.

Logging improvements

Need to handle logging better than just dumping everything to the screen. Two improvements are needed

  1. Log-level control through a switch
  2. An ability to redirect non-fatal errors to the separate log-file

Better caching of ground-truth data

Currently, we are going to save only a fixed number of GS entries, which is 1000 by default. This is fine for most kNN scenarios, however, may be insufficient for range searching. We need a "knob" to specify what is the maximum number of entries we want to save, but relative to the result size.

In addition, storing complete set of all GS entries uses a lot of memory. In some case, we simply run out of memory. As a quick fix, (see commit 91298da), we never keep more than a specified number of GS entries. Obviously, this is not a satisfactory solution in general. So something more clever needs to be done.

Getting rid of LOG(LIB_FATAL)

In many places, we use LOG(LIB_FATAL), but it kills the app. This call can also reside inside CHECK.

This is hardly acceptable for a good library. We should replace most such calls with throwing an exception. Some care should be taken to ensure this doesn't cause leaked memory.

Class hierarchy for objects

We absolutely need to have a hierarchy of objects, which can do the following

  1. Read an object from input stream
  2. Write object to an output stream
  3. How to provide access to the data to compute distances.

Currently there is no way to write objects (not even for debug purposes). Reading is done by classes inherited from the Space. This is not a good practice, we need to decouple this object-related functionality from spaces. We need to have objects such as sparse vectors, dense vectors, etc... (in the future we will add strings and other weird thingies).

Implement M-index

Metric Index: An efficient and scalable solution for precise and approximate similarity search
David NovakCorresponding author contact information, E-mail the corresponding author, Michal Batko E-mail the corresponding author, Pavel Zezula E-mail the corresponding author.

Source code is available at:

http://mufin.fi.muni.cz/trac/m-index

Implement Fagin's rank aggregation methods

Efficient similarity search and classification via rank aggregation
R Fagin, R Kumar, D Sivakumar - Proceedings of the 2003 ACM …, 2003 - dl.acm.org

It's also patented:

Efficient similarity search and classification via rank aggregation
United States Patent Application 20040249831 Kind Code: A1

Better proxy for distance computations

Instead of the current approach where space provides a virtual function to compute distances, we need the space to provide a function pointer to compute the distance! To estimate the number of distance computations in experiments we will simply create a wrapper space that would provide a wrapper distance function. These function will proxy computations and count their numbers. These will be only done during one type of experiment where we also compute accuracy.

This is related to closed #78 : for low-dim spaces even pointer can be an overhead, though.

Some motivation
Space objects can be used to compute distances only during indexing time.
search time, all distance computations are proxied through a query object.
During the indexing time, the query parameter can be NULL, but during
search time, one needs to supply the actual query.

This is actually not such a good approach, we need a separate
object to proxy distance computations. It should not be the query!
Then, functions like Projectin::compProj will not need to use both the pObj (at
index time) and the pQuery (at search time).

Object class improvements

In version 2, we will introduce several types of memory allocators (available to indexing methods, which will do memory allocations for data themselves) that will take care of

An Object will be renamed to DataPoint

  1. allocating
  2. freeing
  3. properly aligning memory (previously #9)

Hierarchy of class objects referenced in #7 will also be very useful

These allocators will allow several modes:

  1. fully dynamic: can delete/add each object independently
  2. statically mmaped
  3. statically allocated
  4. perhaps dynamically growing allocated as a single chunk
  5. perhaps dynamically growing allocated as multiple chunks

We need to take special care to efficiently represent constant-size objects and variable-size objects.

Some motivation

It isn't clear which code should delete Object instances stored, e.g., in ObjectVector. This is why they are not always deleted, e.g., in the utility 'report_intr_dim'.

We need a policy that would specify Object ownership. One option is to use shared_ptr, but this choice may affect performance due to locks being synced among threads.

Important note: Currently, Object instances are automatically deleted only by ExperimentConfig. So, we are fine in this case, but are screwed up when somebody calls Space<dist_t>::ReadDataSet directly, e.g.,:

 ObjectVector data;
  space->ReadDataset(data, NULL, dataFile.c_str(), maxNumData);

Incremental sorting bug

Try e.g., commenting out incremental sorting for now

release//experiment --dataFile /home/leonid/TextCollect/sift_texmex_learn5m.txt --maxNumData 1000000 --maxNumQuery 200 --distType double --spaceType l2 --knn 10 --testSetQty 1 --cachePrefixGS /home/leonid/GIT/NonMetricSpaceLib/similarity_search/gs_cache/sift1M --method proj_incsort:projType=rand,projDim=1024,knnAmp=60 --method proj_incsort:projType=rand,projDim=1024,knnAmp=200 --method proj_incsort:projType=rand,projDim=1024,knnAmp=400

Embed LM-tree

An efficient tree structure for indexing feature vectors
The-Anh Pham, a, b, , Sabine Barrata, Mathieu Delalandrea, Jean-Yves Ramela

Separate indexing and searching parameters

We need to be able to specify search parameters that can be changed during the search phase. In this way, we can modify them and re-run queries without rebuilding the index.

Finish implementing test_integr

PS: a regression utility that test our methods using a variety of parameters, in a single and multithread modes. Then, it checks if recall and the improvement in the # of distance computation matches pre-recorded values.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.