GithubHelp home page GithubHelp logo

fcjy / hnsw Goto Github PK

View Code? Open in Web Editor NEW

This project forked from nmslib/hnswlib

0.0 2.0 0.0 148 KB

Header-only C++/python library for fast approximate nearest neighbors

Home Page: https://github.com/nmslib/hnsw

License: Apache License 2.0

CMake 0.35% Python 5.32% C++ 94.33%

hnsw's Introduction

Hnswlib - fast approximate nearest neighbor search

Header-only C++ HNSW implementation with python bindings. Paper code for the HNSW 200M SIFT experiment

Highlights:

  1. Lightweight, header-only, no dependences other than C++ 11.
  2. Interfaces for C++, python and R (https://github.com/jlmelville/rcpphnsw).
  3. Has full support for incremental index construction.
  4. Can work with custom user distances (C++).
  5. Significantly less memory footprint and faster build time compared to current nmslib's implementation, although is slightly slower in terms of the search speed.

Description of the algroithm's parameters can be found in ALGO_PARAMS.md.

Python bindings

Supported distances:

Distance parameter Equation
Squared L2 'l2' d = sum((Ai-Bi)^2)
Inner product 'ip' d = 1.0 - sum(Ai*Bi))
Cosine similarity 'cosine' d = 1.0 - sum(Ai*Bi) / sqrt(sum(Ai*Ai) * sum(Bi*Bi))

Note that inner product is not a metric. An element can be closer to some other element than to itself.

For other spaces use the nmslib library https://github.com/nmslib/nmslib.

Python bindings examples

import hnswlib
import numpy as np

dim = 128
num_elements = 10000

# Generating sample data
data = np.float32(np.random.random((num_elements, dim)))
data_labels = np.arange(num_elements)

# Declaring index
p = hnswlib.Index(space = 'l2', dim = dim) # possible options are l2, cosine or ip

# Initing index - the maximum number of elements should be known beforehand
p.init_index(max_elements = num_elements, ef_construction = 200, M = 16)

# Element insertion (can be called several times):
p.add_items(data, data_labels)

# Controlling the recall by setting ef:
p.set_ef(50) # ef should always be > k

# Query dataset, k - number of closest elements (returns 2 numpy arrays)
labels, distances = p.knn_query(data, k = 1)

An example with updates after serialization/deserialization:

import hnswlib
import numpy as np

dim = 16
num_elements = 10000

# Generating sample data
data = np.float32(np.random.random((num_elements, dim)))

# We split the data in two batches:
data1 = data[:num_elements // 2]
data2 = data[num_elements // 2:]

# Declaring index
p = hnswlib.Index(space='l2', dim=dim)  # possible options are l2, cosine or ip

# Initing index
# max_elements - the maximum number of elements (capacity). Will throw an exception if exceeded
# during insertion of an element.
# The capacity can be increased by saving/loading the index, see below.
#
# ef_construction - controls index search speed/build speed tradeoff
#
# M - is tightly connected with internal dimensionality of the data. Strongly affects the memory consumption (~M)
# Higher M leads to higher accuracy/run_time at fixed ef/efConstruction

p.init_index(max_elements=num_elements//2, ef_construction=100, M=16)

# Controlling the recall by setting ef:
# higher ef leads to better accuracy, but slower search
p.set_ef(10)

# Set number of threads used during batch search/construction
# By default using all available cores
p.set_num_threads(4)


print("Adding first batch of %d elements" % (len(data1)))
p.add_items(data1)

# Query the elements for themselves and measure recall:
labels, distances = p.knn_query(data1, k=1)
print("Recall for the first batch:", np.mean(labels.reshape(-1) == np.arange(len(data1))), "\n")

# Serializing and deleting the index:
index_path='first_half.bin'
print("Saving index to '%s'" % index_path)
p.save_index("first_half.bin")
del p

# Reiniting, loading the index
p = hnswlib.Index(space='l2', dim=dim)  # the space can be changed - keeps the data, alters the distance function.

print("\nLoading index from 'first_half.bin'\n")

# Increase the total capacity (max_elements), so that it will handle the new data
p.load_index("first_half.bin", max_elements = num_elements)

print("Adding the second batch of %d elements" % (len(data2)))
p.add_items(data2)

# Query the elements for themselves and measure recall:
labels, distances = p.knn_query(data, k=1)
print("Recall for two batches:", np.mean(labels.reshape(-1) == np.arange(len(data))), "\n")

Bindings installation

apt-get install -y python-setuptools python-pip
pip3 install pybind11 numpy setuptools
cd python_bindings
python3 setup.py install

Other implementations

200M SIFT test reproduction

To download and extract the bigann dataset:

python3 download_bigann.py

To compile:

cmake .
make all

To run the test on 200M SIFT subset:

./main

The size of the bigann subset (in millions) is controlled by the variable subset_size_milllions hardcoded in sift_1b.cpp.

HNSW example demos

References

Malkov, Yu A., and D. A. Yashunin. "Efficient and robust approximate nearest neighbor search using Hierarchical Navigable Small World graphs." arXiv preprint arXiv:1603.09320 (2016). https://arxiv.org/abs/1603.09320

hnsw's People

Contributors

bli25 avatar ironyoung avatar jlmelville avatar moretimes avatar piem avatar searchivarius avatar thomasdelteil avatar yurymalkov avatar

Watchers

 avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.