manzilzaheer / covertree Goto Github PK

View Code? Open in Web Editor NEW

90.0 9.0 19.0 1.88 MB

Cover Tree implementation in C++ for k-Nearest Neighbours and range search

License: Apache License 2.0

Makefile 0.18% Python 0.13% CMake 0.01% C++ 98.76% C 0.92%

covertree's Introduction

Tree Based Nearest Neighbor Search with Guarantees

We present parallel implementations of two tree based nearest neighbor search data structures: SG-Tree and Cover Tree.

Cover Tree

The cover tree data structure was originally presented in and improved in:

Alina Beygelzimer, Sham Kakade, and John Langford. "Cover trees for nearest neighbor." Proceedings of the 23rd international conference on Machine learning. ACM, 2006.
Mike Izbicki and Christian Shelton. "Faster cover trees." Proceedings of the 32nd International Conference on Machine Learning (ICML-15). 2015.

SG-Tree

SG-Tree is a new data structure for exact nearest neighbor search inspired from Cover Tree and its improvement, which has been used in the TerraPattern project. At a high level, SG-Tree tries to create a hierarchical tree where each node performs a "coarse" clustering. The centers of these "clusters" become the children and subsequent insertions are recursively performed on these children. When performing the NN query, we prune out solutions based on a subset of the dimensions that are being queried. This is particularly useful when trying to find the nearest neighbor in highly clustered subset of the data, e.g. when the data comes from a recursive mixture of Gaussians or more generally time marginalized coalscent process . The effect of these two optimizations is that our data structure is extremely simple, highly parallelizable and is comparable in performance to existing NN implementations on many data-sets.

Under active development

New: Moving to Python3

New: Python wrappers added

Just use python setup.py install and then in python you can import nntree. The python API details are provided in API.pdf. If you do not have root priveledges, install with python setup.py install --user and make sure to have the folder in path.

Organisation

All codes are under src within respective folder
Dependencies are provided under lib folder
For running cover tree an example script is provided under scripts
data is a placeholder folder where to put the data
build and dist folder will be created to hold the executables

Requirements

gcc >= 5.0 or Intel® C++ Compiler 2017 for using C++14 features

How to use

We will show how to run our Cover Tree on a single machine using synthetic dataset

First of all compile by hitting make
```
  make
```
Generate synthetic dataset
```
  python data/generateData.py
```

Run Cover Tree

   dist/cover_tree data/train_100d_1000k_1000.dat data/test_100d_1000k_10.dat

The make file has some useful features:

if you have Intel® C++ Compiler, then you can instead
```
  make intel
```
or if you want to use Intel® C++ Compiler's cross-file optimization (ipo), then hit
```
  make inteltogether
```
Yet an other alternative is to use the LLVM/CLang compiler (minimal required version is 3.4, for c++14 support)
```
  make llvm
```

For this to work under linux, you would probably have to install at least these packages (in version 3.4 or later): clang libc++-dev

Also you can selectively compile individual modules by specifying
```
  make <module-name>
```
or clean individually by
```
  make clean-<module-name>
```

Performance

Based on our evaluation the implementation is easily scalable and efficient. For example on Amazon EC2 c4.8xlarge, we could insert more than 1 million vectors of 1000 dimensions in Euclidean space with L2 norm under 250 seconds. During query time we can process > 300 queries per second per core.

Troubleshooting

If the build fails and throws error like "instruction not found", then most probably the system does not support AVX2 instruction sets. To solve this issue, in setup.py and src/cover_tree/makefile please change march=core-avx2to march=corei7.

covertree's People

Contributors

Stargazers

Watchers

Forkers

aman-tiwari gijs akturtle wanjinchang vovoma pandasasa timwee grseb9s hoijui enayatullah kastnerkyle alexandersvozil nmonath manzilz kainwinterheart cat-state dima21250

covertree's Issues

python wrapper ImportError: undefined symbol: _ZTINSt6thread6_StateE

I cloned the repository and installed the python module using setup.py.

When I tried to import the library in a script using

import covertree

I get the following

File "../knn/agents.py", line 3, in
import covertree
File "/home/joe/anaconda3/envs/knnenv/lib/python2.7/site-packages/covertree/init.py", line 1, in
from covertree import CoverTree
File "/home/joe/anaconda3/envs/knnenv/lib/python2.7/site-packages/covertree/covertree.py", line 1, in
import covertreec
ImportError: /home/joe/anaconda3/envs/knnenv/lib/python2.7/site-packages/covertreec.so: undefined symbol: _ZTINSt6thread6_StateE

incorrect computation of nearest neighbor

Hello, I am finding that the nearestNeighbour queries are not giving the correct result.

In the attached test case, i have three 2D points in the cover-tree, and for two test points, I do get the correct kNearestneighbours (k=3) with correct distances. But the nearest neighbor is incorrect.

I will try to look into the code in detail to figure out the cause. But if you have any ideas as to why this could be happening, I would appreciate the fix for this.

The attached program produces the following output.

Number of OpenMP threads: 1
 adding 3 points to cover tree!
Entered case 1: 3.31402 1 0
Requesting global lock!
 testing for nearest neighbors!

 query point 0 : (-0.944485 0.116473)
 doing 3-nearest neighbors using direct computation!
 	 nearest 0 : 1, 1.438655
 	 nearest 1 : 2, 1.651407
 	 nearest 2 : 0, 2.003489
 doing 3-nearest neighbors using cover_tree!
	 cover_tree 0 : 2, 1.438655
	 cover_tree 1 : 0, 1.651407
	 cover_tree 2 : 0, 2.003489
 nearest	 :: direct = (1, 1.438655)	 cover_tree = (0, 1.651407)
---------------------------------------------------------------------> mismatch

 query point 1 : (-0.931471 0.781848)
 doing 3-nearest neighbors using direct computation!
 	 nearest 0 : 1, 1.537557
 	 nearest 1 : 2, 2.114969
 	 nearest 2 : 0, 2.256513
 doing 3-nearest neighbors using cover_tree!
	 cover_tree 0 : 2, 1.537557
	 cover_tree 1 : 0, 2.114969
	 cover_tree 2 : 0, 2.256513
 nearest	 :: direct = (1, 1.537557)	 cover_tree = (0, 2.114969)
---------------------------------------------------------------------> mismatch

main.cpp.txt

support for removal of points

When will support for removal of points become available?

allow different distance metrics and point classes

you hardcode euclidean distance and Eigen::VectorXd points. i would prefer to have the options of

using points that can be owned, or slices of (sparse or dense) matrices.
using other distances.

i.e. i’d like to have

template<Point>
class CoverTree {
   ...
}

with Point needing to specify distance and operator==, e.g.:

template<class Distance, class Vector>
class IndexedPoint {
private:
	Vector _vec;
	size_t _idx;
public:
	IndexedPoint(Vector v, size_t i) : _vec(v), _idx(i) {}
	const Vector& vec() const { return this->_vec; }
	size_t               idx() const { return this->_idx; }
	bool operator==(const IndexedPoint<Distance>& p) {
		return is_true(all(this->_vec == p.vec()));
	};
	double distance(const IndexedPoint<Distance>& p) const {
		return Distance::distance(*this, p);
	};
};

class CosineDistance {
public:
	static double distance(const IndexedPoint<CosineDistance>& p1, const IndexedPoint<CosineDistance>& p2) {
		return 1 - cor(p1.vec(), p2.vec());
	}
};

class EuclideanDistance { ... }

then i can do:

CoverTree<IndexedPoint<EuclideanDistance>> ct;

Python API segfaults on small datasets

I'm getting a segfault when using the Python API (very grateful for that addition by the way =D), am I using it correctly? The API reference PDF doesn't list the constructor. I'm using gcc/g++7.1.0 and Python 2.7, see the gdb trace below.

>>> ct = CoverTree.from_matrix(document_embeddings)
Faster Cover Tree with base 1.3
Max distance: 65.5565
Scale chosen: 18
100% [==================================================]
[New Thread 0x7ffff13fe700 (LWP 11853)]
[New Thread 0x7fffe8bfd700 (LWP 11854)]
Duplicate entry!!!

Program received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0x7fffe8bfd700 (LWP 11854)]
run (src=..., dst=...) at lib/Eigen/src/Core/Assign.h:410
410 dst.template copyPacket<Derived2, dstAlignment, srcAlignment>(index, src);

Question abount setting maxdistUB when insert

I found that in bool CoverTree::insert(const pointType& p),
several statement about setting the maxdistUB have been commented. And I wonder how does it work without setting the maxdistUB properly.

how to check the result from covertree

Hi,
I try to use this demo to have a test, and I run like this:

dist/cover_tree data/train_100d_1000k_1000.dat data/test_100d_1000k_10.dat
data/train_100d_1000k_1000.dat
data/test_100d_1000k_10.dat
Number of OpenMP threads: 8
Number of points: 1000000
Number of dims : 100
56.5687 85.2284 -26.9832
Build time: 62972
Number of OpenMP threads: 8
Number of points: 10000
Number of dims : 100
99.7609 -40.0263 99.2302
Querying serially
Quering parallely
k-NN serially
range serially
Query time: 4394231
sh: 1: pause: not found

How to check the output of the covertree? e.g. indices distances

Thank you.

Query time for large K values

Hey, can you share the analysis for query time as value of K increases. We tried K=200 but it's taking
long time to query. Test points are 21 million.