kayzhu / lshash Goto Github PK

A fast Python implementation of locality sensitive hashing.

License: MIT License

Python 100.00%

lshash's Introduction

LSHash

Version:	0.0.4dev

A fast Python implementation of locality sensitive hashing with persistance support.

Highlights

Fast hash calculation for large amount of high dimensional data through the use of numpy arrays.
Built-in support for persistency through Redis.
Multiple hash indexes support.
Built-in support for common distance/objective functions for ranking outputs.

Installation

LSHash depends on the following libraries:

numpy
redis (if persistency through Redis is needed)
bitarray (if hamming distance is used as distance function)

To install:

$ pip install lshash

Quickstart

To create 6-bit hashes for input data of 8 dimensions:

>>> from lshash import LSHash

>>> lsh = LSHash(6, 8)
>>> lsh.index([1,2,3,4,5,6,7,8])
>>> lsh.index([2,3,4,5,6,7,8,9])
>>> lsh.index([10,12,99,1,5,31,2,3])
>>> lsh.query([1,2,3,4,5,6,7,7])
[((1, 2, 3, 4, 5, 6, 7, 8), 1.0),
 ((2, 3, 4, 5, 6, 7, 8, 9), 11)]

Main Interface

To initialize a LSHash instance:

LSHash(hash_size, input_dim, num_of_hashtables=1, storage=None, matrices_filename=None, overwrite=False)

parameters:

hash_size:: The length of the resulting binary hash.
input_dim:: The dimension of the input vector.
num_hashtables = 1:: (optional) The number of hash tables used for multiple lookups.
storage = None:: (optional) Specify the name of the storage to be used for the index storage. Options include "redis".
matrices_filename = None:: (optional) Specify the path to the .npz file random matrices are stored or to be stored if the file does not exist yet
overwrite = False:: (optional) Whether to overwrite the matrices file if it already exist

To index a data point of a given LSHash instance, e.g., lsh:

lsh.index(input_point, extra_data=None):

parameters:

input_point:: The input data point is an array or tuple of numbers of input_dim.
extra_data = None:: (optional) Extra data to be added along with the input_point.

To query a data point against a given LSHash instance, e.g., lsh:

lsh.query(query_point, num_results=None, distance_func="euclidean"):

parameters:

query_point:: The query data point is an array or tuple of numbers of input_dim.
num_results = None:: (optional) The number of query results to return in ranked order. By default all results will be returned.
distance_func = "euclidean":: (optional) Distance function to use to rank the candidates. By default euclidean distance function will be used.

lshash's People

Contributors

Stargazers

Watchers

Forkers

c0mpsc1 nucflash frrmack vins31 plusbzz scottcode jfelectron zehsilva simonemainardi anna0709 xacce herrbuerger slitayem rtvt123 dbernardoj lizhangzhan honglongwu jesserobertson ivendrov cynicalanlz freakthemighty ejlb gopinutakki xieyanfu imsparsh stevenlol rmagesh148 yangspeaking dcrankshaw qsong4 njuhugn ajgappmark edzhangjianyu lovetimil dongzhuoyao codingafuture gexinworks perluhin mollystark cmcneil hmmbug hobson totalgood phdowling parsegarden naushadzaman luismojica blkstone roccy takehirosekine wzh404 samzhang111 faisal-w superashan johnmeade acerge aresthu thesage21 alancucki xiufranklin channabasavagola-zz mzk665 edvinsson wittyfilter joschif ubear lckfork mdomans smarty1palak tony-hou caohao2008 yyzreal sunnycs f0xxx russellwmy afcarl tidesq zbhatti yuzhao12 gaoyz0625 sunzequn michaelstarkey bagayalu wxb263stu pakchoi xiaoqingwang tearsl smartcai anirband armheb aid91 shihui628 aakashnaik will-holden michaelldd webee zorrock zhleternity conleykong bobinsky-zju

lshash's Issues

It seems LSHash is only used for less storage space, not used for less compute time.

Am I right? Thank you.

ImportError: cannot import name 'LSHash'

pip installed lshash and attempted to import LSHash however I get the above error.

versions:
python - 3.6.7
numpy - 1.15.4

I am operating within a conda environment.

Limit the number of points in each bucket

Hi all,

is there any way to limit the number of points in each bucket?
Thanks in advance.

Matteo

pip install error

I am trying to install this package. when I run pipe install it gives me this error:

pip install lshash==0.0.4dev
Collecting lshash==0.0.4dev
Using cached lshash-0.0.4dev.tar.gz
Complete output from command python setup.py egg_info:
Traceback (most recent call last):
File "", line 20, in
File "/private/var/folders/3y/98pmr_g51g988x_v957qmh100000gn/T/pip-build-vgvzxrjr/lshash/setup.py", line 3, in
import lshash
File "/private/var/folders/3y/98pmr_g51g988x_v957qmh100000gn/T/pip-build-vgvzxrjr/lshash/lshash/init.py", line 12, in
from lshash import LSHash
ImportError: cannot import name 'LSHash'

----------------------------------------

Command "python setup.py egg_info" failed with error code 1 in /private/var/folders/3y/98pmr_g51g988x_v957qmh100000gn/T/pip-build-vgvzxrjr/lshash

Is faster query time possible?

Hi,

I was wondering if there would be any way to reduce the query time.
For example, for my use case with the following parameters, it is 0.2 s which would be too slow for querying my whole dataset:
lsh = LSHash(10, 300)
lsh.query(example_vector, num_results=5) (changing num_results doesn't have an effect on its run time)

Any suggestions would be appreciated!

Thank you

using on word vectors

I want to use this on bunch of word vectors and find the similar ones.

Should I firs index all of the vectors, and query each one again to find the bucket number?

pip install fails

c:\Python33\Scripts>pip install lshash
Downloading/unpacking lshash
Could not find a version that satisfies the requirement lshash (from versions:
0.0.2dev, 0.0.3dev, 0.0.4dev)
Cleaning up...
No distributions matching the version for lshash
Storing complete log in C:\Users\t-hanans\pip\pip.log

How to set a appropriate hash size?

Sorry to bother. This might be a stupid question.
I really don't know how to set the hash size, is that depend on my data size?
In the quick start example, I try to change LSHash (6, 8) to LSHash(3, 8) and getting the same result.

lsh2 = LSHash(3, 8)
lsh2.index([1,2,3,4,5,6,7,8])
lsh2.index([10,12,99,1,5,31,2,3])
lsh2.query([1,2,3,4,5,6,7,7])
[((1, 2, 3, 4, 5, 6, 7, 8), 1), ((2, 3, 4, 5, 6, 7, 8, 9), 11)]

Thanks!

If you get ImportError, you can use https://github.com/guofei9987/pyLSHash

I will continue to maintain LSHash in https://github.com/guofei9987/pyLSHash

LSH family for Hamming distance

I've learned the Lsh algorithm recently. And I found your implementation. Quite useful to me!
But as far as I know, there are different Lsh families for different distance measurement. It seems the index method you use is random planes for cosine distance, am I right?
I'd like to ask you:

is it necessary or right to expand the index function for different distance measurement?
if it's right and necessary, will you or let me do it？

Way to handle NaN-values

It would be nice to have a way to specify how to handle NaN's in the data. For example by ignoring them in distance calculations.

ImportError: cannot import name 'LSHash' from 'lshash'

Is it possible to query results based on threshold.

Many LSH implementations use Jaccard similarity to return matching result above a certain threshold say 80% match.
Is possible to implement the same in this library.

lshash use redis

Can't use with high dimension

when I set the LSHash's input_dim with 300 dimension. the result of lsh.query is [ ].

It seems LSHash is only used for less storage space, not used for less compute time.

Am I right? Thank you.

how to obtain the actual hash for each vector?

>>> lsh = LSHash(6, 8)
>>> lsh.index([1,2,3,4,5,6,7,8])

Is there any way to obtain the lsh signature from the index(..) method?

projection type

The code is using np.random.randn() times input vector.
In the LSH paper survey, we are using either (Gaussian Distribution * input + bias)/W or (Uniform Distribution * input). I was wondering if we should change the distribution to uniform in the code?

Why can't get exact amounts of result?

Hi, I set 'r = lsh.query(inputs[200], distance_func='hamming', num_results=20)', in which the number of returned results is 20, but sometimes I just got only one result(or less than 20 items) .Can you tell me the reason? Or how should I fixed it?
Thanks a lot,
Sincerely

How to delete on index item?

How to delete on index item? Not only add a item to index, but how to delete one?

Sparse Matrix

Is it possible to use this package with sparse matrix?
I mean without converting it to dense matrix, something like this:

np.ndarray.flatten(vector.toarray())

Actually I want to use Locality Sensitive Hashing on text data and now I convert text to vector with scikit-learn's feature extraction module.

query method inefficiently implemented?

In it's current form, it seems that 'LShash.query' simply computes a hash for a query vector, and then computes the hamming distance manually for each point in each hash table. That means we are doing num_hashtables * num_indexed_points distance computations for every query.

Specifically, the problem is here:
https://github.com/kayzh/LSHash/blob/master/lshash/lshash.py#L237
We should not need to iterate through all keys in the index just to query one vector.

From what I understand, the whole point of LSH is to do hashed lookups rather than scanning all our data. Here we are computing hashes and then not using the one feature that makes them so fast, the fact that we can throw them into an index and do lookups in O(1) or at least O(log(n)). Lookups in O(n) are what LSH is supposed to be replacing. This implementation may still be fast enough for some use cases since the hamming distance computations might be cheaper than dot products on large vectors, but it does not scale in the number of samples in the index, which I would see as a serious problem.

Any thoughts on this? Please correct me if I got anything wrong here. I don't mean to complain, this is a library which I enjoy using, but it seems that a core part of it could be made more efficient.

lshash.query() returning empty values

Hi everyone. Don't know whats going on but my index is querying only over the points I have already submitted. For example

>>> import numpy as np
>>> import lshash
>>> lsh = lshash.LSHash(100,100)
>>> sample = np.zeros(100)
>>> sample[13]=1
>>> sample[43]=1
>>> sample[73]=1
>>> lsh.index(sample)
>>> sample[93]=1
>>> lsh.index(sample)
>>> sample = np.zeros(100)
>>> sample[33]=1
>>> sample[32]=1
>>> lsh.index(sample)
>>> sample[33]=0
>>> sample[93]=1
>>> lsh.query(sample)
[]
>>> sample = np.zeros(100)
>>> sample[33]=1
>>> sample[32]=1
>>> lsh.query(sample)
[((0.0, 0.0, 0.0, ..., 0.0, 0.0), 0.0)]
>>>

I am doing something wrong?

This still doesn't work

Yea I'm getting the same errors as everyone else, is possible the pypi wasn't updated with your latest edits?

Any know how to change or add distance function?

lshash.py line 298 like that:
@staticmethod
def euclidean_dist(x, y):
""" This is a hot function, hence some optimizations are made. """
diff = np.array(x)-y
return np.sqrt(np.dot(diff, diff))

I change the diff like that:
diff = np.array(x)-y change to np.array(x)-(mean(x)-mean(y))-y

But when I print(lsh.query([3,4,5,3,4,5,3,4],distance_func="euclidean"))
Compare to the origin euclidean_dist , the result is same .
It seems that it's not a correct method to change distance function.
Anyone could tell me how to change distance function?
Thanks a lot!

Is there any way to also obtain the indices in result of query?

Hi,

I was wondering if there is any way to also obtain the indices of the results of a query since a lot of times we may need to for example go back to some sort of embedding and see what element it is representing.

Thank you so much for the amazing work!

save hashtable to .npz file

i have installed redis successfully and on
providing the argument
storage_config={"redis": {"host": 'localhost', "port": 6379}},matrices_filename="/home/username/filename.npz"
where filename.npz has not been created nor is hashtable stored on redis ..
the program and query runs successfully giving the output vector but doesn't save a new .npz file or save the hashtables

kayzhu / lshash Goto Github PK

lshash's Introduction

LSHash

Highlights

Installation

Quickstart

Main Interface

lshash's People

Contributors

Stargazers

Watchers

Forkers

lshash's Issues

lshash.py line 298 like that: @staticmethod def euclidean_dist(x, y): """ This is a hot function, hence some optimizations are made. """ diff = np.array(x)-y return np.sqrt(np.dot(diff, diff))

I change the diff like that: diff = np.array(x)-y change to np.array(x)-(mean(x)-mean(y))-y

Recommend Projects

Recommend Topics

Recommend Org

Jobs

lshash.py line 298 like that:
@staticmethod
def euclidean_dist(x, y):
""" This is a hot function, hence some optimizations are made. """
diff = np.array(x)-y
return np.sqrt(np.dot(diff, diff))

I change the diff like that:
diff = np.array(x)-y change to np.array(x)-(mean(x)-mean(y))-y