GithubHelp home page GithubHelp logo

lshash's People

Contributors

kayzhu avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

lshash's Issues

Why can't get exact amounts of result?

Hi, I set 'r = lsh.query(inputs[200], distance_func='hamming', num_results=20)', in which the number of returned results is 20, but sometimes I just got only one result(or less than 20 items) .Can you tell me the reason? Or how should I fixed it?
Thanks a lot,
Sincerely

query method inefficiently implemented?

In it's current form, it seems that 'LShash.query' simply computes a hash for a query vector, and then computes the hamming distance manually for each point in each hash table. That means we are doing num_hashtables * num_indexed_points distance computations for every query.

Specifically, the problem is here:
https://github.com/kayzh/LSHash/blob/master/lshash/lshash.py#L237
We should not need to iterate through all keys in the index just to query one vector.

From what I understand, the whole point of LSH is to do hashed lookups rather than scanning all our data. Here we are computing hashes and then not using the one feature that makes them so fast, the fact that we can throw them into an index and do lookups in O(1) or at least O(log(n)). Lookups in O(n) are what LSH is supposed to be replacing. This implementation may still be fast enough for some use cases since the hamming distance computations might be cheaper than dot products on large vectors, but it does not scale in the number of samples in the index, which I would see as a serious problem.

Any thoughts on this? Please correct me if I got anything wrong here. I don't mean to complain, this is a library which I enjoy using, but it seems that a core part of it could be made more efficient.

using on word vectors

I want to use this on bunch of word vectors and find the similar ones.

Should I firs index all of the vectors, and query each one again to find the bucket number?

This still doesn't work

Yea I'm getting the same errors as everyone else, is possible the pypi wasn't updated with your latest edits?

pip install error

I am trying to install this package. when I run pipe install it gives me this error:

pip install lshash==0.0.4dev
Collecting lshash==0.0.4dev
Using cached lshash-0.0.4dev.tar.gz
Complete output from command python setup.py egg_info:
Traceback (most recent call last):
File "", line 20, in
File "/private/var/folders/3y/98pmr_g51g988x_v957qmh100000gn/T/pip-build-vgvzxrjr/lshash/setup.py", line 3, in
import lshash
File "/private/var/folders/3y/98pmr_g51g988x_v957qmh100000gn/T/pip-build-vgvzxrjr/lshash/lshash/init.py", line 12, in
from lshash import LSHash
ImportError: cannot import name 'LSHash'

----------------------------------------

Command "python setup.py egg_info" failed with error code 1 in /private/var/folders/3y/98pmr_g51g988x_v957qmh100000gn/T/pip-build-vgvzxrjr/lshash

Any know how to change or add distance function?

lshash.py line 298 like that:
@staticmethod
def euclidean_dist(x, y):
""" This is a hot function, hence some optimizations are made. """
diff = np.array(x)-y
return np.sqrt(np.dot(diff, diff))

I change the diff like that:
diff = np.array(x)-y change to np.array(x)-(mean(x)-mean(y))-y

But when I print(lsh.query([3,4,5,3,4,5,3,4],distance_func="euclidean"))
Compare to the origin euclidean_dist , the result is same .
It seems that it's not a correct method to change distance function.
Anyone could tell me how to change distance function?
Thanks a lot!

Sparse Matrix

Is it possible to use this package with sparse matrix?
I mean without converting it to dense matrix, something like this:

np.ndarray.flatten(vector.toarray())

Actually I want to use Locality Sensitive Hashing on text data and now I convert text to vector with scikit-learn's feature extraction module.

Is there any way to also obtain the indices in result of query?

Hi,

I was wondering if there is any way to also obtain the indices of the results of a query since a lot of times we may need to for example go back to some sort of embedding and see what element it is representing.

Thank you so much for the amazing work!

Is faster query time possible?

Hi,

I was wondering if there would be any way to reduce the query time.
For example, for my use case with the following parameters, it is 0.2 s which would be too slow for querying my whole dataset:
lsh = LSHash(10, 300)
lsh.query(example_vector, num_results=5) (changing num_results doesn't have an effect on its run time)

Any suggestions would be appreciated!

Thank you

ImportError: cannot import name 'LSHash'

pip installed lshash and attempted to import LSHash however I get the above error.

versions:
python - 3.6.7
numpy - 1.15.4

I am operating within a conda environment.

projection type

The code is using np.random.randn() times input vector.
In the LSH paper survey, we are using either (Gaussian Distribution * input + bias)/W or (Uniform Distribution * input). I was wondering if we should change the distribution to uniform in the code?

LSH family for Hamming distance

I've learned the Lsh algorithm recently. And I found your implementation. Quite useful to me!
But as far as I know, there are different Lsh families for different distance measurement. It seems the index method you use is random planes for cosine distance, am I right?
I'd like to ask you:

  1. is it necessary or right to expand the index function for different distance measurement?
  2. if it's right and necessary, will you or let me do it?

pip install fails

c:\Python33\Scripts>pip install lshash
Downloading/unpacking lshash
Could not find a version that satisfies the requirement lshash (from versions:
0.0.2dev, 0.0.3dev, 0.0.4dev)
Cleaning up...
No distributions matching the version for lshash
Storing complete log in C:\Users\t-hanans\pip\pip.log

save hashtable to .npz file

i have installed redis successfully and on
providing the argument
storage_config={"redis": {"host": 'localhost', "port": 6379}},matrices_filename="/home/username/filename.npz"
where filename.npz has not been created nor is hashtable stored on redis ..
the program and query runs successfully giving the output vector but doesn't save a new .npz file or save the hashtables

Way to handle NaN-values

It would be nice to have a way to specify how to handle NaN's in the data. For example by ignoring them in distance calculations.

How to set a appropriate hash size?

Sorry to bother. This might be a stupid question.
I really don't know how to set the hash size, is that depend on my data size?
In the quick start example, I try to change LSHash (6, 8) to LSHash(3, 8) and getting the same result.

lsh2 = LSHash(3, 8)
lsh2.index([1,2,3,4,5,6,7,8])
lsh2.index([10,12,99,1,5,31,2,3])
lsh2.query([1,2,3,4,5,6,7,7])
[((1, 2, 3, 4, 5, 6, 7, 8), 1), ((2, 3, 4, 5, 6, 7, 8, 9), 11)]

Thanks!

lshash.query() returning empty values

Hi everyone. Don't know whats going on but my index is querying only over the points I have already submitted. For example

>>> import numpy as np
>>> import lshash
>>> lsh = lshash.LSHash(100,100)
>>> sample = np.zeros(100)
>>> sample[13]=1
>>> sample[43]=1
>>> sample[73]=1
>>> lsh.index(sample)
>>> sample[93]=1
>>> lsh.index(sample)
>>> sample = np.zeros(100)
>>> sample[33]=1
>>> sample[32]=1
>>> lsh.index(sample)
>>> sample[33]=0
>>> sample[93]=1
>>> lsh.query(sample)
[]
>>> sample = np.zeros(100)
>>> sample[33]=1
>>> sample[32]=1
>>> lsh.query(sample)
[((0.0, 0.0, 0.0, ..., 0.0, 0.0), 0.0)]
>>>

I am doing something wrong?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.