tdebatty / java-lsh Goto Github PK

View Code? Open in Web Editor NEW

291.0 291.0 83.0 310 KB

A Java implementation of Locality Sensitive Hashing (LSH)

License: Other

Java 100.00%

java-lsh's People

Contributors

Stargazers

Watchers

Forkers

beifeizhou piyushsh snandasena maswin thiseye bigdatafly kwanccc jianweiq riskyhe309 tungndptit fangzheng354 jhsbeat hejy12 jmotif bbt132 kevinking mohab2014 boluoyu zateyev chianingwang jikimlucas superashan novellll snie2012 mickmo optimus1009 colfire josemacedo captainbupt zhangjunqiang maggichk joyouskoala zhouyonglong plutext tlthirtyeight 1398857818 bagayalu silpatflogan xgdsmileboy hdulay s50600822 dquaner hubspot leediaxu pinkonio liyuanzhe hkxiron fangwc onuchin-artem buptbearsmall michaelldd stenpiren dunglason6789p sexroute alex-shmyga haonanli fuliangyuzqm fs-j ldkhanh altafyanto winny1 wangrui0 mahdihajiabadi soham-samanta seahrh iampaopaoyu zxz53000 ryantbwarren markromedia xuqiong1989 aiuidotdev turbo dcy10000 david-li-l soursop rizkyfajarudin bulksecuritygeneratorprojectv2 yushuaiji jungkonkim adedayoominiyi

java-lsh's Issues

A dubt

Hello,

I'm using the java-LSH code, I consider that it is a great project.

LSH is a technique for handling high-dimensional datasets, for instance datasets that have 100000 features, or even more...

When I run the examples SuperBitExample, SuperBitSparseExample or LSHSuperBitExample, I note that they run OK. However, if I increase the number of dimension, for instance I put the number of dimension to 1000, then the speed of the program is very very slow.

Can I use this project for working with datasets that have high-dimensionality.?

Best regards,

Oscar

Plot of Similarity

Hi,
how are you drawing Jaccard and cosine plot in your example files? What are your axis inputs?

Missing utils classes

GRADLE: info.debatty:java-lsh:0.11

The class info.debatty.java.lsh.SuperBit has a dependency on the following classes from the "info.debatty.java.utils" package. However, these classes are not in the source code base.

info.debatty.java.utils.SparseDoubleVector;
info.debatty.java.utils.SparseIntegerVector;

<remove>

avoid java serialization

java serialization can be quite tricky to get right, especially as code changes over time. It would be nice if client code had the option of passing a seed value to the LSH classes that would be passed through to Random objects created during initialization and thus allowing for consistent hashes across object instantiations without having to deal with serialization at all.

Understanding LSH.hashSignature Implementation

Hello Thibault,

Thank you for the great project, I am using the LSHSuperBit class for some experimentation to find similar items.

I am not able to understand the logic LSH.hashSignature(final boolean[] signature) method uses to group similar items into same bucket.

Can you please point me to some resource which you used to implement this? Thank you.

Examples show boolean vectors, what about string vectors?

Hi,
I was wondering how to use this library for comparing two different Strings that are tokenized into a string vector each.
The examples only show boolean vectors which are just "post-transformation". As a newbie and to make great use of the library, it would be great to have the transformation part covered in the examples.

Hash signature method should be order dependent

The hash signature method of LSH class is order independent. But according to Mining of Massive Datasets the bands should be identical. In the current implementation assume two bands of three rows first {1,2,3} and second {3,2,1} are hashed to the same bucket.

[QUESTION] Configuration of LSHMinHash threshold

I see that the algorithm is based on the MMDS book by Ullman et al. However, your implementation seems to use a fixed THRESHOLD value of 0.5, whereas in the book they describe the THRESHOLD as a chosen value at which documents should be regarded as a "similar pair". From section 3.4.3:

Choose a threshold t that defines how similar documents have to be in order for them to be regarded as a desired “similar pair.” Pick a number of bands b and a number of rows r such that br = n, and the threshold t is approximately (1/b) 1/r. If avoidance of false negatives is important,
you may wish to select b and r to produce a threshold lower than t; if speed is important and you wish to limit false positives, select b and r to produce a higher threshold.

Use Bitset instead of Array<boolean>

The current implementation uses a boolean[] as an input. Use of a BitSet (https://docs.oracle.com/javase/7/docs/api/java/util/BitSet.html) would be a lot more efficient.

For example, if dictionary size is Integer.MAX_INT, as it would be with the "hashing shingles" approach given in 3.2.3 of Ullman et al, I need to allocate 2GB of memory to store an array of booleans. With BitSet, I can store that in approximately 8 times less space.

computeSignatureSize error

this is in the method computeSignatureSize in Class LSHMinHash. I think return rb instead of rs .

Why can get two vectors' similarity after random projection?

I am new to LSH.
Thank you very much.

Signature

I'm not sure if this is an issue but how to combine signatures? Each new instance of LSHSuperBit produces different signatures for the same vector (well it should do because of random hyperplanes).
It means we have different "bucket hits" and therefore random cosine similarity for the same vector instead of "1"
How to use/compare it then? Save random hyperplanes and use it for all vectors in database or there's a way to make a "fingerprint"?

Another similarity measure

Hello,
Thank you for your grate library. I have a question. How can I change the similarity measure? I want to use Euclidean distance instead of Cosine similarity, how can I do it? I mean does it possible to change your code and use another similarity measure instead of Cosine similarity?

Thank you in advance

Negative bucket index

When I run the SimpleLSHMinHashExample with a large size of the vector say n = 100000 -I end up getting negative bucket:

Errors with

java -Xmx16G -classpath java-string-similarity-0.13.jar:java-lsh-0.8.jar:. SimpleLSHMinHashExample
Exception in thread "main" java.lang.ArrayIndexOutOfBoundsException: -4
at SimpleLSHMinHashExample.main(SimpleLSHMinHashExample.java:48)

Dealing with missing data

I am contemplating using LSH in my application, but I am unsure how to deal with absent/missing data in a vector. The nearest neighbor imputation implies that this type of algorithm deals with this scenario, but how would I go about implementing it?

Why is that to get relevant results, the number of elements per bucket should be at least 100?

In the comments of the example LSHMinHash code, it says that 'to get relevant results, the number of elements per bucket should be at least 100'. Why?

I tried to specify a number of buckets where the average number of elements per buckets is lower than 100, it turned out that many buckets were empty. Does this have to do with the hashing function that calculates the bucket for each band of the signature? Or is it because that a large portion of signatures after banding are more likely to be identical so they are hashed to the same buckets?

Thanks in advance!

How to return topk similar items?

Hi, I don't know how to return the topk similar vectors. Should I use signatures to calculate similarity? Or to use hashvalue of signatures to calculate?
Thanks very much,

I get different hash values on the same vector when I run LSH multiple times.

Boundary conditions problem of minhash.

I'm not sure whether It is reasonalbe, but Please try this test case:

boolean[] set1 = new boolean[5];
set1[0] = false;
set1[1] = false;
set1[2] = false;
set1[3] = false;
set1[4] = false;
int[] sig1 = minhash.hash(set1);
minhash.printSignature(sig1);


TreeSet<Integer> set2 = new TreeSet<Integer>();
set2.add(0);
int[] sig2 = minhash.hash(set2);
minhash.printSignature(sig2);

System.out.println(minhash.similarity(sig1, sig2));

sig1 will be all Integer.MAX and the result similarity will be 0.

Oh, i see what's going on immediately after the post. the sim = a / (a+b+c) instead of (a+d) / (a+b+c+d)

Request

Hello,

Could you provide the code of info.debatty.java.utils.SparseDoubleVector and info.debatty.java.utils.SparseIntegerVector. I have downloaded the project, but these two files are not included in the zip file.

Best regards.

Oscar

tdebatty / java-lsh Goto Github PK

java-lsh's People

Contributors

Stargazers

Watchers

Forkers

java-lsh's Issues

Recommend Projects

Recommend Topics

Recommend Org

Jobs