tdebatty / java-lsh Goto Github PK
View Code? Open in Web Editor NEWA Java implementation of Locality Sensitive Hashing (LSH)
License: Other
A Java implementation of Locality Sensitive Hashing (LSH)
License: Other
Hello,
I'm using the java-LSH code, I consider that it is a great project.
LSH is a technique for handling high-dimensional datasets, for instance datasets that have 100000 features, or even more...
When I run the examples SuperBitExample, SuperBitSparseExample or LSHSuperBitExample, I note that they run OK. However, if I increase the number of dimension, for instance I put the number of dimension to 1000, then the speed of the program is very very slow.
Can I use this project for working with datasets that have high-dimensionality.?
Best regards,
Oscar
Hi,
how are you drawing Jaccard and cosine plot in your example files? What are your axis inputs?
GRADLE: info.debatty:java-lsh:0.11
The class info.debatty.java.lsh.SuperBit has a dependency on the following classes from the "info.debatty.java.utils" package. However, these classes are not in the source code base.
java serialization can be quite tricky to get right, especially as code changes over time. It would be nice if client code had the option of passing a seed value to the LSH classes that would be passed through to Random
objects created during initialization and thus allowing for consistent hashes across object instantiations without having to deal with serialization at all.
Hello Thibault,
Thank you for the great project, I am using the LSHSuperBit class for some experimentation to find similar items.
I am not able to understand the logic LSH.hashSignature(final boolean[] signature)
method uses to group similar items into same bucket.
Can you please point me to some resource which you used to implement this? Thank you.
Hi,
I was wondering how to use this library for comparing two different Strings that are tokenized into a string vector each.
The examples only show boolean vectors which are just "post-transformation". As a newbie and to make great use of the library, it would be great to have the transformation part covered in the examples.
The hash signature method of LSH class is order independent. But according to Mining of Massive Datasets the bands should be identical. In the current implementation assume two bands of three rows first {1,2,3} and second {3,2,1} are hashed to the same bucket.
I see that the algorithm is based on the MMDS book by Ullman et al. However, your implementation seems to use a fixed THRESHOLD value of 0.5, whereas in the book they describe the THRESHOLD as a chosen value at which documents should be regarded as a "similar pair". From section 3.4.3:
Choose a threshold t that defines how similar documents have to be in order for them to be regarded as a desired “similar pair.” Pick a number of bands b and a number of rows r such that br = n, and the threshold t is approximately (1/b) 1/r. If avoidance of false negatives is important,
you may wish to select b and r to produce a threshold lower than t; if speed is important and you wish to limit false positives, select b and r to produce a higher threshold.
The current implementation uses a boolean[]
as an input. Use of a BitSet (https://docs.oracle.com/javase/7/docs/api/java/util/BitSet.html) would be a lot more efficient.
For example, if dictionary size is Integer.MAX_INT
, as it would be with the "hashing shingles" approach given in 3.2.3 of Ullman et al, I need to allocate 2GB of memory to store an array of booleans. With BitSet, I can store that in approximately 8 times less space.
I am new to LSH.
Thank you very much.
I'm not sure if this is an issue but how to combine signatures? Each new instance of LSHSuperBit produces different signatures for the same vector (well it should do because of random hyperplanes).
It means we have different "bucket hits" and therefore random cosine similarity for the same vector instead of "1"
How to use/compare it then? Save random hyperplanes and use it for all vectors in database or there's a way to make a "fingerprint"?
Hello,
Thank you for your grate library. I have a question. How can I change the similarity measure? I want to use Euclidean distance instead of Cosine similarity, how can I do it? I mean does it possible to change your code and use another similarity measure instead of Cosine similarity?
Thank you in advance
When I run the SimpleLSHMinHashExample with a large size of the vector say n = 100000 -I end up getting negative bucket:
Errors with
java -Xmx16G -classpath java-string-similarity-0.13.jar:java-lsh-0.8.jar:. SimpleLSHMinHashExample
Exception in thread "main" java.lang.ArrayIndexOutOfBoundsException: -4
at SimpleLSHMinHashExample.main(SimpleLSHMinHashExample.java:48)
I am contemplating using LSH in my application, but I am unsure how to deal with absent/missing data in a vector. The nearest neighbor imputation implies that this type of algorithm deals with this scenario, but how would I go about implementing it?
In the comments of the example LSHMinHash code, it says that 'to get relevant results, the number of elements per bucket should be at least 100'. Why?
I tried to specify a number of buckets where the average number of elements per buckets is lower than 100, it turned out that many buckets were empty. Does this have to do with the hashing function that calculates the bucket for each band of the signature? Or is it because that a large portion of signatures after banding are more likely to be identical so they are hashed to the same buckets?
Thanks in advance!
Hi, I don't know how to return the topk similar vectors. Should I use signatures to calculate similarity? Or to use hashvalue of signatures to calculate?
Thanks very much,
I get different hash values on the same vector when I run LSH multiple times.
I'm not sure whether It is reasonalbe, but Please try this test case:
boolean[] set1 = new boolean[5];
set1[0] = false;
set1[1] = false;
set1[2] = false;
set1[3] = false;
set1[4] = false;
int[] sig1 = minhash.hash(set1);
minhash.printSignature(sig1);
TreeSet<Integer> set2 = new TreeSet<Integer>();
set2.add(0);
int[] sig2 = minhash.hash(set2);
minhash.printSignature(sig2);
System.out.println(minhash.similarity(sig1, sig2));
sig1 will be all Integer.MAX and the result similarity will be 0.
Oh, i see what's going on immediately after the post. the sim = a / (a+b+c) instead of (a+d) / (a+b+c+d)
Hello,
Could you provide the code of info.debatty.java.utils.SparseDoubleVector and info.debatty.java.utils.SparseIntegerVector. I have downloaded the project, but these two files are not included in the zip file.
Best regards.
Oscar
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.