GithubHelp home page GithubHelp logo

java-lsh's People

Contributors

jlleitschuh avatar joyouskoala avatar kireet avatar stevie400 avatar tdebatty avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

java-lsh's Issues

A dubt

Hello,

I'm using the java-LSH code, I consider that it is a great project.

LSH is a technique for handling high-dimensional datasets, for instance datasets that have 100000 features, or even more...

When I run the examples SuperBitExample, SuperBitSparseExample or LSHSuperBitExample, I note that they run OK. However, if I increase the number of dimension, for instance I put the number of dimension to 1000, then the speed of the program is very very slow.

Can I use this project for working with datasets that have high-dimensionality.?

Best regards,

Oscar

Plot of Similarity

Hi,
how are you drawing Jaccard and cosine plot in your example files? What are your axis inputs?

Missing utils classes

GRADLE: info.debatty:java-lsh:0.11

The class info.debatty.java.lsh.SuperBit has a dependency on the following classes from the "info.debatty.java.utils" package. However, these classes are not in the source code base.

  • info.debatty.java.utils.SparseDoubleVector;
  • info.debatty.java.utils.SparseIntegerVector;

avoid java serialization

java serialization can be quite tricky to get right, especially as code changes over time. It would be nice if client code had the option of passing a seed value to the LSH classes that would be passed through to Random objects created during initialization and thus allowing for consistent hashes across object instantiations without having to deal with serialization at all.

Understanding LSH.hashSignature Implementation

Hello Thibault,

Thank you for the great project, I am using the LSHSuperBit class for some experimentation to find similar items.

I am not able to understand the logic LSH.hashSignature(final boolean[] signature) method uses to group similar items into same bucket.

Can you please point me to some resource which you used to implement this? Thank you.

Examples show boolean vectors, what about string vectors?

Hi,
I was wondering how to use this library for comparing two different Strings that are tokenized into a string vector each.
The examples only show boolean vectors which are just "post-transformation". As a newbie and to make great use of the library, it would be great to have the transformation part covered in the examples.

Hash signature method should be order dependent

The hash signature method of LSH class is order independent. But according to Mining of Massive Datasets the bands should be identical. In the current implementation assume two bands of three rows first {1,2,3} and second {3,2,1} are hashed to the same bucket.

[QUESTION] Configuration of LSHMinHash threshold

I see that the algorithm is based on the MMDS book by Ullman et al. However, your implementation seems to use a fixed THRESHOLD value of 0.5, whereas in the book they describe the THRESHOLD as a chosen value at which documents should be regarded as a "similar pair". From section 3.4.3:

Choose a threshold t that defines how similar documents have to be in order for them to be regarded as a desired “similar pair.” Pick a number of bands b and a number of rows r such that br = n, and the threshold t is approximately (1/b) 1/r. If avoidance of false negatives is important,
you may wish to select b and r to produce a threshold lower than t; if speed is important and you wish to limit false positives, select b and r to produce a higher threshold.

Signature

I'm not sure if this is an issue but how to combine signatures? Each new instance of LSHSuperBit produces different signatures for the same vector (well it should do because of random hyperplanes).
It means we have different "bucket hits" and therefore random cosine similarity for the same vector instead of "1"
How to use/compare it then? Save random hyperplanes and use it for all vectors in database or there's a way to make a "fingerprint"?

Another similarity measure

Hello,
Thank you for your grate library. I have a question. How can I change the similarity measure? I want to use Euclidean distance instead of Cosine similarity, how can I do it? I mean does it possible to change your code and use another similarity measure instead of Cosine similarity?

Thank you in advance

Negative bucket index

When I run the SimpleLSHMinHashExample with a large size of the vector say n = 100000 -I end up getting negative bucket:

Errors with

java -Xmx16G -classpath java-string-similarity-0.13.jar:java-lsh-0.8.jar:. SimpleLSHMinHashExample
Exception in thread "main" java.lang.ArrayIndexOutOfBoundsException: -4
at SimpleLSHMinHashExample.main(SimpleLSHMinHashExample.java:48)

Dealing with missing data

I am contemplating using LSH in my application, but I am unsure how to deal with absent/missing data in a vector. The nearest neighbor imputation implies that this type of algorithm deals with this scenario, but how would I go about implementing it?

Why is that to get relevant results, the number of elements per bucket should be at least 100?

In the comments of the example LSHMinHash code, it says that 'to get relevant results, the number of elements per bucket should be at least 100'. Why?

I tried to specify a number of buckets where the average number of elements per buckets is lower than 100, it turned out that many buckets were empty. Does this have to do with the hashing function that calculates the bucket for each band of the signature? Or is it because that a large portion of signatures after banding are more likely to be identical so they are hashed to the same buckets?

Thanks in advance!

How to return topk similar items?

Hi, I don't know how to return the topk similar vectors. Should I use signatures to calculate similarity? Or to use hashvalue of signatures to calculate?
Thanks very much,

Boundary conditions problem of minhash.

I'm not sure whether It is reasonalbe, but Please try this test case:

boolean[] set1 = new boolean[5];
set1[0] = false;
set1[1] = false;
set1[2] = false;
set1[3] = false;
set1[4] = false;
int[] sig1 = minhash.hash(set1);
minhash.printSignature(sig1);


TreeSet<Integer> set2 = new TreeSet<Integer>();
set2.add(0);
int[] sig2 = minhash.hash(set2);
minhash.printSignature(sig2);

System.out.println(minhash.similarity(sig1, sig2));

sig1 will be all Integer.MAX and the result similarity will be 0.


Oh, i see what's going on immediately after the post. the sim = a / (a+b+c) instead of (a+d) / (a+b+c+d)

Request

Hello,

Could you provide the code of info.debatty.java.utils.SparseDoubleVector and info.debatty.java.utils.SparseIntegerVector. I have downloaded the project, but these two files are not included in the zip file.

Best regards.

Oscar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.