GithubHelp home page GithubHelp logo

serega / gaoya Goto Github PK

View Code? Open in Web Editor NEW
67.0 1.0 6.0 242 KB

Locality Sensitive Hashing

License: MIT License

Rust 74.96% Makefile 0.50% Python 10.57% Jupyter Notebook 13.97%
locality-sensitive-hashing lsh minhash search simhash similarity

gaoya's Introduction

Gaoya

About

This project implements Locality Sensitive Hashing algorithms and data structures for indexing and querying text documents. The primary use cases for Gaoya are deduplication and clustering.

Main Features

  • 64,32,16,8 bit minhash
  • 64,128 bit simhash
  • Fast implementation in Rust
  • Multi-threaded thanks to rayon
  • Python bindings

Python Example

>>> import gaoya
>>> index = gaoya.minhash.MinHashStringIndex(hash_size=32, 
                                             jaccard_threshold=0.5, 
                                             num_bands=42, 
                                             band_size=3,
                                             num_hashes=42*3,
                                             analyzer='word', 
                                             lowercase=True, 
                                             ngram_range=(1,1))
>>> corpus = [
...     'This is the first document.',
...     'This document is the second document.',
...     'And this is the third document.',
...     'Is this the first document?',
...     'This not the first nor the second nor the third, but the fourth document'
... ]
>>> 
>>> for i, doc in enumerate(corpus): index.insert_document(i, doc)
... 
>>> index.query('This is the first document.')
[0, 1, 2, 3]
>>> 

Installation

$ pip3 install gaoya

Examples

Document Deduplication with Gaoya

Rust Example

use gaoya::minhash::{MinHashIndex, MinHasher32, MinHasher} ;
use gaoya::text::whitespace_split;
use fxhash::FxHashSet;
let corpus = [
    "This is the first document.",
    "This document is the second document.",
    "And this is the third document.",
    "Is this the first document?",
    "This not the first nor the second nor the third, but the fourth document"];
let (num_bands, band_width) = (42, 3);
let minhasher = MinHasher32::new(num_bands * band_width);
let mut index = MinHashIndex::new(num_bands, band_width, 0.5);
for (i, doc) in corpus.iter().enumerate() {
    index.insert(i, minhasher.create_signature(whitespace_split(&doc.to_lowercase())));
}
for (i, doc) in corpus.iter().enumerate() {
    if i < 4 {
        let mut expected = FxHashSet::default();
        expected.extend(vec![0, 1, 2, 3].into_iter());
        let signature = minhasher.create_signature(whitespace_split(&doc.to_lowercase()));
        assert_eq!(index.query_owned(&signature), expected);
    } else {
        let mut expected = FxHashSet::default();
        expected.insert(4);
        let signature = minhasher.create_signature(whitespace_split(&doc.to_lowercase()));
        assert_eq!(index.query_owned(&signature), expected);
    }
}

References

[1] Chapter 3, Mining of Massive Datasets

[2] Similarity Estimation Techniques from Rounding Algorithms

[3] Detecting Near-Duplicates for Web Crawling

gaoya's People

Contributors

serega avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar

gaoya's Issues

Duplicate ids causes panic

The code results in panic.

#[test]
pub fn test_duplidate_ids() {
    let (b, r) = calculate_minhash_params(0.5, 200);
    let min_hash = MinHasher64V1::new(b * r);
    let mut lsh_index = MinHashIndex::new(b, r, 0.5);
    lsh_index.insert(1, min_hash.create_signature(S1.split_whitespace()));
    lsh_index.insert(1, min_hash.create_signature(S4.split_whitespace()));
    lsh_index.insert(6, min_hash.create_signature(S6.split_whitespace()));
    
    println!("{}", lsh_index);
    assert_eq!(lsh_index.size(), 2);
    lsh_index.remove(&1);
    let ret = lsh_index.query(&min_hash.create_signature(S1.split_whitespace()));
    println!("{:?}", ret);
}
no entry found for key
thread 'minhash::minhash_index::tests::test_duplidate_ids' panicked at 'no entry found for key', gaoya/src/minhash/minhash_index.rs:552:30
stack backtrace:
   0: rust_begin_unwind
             at /rustc/9eb3afe9ebe9c7d2b84b71002d44f4a0edac95e0/library/std/src/panicking.rs:575:5
   1: core::panicking::panic_fmt
             at /rustc/9eb3afe9ebe9c7d2b84b71002d44f4a0edac95e0/library/core/src/panicking.rs:64:14
   2: core::panicking::panic_display
             at /rustc/9eb3afe9ebe9c7d2b84b71002d44f4a0edac95e0/library/core/src/panicking.rs:147:5
   3: core::panicking::panic_str
             at /rustc/9eb3afe9ebe9c7d2b84b71002d44f4a0edac95e0/library/core/src/panicking.rs:131:5
   4: core::option::expect_failed
             at /rustc/9eb3afe9ebe9c7d2b84b71002d44f4a0edac95e0/library/core/src/option.rs:1924:5
   5: core::option::Option<T>::expect
             at /rustc/9eb3afe9ebe9c7d2b84b71002d44f4a0edac95e0/library/core/src/option.rs:786:21
   6: <std::collections::hash::map::HashMap<K,V,S> as core::ops::index::Index<&Q>>::index
             at /rustc/9eb3afe9ebe9c7d2b84b71002d44f4a0edac95e0/library/std/src/collections/hash/map.rs:1340:23
   7: gaoya::minhash::minhash_index::MinHashIndex<T,Id,C>::query::{{closure}}
             at ./src/minhash/minhash_index.rs:552:30
   8: hashbrown::set::HashSet<T,S,A>::retain::{{closure}}
             at /Users/runner/.cargo/registry/src/github.com-1ecc6299db9ec823/hashbrown-0.12.3/src/set.rs:325:32
   9: hashbrown::map::HashMap<K,V,S,A>::retain
             at /Users/runner/.cargo/registry/src/github.com-1ecc6299db9ec823/hashbrown-0.12.3/src/map.rs:834:21
  10: hashbrown::set::HashSet<T,S,A>::retain
             at /Users/runner/.cargo/registry/src/github.com-1ecc6299db9ec823/hashbrown-0.12.3/src/set.rs:325:9
  11: std::collections::hash::set::HashSet<T,S>::retain
             at /rustc/9eb3afe9ebe9c7d2b84b71002d44f4a0edac95e0/library/std/src/collections/hash/set.rs:333:9
  12: gaoya::minhash::minhash_index::MinHashIndex<T,Id,C>::query
             at ./src/minhash/minhash_index.rs:551:9
  13: gaoya::minhash::minhash_index::tests::test_duplidate_ids
             at ./src/minhash/minhash_index.rs:1208:19
  14: gaoya::minhash::minhash_index::tests::test_duplidate_ids::{{closure}}
             at ./src/minhash/minhash_index.rs:1197:33
  15: core::ops::function::FnOnce::call_once
             at /rustc/9eb3afe9ebe9c7d2b84b71002d44f4a0edac95e0/library/core/src/ops/function.rs:250:5
  16: core::ops::function::FnOnce::call_once
             at /rustc/9eb3afe9ebe9c7d2b84b71002d44f4a0edac95e0/library/core/src/ops/function.rs:250:5
note: Some details are omitted, run with `RUST_BACKTRACE=full` for a verbose backtrace.```

The current logic does not check in the `insert` method whether ID is already present in the index. The assumption is that IDs are unique and it is up to the caller to handle duplicates.

Indexing without storing signatures

Hi,

First of all many thanks for this library, it is helping me a lot. I'm using it to do neardedup with MinHash of very large collections of text (tens of TB or even a hundred TB, compressed size) and I'm constrained by the amount of RAM available. So, I'm doing some modifications to address this. The first one, was to have index objects that store only one of the bands, so I could do distributed index in different machines. But now I'm wondering if it would be possible to avoid storing all the signatures in id_signatures: HashMap. Therefore only storing ids. As far as I understood from the code, to be able to query a document and return matches, band is needed and the id_signatures would only be needed if return similarity is requested or if I need to do queries by id. Am I right?

Not asking for you to implement it, just wanted to double check if this is feasible.

Thanks in advance,
Jaume

Interpreting results

Good morning!

I'm exploring simhash as an alternative algorithm to ssdeep for identifying near-duplicate webpage responses for my project. The reason is that ssdeep doesn't do well if the webpage's content length is below a certain threshold.

I currently use fuzzyhash and can ask it to compare two hashes, returning a similarity percentage:

s1: 24:hY6svD+6zSU6pedQf3Zvcn1BZdAe1nCr1LTHI5z8xTxS8f:3qD+2+pUAew85zsTUA 
s2: 24:hY6svD+6zSU6pedQf3Zvcn1BZdAe1nCr1LTHI5z8xnZS8f:3qD+2+pUAew85zsnsA

similarity: 97%

Is there a way for me to (correctly) interpret simhash's hamming distance as a percentage-based similarity?

I played around with the 128bit simhash and it seems as though the maximum distance is 61? Could i use that as the maximum in calculating a percentage?

Side question: what is the relevance of the keys in this constructor?
SimSipHasher128::new(1, 2)

example webpage response below that differs slightly. ssdeep reports 97% similarity for the two static strings.

use gaoya::simhash::SimHash;
use gaoya::simhash::SimHashBits;
use gaoya::simhash::SimSipHasher128;

static goog32: &str = r#"<!DOCTYPE html>
<html lang=en>
  <meta charset=utf-8>
  <meta name=viewport content="initial-scale=1, minimum-scale=1, width=device-width">
  <title>Error 404 (Not Found)!!1</title>
  <style>
    *{margin:0;padding:0}html,code{font:15px/22px arial,sans-serif}html{background:#fff;color:#222;padding:15px}body{margin:7% auto 0;max-width:390px;min-height:180px;padding:30px 0 15px}* > body{background:url(//www.google.com/images/errors/robot.png) 100% 5px no-repeat;padding-right:205px}p{margin:11px 0 22px;overflow:hidden}ins{color:#777;text-decoration:none}a img{border:0}@media screen and (max-width:772px){body{background:none;margin-top:0;max-width:none;padding-right:0}}#logo{background:url(//www.google.com/images/branding/googlelogo/1x/googlelogo_color_150x54dp.png) no-repeat;margin-left:-5px}@media only screen and (min-resolution:192dpi){#logo{background:url(//www.google.com/images/branding/googlelogo/2x/googlelogo_color_150x54dp.png) no-repeat 0% 0%/100% 100%;-moz-border-image:url(//www.google.com/images/branding/googlelogo/2x/googlelogo_color_150x54dp.png) 0}}@media only screen and (-webkit-min-device-pixel-ratio:2){#logo{background:url(//www.google.com/images/branding/googlelogo/2x/googlelogo_color_150x54dp.png) no-repeat;-webkit-background-size:100% 100%}}#logo{display:inline-block;height:54px;width:150px}
  </style>
  <a href=//www.google.com/><span id=logo aria-label=Google></span></a>
  <p><b>404.</b> <ins>That’s an error.</ins>
  <p>The requested URL <code>/f1ae349ad8674a4c8e6ecc6403ffa2a2</code> was not found on this server.  <ins>That’s all we know.</ins>
"#;

static goog96: &str = r#"<!DOCTYPE html>
<html lang=en>
  <meta charset=utf-8>
  <meta name=viewport content="initial-scale=1, minimum-scale=1, width=device-width">
  <title>Error 404 (Not Found)!!1</title>
  <style>
    *{margin:0;padding:0}html,code{font:15px/22px arial,sans-serif}html{background:#fff;color:#222;padding:15px}body{margin:7% auto 0;max-width:390px;min-height:180px;padding:30px 0 15px}* > body{background:url(//www.google.com/images/errors/robot.png) 100% 5px no-repeat;padding-right:205px}p{margin:11px 0 22px;overflow:hidden}ins{color:#777;text-decoration:none}a img{border:0}@media screen and (max-width:772px){body{background:none;margin-top:0;max-width:none;padding-right:0}}#logo{background:url(//www.google.com/images/branding/googlelogo/1x/googlelogo_color_150x54dp.png) no-repeat;margin-left:-5px}@media only screen and (min-resolution:192dpi){#logo{background:url(//www.google.com/images/branding/googlelogo/2x/googlelogo_color_150x54dp.png) no-repeat 0% 0%/100% 100%;-moz-border-image:url(//www.google.com/images/branding/googlelogo/2x/googlelogo_color_150x54dp.png) 0}}@media only screen and (-webkit-min-device-pixel-ratio:2){#logo{background:url(//www.google.com/images/branding/googlelogo/2x/googlelogo_color_150x54dp.png) no-repeat;-webkit-background-size:100% 100%}}#logo{display:inline-block;height:54px;width:150px}
  </style>
  <a href=//www.google.com/><span id=logo aria-label=Google></span></a>
  <p><b>404.</b> <ins>That’s an error.</ins>
  <p>The requested URL <code>/some-longer/document/perhaps</code> was not found on this server.  <ins>That’s all we know.</ins>
"#;
fn main() {
    let sim_hash: SimHash<SimSipHasher128, u128, 128> = SimHash::new(SimSipHasher128::new(1, 2));
    let g1 = sim_hash.create_signature(whitespace_split(goog32));
    let g2 = sim_hash.create_signature(whitespace_split(goog96));
    println!("g1 {} g2  {} distance {}", g1, g2, g1.hamming_distance(&g2));
    println!("g1 {} g2  {} distance {}", g1, g2, g1.hamming_angle(&g2));
}

Optimization Ideas For MinHash

Noticed your comment on Hacker News about this repo. I worked on something similar at a previous job (so I don't have the code the share), but looked pretty deeply into optimizing minhash. Not sure if it's helpful or if you're already aware of these tricks, but I found them to significantly speed up my minhash algorithm (and they are fun to code :P).

  1. One Permutation Hashing. Basically use a single hash function in clever way instead of the typical 128. Made a pretty big different in speed during my testing.
  2. Densified One Permutation Hashing. A small tweak for handling the "empty bin" problem of one permutation hashing.
  3. Reservoir Sampling. A trick for using less memory and avoiding dynamic allocations. This last paper also has a summary of the first two.

Caveat: Seems like you're working on Super Min Hash, which I was never able to fully code myself. I don't know how it compares to the one-hash approach.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.