GithubHelp home page GithubHelp logo

mattilyra / lsh Goto Github PK

View Code? Open in Web Editor NEW
273.0 10.0 77.0 525 KB

Locality Sensitive Hashing using MinHash in Python/Cython to detect near duplicate text documents

License: MIT License

Python 56.28% C++ 27.83% C 3.90% Cython 11.98%
lsh cython minhash duplicate-documents deduplication

lsh's People

Contributors

bugggggggg avatar hbrylkowski avatar mattilyra avatar stultus avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

lsh's Issues

Need help installing this

fairly new to Python. What;s the easiest way to install this package. I don't see a setyp.py, so it's not clear what I need to do. I'm running OSX 10.8.5.

Thanks! I've implemented near-duplicate detection using LSH in Java, but my new code base ins in Python, so this would help a lot.

Python 3.10 compatiblity

Encountered some compatibility issues while installing LSH due to different Python versions:

In "lsh/cMinhash.cpp" at line 19292, there is an error related to 'PyThreadState' (also known as 'struct _ts') where it mentions that 'exc_type' is not a member, and it suggests replacing it with 'curexc_type' to resolve the issue.

In "lsh/cMinhash.cpp" at line 17704, there is another error involving 'PyTypeObject' (or 'struct _typeobject') where 'tp_print' is not a member. The solution is to replace 'tp_print' with 'tp_vectorcall_offset'.

There is another one.

python3.10/object.h:133:33: error: lvalue required as increment operand
        133 | #define Py_REFCNT(ob) _Py_REFCNT(_PyObject_CAST_CONST(ob))
            |                       ~~~~~~~~~~^~~~~~~~~~~~~~~~~~~~~~~~~~

Jaccard should be performed on sets, but appears to be given numpy arrays

It appears that MinHash.jaccard is expecting two sets to be given here, where the & and | are used for set intersection and union, respectively:

return len(f_a & f_b) / len(f_a | f_b)

From what I understand, it's being passed two numpy arrays from Cache (since they're outputs of the fingerprint functions):

LSH/lsh/cache.py

Lines 65 to 66 in da67215

jaccard = self.hasher.jaccard(self.fingerprints[id1],
self.fingerprints[id2])

The code doesn't raise an exception, because & and | are overloaded for numpy, but I'm concerned that this may not be computing jaccard correctly.

From my testing, I found that this jaccard function did not work as expected (didn't filter any candidates).

I apologize if I'm not understanding this correctly, please correct me if I'm wrong!

parallel deduplication

Allow a stream to be fed in and deduplicated in parallel. Obivously the deduplication itself can not happen in parallel but shingling and minhashing the documents can. Given a fast enough backend for storing the fingerprint this should quite significantly speed up deduplicating large document collections.

Check buckets of LSH MinHash

I want to have a rough clustering of input, that is, input sets that are similar to each other in terms of Jaccard coefficient should be grouped together. I think lsh may be able to accomplish such task since it hashes items similar to each other in the same bucket. I wonder if I can check which items are grouped together. I tried the following way but I do not think it is a good one.
lsh.hashtables[6]._dict.values()
Any help or suggestion is much appreaciated.

ModuleNotFoundError: No module named 'lsh.cMinhash'

I am using python 3.6.8, after installation when try to import using
from lsh import minhash
I get the following error trace

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/nithin/inflict/LSH/lsh/minhash.py", line 9, in <module>
    from .cMinhash import minhash_32, minhash_64
ModuleNotFoundError: No module named 'lsh.cMinhash'

A few questions about the `len` argument in the function `MurmurHash3_x86_32`.

Hello @mattilyra , thanks for your awesome examples of detecting duplicated documents.
(The Jupyter notebook was so polite that made me easy to understand.)

I'm trying to clustering short sentences using this method, but with word-level n-grams.
So I'm working on expanding your cMinhash.pyx, but cannot figure out what does the second argument len doing in the function MurmurHash3_x86_32.

My question is,

  1. why did you use char_ngram as the len argument, and
  2. is there any harmful effects if I use word_ngram as the len argument.

Thanks.

BTW, I found this problem because the hash result tendency was quite different between mine and yours even the configuration was almost the same.

My result of hashing "Lorem Ipsum dolor sit amet" with 100 seeds and char-level 5-gram, using hajimes/mmh3, a python wrapper of MurmurHash3 as the MinHasher.

array([  65436857,   26223331,  165959958,   35857255,  212417650,
        185799665,   72344264,   29695203,  306301591,   88841905,
         49846023,  193880158,  394644100,  393466921,   88563338,
        193342788,  289561251,   41457677,   46269772,   45140637,
         88731786,  154944682,  167707365,   12226981,  134694109,
        152174644,  149058781,  137634731,  282990808,  660085804,
         31993919,   95610818,   82276674,  393466240,  168429263,
        310122140,   96764607,  170415308,  793383417,   67665263,
        369128956,  663065730,    7993604,   62970620,  732822434,
        237305329,  161302415,  290720290,   68378231,   13636483,
        193113465,    3015742,   40301015,  455083766,  108353051,
        262511163,   84328315,   29373936,   97439899,   86035674,
        169048511,  301589216,  304074377,   44969229,  320465503,
         10129839,  429861020,  120736105,   69736016,  143478980,
        360628113,  348757135,  120123671, 1052150375,   61331130,
         25125176,   34933924,  182346076,  464411593,  305861551,
        325756924,  259878569,  369066011,   87468108,  557439393,
        104788999,   33171267,  268620735,  155177532,   29934811,
         19180594,   58288667,    8061171,  109245552,  104467657,
        176372959,  130951767,  258276624,   59320468,  915427336],
      dtype=uint32)

Your result of hashing with the same configuration.

array([2270775894, 2244931819, 2222833540, 2370931475, 2358887817,
       2286506241, 2483588865, 2235209090, 2242850826, 2670956706,
       2332349427, 2205899159, 3046739795, 2412257222, 2639427412,
       2439806156, 2481864998, 2315134778, 2276036063, 2173185890,
       2356592485, 2250310001, 2426157323, 2197343414, 2170959327,
       2666745886, 2497212147, 2227519238, 2270253453, 2682657866,
       2355382986, 2167642277, 2407297617, 2388667035, 2309089485,
       2186779532, 2574604323, 2216949965, 2218059463, 2158519866,
       2506498897, 2271297387, 2766549748, 2333709880, 2192453023,
       2213638709, 2298919119, 2334076817, 2655285423, 2181653514,
       2169583114, 2758877533, 2205629894, 2266512646, 2308863664,
       2190394274, 2694111477, 2799473812, 2430748017, 2214130591,
       2380590935, 2178089510, 2203907876, 2593729455, 2185184798,
       2274709474, 2494067266, 2626021353, 2202501877, 2355924309,
       2242977078, 2162025102, 2612350777, 2213862508, 2205571482,
       2238265438, 2305791018, 2187691276, 2318248647, 2219845855,
       2265366812, 2633383060, 2311319978, 2379408053, 2188968632,
       2639427412, 2383615522, 2401562252, 2164974019, 3230385414,
       2278782695, 2193521393, 2379669319, 2249922125, 2161391929,
       2178875277, 2261101105, 2341046147, 2664062261, 2251239581],
      dtype=uint32)

Unable to install in Python3

/usr/local/Cellar/python/3.7.6_1/Frameworks/Python.framework/Versions/3.7/include/python3.7m/pystate.h:241:15: note:
'curexc_traceback' declared here
PyObject *curexc_traceback;
^
lsh/cMinhash.cpp:19391:13: error: no member named 'exc_type' in '_ts'
tstate->exc_type = *type;
~~~~~~ ^
fatal error: too many errors emitted, stopping now [-ferror-limit=]
1 warning and 20 errors generated.

It works for Python2.7 but not able to install in Python3.7.

storing cache

Is it possible to store the cache? If I created a cache and I want to use it later, how would I go about it?

How to make minhash scalable

If suppose I have 100,000 sentences or document. and I want to find the pairwise jaccard similarity. How to make minhash algorithm scalable? could please add the example for the same.

Unable to install

I am using Python 2.7.15 and after setup I get the following error when I try to from lsh import minhash.

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "lsh/minhash.py", line 9, in <module>
    from .cMinhash import minhash_32, minhash_64
ImportError: No module named cMinhash

Unable to install on Mojave (10.14.2) or Ubuntu

This implementation has been working astonishingly well! Such a great resource and explanation of everything. I've been running this on a Windows 10 box and realized that the number of candidate duplicates aren't responding bands / duplicates - you can specify, for example 128 bands but cache.py will override that. i.e. setting 16 bands and 128 seeds will clear your example code ValueError but will raise an error from line 35 of cache.py. I was wondering if it was a windows thing (edit: this was my issue). and tried to install on my Mac and have been beating my head against the wall with the error below. Any chance you've encountered something like this before?

I've updated xcode / command line tools / gcc but am getting no where (and the limits.h file does exist where it says it doesn't). Would love to get this up and running on my mac if you have any ideas.

building 'lsh.cMinhash' extension
gcc -Wno-unused-result -Wsign-compare -Wunreachable-code -DNDEBUG -g -fwrapv -O3 -Wall -Wstrict-prototypes -I/Users/admin/anaconda3/include -arch x86_64 -I/Users/admin/anaconda3/include -arch x86_64 -Ilsh -I/Users/admin/anaconda3/lib/python3.6/site-packages/numpy/core/include -I/Users/admin/anaconda3/include/python3.6m -c lsh/cMinhash.cpp -o build/temp.macosx-10.7-x86_64-3.6/lsh/cMinhash.o
cc1plus: warning: command line option ‘-Wstrict-prototypes’ is valid for C/ObjC but not for C++ [enabled by default]
In file included from /Users/admin/anaconda3/lib/gcc/x86_64-apple-darwin11.4.2/4.8.5/include-fixed/syslimits.h:7:0,
from /Users/admin/anaconda3/lib/gcc/x86_64-apple-darwin11.4.2/4.8.5/include-fixed/limits.h:34,
from /Users/admin/anaconda3/include/python3.6m/Python.h:11,
from lsh/cMinhash.cpp:27:
/Users/admin/anaconda3/lib/gcc/x86_64-apple-darwin11.4.2/4.8.5/include-fixed/limits.h:168:61: fatal error: limits.h: No such file or directory
#include_next <limits.h> /* recurse down to the real one */
^
compilation terminated.
error: command 'gcc' failed with exit status 1

Unable to install on Windows

I was unable to install the lib on Windows 10 x64, either with or without the Cython flag.
Here's the full setup log:

running install
running bdist_egg
running egg_info
writing lsh.egg-info\PKG-INFO
writing dependency_links to lsh.egg-info\dependency_links.txt
writing requirements to lsh.egg-info\requires.txt
writing top-level names to lsh.egg-info\top_level.txt
reading manifest file 'lsh.egg-info\SOURCES.txt'
writing manifest file 'lsh.egg-info\SOURCES.txt'
installing library code to build\bdist.win-amd64\egg
running install_lib
running build_py
running build_ext
building 'lsh.cMinhash' extension
C:\Program Files (x86)\Microsoft Visual Studio\2019\Professional\VC\Tools\MSVC\14.20.27508\bin\HostX86\x64\cl.exe /c /nologo /Ox /W3 /GL /DNDEBUG /MD -Ilsh -Ic:\_PROG_\WPy64-3720\python-3.7.2.amd64\lib\site-packages\numpy\core\include -Ic:\_PROG_\WPy64-3720\python-3.7.2.amd64\include -Ic:\_PROG_\WPy64-3720\python-3.7.2.amd64\include "-IC:\Program Files (x86)\Microsoft Visual Studio\2019\Professional\VC\Tools\MSVC\14.20.27508\ATLMFC\include" "-IC:\Program Files (x86)\Microsoft Visual Studio\2019\Professional\VC\Tools\MSVC\14.20.27508\include" "-IC:\Program Files (x86)\Windows Kits\NETFXSDK\4.6.1\include\um" "-IC:\Program Files (x86)\Windows Kits\10\include\10.0.17763.0\ucrt" "-IC:\Program Files (x86)\Windows Kits\10\include\10.0.17763.0\shared" "-IC:\Program Files (x86)\Windows Kits\10\include\10.0.17763.0\um" "-IC:\Program Files (x86)\Windows Kits\10\include\10.0.17763.0\winrt" "-IC:\Program Files (x86)\Windows Kits\10\include\10.0.17763.0\cppwinrt" /EHsc /Tplsh/cMinhash.cpp /Fobuild\temp.win-amd64-3.7\Release\lsh/cMinhash.obj
cMinhash.cpp
lsh/cMinhash.cpp(2153): warning C4244: =: converting "Py_ssize_t" to "uint32_t", data loss possible
lsh/cMinhash.cpp(2271): warning C4018: <: signed and unsigned types do not correspond
lsh/cMinhash.cpp(2554): warning C4244: =: converting "Py_ssize_t" to "uint32_t", data loss possible
lsh/cMinhash.cpp(2672): warning C4018: <: signed and unsigned types do not correspond
lsh/cMinhash.cpp(2957): error C2065: NPY_C_CONTIGUOUS: undeclared indentifier
lsh/cMinhash.cpp(3013): error C2065: NPY_F_CONTIGUOUS: undeclared indentifier
lsh/cMinhash.cpp(3198): error C2039: descr: is not a member of "tagPyArrayObject"
c:\_PROG_\WPy64-3720\python-3.7.2.amd64\lib\site-packages\numpy\core\include\numpy\ndarraytypes.h(722): note:  see declaration "tagPyArrayObject"
lsh/cMinhash.cpp(4835): error C2039: base: is not a member of "tagPyArrayObject"
c:\_PROG_\WPy64-3720\python-3.7.2.amd64\lib\site-packages\numpy\core\include\numpy\ndarraytypes.h(722): note:  see declaration "tagPyArrayObject"
lsh/cMinhash.cpp(4844): error C2039: base: is not a member of "tagPyArrayObject"
c:\_PROG_\WPy64-3720\python-3.7.2.amd64\lib\site-packages\numpy\core\include\numpy\ndarraytypes.h(722): note:  see declaration "tagPyArrayObject"
lsh/cMinhash.cpp(4879): error C2039: base: is not a member of "tagPyArrayObject"
c:\_PROG_\WPy64-3720\python-3.7.2.amd64\lib\site-packages\numpy\core\include\numpy\ndarraytypes.h(722): note:  see declaration "tagPyArrayObject"
lsh/cMinhash.cpp(4910): error C2039: base: is not a member of "tagPyArrayObject"
c:\_PROG_\WPy64-3720\python-3.7.2.amd64\lib\site-packages\numpy\core\include\numpy\ndarraytypes.h(722): note:  see declaration "tagPyArrayObject"
lsh/cMinhash.cpp(4911): error C2039: base: is not a member of "tagPyArrayObject"
c:\_PROG_\WPy64-3720\python-3.7.2.amd64\lib\site-packages\numpy\core\include\numpy\ndarraytypes.h(722): note:  see declaration "tagPyArrayObject"
lsh/cMinhash.cpp(19294): error C2039: exc_type: is not a member of "_ts"
c:\_PROG_\WPy64-3720\python-3.7.2.amd64\include\pystate.h(212): note:  see declaration "_ts"
lsh/cMinhash.cpp(19295): error C2039: exc_value: is not a member of "_ts"
c:\_PROG_\WPy64-3720\python-3.7.2.amd64\include\pystate.h(212): note:  see declaration "_ts"
lsh/cMinhash.cpp(19296): error C2039: exc_traceback: is not a member of "_ts"
c:\_PROG_\WPy64-3720\python-3.7.2.amd64\include\pystate.h(212): note:  see declaration "_ts"
lsh/cMinhash.cpp(19303): error C2039: exc_type: is not a member of "_ts"
c:\_PROG_\WPy64-3720\python-3.7.2.amd64\include\pystate.h(212): note:  see declaration "_ts"
lsh/cMinhash.cpp(19304): error C2039: exc_value: is not a member of "_ts"
c:\_PROG_\WPy64-3720\python-3.7.2.amd64\include\pystate.h(212): note:  see declaration "_ts"
lsh/cMinhash.cpp(19305): error C2039: exc_traceback: is not a member of "_ts"
c:\_PROG_\WPy64-3720\python-3.7.2.amd64\include\pystate.h(212): note:  see declaration "_ts"
lsh/cMinhash.cpp(19306): error C2039: exc_type: is not a member of "_ts"
c:\_PROG_\WPy64-3720\python-3.7.2.amd64\include\pystate.h(212): note:  see declaration "_ts"
lsh/cMinhash.cpp(19307): error C2039: exc_value: is not a member of "_ts"
c:\_PROG_\WPy64-3720\python-3.7.2.amd64\include\pystate.h(212): note:  see declaration "_ts"
lsh/cMinhash.cpp(19308): error C2039: exc_traceback: is not a member of "_ts"
c:\_PROG_\WPy64-3720\python-3.7.2.amd64\include\pystate.h(212): note:  see declaration "_ts"
lsh/cMinhash.cpp(19363): error C2039: exc_type: is not a member of "_ts"
c:\_PROG_\WPy64-3720\python-3.7.2.amd64\include\pystate.h(212): note:  see declaration "_ts"
lsh/cMinhash.cpp(19364): error C2039: exc_value: is not a member of "_ts"
c:\_PROG_\WPy64-3720\python-3.7.2.amd64\include\pystate.h(212): note:  see declaration "_ts"
lsh/cMinhash.cpp(19365): error C2039: exc_traceback: is not a member of "_ts"
c:\_PROG_\WPy64-3720\python-3.7.2.amd64\include\pystate.h(212): note:  see declaration "_ts"
lsh/cMinhash.cpp(19366): error C2039: exc_type: is not a member of "_ts"
c:\_PROG_\WPy64-3720\python-3.7.2.amd64\include\pystate.h(212): note:  see declaration "_ts"
lsh/cMinhash.cpp(19367): error C2039: exc_value: is not a member of "_ts"
c:\_PROG_\WPy64-3720\python-3.7.2.amd64\include\pystate.h(212): note:  see declaration "_ts"
lsh/cMinhash.cpp(19368): error C2039: exc_traceback: is not a member of "_ts"
c:\_PROG_\WPy64-3720\python-3.7.2.amd64\include\pystate.h(212): note:  see declaration "_ts"
lsh/cMinhash.cpp(19390): error C2039: exc_type: is not a member of "_ts"
c:\_PROG_\WPy64-3720\python-3.7.2.amd64\include\pystate.h(212): note:  see declaration "_ts"
lsh/cMinhash.cpp(19391): error C2039: exc_value: is not a member of "_ts"
c:\_PROG_\WPy64-3720\python-3.7.2.amd64\include\pystate.h(212): note:  see declaration "_ts"
lsh/cMinhash.cpp(19392): error C2039: exc_traceback: is not a member of "_ts"
c:\_PROG_\WPy64-3720\python-3.7.2.amd64\include\pystate.h(212): note:  see declaration "_ts"
lsh/cMinhash.cpp(19393): error C2039: exc_type: is not a member of "_ts"
c:\_PROG_\WPy64-3720\python-3.7.2.amd64\include\pystate.h(212): note:  see declaration "_ts"
lsh/cMinhash.cpp(19394): error C2039: exc_value: is not a member of "_ts"
c:\_PROG_\WPy64-3720\python-3.7.2.amd64\include\pystate.h(212): note:  see declaration "_ts"
lsh/cMinhash.cpp(19395): error C2039: exc_traceback: is not a member of "_ts"
c:\_PROG_\WPy64-3720\python-3.7.2.amd64\include\pystate.h(212): note:  see declaration "_ts"
error: command 'C:\\Program Files (x86)\\Microsoft Visual Studio\\2019\\Professional\\VC\\Tools\\MSVC\\14.20.27508\\bin\\HostX86\\x64\\cl.exe' failed with exit status 2

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.