mattilyra / lsh Goto Github PK
View Code? Open in Web Editor NEWLocality Sensitive Hashing using MinHash in Python/Cython to detect near duplicate text documents
License: MIT License
Locality Sensitive Hashing using MinHash in Python/Cython to detect near duplicate text documents
License: MIT License
fairly new to Python. What;s the easiest way to install this package. I don't see a setyp.py, so it's not clear what I need to do. I'm running OSX 10.8.5.
Thanks! I've implemented near-duplicate detection using LSH in Java, but my new code base ins in Python, so this would help a lot.
Encountered some compatibility issues while installing LSH due to different Python versions:
In "lsh/cMinhash.cpp" at line 19292, there is an error related to 'PyThreadState' (also known as 'struct _ts') where it mentions that 'exc_type' is not a member, and it suggests replacing it with 'curexc_type' to resolve the issue.
In "lsh/cMinhash.cpp" at line 17704, there is another error involving 'PyTypeObject' (or 'struct _typeobject') where 'tp_print' is not a member. The solution is to replace 'tp_print' with 'tp_vectorcall_offset'.
There is another one.
python3.10/object.h:133:33: error: lvalue required as increment operand
133 | #define Py_REFCNT(ob) _Py_REFCNT(_PyObject_CAST_CONST(ob))
| ~~~~~~~~~~^~~~~~~~~~~~~~~~~~~~~~~~~~
It appears that MinHash.jaccard
is expecting two sets to be given here, where the &
and |
are used for set intersection and union, respectively:
Line 76 in da67215
From what I understand, it's being passed two numpy arrays from Cache
(since they're outputs of the fingerprint functions):
Lines 65 to 66 in da67215
The code doesn't raise an exception, because &
and |
are overloaded for numpy, but I'm concerned that this may not be computing jaccard correctly.
From my testing, I found that this jaccard function did not work as expected (didn't filter any candidates).
I apologize if I'm not understanding this correctly, please correct me if I'm wrong!
at the moment all the documents are stored in an in-memory database - it should be possible to define this to be anything that supports getting/setting items
Allow a stream to be fed in and deduplicated in parallel. Obivously the deduplication itself can not happen in parallel but shingling and minhashing the documents can. Given a fast enough backend for storing the fingerprint this should quite significantly speed up deduplicating large document collections.
Wanted to see if the current project can be extended to popular document formats like pdf, docx etc.
I want to have a rough clustering of input, that is, input sets that are similar to each other in terms of Jaccard coefficient should be grouped together. I think lsh may be able to accomplish such task since it hashes items similar to each other in the same bucket. I wonder if I can check which items are grouped together. I tried the following way but I do not think it is a good one.
lsh.hashtables[6]._dict.values()
Any help or suggestion is much appreaciated.
I am using python 3.6.8, after installation when try to import using
from lsh import minhash
I get the following error trace
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/home/nithin/inflict/LSH/lsh/minhash.py", line 9, in <module>
from .cMinhash import minhash_32, minhash_64
ModuleNotFoundError: No module named 'lsh.cMinhash'
Hello @mattilyra , thanks for your awesome examples of detecting duplicated documents.
(The Jupyter notebook was so polite that made me easy to understand.)
I'm trying to clustering short sentences using this method, but with word-level n-grams.
So I'm working on expanding your cMinhash.pyx
, but cannot figure out what does the second argument len
doing in the function MurmurHash3_x86_32
.
My question is,
char_ngram
as the len
argument, andword_ngram
as the len
argument.Thanks.
BTW, I found this problem because the hash result tendency was quite different between mine and yours even the configuration was almost the same.
My result of hashing "Lorem Ipsum dolor sit amet" with 100 seeds and char-level 5-gram, using hajimes/mmh3, a python wrapper of MurmurHash3 as the MinHasher.
array([ 65436857, 26223331, 165959958, 35857255, 212417650,
185799665, 72344264, 29695203, 306301591, 88841905,
49846023, 193880158, 394644100, 393466921, 88563338,
193342788, 289561251, 41457677, 46269772, 45140637,
88731786, 154944682, 167707365, 12226981, 134694109,
152174644, 149058781, 137634731, 282990808, 660085804,
31993919, 95610818, 82276674, 393466240, 168429263,
310122140, 96764607, 170415308, 793383417, 67665263,
369128956, 663065730, 7993604, 62970620, 732822434,
237305329, 161302415, 290720290, 68378231, 13636483,
193113465, 3015742, 40301015, 455083766, 108353051,
262511163, 84328315, 29373936, 97439899, 86035674,
169048511, 301589216, 304074377, 44969229, 320465503,
10129839, 429861020, 120736105, 69736016, 143478980,
360628113, 348757135, 120123671, 1052150375, 61331130,
25125176, 34933924, 182346076, 464411593, 305861551,
325756924, 259878569, 369066011, 87468108, 557439393,
104788999, 33171267, 268620735, 155177532, 29934811,
19180594, 58288667, 8061171, 109245552, 104467657,
176372959, 130951767, 258276624, 59320468, 915427336],
dtype=uint32)
Your result of hashing with the same configuration.
array([2270775894, 2244931819, 2222833540, 2370931475, 2358887817,
2286506241, 2483588865, 2235209090, 2242850826, 2670956706,
2332349427, 2205899159, 3046739795, 2412257222, 2639427412,
2439806156, 2481864998, 2315134778, 2276036063, 2173185890,
2356592485, 2250310001, 2426157323, 2197343414, 2170959327,
2666745886, 2497212147, 2227519238, 2270253453, 2682657866,
2355382986, 2167642277, 2407297617, 2388667035, 2309089485,
2186779532, 2574604323, 2216949965, 2218059463, 2158519866,
2506498897, 2271297387, 2766549748, 2333709880, 2192453023,
2213638709, 2298919119, 2334076817, 2655285423, 2181653514,
2169583114, 2758877533, 2205629894, 2266512646, 2308863664,
2190394274, 2694111477, 2799473812, 2430748017, 2214130591,
2380590935, 2178089510, 2203907876, 2593729455, 2185184798,
2274709474, 2494067266, 2626021353, 2202501877, 2355924309,
2242977078, 2162025102, 2612350777, 2213862508, 2205571482,
2238265438, 2305791018, 2187691276, 2318248647, 2219845855,
2265366812, 2633383060, 2311319978, 2379408053, 2188968632,
2639427412, 2383615522, 2401562252, 2164974019, 3230385414,
2278782695, 2193521393, 2379669319, 2249922125, 2161391929,
2178875277, 2261101105, 2341046147, 2664062261, 2251239581],
dtype=uint32)
/usr/local/Cellar/python/3.7.6_1/Frameworks/Python.framework/Versions/3.7/include/python3.7m/pystate.h:241:15: note:
'curexc_traceback' declared here
PyObject *curexc_traceback;
^
lsh/cMinhash.cpp:19391:13: error: no member named 'exc_type' in '_ts'
tstate->exc_type = *type;
~~~~~~ ^
fatal error: too many errors emitted, stopping now [-ferror-limit=]
1 warning and 20 errors generated.
It works for Python2.7 but not able to install in Python3.7.
i can not use it
SimHash is another LSH technique for near duplicate detection, it relies on cosine similarity instead of Jaccard similarity.
https://en.wikipedia.org/wiki/SimHash
https://doi.org/10.1145/509907.509965
Installation issue on the virtual environment:
error: command 'cc' failed with exit status 1
Is it possible to store the cache? If I created a cache and I want to use it later, how would I go about it?
lsh should be pip
installable, use cookiecutter
If suppose I have 100,000 sentences or document. and I want to find the pairwise jaccard similarity. How to make minhash algorithm scalable? could please add the example for the same.
I am using Python 2.7.15 and after setup I get the following error when I try to from lsh import minhash
.
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "lsh/minhash.py", line 9, in <module>
from .cMinhash import minhash_32, minhash_64
ImportError: No module named cMinhash
This implementation has been working astonishingly well! Such a great resource and explanation of everything. I've been running this on a Windows 10 box and realized that the number of candidate duplicates aren't responding bands / duplicates - you can specify, for example 128 bands but cache.py will override that. i.e. setting 16 bands and 128 seeds will clear your example code ValueError but will raise an error from line 35 of cache.py. I was wondering if it was a windows thing (edit: this was my issue). and tried to install on my Mac and have been beating my head against the wall with the error below. Any chance you've encountered something like this before?
I've updated xcode / command line tools / gcc but am getting no where (and the limits.h file does exist where it says it doesn't). Would love to get this up and running on my mac if you have any ideas.
building 'lsh.cMinhash' extension
gcc -Wno-unused-result -Wsign-compare -Wunreachable-code -DNDEBUG -g -fwrapv -O3 -Wall -Wstrict-prototypes -I/Users/admin/anaconda3/include -arch x86_64 -I/Users/admin/anaconda3/include -arch x86_64 -Ilsh -I/Users/admin/anaconda3/lib/python3.6/site-packages/numpy/core/include -I/Users/admin/anaconda3/include/python3.6m -c lsh/cMinhash.cpp -o build/temp.macosx-10.7-x86_64-3.6/lsh/cMinhash.o
cc1plus: warning: command line option ‘-Wstrict-prototypes’ is valid for C/ObjC but not for C++ [enabled by default]
In file included from /Users/admin/anaconda3/lib/gcc/x86_64-apple-darwin11.4.2/4.8.5/include-fixed/syslimits.h:7:0,
from /Users/admin/anaconda3/lib/gcc/x86_64-apple-darwin11.4.2/4.8.5/include-fixed/limits.h:34,
from /Users/admin/anaconda3/include/python3.6m/Python.h:11,
from lsh/cMinhash.cpp:27:
/Users/admin/anaconda3/lib/gcc/x86_64-apple-darwin11.4.2/4.8.5/include-fixed/limits.h:168:61: fatal error: limits.h: No such file or directory
#include_next <limits.h> /* recurse down to the real one */
^
compilation terminated.
error: command 'gcc' failed with exit status 1
Is it possible to extend LSH to detect near duplicate images?
I was unable to install the lib on Windows 10 x64, either with or without the Cython flag.
Here's the full setup log:
running install
running bdist_egg
running egg_info
writing lsh.egg-info\PKG-INFO
writing dependency_links to lsh.egg-info\dependency_links.txt
writing requirements to lsh.egg-info\requires.txt
writing top-level names to lsh.egg-info\top_level.txt
reading manifest file 'lsh.egg-info\SOURCES.txt'
writing manifest file 'lsh.egg-info\SOURCES.txt'
installing library code to build\bdist.win-amd64\egg
running install_lib
running build_py
running build_ext
building 'lsh.cMinhash' extension
C:\Program Files (x86)\Microsoft Visual Studio\2019\Professional\VC\Tools\MSVC\14.20.27508\bin\HostX86\x64\cl.exe /c /nologo /Ox /W3 /GL /DNDEBUG /MD -Ilsh -Ic:\_PROG_\WPy64-3720\python-3.7.2.amd64\lib\site-packages\numpy\core\include -Ic:\_PROG_\WPy64-3720\python-3.7.2.amd64\include -Ic:\_PROG_\WPy64-3720\python-3.7.2.amd64\include "-IC:\Program Files (x86)\Microsoft Visual Studio\2019\Professional\VC\Tools\MSVC\14.20.27508\ATLMFC\include" "-IC:\Program Files (x86)\Microsoft Visual Studio\2019\Professional\VC\Tools\MSVC\14.20.27508\include" "-IC:\Program Files (x86)\Windows Kits\NETFXSDK\4.6.1\include\um" "-IC:\Program Files (x86)\Windows Kits\10\include\10.0.17763.0\ucrt" "-IC:\Program Files (x86)\Windows Kits\10\include\10.0.17763.0\shared" "-IC:\Program Files (x86)\Windows Kits\10\include\10.0.17763.0\um" "-IC:\Program Files (x86)\Windows Kits\10\include\10.0.17763.0\winrt" "-IC:\Program Files (x86)\Windows Kits\10\include\10.0.17763.0\cppwinrt" /EHsc /Tplsh/cMinhash.cpp /Fobuild\temp.win-amd64-3.7\Release\lsh/cMinhash.obj
cMinhash.cpp
lsh/cMinhash.cpp(2153): warning C4244: =: converting "Py_ssize_t" to "uint32_t", data loss possible
lsh/cMinhash.cpp(2271): warning C4018: <: signed and unsigned types do not correspond
lsh/cMinhash.cpp(2554): warning C4244: =: converting "Py_ssize_t" to "uint32_t", data loss possible
lsh/cMinhash.cpp(2672): warning C4018: <: signed and unsigned types do not correspond
lsh/cMinhash.cpp(2957): error C2065: NPY_C_CONTIGUOUS: undeclared indentifier
lsh/cMinhash.cpp(3013): error C2065: NPY_F_CONTIGUOUS: undeclared indentifier
lsh/cMinhash.cpp(3198): error C2039: descr: is not a member of "tagPyArrayObject"
c:\_PROG_\WPy64-3720\python-3.7.2.amd64\lib\site-packages\numpy\core\include\numpy\ndarraytypes.h(722): note: see declaration "tagPyArrayObject"
lsh/cMinhash.cpp(4835): error C2039: base: is not a member of "tagPyArrayObject"
c:\_PROG_\WPy64-3720\python-3.7.2.amd64\lib\site-packages\numpy\core\include\numpy\ndarraytypes.h(722): note: see declaration "tagPyArrayObject"
lsh/cMinhash.cpp(4844): error C2039: base: is not a member of "tagPyArrayObject"
c:\_PROG_\WPy64-3720\python-3.7.2.amd64\lib\site-packages\numpy\core\include\numpy\ndarraytypes.h(722): note: see declaration "tagPyArrayObject"
lsh/cMinhash.cpp(4879): error C2039: base: is not a member of "tagPyArrayObject"
c:\_PROG_\WPy64-3720\python-3.7.2.amd64\lib\site-packages\numpy\core\include\numpy\ndarraytypes.h(722): note: see declaration "tagPyArrayObject"
lsh/cMinhash.cpp(4910): error C2039: base: is not a member of "tagPyArrayObject"
c:\_PROG_\WPy64-3720\python-3.7.2.amd64\lib\site-packages\numpy\core\include\numpy\ndarraytypes.h(722): note: see declaration "tagPyArrayObject"
lsh/cMinhash.cpp(4911): error C2039: base: is not a member of "tagPyArrayObject"
c:\_PROG_\WPy64-3720\python-3.7.2.amd64\lib\site-packages\numpy\core\include\numpy\ndarraytypes.h(722): note: see declaration "tagPyArrayObject"
lsh/cMinhash.cpp(19294): error C2039: exc_type: is not a member of "_ts"
c:\_PROG_\WPy64-3720\python-3.7.2.amd64\include\pystate.h(212): note: see declaration "_ts"
lsh/cMinhash.cpp(19295): error C2039: exc_value: is not a member of "_ts"
c:\_PROG_\WPy64-3720\python-3.7.2.amd64\include\pystate.h(212): note: see declaration "_ts"
lsh/cMinhash.cpp(19296): error C2039: exc_traceback: is not a member of "_ts"
c:\_PROG_\WPy64-3720\python-3.7.2.amd64\include\pystate.h(212): note: see declaration "_ts"
lsh/cMinhash.cpp(19303): error C2039: exc_type: is not a member of "_ts"
c:\_PROG_\WPy64-3720\python-3.7.2.amd64\include\pystate.h(212): note: see declaration "_ts"
lsh/cMinhash.cpp(19304): error C2039: exc_value: is not a member of "_ts"
c:\_PROG_\WPy64-3720\python-3.7.2.amd64\include\pystate.h(212): note: see declaration "_ts"
lsh/cMinhash.cpp(19305): error C2039: exc_traceback: is not a member of "_ts"
c:\_PROG_\WPy64-3720\python-3.7.2.amd64\include\pystate.h(212): note: see declaration "_ts"
lsh/cMinhash.cpp(19306): error C2039: exc_type: is not a member of "_ts"
c:\_PROG_\WPy64-3720\python-3.7.2.amd64\include\pystate.h(212): note: see declaration "_ts"
lsh/cMinhash.cpp(19307): error C2039: exc_value: is not a member of "_ts"
c:\_PROG_\WPy64-3720\python-3.7.2.amd64\include\pystate.h(212): note: see declaration "_ts"
lsh/cMinhash.cpp(19308): error C2039: exc_traceback: is not a member of "_ts"
c:\_PROG_\WPy64-3720\python-3.7.2.amd64\include\pystate.h(212): note: see declaration "_ts"
lsh/cMinhash.cpp(19363): error C2039: exc_type: is not a member of "_ts"
c:\_PROG_\WPy64-3720\python-3.7.2.amd64\include\pystate.h(212): note: see declaration "_ts"
lsh/cMinhash.cpp(19364): error C2039: exc_value: is not a member of "_ts"
c:\_PROG_\WPy64-3720\python-3.7.2.amd64\include\pystate.h(212): note: see declaration "_ts"
lsh/cMinhash.cpp(19365): error C2039: exc_traceback: is not a member of "_ts"
c:\_PROG_\WPy64-3720\python-3.7.2.amd64\include\pystate.h(212): note: see declaration "_ts"
lsh/cMinhash.cpp(19366): error C2039: exc_type: is not a member of "_ts"
c:\_PROG_\WPy64-3720\python-3.7.2.amd64\include\pystate.h(212): note: see declaration "_ts"
lsh/cMinhash.cpp(19367): error C2039: exc_value: is not a member of "_ts"
c:\_PROG_\WPy64-3720\python-3.7.2.amd64\include\pystate.h(212): note: see declaration "_ts"
lsh/cMinhash.cpp(19368): error C2039: exc_traceback: is not a member of "_ts"
c:\_PROG_\WPy64-3720\python-3.7.2.amd64\include\pystate.h(212): note: see declaration "_ts"
lsh/cMinhash.cpp(19390): error C2039: exc_type: is not a member of "_ts"
c:\_PROG_\WPy64-3720\python-3.7.2.amd64\include\pystate.h(212): note: see declaration "_ts"
lsh/cMinhash.cpp(19391): error C2039: exc_value: is not a member of "_ts"
c:\_PROG_\WPy64-3720\python-3.7.2.amd64\include\pystate.h(212): note: see declaration "_ts"
lsh/cMinhash.cpp(19392): error C2039: exc_traceback: is not a member of "_ts"
c:\_PROG_\WPy64-3720\python-3.7.2.amd64\include\pystate.h(212): note: see declaration "_ts"
lsh/cMinhash.cpp(19393): error C2039: exc_type: is not a member of "_ts"
c:\_PROG_\WPy64-3720\python-3.7.2.amd64\include\pystate.h(212): note: see declaration "_ts"
lsh/cMinhash.cpp(19394): error C2039: exc_value: is not a member of "_ts"
c:\_PROG_\WPy64-3720\python-3.7.2.amd64\include\pystate.h(212): note: see declaration "_ts"
lsh/cMinhash.cpp(19395): error C2039: exc_traceback: is not a member of "_ts"
c:\_PROG_\WPy64-3720\python-3.7.2.amd64\include\pystate.h(212): note: see declaration "_ts"
error: command 'C:\\Program Files (x86)\\Microsoft Visual Studio\\2019\\Professional\\VC\\Tools\\MSVC\\14.20.27508\\bin\\HostX86\\x64\\cl.exe' failed with exit status 2
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.