Comments (1)
As long as you have some way of extracting the content from the PDFs and DOCX files the method will work. I don't intend to integrate a PDF or DOCX parser to lsh
though but there are a number of python libraries for handling those documents
For PDFs
https://pypi.python.org/pypi/PyPDF2
http://www.unixuser.org/~euske/python/pdfminer/index.html
For DOCX
https://python-docx.readthedocs.io/en/latest/
I have not used any one of the above libraries though.
from lsh.
Related Issues (19)
- Need help installing this HOT 2
- Check buckets of LSH MinHash
- Unable to install HOT 1
- storing cache HOT 1
- Unable to install on Mojave (10.14.2) or Ubuntu HOT 5
- ModuleNotFoundError: No module named 'lsh.cMinhash' HOT 5
- Unable to install on Windows HOT 1
- Unable to install in Python3 HOT 5
- Jaccard should be performed on sets, but appears to be given numpy arrays HOT 3
- How to make minhash scalable
- A few questions about the `len` argument in the function `MurmurHash3_x86_32`.
- ModuleNotFoundError: No module named 'LSH.lsh.cMinhash' HOT 1
- error: command 'cc' failed with exit status 1 HOT 1
- parallel deduplication HOT 3
- allow other backends for storing duplicate documents HOT 2
- Add support for SimHash
- Create PyPi package HOT 2
- Is it possible to extend LSH to detect near duplicate images? HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from lsh.