GithubHelp home page GithubHelp logo

steven-s / minhash-document-clusters Goto Github PK

View Code? Open in Web Editor NEW
4.0 2.0 1.0 34 KB

Minhash clustering of text documents

License: MIT License

Scala 90.92% Shell 9.08%
document-clustering clustering lsh text-mining locality-sensitive-hashing minhash-lsh-algorithm minhash

minhash-document-clusters's Introduction

Document Clustering utilizing MinHash signatures and LSH

This project encompasses two basic approaches to similar document clustering.

Brute Force

The brute force clustering is executed as a spark job. It compares all pairings of documents utilizing k-shingling for similarity.

LSH

The LSH clustering is also executed as a spark job. It calculates MinHashes for documents and then utilizes Locality Sensitive Hashing (LSH) to generate candidate pairs which can then be tested for similarity.

Locality Sensitive-what now

More information on this subject can be found in Chapter 3 of the Stanford Mining of Massive Datasets text.

Shortcomings

This project can find similarities between documents quite well, but its clustering approach with the resulting matches is not perfect and will sometimes output subsets of clusters and intersecting clusters. There are approaches that can eliminate these issues, but they are not contained within this source at this point.

Where to find the data used when developing this project?

I found a collection of short BBC articles in plain text at this site:

http://mlg.ucd.ie/datasets/bbc.html

There are surprisingly already duplicates present in the different corpora.

minhash-document-clusters's People

Contributors

steven-s avatar

Stargazers

 avatar  avatar  avatar  avatar

Watchers

 avatar  avatar

Forkers

redzgn

minhash-document-clusters's Issues

For better cluster result

This project can find similarities between documents quite well, but its clustering approach with the resulting matches is not perfect and will sometimes output subsets of clusters and intersecting clusters.

Hello, Stev, Do you know some approaches to make the clustering result better after using LSH & minhash? Thank you.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.