GithubHelp home page GithubHelp logo

taki0112 / vector_similarity Goto Github PK

View Code? Open in Web Editor NEW
294.0 9.0 44.0 1.85 MB

Python, Java implementation of TS-SS called from "A Hybrid Geometric Approach for Measuring Similarity Level Among Documents and Document Clustering"

License: MIT License

Java 30.22% Python 69.78%
vector-similarity document-clustering

vector_similarity's Introduction

Vector_Similarity

  • Python, Java implementation of TS-SS called from "A Hybrid Geometric Approach for Measuring Similarity Level Among Documents and Document Clustering"
  • Also, I have summarized "A Hybrid Geometric Approach for Measuring Similarity Level Among Documents and Document Clustering"
  • I recommend TS-SS instead of Cosine distance or Euclidean distance.

The reasons are...

Cosine drawbacks

coise_drawback

Euclidean drawbacks

euclidean drawback

Triangle's Area Similarity (TS)

TS

Sector's Area Similarity (SS)

SS

TS-SS

TS_SS

Results

results

Conclusion

  • In biggest dataset, TS-SS outperforms Cosine with a significant difference, while in other datasets TS-SS outperforms Cosine slightly

  • Therefore, the significant better result of TS-SS in biggest dataset justifies the robustness and reliability of the model for big data and real world data where the variety of documents/texts are high

Reference

[1] A Hybrid Geometric Approach for Measuring Similarity Level Among Documents and Document Clustering [link1] [link2] [View Article]

vector_similarity's People

Contributors

markomih avatar taki0112 avatar trdavidson avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

vector_similarity's Issues

angle Theta converted 2x to radian

Hi,

thanks a lot for the great implementation of TS-SS!
I noticed that in your as well as all other implementations I have found, the angle theta is converted to radians, although it already is in radian.

See the following line here:
https://github.com/taki0112/Vector_Similarity/blob/3a1f6248aba6be8eed865339a26e800dcf02c028/python/TS_SS/Vector_Similarity.py#L20C1-L20C43

Is there a reason for this conversion? As far as I can tell Theta() outputs the angle in radian (as also demonstrated by the fact that we add math.radian(10) to the acos).

Similarly, the area of the sector is calculated with the formula for the angle in degrees, but again, as far as I can tell Theta is in radian.

Thanks a lot in advance!
Clemens

License

What is the License for this software and the TS algorithms ?
are they open source for commercial use?

Vectorized Calculation?

If we want to calculate for large collections of document, can ts-ss be vectorized like cosine similarity

Why euclidean and ts ss result exceeding 1?

I tried the java version. When I run the example from the code, the result as follows
euclidean: 2.23606797749979
cosine similarity: 0.9999999999999998
ts ss: 4.6395825669994775E-4

I get that the cosine similarity is 0.99, because I expect it to be out of 1 or 100%. But why euclidean and ts ss exceed 100%. How to measure it then?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.