GithubHelp home page GithubHelp logo

bhanditz / tf-idf-similarity Goto Github PK

View Code? Open in Web Editor NEW

This project forked from jpmckinney/tf-idf-similarity

0.0 2.0 0.0 162 KB

Ruby gem to calculate the similarity between texts using tf*idf

License: MIT License

Ruby 100.00%

tf-idf-similarity's Introduction

Ruby Vector Space Model (VSM) with tf*idf weights

Gem Version Build Status Coverage Status Code Climate

Calculates the similarity between texts using a bag-of-words Vector Space Model with Term Frequency-Inverse Document Frequency (tf*idf) weights. If your use case demands performance, use Lucene (see below).

Usage

require 'matrix'
require 'tf-idf-similarity'

Create a set of documents:

document1 = TfIdfSimilarity::Document.new("Lorem ipsum dolor sit amet...")
document2 = TfIdfSimilarity::Document.new("Pellentesque sed ipsum dui...")
document3 = TfIdfSimilarity::Document.new("Nam scelerisque dui sed leo...")
corpus = [document1, document2, document3]

Create a document-term matrix using Term Frequency-Inverse Document Frequency function:

model = TfIdfSimilarity::TfIdfModel.new(corpus)

Or, create a document-term matrix using the Okapi BM25 ranking function:

model = TfIdfSimilarity::BM25Model.new(corpus)

Create a similarity matrix:

matrix = model.similarity_matrix

Find the similarity of two documents in the matrix:

matrix[model.document_index(document1), model.document_index(document2)]

Print the tf*idf values for terms in a document:

tfidf_by_term = {}
document1.terms.each do |term|
  tfidf_by_term[term] = model.tfidf(document1, term)
end
puts tfidf_by_term.sort_by{|_,tfidf| -tfidf}

Tokenize a document yourself, for example by excluding stop words:

require 'unicode_utils'
text = "Lorem ipsum dolor sit amet..."
tokens = UnicodeUtils.each_word(text).to_a - ['and', 'the', 'to']
document1 = TfIdfSimilarity::Document.new(text, :tokens => tokens)

Provide, by yourself, the number of times each term appears and the number of tokens in the document:

require 'unicode_utils'
text = "Lorem ipsum dolor sit amet..."
tokens = UnicodeUtils.each_word(text).to_a - ['and', 'the', 'to']
term_counts = Hash.new(0)
size = 0
tokens.each do |token|
  # Unless the token is numeric.
  unless token[/\A\d+\z/]
    # Remove all punctuation from tokens.
    term_counts[token.gsub(/\p{Punct}/, '')] += 1
    size += 1
  end
end
document1 = TfIdfSimilarity::Document.new(text, :term_counts => term_counts, :size => size)

Read the documentation at RubyDoc.info.

Troubleshooting

NoMethodError: undefined method `[]' for Matrix:Module

The matrix gem conflicts with Ruby's internal Matrix module. Don't use the matrix gem.

Speed

Instead of using the Ruby Standard Library's Matrix class, you can use one of the GNU Scientific Library (GSL), NArray or NMatrix (0.0.9 or greater) gems for faster matrix operations. For example:

require 'narray'
model = TfIdfSimilarity::TfIdfModel.new(corpus, :library => :narray)

NArray seems to have the best performance of the three libraries.

The NMatrix gem gives access to Automatically Tuned Linear Algebra Software (ATLAS), which you may know of through Linear Algebra PACKage (LAPACK) or Basic Linear Algebra Subprograms (BLAS). Follow these instructions to install the NMatrix gem.

Extras

You can access more term frequency, document frequency, and normalization formulas with:

require 'tf-idf-similarity/extras/document'
require 'tf-idf-similarity/extras/tf_idf_model'

The default tf*idf formula follows the Lucene Conceptual Scoring Formula.

Why?

At the time of writing, no other Ruby gem implemented the tf*idf formula used by Lucene, Sphinx and Ferret.

Term frequencies

  • The vss gem does not normalize the frequency of a term in a document; this occurs frequently in the academic literature, but only to demonstrate why normalization is important.
  • The tf_idf and similarity gems normalize the frequency of a term in a document to the number of terms in that document, which never occurs in the literature.
  • The tf-idf gem normalizes the frequency of a term in a document to the number of unique terms in that document, which never occurs in the literature.

Document frequencies

  • The vss gem does not normalize the inverse document frequency.
  • The treat, tf_idf, tf-idf and similarity gems use variants of the typical inverse document frequency formula.

Normalization

  • The treat, tf_idf, tf-idf, rsemantic and vss gems have no normalization component.

Additional adapters

Adapters for the following projects were also considered:

  • Ruby-LAPACK is a very thin wrapper around LAPACK, which has an opaque Fortran-style naming scheme.
  • Linalg and RNum give access to LAPACK from Ruby but are old and unavailable as gems.

Reference

Further Reading

Lucene implements many more similarity functions, such as:

Lucene can even combine similarity measures.

Copyright (c) 2012 James McKinney, released under the MIT license

tf-idf-similarity's People

Contributors

diasks2 avatar louismullie avatar airy avatar

Watchers

James Cloos avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.