GithubHelp home page GithubHelp logo

text's Introduction

Text

A collection of text algorithms.

Usage

require 'text'

Levenshtein distance

Text::Levenshtein.distance('test', 'test')
# => 0
Text::Levenshtein.distance('test', 'tent')
# => 1
Text::Levenshtein.distance('test', 'testing')
# => 3
Text::Levenshtein.distance('test', 'testing', 2)
# => 2

Metaphone

Text::Metaphone.metaphone('BRIAN')
# => 'BRN'

Text::Metaphone.double_metaphone('Coburn')
# => ['KPRN', nil]
Text::Metaphone.double_metaphone('Angier')
# => ['ANJ', 'ANJR']

Soundex

Text::Soundex.soundex('Knuth')
# => 'K530'

Porter stemming

Text::PorterStemming.stem('abatements')  # => 'abat'

White similarity

white = Text::WhiteSimilarity.new
white.similarity('Healed', 'Sealed')   # 0.8
white.similarity('Healed', 'Help')     # 0.25

Note that some intermediate information is cached on the instance to improve performance.

Ruby version compatibility

The library has been tested on Ruby 1.8.6 to 1.9.3 and on JRuby.

Thanks

  • Hampton Catlin (hcatlin) for Ruby 1.9 compatibility work

  • Wilker Lúcio for the initial implementation of the White algorithm

License

MIT. See COPYING.txt for details.

text's People

Contributors

bill-kolokithas avatar hamptonmakes avatar martoche avatar pablorusso avatar qnm avatar tcrouch avatar threedaymonk avatar tim avatar wilkerlucio avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

text's Issues

License status

Hello,

can you please add a license file or a statement on which licenses you're refering to?

I'd like to package your gem for Gentoo, but don't know if it is licensed under ruby 1.8's license ( Ruby or GPL-2 ) or ruby 1.9's license ( Ruby-BSD or BSD-2
), which makes it had to package it.

Thank you in advance

Rogue UTF-8 Characters are failing to encode to UTF-8

I'm getting the following error:

"\xC2" from ASCII-8BIT to UTF-8
vendor/bundle/ruby/2.1.0/gems/text-1.2.3/lib/text/levenshtein.rb:35:in `encode'

This occurs when it tries to encode the following which is the 'degree' symbol:
\xC2\xBA

I'm using the gem 'mysql', '~> 2.8.1' because we need to be able to execute stored procedures which i believe mysql2 has major problems with. Could that be the reason for this error?

Metaphone 3 support

The Metaphone 3 algorithm has been released as FOSS as part of Google Refine here:

https://code.google.com/p/google-refine/source/browse/trunk/main/src/com/google/refine/clustering/binning/Metaphone3.java

The disclaimer is as follows:

Copyright 2010, Lawrence Philips
All rights reserved.

Redistribution and use in source and binary forms, with or without
modification, are permitted
provided that the following conditions are
met:

  • Redistributions of source code must retain the above copyright
    notice, this list of conditions and the following disclaimer.
  • Redistributions in binary form must reproduce the above
    copyright notice, this list of conditions and the following disclaimer
    in the documentation and/or other materials provided with the
    distribution.
  • Neither the name of Google Inc. nor the names of its
    contributors may be used to endorse or promote products derived from
    this software without specific prior written permission.

So, provided that we include the disclaimer it would be OK to release a Ruby-port of the code as part of this gem.

Single-character words in inputs get ignored with White Similarity

The following returns an exact match:

irb(main):164:0> Text::WhiteSimilarity.new.similarity("John F Kennedy", "John A B C D E F G H I J K L M N O P Q R S T U V W X Y Z Kennedy")
1.0

I would not expect it to match exactly.

This returns NaN:

irb(main):165:0> Text::WhiteSimilarity.new.similarity("C J", "C J")
NaN

I am expecting this to return a 1.0.

The issue is in module Text, in class WhiteSimilarity, in private method word_letter_pairs which always expects the words that are parsed from input string argument to be at least two characters long.

An example of a refactor for this method would be to check for single-character length words and handle them differently:

  def word_letter_pairs(str)
    @word_letter_pairs[str] ||=
      str.upcase.split(/\s+/).map{ |word|
        if word.length == 1
          [word]
        else
          (0 ... (word.length - 1)).map { |i| word[i, 2] }
        end
      }.flatten.freeze
  end

I am using version 1.3.1

please include COPYING.txt in the gem

Hi,

The MIT license requires to include the full text of the license together with the source. You should then also include it in the gem. Could you please include it?

Cheers,

Cédric

Soundex implementation broken

It appears that the Soundex implementation is broken due to two problems:

  1. Whitespace in the input causes Text::Soundex.soundex to always return nil
  2. The length of the returned soundex codes is not correct.

The first problem can be reproduced as following:

require 'text'

p Text::Soundex.soundex('San Francisco') # => nil

The second problem can be reproduced as following:

require 'text'

p Text::Soundex.soundex('SanFranciscoooooooblalalalalal') # => "S516"

If one were to use the SOUNDEX() function provided by MySQL they would get the following instead:

mysql> SELECT SOUNDEX('SanFranciscoooooooblalalalalal');
+-------------------------------------------+
| SOUNDEX('SanFranciscoooooooblalalalalal') |
+-------------------------------------------+
| S5165214                                  |
+-------------------------------------------+
1 row in set (0.00 sec)

Perhaps this is intended behaviour. In that case it should be documented somewhere.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.