GithubHelp home page GithubHelp logo

aceakash / string-similarity Goto Github PK

View Code? Open in Web Editor NEW
2.5K 30.0 122.0 112 KB

Finds degree of similarity between two strings, based on Dice's Coefficient, which is mostly better than Levenshtein distance.

License: MIT License

JavaScript 100.00%
javascript dice-coefficient string-comparison string-similarity strings

string-similarity's Introduction

⚰️ ⚰️ DEPRECATED ⚰️ ⚰️

This repository and the associated NPM package is no longer being maintained.

string-similarity

Finds degree of similarity between two strings, based on Dice's Coefficient, which is mostly better than Levenshtein distance.

Table of Contents

Usage

For Node.js

Install using:

npm install string-similarity --save

In your code:

var stringSimilarity = require("string-similarity");

var similarity = stringSimilarity.compareTwoStrings("healed", "sealed");

var matches = stringSimilarity.findBestMatch("healed", [
  "edward",
  "sealed",
  "theatre",
]);

For browser apps

Include <script src="//unpkg.com/string-similarity/umd/string-similarity.min.js"></script> to get the latest version.

Or <script src="//unpkg.com/[email protected]/umd/string-similarity.min.js"></script> to get a specific version (4.0.1) in this case.

This exposes a global variable called stringSimilarity which you can start using.

<script>
  stringSimilarity.compareTwoStrings('what!', 'who?');
</script>

(The package is exposed as UMD, so you can consume it as such)

API

The package contains two methods:

compareTwoStrings(string1, string2)

Returns a fraction between 0 and 1, which indicates the degree of similarity between the two strings. 0 indicates completely different strings, 1 indicates identical strings. The comparison is case-sensitive.

Arguments
  1. string1 (string): The first string
  2. string2 (string): The second string

Order does not make a difference.

Returns

(number): A fraction from 0 to 1, both inclusive. Higher number indicates more similarity.

Examples
stringSimilarity.compareTwoStrings("healed", "sealed");
// → 0.8

stringSimilarity.compareTwoStrings(
  "Olive-green table for sale, in extremely good condition.",
  "For sale: table in very good  condition, olive green in colour."
);
// → 0.6060606060606061

stringSimilarity.compareTwoStrings(
  "Olive-green table for sale, in extremely good condition.",
  "For sale: green Subaru Impreza, 210,000 miles"
);
// → 0.2558139534883721

stringSimilarity.compareTwoStrings(
  "Olive-green table for sale, in extremely good condition.",
  "Wanted: mountain bike with at least 21 gears."
);
// → 0.1411764705882353

findBestMatch(mainString, targetStrings)

Compares mainString against each string in targetStrings.

Arguments
  1. mainString (string): The string to match each target string against.
  2. targetStrings (Array): Each string in this array will be matched against the main string.
Returns

(Object): An object with a ratings property, which gives a similarity rating for each target string, a bestMatch property, which specifies which target string was most similar to the main string, and a bestMatchIndex property, which specifies the index of the bestMatch in the targetStrings array.

Examples
stringSimilarity.findBestMatch('Olive-green table for sale, in extremely good condition.', [
  'For sale: green Subaru Impreza, 210,000 miles',
  'For sale: table in very good condition, olive green in colour.',
  'Wanted: mountain bike with at least 21 gears.'
]);
// →
{ ratings:
   [ { target: 'For sale: green Subaru Impreza, 210,000 miles',
       rating: 0.2558139534883721 },
     { target: 'For sale: table in very good condition, olive green in colour.',
       rating: 0.6060606060606061 },
     { target: 'Wanted: mountain bike with at least 21 gears.',
       rating: 0.1411764705882353 } ],
  bestMatch:
   { target: 'For sale: table in very good condition, olive green in colour.',
     rating: 0.6060606060606061 },
  bestMatchIndex: 1
}

Release Notes

2.0.0

  • Removed production dependencies
  • Updated to ES6 (this breaks backward-compatibility for pre-ES6 apps)

3.0.0

  • Performance improvement for compareTwoStrings(..): now O(n) instead of O(n^2)
  • The algorithm has been tweaked slightly to disregard spaces and word boundaries. This will change the rating values slightly but not enough to make a significant difference
  • Adding a bestMatchIndex to the results for findBestMatch(..) to point to the best match in the supplied targetStrings array

3.0.1

  • Refactoring: removed unused functions; used substring instead of substr
  • Updated dependencies

4.0.1

  • Distributing as an UMD build to be used in browsers.

4.0.2

  • Update dependencies to latest versions.

4.0.3

  • Make compatible with IE and ES5. Also, update deps. (see PR56)

4.0.4

  • Simplify some conditional statements. Also, update deps. (see PR50)

Build status Known Vulnerabilities

string-similarity's People

Contributors

aceakash avatar ascriver avatar awalin avatar dependabot-support avatar f-a-r-a-z avatar ludo97240 avatar maxbachmann avatar rclai avatar tom-sap avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

string-similarity's Issues

Doesn't work for strings with length 1

This looks like expected behavior, but it could be useful to fall back to a simple algorithm if one of the inputs is a length 1 string.

  if (first.length === 1 || second.length === 1) {			           // if either is a 1-letter string
    let [smaller, larger] = (first.length === 1)
      ? [first, second]
      : [second, first];
    return larger.includes(smaller) ? 2.0 / (larger.length + 1) : 0;
  }

This came up when I tried to use compareTwoStrings for a search ranking.

if (first.length < 2 || second.length < 2) return 0; // if either is a 1-letter string

Handling of tiny strings not functioning as expected

For some reason when both strings are 1 or less characters long, compareTwoStrings will return Number.NaN instead of expected 1 or 0.

Examples:

compareTwoStrings("", "") === Number.NaN
compareTwoStrings("a", "a") === Number.NaN
compareTwoStrings("a", "") === Number.NaN
compareTwoStrings("aa", "aa") === 1
compareTwoStrings("aa", "") === 0

This is a problem as Number.NaN is always greater then other numbers,
eg the following will always return false, even though expected true:
compareTwoStrings("", "") > 0.9

I have temporarily got around this in my own project, by simply using:

if(a.length <= 1 && b.length <= 1)return a === b ? 1 : 0;
return compareTwoStrings(a, b);

Cheers,
Josh

Issue comparing a word against a the same word plus a blank and another letter

Hello, I found an issue comparing a word against a the same word plus a blank and another letter.
Eg:
"Iphone" compared with "Iphone X" gives me a match of 1, but the texts are not equal. It should be close to 1 but not 1.

I'm using version 2.0.0

findBestMatch('Iphone', ['Iphone 8', 'Iphone 10', 'Iphone X', 'Iphone XS'])
image

Huge difference comparing to Levenshtein Distance method

If we test from https://planetcalc.com for;
source : Olive-green table for sale, in extremely good condition.
target : For sale: table in very good condition, olive green in colour.
number of movement is 47

source : Olive-green table for sale, in extremely good condition.
target : Wanted: mountain bike with at least 21 gears..
number of movement is 47

looks same lol :D but doesn't make sense. Sørensen–Dice very accurate.

compareTwoStrings return wrong output

I notice that if we pass ababacac and abacabac in compareTwoStrings, it return output 1 which is wrong.

var stringSimilarity = require('string-similarity');
console.log(stringSimilarity.compareTwoStrings('ababacac', 'abacabac'));

Expected
Should not 1

Output
1

This is not the Dice coefficient

Your algorithm is not the Dice coefficient. It counts all bigram duplicates, whereas the Dice coefficient only counts distinct bigrams (as defined in Wikipedia).

As an example, let's compare two versions of the main file of this repo (https://github.com/aceakash/string-similarity/blob/2718c82bbbf5190ebb8e9c54d4cbae6d1259527a/compare-strings.js and the latest https://github.com/aceakash/string-similarity/blob/eaeec5d74c98a6f6fcb1b06fad44ad7f3d8c2965/src/index.js. They have a Dice coefficient of 0.90, but this lib string-similarity outputs 0.74 when comparing these two files.

Please have a look at the implementations in Talisman, NLTK or in many languages in https://en.wikibooks.org/wiki/Algorithm_Implementation/Strings/Dice%27s_coefficient

Does not seem to care about the order of words

I have the following two strings:
grid styling xs 1/12
grid styling xs 2/12

And my search input is

xs 2

Both strings get the exact same score, which doesn't seem right.
Because the xs 2 has a longer "direct match" in the order of words with the second string than the first one.
The first string only gets the same score because there is also a "2" in the string. But it's on a location that shouldn't influence the score as match as the "2" in the "right spot"

Strange/incorrect matching

stringSimilarity.findBestMatch('wall e', ['wall·e', 'wall']);
stringSimilarity.findBestMatch('wall-e', ['wall·e', 'wall']);
stringSimilarity.findBestMatch('wall_e', ['wall-e', 'wall']);

These all return 1, as though "wall" is the best match. They should all return 0, since they differ by only 1 character and are more symbolically similar.

Matching % seems incorrect

I just tryed the example:

stringSimilarity.compareTwoStrings("healed", "sealed");
//0.8

=> 80% for a 1 letter change.

stringSimilarity.compareTwoStrings("healed", "ehaled");
//0.6

=> 60% for 2 letter switching

Ok I get it but now I just try with another word that contains 1 letter less (5 char length vs 6)

stringSimilarity.compareTwoStrings("fuira", "fuia");
//0.57

=> 57% for a 1 letter change (just lost 23%)

stringSimilarity.compareTwoStrings("furia", "fuira");
//0.25

=> 25% for a 1 letter change (just lost 35%)

Seems to me that less the string is long more the matching is severe.
Is there a way to make it "average" undepending of the length ?

Is it possible to import it in a ES6 module ? (front-side)

I'm trying to use the package to implement a fuzzy search in a React component.
I don't want to use UMD.

I'm trying to import the module like so:

import stringSimilarity from 'stringSimilarity'

Node throw: Cannot find module 'stringSimilarity'.

Is it possible with this package ?

findBestMatch : accept a list of objects

findBestMatch is really cool, but it could use one extra layer of convenience....

You see...I have an array of objects, and one of the attributes is the string that I am comparing.
I need to find the object from the array with the best matching string.

As it is, I have to extract the strings from all objects, find the best matching string, and then go back and find the object whose string is the best matching string.

It would be easier for me to pass-in the list of objects, along with the name of the attribute, and get back a reference to the best object. I suspect that this pattern of usage might be very common.

Again, this is not a bug, just a suggestion to make the API easier to use.
Thanks for this software -- I am going to use it to solve a tricky problem in converting some very old insurance data.

IE support

IE doesn't understand ES6 (const, let and arrow functions are the ES6 things that I see in the package sources) and to provide IE compatibility (curse him) we need to have our vendor bundles in ES5. And it's not easy to transpile a specific library during bundling...

The common way is to have ./dist/compare-strings.js in the npm package repo and an npm build script for ES6 -> ES5 transpilation process. If it's ok, I can provide a PR covering this situation. What do you think?

Weird behavior

Hi,

I'm having a weird result when comparing those two string.
It always return a rating of 0 despite them having 2 letters in common.

stringSimilarity.compareTwoStrings('NOS', 'NPS')
//0

stringSimilarity.findBestMatch('NOS', ['NPS'])
//{ ratings: [ { target: 'NPS', rating: 0 } ], bestMatch: { target: 'NPS', rating: 0 } }

https://runkit.com/588655d7fb7a220014a01b47/5886577d0629220014e341d7

Thanks.

incompatible with uglifyjs

Hello, I'm getting this when building the productio package

....... from UglifyJs
Unexpected token: punc (,) ....

Perhaps there is a way to add build configuration to the package to fix this?
I've gone around by copying the code in my utilities library.
Thanks!

is it possible to support chinese?

I have test string-similarity with chinese letter, but seems not working, it appears "0"

var similarity = stringSimilarity.compareTwoStrings('布莱顿', '布赖顿');

plz advise. thx.

Wrong bestMatch with game titles

Hey there!

I've noticed that string-similarity is having issues with game titles. Here's a little example:

var matches = stringSimilarity.findBestMatch('Portal 2', ['Portal', 'Portal 2']);

This example returns 'Portal' as bestMatch. However if I change the order of the targetStrings array, like this:

var matches = stringSimilarity.findBestMatch('Portal 2', ['Portal 2', 'Portal']);

Then bestMatch is Portal 2. While this sounds like the solution, searching Portal would lead to bestMatch = Portal 2.

Testing around I also found out that if I compareTwoStrings('Portal', 'Portal 2'), the return value is 1, even tho those 2 strings are obviously not exactly the same?

Is there any way to make the comparison more strict?

LICENSE.md and package.json disagree

The package.json reports the license to be ISC, while the LICENSE file reports it to be MIT.

It's quite important that this is fixed as license reporting tools will rightfully report this as problematic.

isEdgeCaseWithOneOrZeroChars?

Hi :)

Just finished modifying this script so that I can use it in mongoDB (SpiderMonkey with some parts of ES6) without the lodash dependency, and noticed this unused method, isEdgeCaseWithOneOrZeroChars.

It was introduced here, but that's over a year ago, and it hasn't been used since then.

So I'm wondering if it's some unfinished work that should be there, or just some a stab at some approach deemed unnecessary and then accidentally left behind?

Cheers! :)

Daniel

Feature suggestion - pass array of objects as targets for findBestMatch function

Use case:
Instead of wanting to compare ["foo","bar","baz"], it can be useful to pass in an array of objects for which you want to compare one property, i.e.

[
    { name: "foo", otherProperty: 23 },
    { name: "bar", otherProperty: 27 },
    { name: "baz", otherProperty: 99 }
]

and instruct the function to compare based on the name property, but return the whole object in the response.

I have already created a PR #124 for this. Just need approval

compareTwoStrings returning 1 for small and different strings

Hello everyone, I hope all is doing good. I found a case in which there is a difference (search for <_15>), the difference is that the first string has <_15>FLL while the second one has <_15>ORD, yet the function is returning 1 as if it were a perfect match. The version used for this comparison was 4.0.1. Below you can see an example ready to be ran in node.js (system version 14.4.0):

const similarity = require("string-similarity");

const body1 = '<REQ><_0>MSG</_0><_1/><_2>55</_2><_3>ORG</_3><_4>F1</_4><_5>MIA</_5><_6>07560685</_6><_7>AC30</_7><_8>HFD</_8><_9>F1</_9><_10>T</_10><_11>US</_11><_12>USD</_12><_13>ZE</_13><_14>ODI</_14><_15>FLL</_15><_16>ORD</_16><_17>UNT</_17><_18>5</_18><_19>1</_19><_20>UNZ</_20><_21>1</_21><_22>000000</_22><_23/></REQ>';

const body2 = '<REQ><_0>MSG</_0><_1/><_2>55</_2><_3>ORG</_3><_4>F1</_4><_5>MIA</_5><_6>07560685</_6><_7>AC30</_7><_8>HFD</_8><_9>F1</_9><_10>T</_10><_11>US</_11><_12>USD</_12><_13>ZE</_13><_14>ODI</_14><_15>ORD</_15><_16>FLL</_16><_17>UNT</_17><_18>5</_18><_19>1</_19><_20>UNZ</_20><_21>1</_21><_22>000000</_22><_23/></REQ>';

console.log(similarity.compareTwoStrings(body1, body2));

Thanks!

Memoize results

It will be good if you memoize the result, so if you run the function with the same arguments, it will give the result right away instead of making the calculation all over again

[HELP!!] Latest version does not detect spaces, and 2.0.0 version is not case sensitive.

I need to compare two strings completely, that means it should also detect spaces and caps.

I have read the relase notes, and I don't understand why you decided to disregard spaces from version 3.0.0, so after running npm install --save [email protected], it detects spaces, but it is not case sensitive.

Latest version:
stringSimilarity.compareTwoStrings("Te st", "Test"); //1.00

2.0 version
stringSimilarity.compareTwoStrings("TEST", "test"); //1.00

Please help, I need to get this done as soon as possible.
Thank you!

Weird result based on string case

I've tried to compare the strings, as follow:

stringSimilarity.findBestMatch('bnp', ['BNP Paribas', absolutelyunrelated])

Both rating are 0

stringSimilarity.findBestMatch('BNP', ['BNP Paribas', absolutelyunrelated])

BNP Paribas rating is 0.36363636363636365

I wouldn't expect 0 with bnp, I'd expect something around 0.2+ I guess, the difference is huge based on the string case.

Adding an optional param for replacing string and case sensitivity.

I am using compareTwoStrings in my own projects and have found a lot of use in having an optional parameter where you can give regex/string to replace by. In my recent implementation I needed to exclude special characters. Having the parameter allowed me to just add that bit of regex in my call like so: compareTwoStrings('hello', 'hey', /[^\w\s]/gi). I also added option to remove case sensitivity. I would love to contribute these features.

Add browser support

I'd be great if this library was also usable in the browser as it currently uses require this makes it impossible to use on the client side. 😞

findBestMatch

Hello, how can I pass a key:value array targetStrings to
findBestMatch(mainString, targetStrings) {
}
The match target is the key in the key:value array.

Thanks

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.