GithubHelp home page GithubHelp logo

hbollon / go-edlib Goto Github PK

View Code? Open in Web Editor NEW
466.0 13.0 23.0 78 KB

📚 String comparison and edit distance algorithms library, featuring : Levenshtein, LCS, Hamming, Damerau levenshtein (OSA and Adjacent transpositions algorithms), Jaro-Winkler, Cosine, etc...

License: MIT License

Go 99.19% Shell 0.81%
go golang edit-distance algorithms levenshtein levenshtein-distance lcs lcs-distance damerau-levenshtein string-matching

go-edlib's Introduction

Go-edlib : Edit distance and string comparison library

Coverage Status Go Report Card License: MIT PkgGoDev

Golang string comparison and edit distance algorithms library featuring : Levenshtein, LCS, Hamming, Damerau levenshtein (OSA and Adjacent transpositions algorithms), Jaro-Winkler, Cosine, etc...


Table of Contents


Requirements

  • Go (v1.13+)

Introduction

Golang open-source library which includes most (and soon all) edit-distance and string comparision algorithms with some extra!
Designed to be fully compatible with Unicode characters!
This library is 100% test covered 😁

Features

Benchmarks

You can check an interactive Google chart with few benchmark cases for all similarity algorithms in this library through StringsSimilarity function here

However, if you want or need more details, you can also viewing benchmark raw output here, which also includes memory allocations and test cases output (similarity result and errors).

If you are on Linux and want to run them on your setup, you can run ./tests/benchmark.sh script.

Installation

Open bash into your project folder and run:

go get github.com/hbollon/go-edlib

And import it into your project:

import (
	"github.com/hbollon/go-edlib"
)

Run tests

If you are on Linux and want to run all unit tests just run ./tests/tests.sh script.

For Windows users you can run:

go test ./... # Add desired parameters to this command if you want

Documentation

You can find all the documentation here : Documentation

Examples

Calculate string similarity index between two string

You can use StringSimilarity(str1, str2, algorithm) function. algorithm parameter must one of the following constants:

// Algorithm identifiers
const (
	Levenshtein Algorithm = iota
	DamerauLevenshtein
	OSADamerauLevenshtein
	Lcs
	Hamming
	Jaro
	JaroWinkler
	Cosine
)

Example with levenshtein:

res, err := edlib.StringsSimilarity("string1", "string2", edlib.Levenshtein)
if err != nil {
  fmt.Println(err)
} else {
  fmt.Printf("Similarity: %f", res)
}

Execute fuzzy search based on string similarity algorithm

1. Most matching unique result without threshold

You can use FuzzySearch(str, strList, algorithm) function.

strList := []string{"test", "tester", "tests", "testers", "testing", "tsting", "sting"}
res, err := edlib.FuzzySearch("testnig", strList, edlib.Levenshtein)
if err != nil {
  fmt.Println(err)
} else {
  fmt.Printf("Result: %s", res)
}
Result: testing 

2. Most matching unique result with threshold

You can use FuzzySearchThreshold(str, strList, minSimilarity, algorithm) function.

strList := []string{"test", "tester", "tests", "testers", "testing", "tsting", "sting"}
res, err := edlib.FuzzySearchThreshold("testnig", strList, 0.7, edlib.Levenshtein)
if err != nil {
  fmt.Println(err)
} else {
  fmt.Printf("Result for 'testnig': %s", res)
}

res, err = edlib.FuzzySearchThreshold("hello", strList, 0.7, edlib.Levenshtein)
if err != nil {
  fmt.Println(err)
} else {
  fmt.Printf("Result for 'hello': %s", res)
}
Result for 'testnig': testing
Result for 'hello':

3. Most matching result set without threshold

You can use FuzzySearchSet(str, strList, resultQuantity, algorithm) function.

strList := []string{"test", "tester", "tests", "testers", "testing", "tsting", "sting"}
res, err := edlib.FuzzySearchSet("testnig", strList, 3, edlib.Levenshtein)
if err != nil {
  fmt.Println(err)
} else {
  fmt.Printf("Results: %s", strings.Join(res, ", "))
}
Results: testing, test, tester 

4. Most matching result set with threshold

You can use FuzzySearchSetThreshold(str, strList, resultQuantity, minSimilarity, algorithm) function.

strList := []string{"test", "tester", "tests", "testers", "testing", "tsting", "sting"}
res, err := edlib.FuzzySearchSetThreshold("testnig", strList, 3, 0.5, edlib.Levenshtein)
if err != nil {
  fmt.Println(err)
} else {
  fmt.Printf("Result for 'testnig' with '0.5' threshold: %s", strings.Join(res, " "))
}

res, err = edlib.FuzzySearchSetThreshold("testnig", strList, 3, 0.7, edlib.Levenshtein)
if err != nil {
  fmt.Println(err)
} else {
  fmt.Printf("Result for 'testnig' with '0.7' threshold: %s", strings.Join(res, " "))
}
Result for 'testnig' with '0.5' threshold: testing test tester
Result for 'testnig' with '0.7' threshold: testing

Get raw edit distance (Levenshtein, LCS, Damerau–Levenshtein, Hamming)

You can use one of the following function to get an edit distance between two strings :

Example with Levenshtein distance:

res := edlib.LevenshteinDistance("kitten", "sitting")
fmt.Printf("Result: %d", res)
Result: 3

LCS, LCS Backtrack and LCS Diff

1. Compute LCS(Longuest Common Subsequence) between two strings

You can use LCS(str1, str2) function.

lcs := edlib.LCS("ABCD", "ACBAD")
fmt.Printf("Length of their LCS: %d", lcs)
Length of their LCS: 3

2. Backtrack their LCS

You can use LCSBacktrack(str1, str2) function.

res, err := edlib.LCSBacktrack("ABCD", "ACBAD")
if err != nil {
  fmt.Println(err)
} else {
  fmt.Printf("LCS: %s", res)
}
LCS: ABD

3. Backtrack all their LCS

You can use LCSBacktrackAll(str1, str2) function.

res, err := edlib.LCSBacktrackAll("ABCD", "ACBAD")
if err != nil {
  fmt.Println(err)
} else {
  fmt.Printf("LCS: %s", strings.Join(res, ", "))
}
LCS: ABD, ACD

4. Get LCS Diff between two strings

You can use LCSDiff(str1, str2) function.

res, err := edlib.LCSDiff("computer", "houseboat")
if err != nil {
  fmt.Println(err)
} else {
  fmt.Printf("LCS: \n%s\n%s", res[0], res[1])
}
LCS Diff: 
 h c o m p u s e b o a t e r
 + -   - -   + + + + +   - -

Author

👤 Hugo Bollon

🤝 Contributing

Contributions, issues and feature requests are welcome!
Feel free to check issues page.

Show your support

Give a ⭐️ if this project helped you!

📝 License

Copyright © 2020 Hugo Bollon.
This project is MIT License licensed.

go-edlib's People

Contributors

akosiaris avatar deepsourcebot avatar github-actions[bot] avatar hbollon avatar shogg avatar shriprajwalk avatar tang-hi avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

go-edlib's Issues

Missing methods for string shingling

Missing methods for string shingling(https://en.wikipedia.org/wiki/W-shingling) aka n-grams for strings. In general, Cosine similarity and Jaccard index, among other methods, are computed on n-grams of the string, instead of the string itself(eg: https://www.joyofdata.de/blog/comparison-of-string-distance-algorithms/). Right now, we just split the string on spaces and compute the index based on the words in the string, which is fine but is probably not technically correct.
I propose we introduce a method to shingle the given strings so that we can later use the shingled string in finding cosine similarity, etc. This has been done before here:
Java: https://github.com/tdebatty/java-string-similarity
Julia: https://github.com/matthieugomez/StringDistances.jl

I think we should introduce a function that does something like:

// Function signature
shingle(str string, n int) map[string]int
// Example usage:
shingled_s1 := shingle("ABCD", 2) // returns map[AB:1 BC:1 CD:1]
shingled_s2 := shingle("ABCE", 2) // returns map[AB:1 BC:1 CE:1]

This will enable us to use these maps to later compute stuff, eg:

res := JaccardSimilarityWithShingles(shingled_s1, shingled_s2) // should return 0.5.

Reduce OSADamerau–Levenshtein space complexity from O(m*n) to O(3 * min(m,n))

Discussed in #15

Originally posted by tang-hi July 2, 2022
When we calculate the distance of OSADamerau–Levenshtein, we actually calculate the matrix with dimension(m*n)
Suppose we have a matrix looks like

1656691342850
the value in (3,0) is always 3
the value in (3,1) depends on (3,0) (2,0) (2,1)
the value in (3,2) depends on (3,1) (2,1) (2,2) and (1,1)
we can found that the value in (row : i) only depends values in (row:i-1,i-2)
eg. the value in (row : 3) only depends values in (row:2,1)

So any row number >= 3, we could write it in (row %3, j),
eg.
write the value of (3,0) in (0,0) .... (3,j) in (0,j)
write the value of (4,0) in (1,0) .... (4,j) in (1,j)
write the value of (i,0) in (i%3,0) .... (i,j) in (i%3,j)

In this way, we only need the matrix with dimension(3, min(m,n)) to calculate OSADamerau–Levenshtein distance
Here is the code to illustrate my idea

StringsSimilarity is broken for unicode

The test is failing

func TestStringsSimilarity(t *testing.T) {
	str1 := "abcde"
	str2 := "бвгдж"
	str3 := "fghjk"
	sim12, _ := edlib.StringsSimilarity(str1, str2, edlib.Levenshtein)
	sim13, _ := edlib.StringsSimilarity(str1, str3, edlib.Levenshtein)
	if sim12 != sim13 {
		t.Errorf("different similarities, %v != %v", sim12, sim13)
	}
}

output: different similarities, 0.5 != 0

Similarity between "abcde" and "бвгдж" should be the same as between "abcde" and "fghjk".
Looks like bug in the matchingIndex() because it compares length of strings but ...Distance() functions use runes.

LCSBacktrackAll not work

My code:

ress,err:=edlib.LCSBacktrackAll(" 是ab是cde22f123g", "222222是ab是cd123") if err!=nil{ fmt.Println(err) }else { fmt.Printf("%s", strings.Join(ress, "\r\n")) }

Shoud be print
`
是ab是cd

123
`

But is print
是ab是cd123

benchmarks

Hi, it would be great to see benchmarks between the algorithms. Creating a graph and putting them in the README, would be really nice =)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.