markvanderloo / stringdist Goto Github PK
View Code? Open in Web Editor NEWString distance functions for R
String distance functions for R
Q-grams are stored internally in a binary tree. We could make this user-switchable between
depending on the use case.
Theres a factor of 2 overhead callling
stringdistmatrix(a=x, b=x)
The default could be changed to
stringdistmatrix(a,b=NULL)
so that
stringdistmatrix(x)
only computes the lower triangle, similar to native R's dist
function
Using R 3.1.2 under Cygwin. Seems that it installs fine, but is unable to read some kind of number of available threads or something while testing the installation. I guess it should have a default value if it's not found?
> install.packages("stringdist")
Installing package into ‘/usr/lib/R/site-library’
(as ‘lib’ is unspecified)
trying URL 'http://cran.rstudio.com/src/contrib/stringdist_0.9.0.tar.gz'
Content type 'application/x-gzip' length 44307 bytes (43 Kb)
opened URL
==================================================
downloaded 43 Kb
* installing *source* package ‘stringdist’ ...
** package ‘stringdist’ successfully unpacked and MD5 sums checked
** libs
gcc -I/usr/lib/R/include -DNDEBUG -fopenmp -ggdb -O2 -pipe -Wimplicit-function-declaration -std=gnu99 -fdebug-prefix-map=/pub/devel/R/R-3.1.2-1.x86_64/build=/usr/src/debug/R-3.1.2-1 -fdebug-prefix-map=/pub/devel/R/R-3.1.2-1.x86_64/src/R-3.1.2=/usr/src/debug/R-3.1.2-1 -c dl.c -o dl.o
gcc -I/usr/lib/R/include -DNDEBUG -fopenmp -ggdb -O2 -pipe -Wimplicit-function-declaration -std=gnu99 -fdebug-prefix-map=/pub/devel/R/R-3.1.2-1.x86_64/build=/usr/src/debug/R-3.1.2-1 -fdebug-prefix-map=/pub/devel/R/R-3.1.2-1.x86_64/src/R-3.1.2=/usr/src/debug/R-3.1.2-1 -c hamming.c -o hamming.o
gcc -I/usr/lib/R/include -DNDEBUG -fopenmp -ggdb -O2 -pipe -Wimplicit-function-declaration -std=gnu99 -fdebug-prefix-map=/pub/devel/R/R-3.1.2-1.x86_64/build=/usr/src/debug/R-3.1.2-1 -fdebug-prefix-map=/pub/devel/R/R-3.1.2-1.x86_64/src/R-3.1.2=/usr/src/debug/R-3.1.2-1 -c jaro.c -o jaro.o
gcc -I/usr/lib/R/include -DNDEBUG -fopenmp -ggdb -O2 -pipe -Wimplicit-function-declaration -std=gnu99 -fdebug-prefix-map=/pub/devel/R/R-3.1.2-1.x86_64/build=/usr/src/debug/R-3.1.2-1 -fdebug-prefix-map=/pub/devel/R/R-3.1.2-1.x86_64/src/R-3.1.2=/usr/src/debug/R-3.1.2-1 -c lcs.c -o lcs.o
gcc -I/usr/lib/R/include -DNDEBUG -fopenmp -ggdb -O2 -pipe -Wimplicit-function-declaration -std=gnu99 -fdebug-prefix-map=/pub/devel/R/R-3.1.2-1.x86_64/build=/usr/src/debug/R-3.1.2-1 -fdebug-prefix-map=/pub/devel/R/R-3.1.2-1.x86_64/src/R-3.1.2=/usr/src/debug/R-3.1.2-1 -c lv.c -o lv.o
gcc -I/usr/lib/R/include -DNDEBUG -fopenmp -ggdb -O2 -pipe -Wimplicit-function-declaration -std=gnu99 -fdebug-prefix-map=/pub/devel/R/R-3.1.2-1.x86_64/build=/usr/src/debug/R-3.1.2-1 -fdebug-prefix-map=/pub/devel/R/R-3.1.2-1.x86_64/src/R-3.1.2=/usr/src/debug/R-3.1.2-1 -c osa.c -o osa.o
gcc -I/usr/lib/R/include -DNDEBUG -fopenmp -ggdb -O2 -pipe -Wimplicit-function-declaration -std=gnu99 -fdebug-prefix-map=/pub/devel/R/R-3.1.2-1.x86_64/build=/usr/src/debug/R-3.1.2-1 -fdebug-prefix-map=/pub/devel/R/R-3.1.2-1.x86_64/src/R-3.1.2=/usr/src/debug/R-3.1.2-1 -c qgram.c -o qgram.o
gcc -I/usr/lib/R/include -DNDEBUG -fopenmp -ggdb -O2 -pipe -Wimplicit-function-declaration -std=gnu99 -fdebug-prefix-map=/pub/devel/R/R-3.1.2-1.x86_64/build=/usr/src/debug/R-3.1.2-1 -fdebug-prefix-map=/pub/devel/R/R-3.1.2-1.x86_64/src/R-3.1.2=/usr/src/debug/R-3.1.2-1 -c soundex.c -o soundex.o
gcc -I/usr/lib/R/include -DNDEBUG -fopenmp -ggdb -O2 -pipe -Wimplicit-function-declaration -std=gnu99 -fdebug-prefix-map=/pub/devel/R/R-3.1.2-1.x86_64/build=/usr/src/debug/R-3.1.2-1 -fdebug-prefix-map=/pub/devel/R/R-3.1.2-1.x86_64/src/R-3.1.2=/usr/src/debug/R-3.1.2-1 -c utf8ToInt.c -o utf8ToInt.o
gcc -I/usr/lib/R/include -DNDEBUG -fopenmp -ggdb -O2 -pipe -Wimplicit-function-declaration -std=gnu99 -fdebug-prefix-map=/pub/devel/R/R-3.1.2-1.x86_64/build=/usr/src/debug/R-3.1.2-1 -fdebug-prefix-map=/pub/devel/R/R-3.1.2-1.x86_64/src/R-3.1.2=/usr/src/debug/R-3.1.2-1 -c utils.c -o utils.o
gcc -shared -L/usr/lib/R/lib -o stringdist.dll dl.o hamming.o jaro.o lcs.o lv.o osa.o qgram.o soundex.o utf8ToInt.o utils.o -fopenmp -L/usr/lib/R/lib -lR -lintl -lpcre -llzma -lbz2 -lz -lrt -ldl -lm -liconv -licuuc -licui18n
installing to /usr/lib/R/site-library/stringdist/libs
** R
** inst
** preparing package for lazy loading
** help
*** installing help indices
** building package indices
** testing if installed package can be loaded
Error : .onLoad failed in loadNamespace() for 'stringdist', details:
call: if (nthread >= 4) nthread <- nthread - 1
error: missing value where TRUE/FALSE needed
Error: loading failed
Execution halted
ERROR: loading failed
* removing ‘/usr/lib/R/site-library/stringdist’
The downloaded source packages are in
‘/tmp/RtmpACPgeU/downloaded_packages’
Warning message:
In install.packages("stringdist") :
installation of package ‘stringdist’ had non-zero exit status
Hi
Version 0.8.x would spawn multiple R processes and use all my 24 cores on my Mac under R3.2.1. Version 0.9.x instead spawns multiple threads, not processes. Looking at the Activity Monitor it's clear it's just using a single core (and the runtime confirms that).
I've reverted to 0.8.x in the meantime.
Only add penalty when jaro-dist exceeds a threshold. see here.
Suggested via email by Riki Saito
Hi,
Please see below problem I have with asymmetry in the distance for 'jw' method. This is causing a problem in my data set. In fact, the specific strings I test in my data set (can't share due to PHI), the distance equality test for symmetry fails even more badly with much larger delta on the reverse.
Thank you
round(stringdist('HENCERANGE3058RUNAWAY2HELLCITYAA12345', 'RANCHRANGE3058RUNAWAY2HELLCITYAA12345', method = 'jw'), 8) == round(stringdist('RANCHRANGE3058RUNAWAY2HELLCITYAA12345', 'HENCERANGE3058RUNAWAY2HELLCITYAA12345', method = 'jw'), 8)
[1] FALSE
'HENCERANGE3058RUNAWAY2HELLCITYAA12345' == 'HENCERANGE3058RUNAWAY2HELLCITYAA12345'
[1] TRUE
'RANCHRANGE3058RUNAWAY2HELLCITYAA12345' == 'RANCHRANGE3058RUNAWAY2HELLCITYAA12345'
[1] TRUEstringdist('HENCERANGE3058RUNAWAY2HELLCITYAA12345', 'RANCHRANGE3058RUNAWAY2HELLCITYAA12345', method = 'jw')
[1] 0.2550837
stringdist('RANCHRANGE3058RUNAWAY2HELLCITYAA12345', 'HENCERANGE3058RUNAWAY2HELLCITYAA12345', method = 'jw')
[1] 0.2265122
Hello! I ran across your blog and found out about the stringdist
package. I think that it might be beneficial to implement some functions, based on phonetic algorithms, such as Soundex
. I apologize, if my suggestion doesn't match the goals of your package or the package already covers this functionality. I believe that implementation of my suggestion might also be helpful for automatic data correction processes via editrules
and deducorrect
packages, as misspelling is a common data quality problem.
After some careful thought, I now believe that the convention for q-gram distances when q=0 should be
d(s,t,q=0) = 0 for all s, t in \Sigma^*
At the moment it is 0 when s=t="" and infinity if |s|+|t|>0. This will probably not break anyone's code but it does deviate from the convention I denoted in the R Journal.
stringdist("600 EXAMPLE AVE NJ 8629", "2100 EXAMPLE AVE NJ 8619", method="jaccard")
[1] 0.0625stringdist("600 EXAMPLE AVE NJ 8629", "600 EXAMPLE AVE NJ 8629", method="jaccard")
[1] 0amatch("600 EXAMPLE AVE NJ 8629", c("2100 EXAMPLE AVE NJ 8619", "600 EXAMPLE AVE NJ 8629"), method="jaccard")
[1] 1
In jaro.C
the value of 3.0
is hard coded:
} else {
d = 1.0 - (1.0/3.0)*(w[0]*m/((double) x) + w[1]*m/((double) y) + w[2]*(m-t)/m);
}
This should be the sum of the weights. Otherwise, the score is no longer necessarily between 0 and 1.
Adding support for string kernel distances would be nice.
I was wondering if you've thought of including qgram filtering for edit distance in the stringdist package. Oftentimes users are only concerned with comparing strings that pass a certain similarity threshold, and qgram filtering allows them to do this significantly quicker than just calculating the levenstein distance on all the strings.
For example
stringdist("hello","world",method="cosine", q=1:2)
would yield the cosine distance over the concatenation of 1-gram and 2-gram profiles.
This would also enhance compatibility, e.g. with the textcat
package.
Would you be open to including a local alignment metric (https://en.wikipedia.org/wiki/Smith%E2%80%93Waterman_algorithm)? I could prepare a patch.
Why not, a bit of user-friendlyness :-).
Sometimes text data is long and is better thought of as being broken up into "words" rather than "characters."
Your qgram tokenizer is extremely fast and I've found it to be incredibly useful, but I still find myself looking for a fast word-ngram tokenizer.
When stringdistmatrix
is called using a large single input vector, the result is a distance matrix of class dist
with fewer elements than expected.
many_words <- sapply(1:100000, function(x) paste(sample(letters, 10, replace=T),
collapse=""))
# Needs a lot of memory
d <- stringdist::stringdistmatrix(many_words, method = "jw")
size <- attr(d, "Size")
stopifnot(inherits(d, "dist"))
stopifnot(size == length(many_words))
stopifnot(length(d) == size*(size - 1)/2)
# Error: length(d) == size * (size - 1)/2 is not TRUE
# > length(d)
# [1] 704982704
# > size*(size - 1)/2
# [1] 4999950000
For smaller character vectors I did not encounter any problems:
few_words <- sapply(1:1000, function(x) paste(sample(letters, 10, replace=T),
collapse=""))
d <- stringdist::stringdistmatrix(few_words, method = "jw")
stopifnot(inherits(d, "dist"))
size <- attr(d, "Size")
stopifnot(size == length(few_words))
stopifnot(length(d) == size*(size - 1)/2)
# No errors raised
I can provide more info when necessary.
Since the move to openMP I can't get stringdist to run on more than one core. When nthreads is set to 2, 3, or 4 (my max) I can tell from activity monitor that only one core is active. I've reinstalled gcc and checked that openMP is working elsewhere; I can't figure out what's wrong! Wondering if this is a known issue or just me.
> getOption("sd_num_thread")
[1] 3
> parallel::detectCores()
[1] 4
> sessionInfo(package = NULL)
R version 3.1.3 (2015-03-09)
Platform: x86_64-apple-darwin13.4.0 (64-bit)
Running under: OS X 10.10.3 (Yosemite)
locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] stringdist_0.9.0
loaded via a namespace (and not attached):
[1] parallel_3.1.3 tools_3.1.3
we should probably make nchar listen to useBytes to compute the right maximum distances.
Something like
// store options, create workspace...
u = stringdist_open(dist, q,...)
stringdist(u,s,t)
stringdist_close(u)
So one could hash a (series of) objects or e.g. words in a sentence and compute the distance between them.
Technically this is easy to do; all C-level routines operate on unsigned int arrays.
Add option to add meaningless pre-and postsfixes when computing q-gram distance.
Options
There are some messages by the CLANG undefined behaviour sanitizer reported on cran. These should be ironed out for the next release.
This could be an option; but the current compiler used by the Core Team does not support it under Windows.
At the moment all vectors are indexed with int
in the underlying C-code. I should update this to size_t
Most string distances (except perhaps the ones based on string kernels) have a natural and easily computed minimum and maximum. I could add a stringsim
function that returns a similarity measure between 0 and 1.
Q-gram binary tree is allocated per-node on the fly. Change this to pooling and doubling re-allocation
In python world, there is a very well known lib called fuzzywuzzy. A nice description of its features is available there.
I think stringdist already implements the basis to have similar features which would be very useful in two cases:
From my understanding, it just need a small layer on top of what is already existing in stringdist to provide a distance between sentences which is not word order sensitive for instance. Do you think it would be easy to implement?
Kind regards,
Michael
Add episode distance
When given weights, stringdist will sometimes give an incorrect result. It can even be inconsistent with itself when weights are adjusted by a common factor.
Example:
stringdist("ABC", "BC", method = "lv") # Returns 1, as it should
stringdist("ABC", "BC", method = "lv", weight = c(i=.1, d=.1, s=.1)) # Returns .2, should be .1
stringdist("ABC", "BC", method = "lv", weight = c(i=.1, d=.1, s=1)) # Returns 1, should be .1
This differs from what adist returns for the same inputs, too:
adist("ABC", "AB") # Returns 1
adist("ABC", "AB", costs = c(insertions=.1, deletions=.1, substitutions=.1)) # Returns .1
adist("ABC", "AB", costs = c(insertions=.1, deletions=.1, substitutions=1)) # Returns .1
Version 0.9.2, installed from CRAN.
At the moment stringsim
assumes that all weights are equal to 1 for edit-based distances. Although this does yield a valid maximum (weights are maximally 1), using lower weights will lower the maximum possible similarity. It is probably more intuitive to scale the similarities taking weights into account.
Think I found an interesting bug. Looks like when a single argument is passed to stringdistance matrix, the resulting matrix is transposed. Seems to happen independently of the method being used as well.
> stringdist::stringdistmatrix(c("foo", "bar"), c("foo", "a", "b"), method = "hamming")
[,1] [,2] [,3]
[1,] 0 Inf Inf
[2,] 3 Inf Inf
> stringdist::stringdistmatrix(c("foo"), c("foo", "a", "b"), method = "hamming")
[,1]
[1,] 0
[2,] Inf
[3,] Inf
The install.packages('stringdist') command in R is failing right now. Here is the error message:
Error in download.file(url, destfile, method, mode = "wb", ...) :
cannot open URL 'http://cran.rstudio.com/bin/windows/contrib/3.1/stringdist_0.7.2.zip'
I don't know if this is a CRAN issue or if it's something that can be fixed here.
Regards,
Alun
There is one use case where it would be possible to get much better perf very easily.
When you want to deduplicate data, you provide to the matrix version of stringdist two times the same vector of words. In this square matrix, each information is two times (symmetric distances) and the diagonal is by definition 0.
> stringdistmatrix(c('abc','abef','cde', 'akj'),c('abc','abef','cde', 'akj'))
[,1] [,2] [,3] [,4]
[1,] 0 2 3 2
[2,] 2 0 3 3
[3,] 3 3 0 3
[4,] 2 3 3 0
Would it be possible to have this use case managed by the function and computation not done when not required?
Kind regards,
Michael
The title pretty much says it all - the default method "optimal string alignment" returns NA for every comparison after encountering one NA. Does not depend on the order of arguments, but that's as far as I got with debugging. Package version 0.8.1, R version 3.1.2
a <- c("a", NA, "b", "c")
b <- c("aa", "bb", "cc", "dd")
stringdist(a,b, method="lv")
[1] 1 NA 2 2
stringdist(a,b, method="osa")
[1] 1 NA NA NA
sometimes only care about the distance blow a max value.
postgres has this.
levenshtein_less_equal(text` source, text target, int max_d) returns int
https://www.postgresql.org/docs/9.1/static/fuzzystrmatch.html
I check the document of this package, indicate the MaxDist is deprecated.
The description of string distances should be moved to a general section.
In certain cases, distance measures between two strings are undefined. The package currently returns Inf in this case (to allow for numerical comparison) this output should be made switchable.
--> Need to phase out maxDist first.
I understand that the q-gram distance is the sum of absolute differences between q-gram vectors of both strings. But I see some weird behavior when one of the strings is shorter than the chosen q.
So for these two strings, while the qgrams function is correct:
> qgrams("a", "the cat sat on the mat", q = 2)
th he t sa on n ma e c ca at s t o m
V1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
V2 2 2 2 1 1 1 1 2 1 1 3 1 1 1 1
The stringdist function returns:
> stringdist("a", "the cat sat on the mat", q = 2, method = "qgram")
[1] Inf
Instead of returning:
> sum(qgrams("a", "the cat sat on the mat", q = 2)[2,])
[1] 21
Posted at SO by Giora Simchoni.
What would be the best way to match records with multiple variables like first name / last name ? I
More generally, something very useful would be a merge function that can accept a list of variables with weights (and where a subset can be required to match exactly)
The useNames
argument of stringdistmatrix
can only take TRUE
(use strings as names for output) or FALSE
. It makes more sense to have the choice between:
none
no output names (similar to FALSE
)strings
use strings (similar to TRUE
)names
use names(a)
and names(b)
The latter will be especially useful when comparing long strings, like documents using qgrams
When I and my colleague pass vectors with > 7k elements to stringdistmatrix using the cosine method R crashes completely throwing a segfault error which says some memory did not map. On my mac. Here's the traceback and error:
*** caught segfault ***
address 0xbc9900000, cause 'memory not mapped'
Traceback:
1: .Call("R_lower_tri", a, methnr, as.double(weight), as.double(p), as.integer(q), as.integer(useBytes), as.integer(nthread))
2: lower_tri(a, method = method, useBytes = useBytes, weight = weight, useNames = useNames, nthread = nthread)
3: stringdistmatrix(path.exitURL$exitPagePath_TermPretty, method = "cosine")
4: eval(expr, envir, enclos)
5: eval(ei, envir)
6: withVisible(eval(ei, envir))
7: source("code/path_analysis/cluster_path.R")
Mac system info:
Model Name: MacBook Pro
Model Identifier: MacBookPro11,5
Processor Name: Intel Core i7
Processor Speed: 2.5 GHz
Number of Processors: 1
Total Number of Cores: 4
L2 Cache (per Core): 256 KB
L3 Cache: 6 MB
Memory: 16 GB
Boot ROM Version: MBP114.0172.B09
SMC Version (system): 2.30f2
R version 3.3.1 (2016-06-21)
Platform: x86_64-apple-darwin15.5.0 (64-bit)
Running under: OS X 10.11.5 (El Capitan)
It also fails on a fresh ubuntu and R installation. System info:
Description: Ubuntu 14.04.4 LTS
Release: 14.04
Codename: trusty
*-memory
description: System memory
physical id: 0
size: 29GiB
*-cpu
product: Intel(R) Xeon(R) CPU E5-2680 v2 @ 2.80GHz
vendor: Intel Corp.
physical id: 1
bus info: cpu@0
width: 64 bits
When running this example from another issue on the command line in ubuntu the only message I get is Killed
:
many_words <- sapply(1:30000, function(x) paste(sample(letters, 10, replace=T),
collapse=""))
stringdist::stringdistmatrix(many_words, method = 'cosine')
We could add functionality to optionally retrieve the DP matrix for edit-like distances. This would conflict with my desire to lower memory usage by not storing the full DP-matrix for computation.
With thanks to Max Fritsche for reporting this.
to reproduce
stringdist("aap","apen",method="jw",p=0.1)
stringdist("aap","apen",method="jw",p=0)
x <- c("aap","apen")
stringdistmatrix(x,method="jw")
stringdistmatrix(x,method="jw",p=0.1)
stringdistmatrix(x,x,method="jw")
stringdistmatrix(x,x,method="jw",p=0.1)
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.