markvanderloo / stringdist Goto Github PK

View Code? Open in Web Editor NEW

314.0 314.0 36.0 1.33 MB

String distance functions for R

Shell 0.36% R 56.20% C 43.09% Makefile 0.35%

stringdist's People

Contributors

Stargazers

Watchers

stringdist's Issues

storage option for q-grams

Q-grams are stored internally in a binary tree. We could make this user-switchable between

unsorted list
binary tree
hashed storage

depending on the use case.

stringdistmatrix outputs 'dist' object when called with single character argument

Theres a factor of 2 overhead callling

stringdistmatrix(a=x, b=x)

The default could be changed to

stringdistmatrix(a,b=NULL)

so that

stringdistmatrix(x)

only computes the lower triangle, similar to native R's dist function

Installing package fails under Cygwin

Using R 3.1.2 under Cygwin. Seems that it installs fine, but is unable to read some kind of number of available threads or something while testing the installation. I guess it should have a default value if it's not found?

> install.packages("stringdist")
Installing package into ‘/usr/lib/R/site-library’
(as ‘lib’ is unspecified)
trying URL 'http://cran.rstudio.com/src/contrib/stringdist_0.9.0.tar.gz'
Content type 'application/x-gzip' length 44307 bytes (43 Kb)
opened URL
==================================================
downloaded 43 Kb

* installing *source* package ‘stringdist’ ...
** package ‘stringdist’ successfully unpacked and MD5 sums checked
** libs
gcc -I/usr/lib/R/include -DNDEBUG     -fopenmp   -ggdb -O2 -pipe -Wimplicit-function-declaration -std=gnu99 -fdebug-prefix-map=/pub/devel/R/R-3.1.2-1.x86_64/build=/usr/src/debug/R-3.1.2-1 -fdebug-prefix-map=/pub/devel/R/R-3.1.2-1.x86_64/src/R-3.1.2=/usr/src/debug/R-3.1.2-1  -c dl.c -o dl.o
gcc -I/usr/lib/R/include -DNDEBUG     -fopenmp   -ggdb -O2 -pipe -Wimplicit-function-declaration -std=gnu99 -fdebug-prefix-map=/pub/devel/R/R-3.1.2-1.x86_64/build=/usr/src/debug/R-3.1.2-1 -fdebug-prefix-map=/pub/devel/R/R-3.1.2-1.x86_64/src/R-3.1.2=/usr/src/debug/R-3.1.2-1  -c hamming.c -o hamming.o
gcc -I/usr/lib/R/include -DNDEBUG     -fopenmp   -ggdb -O2 -pipe -Wimplicit-function-declaration -std=gnu99 -fdebug-prefix-map=/pub/devel/R/R-3.1.2-1.x86_64/build=/usr/src/debug/R-3.1.2-1 -fdebug-prefix-map=/pub/devel/R/R-3.1.2-1.x86_64/src/R-3.1.2=/usr/src/debug/R-3.1.2-1  -c jaro.c -o jaro.o
gcc -I/usr/lib/R/include -DNDEBUG     -fopenmp   -ggdb -O2 -pipe -Wimplicit-function-declaration -std=gnu99 -fdebug-prefix-map=/pub/devel/R/R-3.1.2-1.x86_64/build=/usr/src/debug/R-3.1.2-1 -fdebug-prefix-map=/pub/devel/R/R-3.1.2-1.x86_64/src/R-3.1.2=/usr/src/debug/R-3.1.2-1  -c lcs.c -o lcs.o
gcc -I/usr/lib/R/include -DNDEBUG     -fopenmp   -ggdb -O2 -pipe -Wimplicit-function-declaration -std=gnu99 -fdebug-prefix-map=/pub/devel/R/R-3.1.2-1.x86_64/build=/usr/src/debug/R-3.1.2-1 -fdebug-prefix-map=/pub/devel/R/R-3.1.2-1.x86_64/src/R-3.1.2=/usr/src/debug/R-3.1.2-1  -c lv.c -o lv.o
gcc -I/usr/lib/R/include -DNDEBUG     -fopenmp   -ggdb -O2 -pipe -Wimplicit-function-declaration -std=gnu99 -fdebug-prefix-map=/pub/devel/R/R-3.1.2-1.x86_64/build=/usr/src/debug/R-3.1.2-1 -fdebug-prefix-map=/pub/devel/R/R-3.1.2-1.x86_64/src/R-3.1.2=/usr/src/debug/R-3.1.2-1  -c osa.c -o osa.o
gcc -I/usr/lib/R/include -DNDEBUG     -fopenmp   -ggdb -O2 -pipe -Wimplicit-function-declaration -std=gnu99 -fdebug-prefix-map=/pub/devel/R/R-3.1.2-1.x86_64/build=/usr/src/debug/R-3.1.2-1 -fdebug-prefix-map=/pub/devel/R/R-3.1.2-1.x86_64/src/R-3.1.2=/usr/src/debug/R-3.1.2-1  -c qgram.c -o qgram.o
gcc -I/usr/lib/R/include -DNDEBUG     -fopenmp   -ggdb -O2 -pipe -Wimplicit-function-declaration -std=gnu99 -fdebug-prefix-map=/pub/devel/R/R-3.1.2-1.x86_64/build=/usr/src/debug/R-3.1.2-1 -fdebug-prefix-map=/pub/devel/R/R-3.1.2-1.x86_64/src/R-3.1.2=/usr/src/debug/R-3.1.2-1  -c soundex.c -o soundex.o
gcc -I/usr/lib/R/include -DNDEBUG     -fopenmp   -ggdb -O2 -pipe -Wimplicit-function-declaration -std=gnu99 -fdebug-prefix-map=/pub/devel/R/R-3.1.2-1.x86_64/build=/usr/src/debug/R-3.1.2-1 -fdebug-prefix-map=/pub/devel/R/R-3.1.2-1.x86_64/src/R-3.1.2=/usr/src/debug/R-3.1.2-1  -c utf8ToInt.c -o utf8ToInt.o
gcc -I/usr/lib/R/include -DNDEBUG     -fopenmp   -ggdb -O2 -pipe -Wimplicit-function-declaration -std=gnu99 -fdebug-prefix-map=/pub/devel/R/R-3.1.2-1.x86_64/build=/usr/src/debug/R-3.1.2-1 -fdebug-prefix-map=/pub/devel/R/R-3.1.2-1.x86_64/src/R-3.1.2=/usr/src/debug/R-3.1.2-1  -c utils.c -o utils.o
gcc -shared -L/usr/lib/R/lib -o stringdist.dll dl.o hamming.o jaro.o lcs.o lv.o osa.o qgram.o soundex.o utf8ToInt.o utils.o -fopenmp -L/usr/lib/R/lib -lR -lintl -lpcre -llzma -lbz2 -lz -lrt -ldl -lm -liconv -licuuc -licui18n
installing to /usr/lib/R/site-library/stringdist/libs
** R
** inst
** preparing package for lazy loading
** help
*** installing help indices
** building package indices
** testing if installed package can be loaded
Error : .onLoad failed in loadNamespace() for 'stringdist', details:
  call: if (nthread >= 4) nthread <- nthread - 1
  error: missing value where TRUE/FALSE needed
Error: loading failed
Execution halted
ERROR: loading failed
* removing ‘/usr/lib/R/site-library/stringdist’

The downloaded source packages are in
        ‘/tmp/RtmpACPgeU/downloaded_packages’
Warning message:
In install.packages("stringdist") :
  installation of package ‘stringdist’ had non-zero exit status

Multiprocessor support on Mavericks

Version 0.8.x would spawn multiple R processes and use all my 24 cores on my Mac under R3.2.1. Version 0.9.x instead spawns multiple threads, not processes. Looking at the Activity Monitor it's clear it's just using a single core (and the runtime confirms that).

I've reverted to 0.8.x in the meantime.

boost threshold for JW-dist

Only add penalty when jaro-dist exceeds a threshold. see here.

Suggested via email by Riki Saito

stringdist with 'jw' method not producing symmetric distance

Hi,

Please see below problem I have with asymmetry in the distance for 'jw' method. This is causing a problem in my data set. In fact, the specific strings I test in my data set (can't share due to PHI), the distance equality test for symmetry fails even more badly with much larger delta on the reverse.

Thank you

round(stringdist('HENCERANGE3058RUNAWAY2HELLCITYAA12345', 'RANCHRANGE3058RUNAWAY2HELLCITYAA12345', method = 'jw'), 8) == round(stringdist('RANCHRANGE3058RUNAWAY2HELLCITYAA12345', 'HENCERANGE3058RUNAWAY2HELLCITYAA12345', method = 'jw'), 8)
[1] FALSE
'HENCERANGE3058RUNAWAY2HELLCITYAA12345' == 'HENCERANGE3058RUNAWAY2HELLCITYAA12345'
[1] TRUE
'RANCHRANGE3058RUNAWAY2HELLCITYAA12345' == 'RANCHRANGE3058RUNAWAY2HELLCITYAA12345'
[1] TRUE

stringdist('HENCERANGE3058RUNAWAY2HELLCITYAA12345', 'RANCHRANGE3058RUNAWAY2HELLCITYAA12345', method = 'jw')
[1] 0.2550837
stringdist('RANCHRANGE3058RUNAWAY2HELLCITYAA12345', 'HENCERANGE3058RUNAWAY2HELLCITYAA12345', method = 'jw')
[1] 0.2265122

Feature suggestion - phonetic algorithms implementation

Hello! I ran across your blog and found out about the stringdist package. I think that it might be beneficial to implement some functions, based on phonetic algorithms, such as Soundex. I apologize, if my suggestion doesn't match the goals of your package or the package already covers this functionality. I believe that implementation of my suggestion might also be helpful for automatic data correction processes via editrules and deducorrect packages, as misspelling is a common data quality problem.

Convention for q=0

After some careful thought, I now believe that the convention for q-gram distances when q=0 should be

d(s,t,q=0) = 0 for all s, t in \Sigma^*

At the moment it is 0 when s=t="" and infinity if |s|+|t|>0. This will probably not break anyone's code but it does deviate from the convention I denoted in the R Journal.

bug in amatch/jaccard

stringdist("600 EXAMPLE AVE NJ 8629", "2100 EXAMPLE AVE NJ 8619", method="jaccard")
[1] 0.0625

stringdist("600 EXAMPLE AVE NJ 8629", "600 EXAMPLE AVE NJ 8629", method="jaccard")
[1] 0

amatch("600 EXAMPLE AVE NJ 8629", c("2100 EXAMPLE AVE NJ 8619", "600 EXAMPLE AVE NJ 8629"), method="jaccard")
[1] 1

Normalise weights in jaro distance

In jaro.C the value of 3.0 is hard coded:

  } else {
    d = 1.0 - (1.0/3.0)*(w[0]*m/((double) x) + w[1]*m/((double) y) + w[2]*(m-t)/m);
  }

This should be the sum of the weights. Otherwise, the score is no longer necessarily between 0 and 1.

String kernel distances

Adding support for string kernel distances would be nice.

Q-Gram Filtering

I was wondering if you've thought of including qgram filtering for edit distance in the stringdist package. Oftentimes users are only concerned with comparing strings that pass a certain similarity threshold, and qgram filtering allows them to do this significantly quicker than just calculating the levenstein distance on all the strings.

Allow for q-gram based distances with multiple q's

For example

stringdist("hello","world",method="cosine", q=1:2)

would yield the cosine distance over the concatenation of 1-gram and 2-gram profiles.

This would also enhance compatibility, e.g. with the textcat package.

Local alignment

Would you be open to including a local alignment metric (https://en.wikipedia.org/wiki/Smith%E2%80%93Waterman_algorithm)? I could prepare a patch.

Allow weights to be defined with names as well as order

Why not, a bit of user-friendlyness :-).

word-qgrams as well as character-qgrams

Sometimes text data is long and is better thought of as being broken up into "words" rather than "characters."

Your qgram tokenizer is extremely fast and I've found it to be incredibly useful, but I still find myself looking for a fast word-ngram tokenizer.

stringdistmatrix gives unexpected output for large input vectors

When stringdistmatrix is called using a large single input vector, the result is a distance matrix of class dist with fewer elements than expected.

many_words <- sapply(1:100000, function(x) paste(sample(letters, 10, replace=T),
                                                 collapse=""))
# Needs a lot of memory
d <- stringdist::stringdistmatrix(many_words, method = "jw")
size <- attr(d, "Size")
stopifnot(inherits(d, "dist"))
stopifnot(size == length(many_words))
stopifnot(length(d) == size*(size - 1)/2)
# Error: length(d) == size * (size - 1)/2 is not TRUE
# > length(d)
# [1] 704982704
# > size*(size - 1)/2
# [1] 4999950000

For smaller character vectors I did not encounter any problems:

few_words <- sapply(1:1000, function(x) paste(sample(letters, 10, replace=T),
                                              collapse=""))
d <- stringdist::stringdistmatrix(few_words, method = "jw")
stopifnot(inherits(d, "dist"))
size <- attr(d, "Size")
stopifnot(size == length(few_words))
stopifnot(length(d) == size*(size - 1)/2)
# No errors raised

I can provide more info when necessary.

multi threading not working

Since the move to openMP I can't get stringdist to run on more than one core. When nthreads is set to 2, 3, or 4 (my max) I can tell from activity monitor that only one core is active. I've reinstalled gcc and checked that openMP is working elsewhere; I can't figure out what's wrong! Wondering if this is a known issue or just me.

> getOption("sd_num_thread")
[1] 3
> parallel::detectCores()
[1] 4
> sessionInfo(package = NULL)
R version 3.1.3 (2015-03-09)
Platform: x86_64-apple-darwin13.4.0 (64-bit)
Running under: OS X 10.10.3 (Yosemite)

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] stringdist_0.9.0

loaded via a namespace (and not attached):
[1] parallel_3.1.3 tools_3.1.3

stringsim/nchar/useBytes

we should probably make nchar listen to useBytes to compute the right maximum distances.

Better C interface for basic string distance functions.

Something like

// store options, create workspace...
u = stringdist_open(dist, q,...)
stringdist(u,s,t)
stringdist_close(u)

Add weights for Jaro-Winkler distance

finally write that vignette

Support for distances between integer vectors

So one could hash a (series of) objects or e.g. words in a sentence and compute the distance between them.

Technically this is easy to do; all C-level routines operate on unsigned int arrays.

Option to add q-1 pre- and or postfixes

Add option to add meaningless pre-and postsfixes when computing q-gram distance.
Options

a padding function
put all under the hood and use non-utf8 codes

solve UBSAN messages

There are some messages by the CLANG undefined behaviour sanitizer reported on cran. These should be ironed out for the next release.

paralellisation with openMP

This could be an option; but the current compiler used by the Core Team does not support it under Windows.

support for long vectors

At the moment all vectors are indexed with int in the underlying C-code. I should update this to size_t

stringsim function

Most string distances (except perhaps the ones based on string kernels) have a natural and easily computed minimum and maximum. I could add a stringsim function that returns a similarity measure between 0 and 1.

Memory management for q-gram storage

Q-gram binary tree is allocated per-node on the fly. Change this to pooling and doubling re-allocation

Better distance for sentences using tokenization + better perf in deduplication of > 10K sentences dataset

In python world, there is a very well known lib called fuzzywuzzy. A nice description of its features is available there.

I think stringdist already implements the basis to have similar features which would be very useful in two cases:

record linkage
deduplication

From my understanding, it just need a small layer on top of what is already existing in stringdist to provide a distance between sentences which is not word order sensitive for instance. Do you think it would be easy to implement?

Kind regards,
Michael

Episode distance

Add episode distance

stringdist gives incorrect result when given weights

When given weights, stringdist will sometimes give an incorrect result. It can even be inconsistent with itself when weights are adjusted by a common factor.

Example:

stringdist("ABC", "BC", method = "lv") # Returns 1, as it should
stringdist("ABC", "BC", method = "lv", weight = c(i=.1, d=.1, s=.1)) # Returns .2, should be .1
stringdist("ABC", "BC", method = "lv", weight = c(i=.1, d=.1, s=1)) # Returns 1, should be .1

This differs from what adist returns for the same inputs, too:

adist("ABC", "AB") # Returns 1
adist("ABC", "AB", costs = c(insertions=.1, deletions=.1, substitutions=.1)) # Returns .1
adist("ABC", "AB", costs = c(insertions=.1, deletions=.1, substitutions=1)) # Returns .1

Version 0.9.2, installed from CRAN.

C-code enhancements

Reorganize code and check if better abstractions are possible
Smarter indexing
Decrease memory footprint (as in lv-implementation of RecordLinkage)

stringsim: scale taking edit weights into account

At the moment stringsim assumes that all weights are equal to 1 for edit-based distances. Although this does yield a valid maximum (weights are maximally 1), using lower weights will lower the maximum possible similarity. It is probably more intuitive to scale the similarities taking weights into account.

Transpose on single argument

Think I found an interesting bug. Looks like when a single argument is passed to stringdistance matrix, the resulting matrix is transposed. Seems to happen independently of the method being used as well.

> stringdist::stringdistmatrix(c("foo", "bar"), c("foo", "a", "b"), method = "hamming")
     [,1] [,2] [,3]
[1,]    0  Inf  Inf
[2,]    3  Inf  Inf

> stringdist::stringdistmatrix(c("foo"), c("foo", "a", "b"), method = "hamming")
     [,1]
[1,]    0
[2,]  Inf
[3,]  Inf

Add useBytes option

install.packages() fail in R

The install.packages('stringdist') command in R is failing right now. Here is the error message:
Error in download.file(url, destfile, method, mode = "wb", ...) :
cannot open URL 'http://cran.rstudio.com/bin/windows/contrib/3.1/stringdist_0.7.2.zip'

I don't know if this is a CRAN issue or if it's something that can be fixed here.

Regards,
Alun

Better perf for deduplication

There is one use case where it would be possible to get much better perf very easily.

When you want to deduplicate data, you provide to the matrix version of stringdist two times the same vector of words. In this square matrix, each information is two times (symmetric distances) and the diagonal is by definition 0.

> stringdistmatrix(c('abc','abef','cde', 'akj'),c('abc','abef','cde', 'akj'))
     [,1] [,2] [,3] [,4]
[1,]    0    2    3    2
[2,]    2    0    3    3
[3,]    3    3    0    3
[4,]    2    3    3    0

Would it be possible to have this use case managed by the function and computation not done when not required?

Kind regards,
Michael

Method "osa" fails after encountering NA string

The title pretty much says it all - the default method "optimal string alignment" returns NA for every comparison after encountering one NA. Does not depend on the order of arguments, but that's as far as I got with debugging. Package version 0.8.1, R version 3.1.2

a <- c("a", NA, "b", "c")
b <- c("aa", "bb", "cc", "dd")
stringdist(a,b, method="lv")
[1] 1 NA 2 2
stringdist(a,b, method="osa")
[1] 1 NA NA NA

MaxDist?

sometimes only care about the distance blow a max value.

postgres has this.

levenshtein_less_equal(text` source, text target, int max_d) returns int

https://www.postgresql.org/docs/9.1/static/fuzzystrmatch.html

I check the document of this package, indicate the MaxDist is deprecated.

Update documentation

The description of string distances should be moved to a general section.

Make output switchable for undefined distances

In certain cases, distance measures between two strings are undefined. The package currently returns Inf in this case (to allow for numerical comparison) this output should be made switchable.

--> Need to phase out maxDist first.

Decrease memory usage for levenshtein-like distances

Give amatch the option to also return the actual distances

stringdist/qgram behaviour when q<nchar(x)

I understand that the q-gram distance is the sum of absolute differences between q-gram vectors of both strings. But I see some weird behavior when one of the strings is shorter than the chosen q.

So for these two strings, while the qgrams function is correct:

> qgrams("a", "the cat sat on the mat", q = 2)
   th he t  sa on n  ma e   c ca at  s  t  o  m
V1  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0
V2  2  2  2  1  1  1  1  2  1  1  3  1  1  1  1
The stringdist function returns:

> stringdist("a", "the cat sat on the mat", q = 2, method = "qgram")
[1] Inf

Instead of returning:

> sum(qgrams("a", "the cat sat on the mat", q = 2)[2,])
[1] 21

Posted at SO by Giora Simchoni.

Match records

What would be the best way to match records with multiple variables like first name / last name ? I

More generally, something very useful would be a merge function that can accept a list of variables with weights (and where a subset can be required to match exactly)

make useNames more flexible

The useNames argument of stringdistmatrix can only take TRUE (use strings as names for output) or FALSE. It makes more sense to have the choice between:

none no output names (similar to FALSE)
strings use strings (similar to TRUE)
names use names(a) and names(b)

The latter will be especially useful when comparing long strings, like documents using qgrams

Vectors with large n elements failing for stringdistmatrix() when method = cosine

When I and my colleague pass vectors with > 7k elements to stringdistmatrix using the cosine method R crashes completely throwing a segfault error which says some memory did not map. On my mac. Here's the traceback and error:

*** caught segfault ***
address 0xbc9900000, cause 'memory not mapped'

Traceback:
 1: .Call("R_lower_tri", a, methnr, as.double(weight), as.double(p),     as.integer(q), as.integer(useBytes), as.integer(nthread))
 2: lower_tri(a, method = method, useBytes = useBytes, weight = weight,     useNames = useNames, nthread = nthread)
 3: stringdistmatrix(path.exitURL$exitPagePath_TermPretty, method = "cosine")
 4: eval(expr, envir, enclos)
 5: eval(ei, envir)
 6: withVisible(eval(ei, envir))
 7: source("code/path_analysis/cluster_path.R")

Mac system info:
Model Name: MacBook Pro
Model Identifier: MacBookPro11,5
Processor Name: Intel Core i7
Processor Speed: 2.5 GHz
Number of Processors: 1
Total Number of Cores: 4
L2 Cache (per Core): 256 KB
L3 Cache: 6 MB
Memory: 16 GB
Boot ROM Version: MBP114.0172.B09
SMC Version (system): 2.30f2

R version 3.3.1 (2016-06-21)
Platform: x86_64-apple-darwin15.5.0 (64-bit)
Running under: OS X 10.11.5 (El Capitan)

It also fails on a fresh ubuntu and R installation. System info:
Description: Ubuntu 14.04.4 LTS
Release: 14.04
Codename: trusty
*-memory
description: System memory
physical id: 0
size: 29GiB
*-cpu
product: Intel(R) Xeon(R) CPU E5-2680 v2 @ 2.80GHz
vendor: Intel Corp.
physical id: 1
bus info: cpu@0
width: 64 bits

When running this example from another issue on the command line in ubuntu the only message I get is Killed :

many_words <- sapply(1:30000, function(x) paste(sample(letters, 10, replace=T),
                                                 collapse=""))
stringdist::stringdistmatrix(many_words, method = 'cosine')

retrieve dynamic programming matrix?

We could add functionality to optionally retrieve the DP matrix for edit-like distances. This would conflict with my desire to lower memory usage by not storing the full DP-matrix for computation.

JW-distance's p-parameter ignored when calling stringdistmatrix with a single argument

With thanks to Max Fritsche for reporting this.

to reproduce

stringdist("aap","apen",method="jw",p=0.1)
stringdist("aap","apen",method="jw",p=0)
x <- c("aap","apen")

stringdistmatrix(x,method="jw")
stringdistmatrix(x,method="jw",p=0.1)

stringdistmatrix(x,x,method="jw")
stringdistmatrix(x,x,method="jw",p=0.1)

markvanderloo / stringdist Goto Github PK

stringdist's People

Contributors

Stargazers

Watchers

Forkers

stringdist's Issues

Recommend Projects

Recommend Topics

Recommend Org

Jobs