GithubHelp home page GithubHelp logo

stringdist's People

Contributors

bzki avatar chrismuir avatar markvanderloo avatar moohan avatar rsaporta avatar schoonees avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

stringdist's Issues

storage option for q-grams

Q-grams are stored internally in a binary tree. We could make this user-switchable between

  • unsorted list
  • binary tree
  • hashed storage

depending on the use case.

Installing package fails under Cygwin

Using R 3.1.2 under Cygwin. Seems that it installs fine, but is unable to read some kind of number of available threads or something while testing the installation. I guess it should have a default value if it's not found?

> install.packages("stringdist")
Installing package into ‘/usr/lib/R/site-library’
(as ‘lib’ is unspecified)
trying URL 'http://cran.rstudio.com/src/contrib/stringdist_0.9.0.tar.gz'
Content type 'application/x-gzip' length 44307 bytes (43 Kb)
opened URL
==================================================
downloaded 43 Kb

* installing *source* package ‘stringdist’ ...
** package ‘stringdist’ successfully unpacked and MD5 sums checked
** libs
gcc -I/usr/lib/R/include -DNDEBUG     -fopenmp   -ggdb -O2 -pipe -Wimplicit-function-declaration -std=gnu99 -fdebug-prefix-map=/pub/devel/R/R-3.1.2-1.x86_64/build=/usr/src/debug/R-3.1.2-1 -fdebug-prefix-map=/pub/devel/R/R-3.1.2-1.x86_64/src/R-3.1.2=/usr/src/debug/R-3.1.2-1  -c dl.c -o dl.o
gcc -I/usr/lib/R/include -DNDEBUG     -fopenmp   -ggdb -O2 -pipe -Wimplicit-function-declaration -std=gnu99 -fdebug-prefix-map=/pub/devel/R/R-3.1.2-1.x86_64/build=/usr/src/debug/R-3.1.2-1 -fdebug-prefix-map=/pub/devel/R/R-3.1.2-1.x86_64/src/R-3.1.2=/usr/src/debug/R-3.1.2-1  -c hamming.c -o hamming.o
gcc -I/usr/lib/R/include -DNDEBUG     -fopenmp   -ggdb -O2 -pipe -Wimplicit-function-declaration -std=gnu99 -fdebug-prefix-map=/pub/devel/R/R-3.1.2-1.x86_64/build=/usr/src/debug/R-3.1.2-1 -fdebug-prefix-map=/pub/devel/R/R-3.1.2-1.x86_64/src/R-3.1.2=/usr/src/debug/R-3.1.2-1  -c jaro.c -o jaro.o
gcc -I/usr/lib/R/include -DNDEBUG     -fopenmp   -ggdb -O2 -pipe -Wimplicit-function-declaration -std=gnu99 -fdebug-prefix-map=/pub/devel/R/R-3.1.2-1.x86_64/build=/usr/src/debug/R-3.1.2-1 -fdebug-prefix-map=/pub/devel/R/R-3.1.2-1.x86_64/src/R-3.1.2=/usr/src/debug/R-3.1.2-1  -c lcs.c -o lcs.o
gcc -I/usr/lib/R/include -DNDEBUG     -fopenmp   -ggdb -O2 -pipe -Wimplicit-function-declaration -std=gnu99 -fdebug-prefix-map=/pub/devel/R/R-3.1.2-1.x86_64/build=/usr/src/debug/R-3.1.2-1 -fdebug-prefix-map=/pub/devel/R/R-3.1.2-1.x86_64/src/R-3.1.2=/usr/src/debug/R-3.1.2-1  -c lv.c -o lv.o
gcc -I/usr/lib/R/include -DNDEBUG     -fopenmp   -ggdb -O2 -pipe -Wimplicit-function-declaration -std=gnu99 -fdebug-prefix-map=/pub/devel/R/R-3.1.2-1.x86_64/build=/usr/src/debug/R-3.1.2-1 -fdebug-prefix-map=/pub/devel/R/R-3.1.2-1.x86_64/src/R-3.1.2=/usr/src/debug/R-3.1.2-1  -c osa.c -o osa.o
gcc -I/usr/lib/R/include -DNDEBUG     -fopenmp   -ggdb -O2 -pipe -Wimplicit-function-declaration -std=gnu99 -fdebug-prefix-map=/pub/devel/R/R-3.1.2-1.x86_64/build=/usr/src/debug/R-3.1.2-1 -fdebug-prefix-map=/pub/devel/R/R-3.1.2-1.x86_64/src/R-3.1.2=/usr/src/debug/R-3.1.2-1  -c qgram.c -o qgram.o
gcc -I/usr/lib/R/include -DNDEBUG     -fopenmp   -ggdb -O2 -pipe -Wimplicit-function-declaration -std=gnu99 -fdebug-prefix-map=/pub/devel/R/R-3.1.2-1.x86_64/build=/usr/src/debug/R-3.1.2-1 -fdebug-prefix-map=/pub/devel/R/R-3.1.2-1.x86_64/src/R-3.1.2=/usr/src/debug/R-3.1.2-1  -c soundex.c -o soundex.o
gcc -I/usr/lib/R/include -DNDEBUG     -fopenmp   -ggdb -O2 -pipe -Wimplicit-function-declaration -std=gnu99 -fdebug-prefix-map=/pub/devel/R/R-3.1.2-1.x86_64/build=/usr/src/debug/R-3.1.2-1 -fdebug-prefix-map=/pub/devel/R/R-3.1.2-1.x86_64/src/R-3.1.2=/usr/src/debug/R-3.1.2-1  -c utf8ToInt.c -o utf8ToInt.o
gcc -I/usr/lib/R/include -DNDEBUG     -fopenmp   -ggdb -O2 -pipe -Wimplicit-function-declaration -std=gnu99 -fdebug-prefix-map=/pub/devel/R/R-3.1.2-1.x86_64/build=/usr/src/debug/R-3.1.2-1 -fdebug-prefix-map=/pub/devel/R/R-3.1.2-1.x86_64/src/R-3.1.2=/usr/src/debug/R-3.1.2-1  -c utils.c -o utils.o
gcc -shared -L/usr/lib/R/lib -o stringdist.dll dl.o hamming.o jaro.o lcs.o lv.o osa.o qgram.o soundex.o utf8ToInt.o utils.o -fopenmp -L/usr/lib/R/lib -lR -lintl -lpcre -llzma -lbz2 -lz -lrt -ldl -lm -liconv -licuuc -licui18n
installing to /usr/lib/R/site-library/stringdist/libs
** R
** inst
** preparing package for lazy loading
** help
*** installing help indices
** building package indices
** testing if installed package can be loaded
Error : .onLoad failed in loadNamespace() for 'stringdist', details:
  call: if (nthread >= 4) nthread <- nthread - 1
  error: missing value where TRUE/FALSE needed
Error: loading failed
Execution halted
ERROR: loading failed
* removing ‘/usr/lib/R/site-library/stringdist’

The downloaded source packages are in
        ‘/tmp/RtmpACPgeU/downloaded_packages’
Warning message:
In install.packages("stringdist") :
  installation of package ‘stringdist’ had non-zero exit status

Multiprocessor support on Mavericks

Hi

Version 0.8.x would spawn multiple R processes and use all my 24 cores on my Mac under R3.2.1. Version 0.9.x instead spawns multiple threads, not processes. Looking at the Activity Monitor it's clear it's just using a single core (and the runtime confirms that).

I've reverted to 0.8.x in the meantime.

stringdist with 'jw' method not producing symmetric distance

Hi,

Please see below problem I have with asymmetry in the distance for 'jw' method. This is causing a problem in my data set. In fact, the specific strings I test in my data set (can't share due to PHI), the distance equality test for symmetry fails even more badly with much larger delta on the reverse.

Thank you

round(stringdist('HENCERANGE3058RUNAWAY2HELLCITYAA12345', 'RANCHRANGE3058RUNAWAY2HELLCITYAA12345', method = 'jw'), 8) == round(stringdist('RANCHRANGE3058RUNAWAY2HELLCITYAA12345', 'HENCERANGE3058RUNAWAY2HELLCITYAA12345', method = 'jw'), 8)
[1] FALSE
'HENCERANGE3058RUNAWAY2HELLCITYAA12345' == 'HENCERANGE3058RUNAWAY2HELLCITYAA12345'
[1] TRUE
'RANCHRANGE3058RUNAWAY2HELLCITYAA12345' == 'RANCHRANGE3058RUNAWAY2HELLCITYAA12345'
[1] TRUE

stringdist('HENCERANGE3058RUNAWAY2HELLCITYAA12345', 'RANCHRANGE3058RUNAWAY2HELLCITYAA12345', method = 'jw')
[1] 0.2550837
stringdist('RANCHRANGE3058RUNAWAY2HELLCITYAA12345', 'HENCERANGE3058RUNAWAY2HELLCITYAA12345', method = 'jw')
[1] 0.2265122

Feature suggestion - phonetic algorithms implementation

Hello! I ran across your blog and found out about the stringdist package. I think that it might be beneficial to implement some functions, based on phonetic algorithms, such as Soundex. I apologize, if my suggestion doesn't match the goals of your package or the package already covers this functionality. I believe that implementation of my suggestion might also be helpful for automatic data correction processes via editrules and deducorrect packages, as misspelling is a common data quality problem.

Convention for q=0

After some careful thought, I now believe that the convention for q-gram distances when q=0 should be

d(s,t,q=0) = 0 for all s, t in \Sigma^*

At the moment it is 0 when s=t="" and infinity if |s|+|t|>0. This will probably not break anyone's code but it does deviate from the convention I denoted in the R Journal.

bug in amatch/jaccard

stringdist("600 EXAMPLE AVE NJ 8629", "2100 EXAMPLE AVE NJ 8619", method="jaccard")
[1] 0.0625

stringdist("600 EXAMPLE AVE NJ 8629", "600 EXAMPLE AVE NJ 8629", method="jaccard")
[1] 0

amatch("600 EXAMPLE AVE NJ 8629", c("2100 EXAMPLE AVE NJ 8619", "600 EXAMPLE AVE NJ 8629"), method="jaccard")
[1] 1

Normalise weights in jaro distance

In jaro.C the value of 3.0 is hard coded:

  } else {
    d = 1.0 - (1.0/3.0)*(w[0]*m/((double) x) + w[1]*m/((double) y) + w[2]*(m-t)/m);
  }

This should be the sum of the weights. Otherwise, the score is no longer necessarily between 0 and 1.

Q-Gram Filtering

I was wondering if you've thought of including qgram filtering for edit distance in the stringdist package. Oftentimes users are only concerned with comparing strings that pass a certain similarity threshold, and qgram filtering allows them to do this significantly quicker than just calculating the levenstein distance on all the strings.

Allow for q-gram based distances with multiple q's

For example

stringdist("hello","world",method="cosine", q=1:2)

would yield the cosine distance over the concatenation of 1-gram and 2-gram profiles.

This would also enhance compatibility, e.g. with the textcat package.

word-qgrams as well as character-qgrams

Sometimes text data is long and is better thought of as being broken up into "words" rather than "characters."

Your qgram tokenizer is extremely fast and I've found it to be incredibly useful, but I still find myself looking for a fast word-ngram tokenizer.

stringdistmatrix gives unexpected output for large input vectors

When stringdistmatrix is called using a large single input vector, the result is a distance matrix of class dist with fewer elements than expected.

many_words <- sapply(1:100000, function(x) paste(sample(letters, 10, replace=T),
                                                 collapse=""))
# Needs a lot of memory
d <- stringdist::stringdistmatrix(many_words, method = "jw")
size <- attr(d, "Size")
stopifnot(inherits(d, "dist"))
stopifnot(size == length(many_words))
stopifnot(length(d) == size*(size - 1)/2)
# Error: length(d) == size * (size - 1)/2 is not TRUE
# > length(d)
# [1] 704982704
# > size*(size - 1)/2
# [1] 4999950000

For smaller character vectors I did not encounter any problems:

few_words <- sapply(1:1000, function(x) paste(sample(letters, 10, replace=T),
                                              collapse=""))
d <- stringdist::stringdistmatrix(few_words, method = "jw")
stopifnot(inherits(d, "dist"))
size <- attr(d, "Size")
stopifnot(size == length(few_words))
stopifnot(length(d) == size*(size - 1)/2)
# No errors raised

I can provide more info when necessary.

multi threading not working

Since the move to openMP I can't get stringdist to run on more than one core. When nthreads is set to 2, 3, or 4 (my max) I can tell from activity monitor that only one core is active. I've reinstalled gcc and checked that openMP is working elsewhere; I can't figure out what's wrong! Wondering if this is a known issue or just me.

> getOption("sd_num_thread")
[1] 3
> parallel::detectCores()
[1] 4
> sessionInfo(package = NULL)
R version 3.1.3 (2015-03-09)
Platform: x86_64-apple-darwin13.4.0 (64-bit)
Running under: OS X 10.10.3 (Yosemite)

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] stringdist_0.9.0

loaded via a namespace (and not attached):
[1] parallel_3.1.3 tools_3.1.3   

Support for distances between integer vectors

So one could hash a (series of) objects or e.g. words in a sentence and compute the distance between them.

Technically this is easy to do; all C-level routines operate on unsigned int arrays.

Option to add q-1 pre- and or postfixes

Add option to add meaningless pre-and postsfixes when computing q-gram distance.
Options

  • a padding function
  • put all under the hood and use non-utf8 codes

solve UBSAN messages

There are some messages by the CLANG undefined behaviour sanitizer reported on cran. These should be ironed out for the next release.

paralellisation with openMP

This could be an option; but the current compiler used by the Core Team does not support it under Windows.

support for long vectors

At the moment all vectors are indexed with int in the underlying C-code. I should update this to size_t

stringsim function

Most string distances (except perhaps the ones based on string kernels) have a natural and easily computed minimum and maximum. I could add a stringsim function that returns a similarity measure between 0 and 1.

Better distance for sentences using tokenization + better perf in deduplication of > 10K sentences dataset

In python world, there is a very well known lib called fuzzywuzzy. A nice description of its features is available there.

I think stringdist already implements the basis to have similar features which would be very useful in two cases:

  • record linkage
  • deduplication

From my understanding, it just need a small layer on top of what is already existing in stringdist to provide a distance between sentences which is not word order sensitive for instance. Do you think it would be easy to implement?

Kind regards,
Michael

stringdist gives incorrect result when given weights

When given weights, stringdist will sometimes give an incorrect result. It can even be inconsistent with itself when weights are adjusted by a common factor.

Example:

stringdist("ABC", "BC", method = "lv") # Returns 1, as it should
stringdist("ABC", "BC", method = "lv", weight = c(i=.1, d=.1, s=.1)) # Returns .2, should be .1
stringdist("ABC", "BC", method = "lv", weight = c(i=.1, d=.1, s=1)) # Returns 1, should be .1

This differs from what adist returns for the same inputs, too:

adist("ABC", "AB") # Returns 1
adist("ABC", "AB", costs = c(insertions=.1, deletions=.1, substitutions=.1)) # Returns .1
adist("ABC", "AB", costs = c(insertions=.1, deletions=.1, substitutions=1)) # Returns .1

Version 0.9.2, installed from CRAN.

C-code enhancements

  • Reorganize code and check if better abstractions are possible
  • Smarter indexing
  • Decrease memory footprint (as in lv-implementation of RecordLinkage)

stringsim: scale taking edit weights into account

At the moment stringsim assumes that all weights are equal to 1 for edit-based distances. Although this does yield a valid maximum (weights are maximally 1), using lower weights will lower the maximum possible similarity. It is probably more intuitive to scale the similarities taking weights into account.

Transpose on single argument

Think I found an interesting bug. Looks like when a single argument is passed to stringdistance matrix, the resulting matrix is transposed. Seems to happen independently of the method being used as well.

> stringdist::stringdistmatrix(c("foo", "bar"), c("foo", "a", "b"), method = "hamming")
     [,1] [,2] [,3]
[1,]    0  Inf  Inf
[2,]    3  Inf  Inf
> stringdist::stringdistmatrix(c("foo"), c("foo", "a", "b"), method = "hamming")
     [,1]
[1,]    0
[2,]  Inf
[3,]  Inf

Better perf for deduplication

There is one use case where it would be possible to get much better perf very easily.

When you want to deduplicate data, you provide to the matrix version of stringdist two times the same vector of words. In this square matrix, each information is two times (symmetric distances) and the diagonal is by definition 0.

> stringdistmatrix(c('abc','abef','cde', 'akj'),c('abc','abef','cde', 'akj'))
     [,1] [,2] [,3] [,4]
[1,]    0    2    3    2
[2,]    2    0    3    3
[3,]    3    3    0    3
[4,]    2    3    3    0

Would it be possible to have this use case managed by the function and computation not done when not required?

Kind regards,
Michael

Method "osa" fails after encountering NA string

The title pretty much says it all - the default method "optimal string alignment" returns NA for every comparison after encountering one NA. Does not depend on the order of arguments, but that's as far as I got with debugging. Package version 0.8.1, R version 3.1.2

a <- c("a", NA, "b", "c")
b <- c("aa", "bb", "cc", "dd")
stringdist(a,b, method="lv")
[1] 1 NA 2 2
stringdist(a,b, method="osa")
[1] 1 NA NA NA

Make output switchable for undefined distances

In certain cases, distance measures between two strings are undefined. The package currently returns Inf in this case (to allow for numerical comparison) this output should be made switchable.

--> Need to phase out maxDist first.

stringdist/qgram behaviour when q<nchar(x)

I understand that the q-gram distance is the sum of absolute differences between q-gram vectors of both strings. But I see some weird behavior when one of the strings is shorter than the chosen q.

So for these two strings, while the qgrams function is correct:

> qgrams("a", "the cat sat on the mat", q = 2)
   th he t  sa on n  ma e   c ca at  s  t  o  m
V1  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0
V2  2  2  2  1  1  1  1  2  1  1  3  1  1  1  1
The stringdist function returns:
> stringdist("a", "the cat sat on the mat", q = 2, method = "qgram")
[1] Inf

Instead of returning:

> sum(qgrams("a", "the cat sat on the mat", q = 2)[2,])
[1] 21

Posted at SO by Giora Simchoni.

Match records

What would be the best way to match records with multiple variables like first name / last name ? I

More generally, something very useful would be a merge function that can accept a list of variables with weights (and where a subset can be required to match exactly)

make useNames more flexible

The useNames argument of stringdistmatrix can only take TRUE (use strings as names for output) or FALSE. It makes more sense to have the choice between:

  • none no output names (similar to FALSE)
  • strings use strings (similar to TRUE)
  • names use names(a) and names(b)

The latter will be especially useful when comparing long strings, like documents using qgrams

Vectors with large n elements failing for stringdistmatrix() when method = cosine

When I and my colleague pass vectors with > 7k elements to stringdistmatrix using the cosine method R crashes completely throwing a segfault error which says some memory did not map. On my mac. Here's the traceback and error:

*** caught segfault ***
address 0xbc9900000, cause 'memory not mapped'

Traceback:
 1: .Call("R_lower_tri", a, methnr, as.double(weight), as.double(p),     as.integer(q), as.integer(useBytes), as.integer(nthread))
 2: lower_tri(a, method = method, useBytes = useBytes, weight = weight,     useNames = useNames, nthread = nthread)
 3: stringdistmatrix(path.exitURL$exitPagePath_TermPretty, method = "cosine")
 4: eval(expr, envir, enclos)
 5: eval(ei, envir)
 6: withVisible(eval(ei, envir))
 7: source("code/path_analysis/cluster_path.R")

Mac system info:
Model Name: MacBook Pro
Model Identifier: MacBookPro11,5
Processor Name: Intel Core i7
Processor Speed: 2.5 GHz
Number of Processors: 1
Total Number of Cores: 4
L2 Cache (per Core): 256 KB
L3 Cache: 6 MB
Memory: 16 GB
Boot ROM Version: MBP114.0172.B09
SMC Version (system): 2.30f2

R version 3.3.1 (2016-06-21)
Platform: x86_64-apple-darwin15.5.0 (64-bit)
Running under: OS X 10.11.5 (El Capitan)

It also fails on a fresh ubuntu and R installation. System info:
Description: Ubuntu 14.04.4 LTS
Release: 14.04
Codename: trusty
*-memory
description: System memory
physical id: 0
size: 29GiB
*-cpu
product: Intel(R) Xeon(R) CPU E5-2680 v2 @ 2.80GHz
vendor: Intel Corp.
physical id: 1
bus info: cpu@0
width: 64 bits

When running this example from another issue on the command line in ubuntu the only message I get is Killed :

many_words <- sapply(1:30000, function(x) paste(sample(letters, 10, replace=T),
                                                 collapse=""))
stringdist::stringdistmatrix(many_words, method = 'cosine')

retrieve dynamic programming matrix?

We could add functionality to optionally retrieve the DP matrix for edit-like distances. This would conflict with my desire to lower memory usage by not storing the full DP-matrix for computation.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.