GithubHelp home page GithubHelp logo

simplefuzzymatchr's Introduction

Simple Fuzzy Match R

Simple vectorised Fuzzy Matching function in R based on Levenshtein distance

Problem

Base R function agrep() has several limitation in case vectorize matching:

  1. Returns all matches:
    1.1 Either with number of match (1)
    1.2 Either the match case (2)
fruits <- c("apple", "apples", "aple", "melone", "applejuice", "peanapple", "pear")

agrep("apple", fruits) #(1)
#>[1] 1 2 3 5 6

agrep("apple", fruits, value = TRUE) #(2)
#>[1] "apple"      "apples"     "aple"       "applejuice" "peanapple" 
  1. Returns "matches substrings of each element of x" (see documentation agrep()).
    But someone is likely to like strict matching: "apple", "apples", "aple", "applejuice", "peanapple"
  2. Enythere someone also likes vectorize form where in f(x, y): x, y are vectors for matching and f(x, y) returns vector of best matches from y in sequance of x
  3. f(x, y) should return NA if Levenshtein distance of matching is more than some max value

Solution

Simple function get_match() based on adist():

library("purrr")

get_match <- function(x, y, max = 3, cost = 1) {
  dist_mtrx <- adist(x, y, partial = FALSE, costs = cost) 
  dimnames(dist_mtrx) <- list(x, y)
  
  get_best <- function(n, dist_mtrx, max){
    min_dist <- dist_mtrx[n, ] %>% min()
    if(min_dist < max) res <- dist_mtrx[n, ] %>% which.min() %>% names()
    else res <- NA
    return(res)
  }
  
  1:nrow(dist_mtrx) %>% 
    map_chr(~get_best(., dist_mtrx, max = max))
}

#test result
fruits1 <- stringr::fruit
fruits2 <- c("pears","apples", "melone", "per", "apple", "orange")
get_match(fruits2, fruits1)

#>[1] "pear"   "apple"  NA       "pear"   "apple"  "orange"  

Same result but a little bit faster using only base R

get_match <- function(x, y, max = 3, cost = 1) {
  dist_mtrx <- adist(x, y, partial = FALSE, costs = cost) 
  dimnames(dist_mtrx) <- list(x, y)
  
  get_best <- function(n, dist_mtrx, max){
    min_dist <- min(dist_mtrx[n, ])
    if(min_dist < max) res <- names(which.min(dist_mtrx[n, ])) 
    else res <- NA
    return(res)
  }
  
  sapply(X = 1:nrow(dist_mtrx), FUN = function(x) get_best(x, dist_mtrx, max = max))
}

#test result
fruits1 <- stringr::fruit
fruits1

fruits2 <- c("pears","apples", "melone", "per", "apple", "orange")
get_match(fruits2, fruits1)

#>[1] "pear"   "apple"  NA       "pear"   "apple"  "orange"

© 2020 GitHub, Inc.

simplefuzzymatchr's People

Contributors

dkibalnikov avatar

Watchers

 avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.