GithubHelp home page GithubHelp logo

k3jph / phonics-in-r Goto Github PK

View Code? Open in Web Editor NEW
28.0 4.0 7.0 454 KB

Phonetic Spelling Algorithms in R

Home Page: https://jameshoward.us/phonics-in-r

License: Other

R 83.99% C++ 12.75% TeX 3.27%
phonetic-spelling-algorithms soundex phonics nysiis metaphone text-processing linguistics record-linkage bsd-2-license

phonics-in-r's Introduction

Phonetic Spelling Algorithms in R

CRAN/METACRAN Downloads from the RStudio CRAN mirror Build Status codecov Codacy Badge DOI JOSS Status JSS Status

This is the R package to support phonetic spelling algorithms in R. Several packages provide the Soundex algorithm. However, other algorithms have been developed since Soundex that can also provide phonetic spelling and test phonetic similarity.

Algorithms included

  • Caverphone
    • Original Caverphone
    • Caverphone 2
  • Cologne (Kölner)
  • Lein
  • Match Rating Approach
    • Encoder
    • Comparison
  • Metaphone
  • New York State Identification and Intelligence System
    • NYSIIS
    • Modified NYSIIS
  • Oxford Name Compression Algorithm
  • Phonex
  • Roger Root
  • Soundex
    • Original Soundex
    • Apache Refined Soundex
  • Statistics Canada
    • Census Modified

Dependencies

  • testthat
  • roxygen2
  • Rcpp
  • BH
  • data.table

Contribution guidelines

For more information

Acknowledgements

This work used the Extreme Science and Engineering Discovery Environment (XSEDE), which is supported by National Science Foundation grant number ACI-1548562. In particular, it used the Comet system at the San Diego Supercomputing Center (SDSC) through allocations TG-DBS170012 and TG-ASC150024.

phonics-in-r's People

Contributors

ahood avatar howardjp avatar kylehaynes avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar

phonics-in-r's Issues

Roger Root

Phonics should include an implementation of the Roger Root name comparison algorithm. See this USDA publication for more information.

Add warnings to ONCA

  • Rewrite the unit tester
  • Add new test cases
  • Rewrite the code for to process warnings

Ensure all algorithms return "" for input ""

  • Caverphone
  • Caverphone 2
  • Cologne
  • Lein
  • MRA
  • Metaphone
  • NYSIIS
  • Modified NYSIIS
  • Oxford Name Compression Algorithm
  • Phonex
  • Roger Root
  • Original Soundex
  • Apache Refined Soundex
  • Statistics Canada

NYSIIS encoding of 'CHRISTINA'

Noticed phonics::nysiis('CHRISTINA') outputs 'CHRASTAN' (for maxCodeLen >= 8) whereas it should be 'CRASTAN' as per original algorithm (see https://naldc.nal.usda.gov/download/27833/PDF or https://www.springer.com/us/book/9780387695020 and the somewhat more vague https://en.wikipedia.org/wiki/New_York_State_Identification_and_Intelligence_System; can't find original report by Taft). Steps worked through here: christina.txt

Looks like discrepancy is due to the omission of the first letter of the name in nysiis.R line 107, i.e.
word <- substr(word, 2, nchar(word)) before the application of the 'H' rule (Step 4.5).

Add warnings to MRA

  • Rewrite the unit tester
  • Add new test cases
  • Rewrite the code for to process warnings

Ensure all algorithms return NA for input NA

  • Caverphone
  • Caverphone 2
  • Cologne
  • Lein
  • MRA
  • Metaphone
  • NYSIIS
  • Modified NYSIIS
  • Oxford Name Compression Algorithm
  • Phonex
  • Roger Root
  • Original Soundex
  • Apache Refined Soundex
  • Statistics Canada

Metaphone crashing when encoding "gh"

Describe the bug
metaphone crashes when encoding "gh"

Possibly this is version dependent - I'm running an old R and cannot upgrade until I buy a new computer.

It's just strange that it seems to work for many words and only crash on gh. Maybe gh is producing some sort of strange unicode or something? IDK, I'm not much of a user of this package, but I need to get some code to run and this is breaking it. Any help would be appreciated.

To Reproduce
phonics::metaphone("sigh")

Or any other word with gh in it, as far as I can tell

Expected behavior
Should return the metaphone encoding for sigh.

Example

> phonics::metaphone("ruff")
[1] "RF"
> phonics::metaphone("rough")
Error in metaphone_internal(word, maxCodeLen) : 
  c++ exception (unknown reason)
> phonics::metaphone("funhouse")
[1] "FNHS"
> phonics::metaphone("bughouse")
Error in metaphone_internal(word, maxCodeLen) : 
  c++ exception (unknown reason)
library(stringr); words[!str_detect(words,"gh")] %>% phonics::metaphone()
# works properly on 962 other words :-)

Desktop (please complete the following information):

> version
               _                           
platform       x86_64-apple-darwin15.6.0   
arch           x86_64                      
os             darwin15.6.0                
system         x86_64, darwin15.6.0        
status                                     
major          3                           
minor          6.1                         
year           2019                        
month          07                          
day            05                          
svn rev        76782                       
language       R                           
version.string R version 3.6.1 (2019-07-05)
nickname       Action of the Toes

Running phonics v1.3.9

Add warnings to Lein

  • Rewrite the unit tester
  • Add new test cases
  • Rewrite the code for to process warnings

soundex single characters

Hi James (Sorry for calling you by you Surname!),

Currently, single character strings return no padded out 0's. Would you consider this a bug?

Looking at three implementations of soundex ...

phonics::soundex("A")
# [1] "A"
RecordLinkage::soundex("A")
# [1] "A000"
stringdist::phonetic("A")
# [1] "A000"

It's pretty edge case, but with the types of names I deal with sometimes I get abbreviations, so when doing linkage, if a name was "DA" on one dataset and "D" on another, I might consider it a pair, though blocking on soundex name wouldn't result in a pair ("D" vs "D000").

Happy to do a pull request if you agree.

NYSIIS encoding of 'JOHN'

nysiis_original() returns 'J', whereas the encoding should be 'JAN'. This is a mistake in the use of gsub (both previous and next letters were part of the 'string to replace' instead of lookarounds being used). Have forked and will fix.

Soundex returning single letter instead of augmenting with zeros

If I understand correctly from the Soundex algorithm steps on Wikipedia, the encoding of e.g. the string 'A' should be 'A000'. Indeed this is what is produced by other Soundex implementations I'm looking at. However, phonics::soundex('A') returns 'A'.

Happy to make a pull request if you agree that 'A000' is the correct encoding and if you agree with the rule that "If you have too few letters in your word that you can't assign three numbers, append with zeros until there are three numbers" (quoting from Step 4 in the Wikipedia article).

NYSIIS encoding of 'HANNAH'

Both nysiis_original() and nysiis_modified() are returning 'HANAH'. The encoding rule for a terminal 'H' is ambiguous in this case because of its definition in terms of the preceding and following letters, whereas there is no following letter for the last letter in the name. However it seems more in the spirit of this phonetic encoding to omit the final 'H' (and therefore the second 'A') from the final encoding, and to return 'HAN' instead. The latter interpretation has been adopted in the plurality of implementations here, by the way.

Match Rating Approach

Phonics should include the match rating approach algorithm, including the comparison engine.

Use of perl = TRUE

Hi Howard,

Thanks for the package.

Have you ever considered the use of the perl = TRUE argument in a lot of your gsub() functions?

It offers considerable time benefits.

Below is an example having updated the nysiis_original function.

# install.packages("babynames")
# install.packages("phonics")
library("babynames")
library("phonics")

name <- babynames$name

length(name)
# 1858689

system.time(a <- nysiis_original_perl(name))
# user  system elapsed 
# 13.36    0.14   13.54 

system.time(b <- nysiis(name))
#  user  system elapsed 
# 22.75    0.24   23.02 

# All equal?
all.equal(a, b)
# [1] TRUE

# microbenchmark'ing
microbenchmark(
  nysiis_original_perl(name),
  nysiis(name), times = 25
)
# Unit: milliseconds
#                        expr      min       lq     mean   median       uq      max neval
#  nysiis_original_perl(name) 308.5931 311.0220 316.0347 312.2456 315.8408 345.8459    25
#                nysiis(name) 568.2662 573.1073 577.4318 575.4571 577.5975 606.7362    25

sessionInfo()
# R version 3.5.0 (2018-04-23)
# Platform: x86_64-w64-mingw32/x64 (64-bit)
# Running under: Windows 10 x64 (build 17763)
# 
# Matrix products: default
# 
# locale:
# [1] LC_COLLATE=English_Australia.1252  LC_CTYPE=English_Australia.1252    LC_MONETARY=English_Australia.1252 LC_NUMERIC=C                       LC_TIME=English_Australia.1252    
# 
# attached base packages:
# [1] stats     graphics  grDevices utils     datasets  methods   base     
# 
# other attached packages:
# [1] phonics_1.1.0   babynames_0.3.0
# 
# loaded via a namespace (and not attached):
# [1] compiler_3.5.0 tools_3.5.0    pillar_1.3.1   tibble_1.4.2   Rcpp_1.0.0     crayon_1.3.4   rlang_0.3.0.1 


# nysiis_original with perl = TRUE ...
nysiis_original_perl <- function(word, maxCodeLen = 6) {

    ## First, remove any nonalphabetical characters and capitalize it
    word <- gsub("[^[:alpha:]]*", "", word, perl = TRUE)
    word <- toupper(word)

    ## Translate first characters of name: MAC to MCC, KN to N, K to C, PH,
    ## PF to FF, SCH to SSS
    word <- gsub("^MAC", "MCC", word, perl = TRUE)
    word <- gsub("KN", "NN", word, perl = TRUE)
    word <- gsub("K", "C", word, perl = TRUE)
    word <- gsub("^PF", "FF", word, perl = TRUE)
    word <- gsub("PH", "FF", word, perl = TRUE)
    word <- gsub("SCH", "SSS", word, perl = TRUE)

    ## Translate last characters of name: EE to Y, IE to Y, DT, RT, RD,
    ## NT, ND to D
    word <- gsub("EE$", "Y", word, perl = TRUE)
    word <- gsub("IE$", "Y", word, perl = TRUE)
    word <- gsub("DT$", "D", word, perl = TRUE)
    word <- gsub("RT$", "D", word, perl = TRUE)
    word <- gsub("RD$", "D", word, perl = TRUE)
    word <- gsub("NT$", "D", word, perl = TRUE)
    word <- gsub("ND$", "D", word, perl = TRUE)

    ## First character of key = first character of name.
    first <- substr(word, 1, 1)
    word <- substr(word, 2, nchar(word))

    ## EV to AF else A, E, I, O, U to A
    word <- gsub("EV", "AF", word, perl = TRUE)
    word <- gsub("E|I|O|U", "A", word, perl = TRUE)

    ## Q to G, Z to S, M to N
    word <- gsub("Q", "G", word, perl = TRUE)
    word <- gsub("Z", "S", word, perl = TRUE)
    word <- gsub("M", "N", word, perl = TRUE)

    ## KN to N else K to C
    ## SCH to SSS, PH to FF
    ## Rules are implemented as part of opening block

    ## H to If previous or next is non-vowel, previous.
    word <- gsub("([^AEIOU])H", "\\1", word, perl = TRUE)
    word <- gsub("(.)H[^AEIOU]", "\\1", word, perl = TRUE)

    ## W to If previous is vowel, A
    word <- gsub("([AEIOU])W", "A", word, perl = TRUE)

    ## If last character is S, remove it
    word <- gsub("S$", "", word, perl = TRUE)

    ## If last characters are AY, replace with Y
    word <- gsub("AY$", "Y", word, perl = TRUE)

    ## Remove duplicate consecutive characters
    word <- gsub("([A-Z])\\1+", "\\1", word, perl = TRUE)

    ## If last character is A, remove it
    word <- gsub("A$", "", word, perl = TRUE)

    ## Append word except for first character to first
    word <- paste(first, word, sep = "")

    ## Truncate to requested length
    word <- substr(word, 1, maxCodeLen)

    return(word)
}

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.