The phonics-in-r from k3jph

Roger Root

Phonics should include an implementation of the Roger Root name comparison algorithm. See this USDA publication for more information.

Add Beider-Morse

You can find more information about Beider-Morse at http://stevemorse.org/phonetics/bmpm.htm

Add warnings to ONCA

Rewrite the unit tester
Add new test cases
Rewrite the code for to process warnings

Ensure all algorithms return "" for input ""

NYSIIS encoding of 'CHRISTINA'

Noticed phonics::nysiis('CHRISTINA') outputs 'CHRASTAN' (for maxCodeLen >= 8) whereas it should be 'CRASTAN' as per original algorithm (see https://naldc.nal.usda.gov/download/27833/PDF or https://www.springer.com/us/book/9780387695020 and the somewhat more vague https://en.wikipedia.org/wiki/New_York_State_Identification_and_Intelligence_System; can't find original report by Taft). Steps worked through here: christina.txt

Looks like discrepancy is due to the omission of the first letter of the name in nysiis.R line 107, i.e.
word <- substr(word, 2, nchar(word)) before the application of the 'H' rule (Step 4.5).

Add warnings to MRA

Rewrite the unit tester
Add new test cases
Rewrite the code for to process warnings

Ensure all algorithms return NA for input NA

Add support for Eudex

More information on Eudex is available https://github.com/ticki/eudex

Add warnings to Statcan

Rewrite the unit tester
Add new test cases
Rewrite the code for to process warnings

Add support for fuzzy Soundex

More information on fuzzy Soundex available from http://wayback.archive.org/web/20100629121128/http://www.ir.iit.edu/publications/downloads/IEEESoundexV5.pdf

Add warnings to Cologne

Rewrite the unit tester
Add new test cases
Rewrite the code for to process warnings

Add warnings to NYSIIS

Rewrite the unit tester
Add new test cases
Rewrite the code for to process warnings

Metaphone crashing when encoding "gh"

Describe the bug
metaphone crashes when encoding "gh"

Possibly this is version dependent - I'm running an old R and cannot upgrade until I buy a new computer.

It's just strange that it seems to work for many words and only crash on gh. Maybe gh is producing some sort of strange unicode or something? IDK, I'm not much of a user of this package, but I need to get some code to run and this is breaking it. Any help would be appreciated.

To Reproduce
phonics::metaphone("sigh")

Or any other word with gh in it, as far as I can tell

Expected behavior
Should return the metaphone encoding for sigh.

Example

> phonics::metaphone("ruff")
[1] "RF"
> phonics::metaphone("rough")
Error in metaphone_internal(word, maxCodeLen) : 
  c++ exception (unknown reason)
> phonics::metaphone("funhouse")
[1] "FNHS"
> phonics::metaphone("bughouse")
Error in metaphone_internal(word, maxCodeLen) : 
  c++ exception (unknown reason)
library(stringr); words[!str_detect(words,"gh")] %>% phonics::metaphone()
# works properly on 962 other words :-)

Desktop (please complete the following information):

> version
               _                           
platform       x86_64-apple-darwin15.6.0   
arch           x86_64                      
os             darwin15.6.0                
system         x86_64, darwin15.6.0        
status                                     
major          3                           
minor          6.1                         
year           2019                        
month          07                          
day            05                          
svn rev        76782                       
language       R                           
version.string R version 3.6.1 (2019-07-05)
nickname       Action of the Toes

Running phonics v1.3.9

Add warnings to Lein

Rewrite the unit tester
Add new test cases
Rewrite the code for to process warnings

soundex single characters

Hi James (Sorry for calling you by you Surname!),

Currently, single character strings return no padded out 0's. Would you consider this a bug?

Looking at three implementations of soundex ...

phonics::soundex("A")
# [1] "A"
RecordLinkage::soundex("A")
# [1] "A000"
stringdist::phonetic("A")
# [1] "A000"

It's pretty edge case, but with the types of names I deal with sometimes I get abbreviations, so when doing linkage, if a name was "DA" on one dataset and "D" on another, I might consider it a pair, though blocking on soundex name wouldn't result in a pair ("D" vs "D000").

Happy to do a pull request if you agree.

NYSIIS encoding of 'JOHN'

nysiis_original() returns 'J', whereas the encoding should be 'JAN'. This is a mistake in the use of gsub (both previous and next letters were part of the 'string to replace' instead of lookarounds being used). Have forked and will fix.

Add warnings to Phonex

Rewrite the unit tester
Add new test cases
Rewrite the code for to process warnings

Add warnings to Caverphone

Rewrite the unit tester
Add new test cases
Rewrite the code for to process warnings

Kölner Phonetik

Add support for Kölner Phonetik

Daitch–Mokotoff Soundex

The package should include a an implementation of Daitch-Mokotoff Soundex

Soundex returning single letter instead of augmenting with zeros

If I understand correctly from the Soundex algorithm steps on Wikipedia, the encoding of e.g. the string 'A' should be 'A000'. Indeed this is what is produced by other Soundex implementations I'm looking at. However, phonics::soundex('A') returns 'A'.

Happy to make a pull request if you agree that 'A000' is the correct encoding and if you agree with the rule that "If you have too few letters in your word that you can't assign three numbers, append with zeros until there are three numbers" (quoting from Step 4 in the Wikipedia article).

NYSIIS encoding of 'HANNAH'

Both nysiis_original() and nysiis_modified() are returning 'HANAH'. The encoding rule for a terminal 'H' is ambiguous in this case because of its definition in terms of the preceding and following letters, whereas there is no following letter for the last letter in the name. However it seems more in the spirit of this phonetic encoding to omit the final 'H' (and therefore the second 'A') from the final encoding, and to return 'HAN' instead. The latter interpretation has been adopted in the plurality of implementations here, by the way.

Add Metaphone3

Add warnings to Metaphone

Rewrite the unit tester
Add new test cases
Rewrite the code for to process warnings

Rewrite the unit tester
Add new test cases
Rewrite the code for to process warnings

Use of perl = TRUE

Hi Howard,

Thanks for the package.

Have you ever considered the use of the perl = TRUE argument in a lot of your gsub() functions?

It offers considerable time benefits.

Below is an example having updated the nysiis_original function.

# install.packages("babynames")
# install.packages("phonics")
library("babynames")
library("phonics")

name <- babynames$name

length(name)
# 1858689

system.time(a <- nysiis_original_perl(name))
# user  system elapsed 
# 13.36    0.14   13.54 

system.time(b <- nysiis(name))
#  user  system elapsed 
# 22.75    0.24   23.02 

# All equal?
all.equal(a, b)
# [1] TRUE

# microbenchmark'ing
microbenchmark(
  nysiis_original_perl(name),
  nysiis(name), times = 25
)
# Unit: milliseconds
#                        expr      min       lq     mean   median       uq      max neval
#  nysiis_original_perl(name) 308.5931 311.0220 316.0347 312.2456 315.8408 345.8459    25
#                nysiis(name) 568.2662 573.1073 577.4318 575.4571 577.5975 606.7362    25

sessionInfo()
# R version 3.5.0 (2018-04-23)
# Platform: x86_64-w64-mingw32/x64 (64-bit)
# Running under: Windows 10 x64 (build 17763)
# 
# Matrix products: default
# 
# locale:
# [1] LC_COLLATE=English_Australia.1252  LC_CTYPE=English_Australia.1252    LC_MONETARY=English_Australia.1252 LC_NUMERIC=C                       LC_TIME=English_Australia.1252    
# 
# attached base packages:
# [1] stats     graphics  grDevices utils     datasets  methods   base     
# 
# other attached packages:
# [1] phonics_1.1.0   babynames_0.3.0
# 
# loaded via a namespace (and not attached):
# [1] compiler_3.5.0 tools_3.5.0    pillar_1.3.1   tibble_1.4.2   Rcpp_1.0.0     crayon_1.3.4   rlang_0.3.0.1 


# nysiis_original with perl = TRUE ...
nysiis_original_perl <- function(word, maxCodeLen = 6) {

    ## First, remove any nonalphabetical characters and capitalize it
    word <- gsub("[^[:alpha:]]*", "", word, perl = TRUE)
    word <- toupper(word)

    ## Translate first characters of name: MAC to MCC, KN to N, K to C, PH,
    ## PF to FF, SCH to SSS
    word <- gsub("^MAC", "MCC", word, perl = TRUE)
    word <- gsub("KN", "NN", word, perl = TRUE)
    word <- gsub("K", "C", word, perl = TRUE)
    word <- gsub("^PF", "FF", word, perl = TRUE)
    word <- gsub("PH", "FF", word, perl = TRUE)
    word <- gsub("SCH", "SSS", word, perl = TRUE)

    ## Translate last characters of name: EE to Y, IE to Y, DT, RT, RD,
    ## NT, ND to D
    word <- gsub("EE$", "Y", word, perl = TRUE)
    word <- gsub("IE$", "Y", word, perl = TRUE)
    word <- gsub("DT$", "D", word, perl = TRUE)
    word <- gsub("RT$", "D", word, perl = TRUE)
    word <- gsub("RD$", "D", word, perl = TRUE)
    word <- gsub("NT$", "D", word, perl = TRUE)
    word <- gsub("ND$", "D", word, perl = TRUE)

    ## First character of key = first character of name.
    first <- substr(word, 1, 1)
    word <- substr(word, 2, nchar(word))

    ## EV to AF else A, E, I, O, U to A
    word <- gsub("EV", "AF", word, perl = TRUE)
    word <- gsub("E|I|O|U", "A", word, perl = TRUE)

    ## Q to G, Z to S, M to N
    word <- gsub("Q", "G", word, perl = TRUE)
    word <- gsub("Z", "S", word, perl = TRUE)
    word <- gsub("M", "N", word, perl = TRUE)

    ## KN to N else K to C
    ## SCH to SSS, PH to FF
    ## Rules are implemented as part of opening block

    ## H to If previous or next is non-vowel, previous.
    word <- gsub("([^AEIOU])H", "\\1", word, perl = TRUE)
    word <- gsub("(.)H[^AEIOU]", "\\1", word, perl = TRUE)

    ## W to If previous is vowel, A
    word <- gsub("([AEIOU])W", "A", word, perl = TRUE)

    ## If last character is S, remove it
    word <- gsub("S$", "", word, perl = TRUE)

    ## If last characters are AY, replace with Y
    word <- gsub("AY$", "Y", word, perl = TRUE)

    ## Remove duplicate consecutive characters
    word <- gsub("([A-Z])\\1+", "\\1", word, perl = TRUE)

    ## If last character is A, remove it
    word <- gsub("A$", "", word, perl = TRUE)

    ## Append word except for first character to first
    word <- paste(first, word, sep = "")

    ## Truncate to requested length
    word <- substr(word, 1, maxCodeLen)

    return(word)
}

Add warnings to Soundex

Rewrite the unit tester
Add new test cases
Rewrite the code for to process warnings

k3jph / phonics-in-r Goto Github PK

phonics-in-r's Introduction

Phonetic Spelling Algorithms in R

Algorithms included

Dependencies

Contribution guidelines

For more information

Acknowledgements

phonics-in-r's People

Contributors

Stargazers

Watchers

Forkers

phonics-in-r's Issues

Recommend Projects

Recommend Topics

Recommend Org

Jobs