GithubHelp home page GithubHelp logo

namae's Introduction

Namae (名前)

Namae is a parser for human names. It recognizes personal names of various cultural backgrounds and tries to split them into their component parts (e.g., given and family names, honorifics etc.).

CI Coverage Gem Version Code Climate

Quickstart

  1. Install the namae gem (or add it to your Gemfile):

     $ gem install namae
    
  2. Start parsing names! Namae expects you to pass in a string and it returns a list of parsed names:

     require 'namae'
    
     names = Namae.parse 'Yukihiro "Matz" Matsumoto'
     #-> [#<Name family="Matsumoto" given="Yukihiro" nick="Matz">]
    
  3. Use the name objects to access the individual parts:

     matz = names[0]
    
     matz.nick
     #-> "Matz"
    
     matz.family
     #-> "Matsumoto"
    
     matz.initials
     #-> "Y.M."
    
     matz.initials :expand => true
     #-> "Y. Matsumoto"
    
     matz.initials :dots => false
     #-> "YM"
    

Rails

It is easy to integrate Namae into your Rails project. There are two typical cases where this might be useful: you want to store individual parts of a person's name in your database, but want to provide your user's with a single input field; or else, you keep personal names in a single database column but your application occasionally requires access to individual parts.

For the latter use case, there is a straightforward way to add Namae to your Rails model:

class Person < ActiveRecord::Base
  attr_accessible :name

  delegate :family, :initials, :to => :namae

  private

  def namae
    @namae ||= Namae::Name.parse(name)
  end
end

In this minimal example, we are using the method Namae::Name.parse which always returns a single Name instance and delegate all readers for the name's parts in which we are interested to this instance.

Format and Examples

Namae recognizes names in a wide variety of two basic formats, internally referred to as display-order and sort-order. For example, the following names are written in display-order:

Namae.parse 'Charles Babbage'
#-> [#<Name family="Babbage" given="Charles">]]

Namae.parse 'Mr. Alan M. Turing'
#-> [#<Name family="Turing" given="Alan M." appellation="Mr.">]

Namae.parse 'Yukihiro "Matz" Matsumoto'
#-> [#<Name family="Matsumoto" given="Yukihiro" nick="Matz">]

Namae.parse 'Augusta Ada King and Lord Byron'
#-> [#<Name family="King" given="Augusta Ada">, #<Name family="Byron" title="Lord">]

Namae.parse 'Sir Isaac Newton'
#-> [#<Name family="Newton" given="Isaac" title="Sir">]

Namae.parse 'Prof. Donald Ervin Knuth'
#-> [#<Name family="Knuth" given="Donald Ervin" title="Prof.">]

Namae.parse 'Ms. Sofia Kovaleskaya'
#-> [#<Name family="Kovaleskaya" given="Sofia" appellation="Ms.">]

Namae.parse 'Countess Ada Lovelace'
#-> [#<Name family="Lovelace" given="Ada" title="Countess">]

Namae.parse 'Ken Griffey Jr.'
#-> [#<Name family="Griffey" given="Ken" suffix="Jr.">]

Or in sort-order:

Namae.parse 'Turing, Alan M.'
#-> [#<Name family="Turing" given="Alan M.">]

You can also mix sort- and display-order in the same expression:

Namae.parse 'Torvalds, Linus and Alan Cox'
#-> [#<Name family="Torvalds" given="Linus">, #<Name family="Cox" given="Alan">]

Typically, sort-order names are easier to parse, because the syntax is less ambiguous. For example, multiple family names are always possible in sort-order:

Namae.parse 'Carreño Quiñones, María-Jose'
#-> [#<Name family="Carreño Quiñones" given="María-Jose">]

Whilst in display-order, multiple family names are only supported when the name contains a particle or a nickname.

Namae tries to detect common particles using the :uppercase_particle lexer pattern. If you prefer to always include particles with the family name, you can set the the :include_particle_in_family parser option.

Namae.parse 'Ludwig von Beethoven'
#-> [#<Name family="Beethoven" given="Ludwig" particle="von">]

Namae.options[:include_particle_in_family] = true
#-> [#<Name family="von Beethoven" given="Ludwig">]

Configuration

You can tweak some of Namae's parse rules by configuring the parser's options. Take a look at Namae.options to see your current settings. If you want to change the default settings for all parsers, you can run Namae.configure which will yield the default options (make sure to change the configuration before using the parser).

A Note On Thread Safety

When using the top-level parse functions, Namae will re-use a thread-local parser instance (Namae::Parser.instance); the instance is created, using the current default options (Namae::Parser.defaults). If you need more control, you are encouraged to create individual parser instances using Namae::Parser.new.

Rationale

Parsing human names is at once too easy and too hard. When working in the confines of a single language or culture it is often a trivial task that does not warrant a dedicated software package; when working across different cultures, languages, or scripts, however, it may quickly become unrealistic to devise a satisfying, one-size-fits-all solution. In languages like Japanese or Chinese, for instance, the issue of word segmentation alone is probably more difficult than name parsing itself.

Having said that, Namae is based on the rules used by BibTeX to format names and can therefore be used to parse names of most languages using latin script with the long-time goal to support as many languages and scripts as possible without the need for sophisticated or large dictionary based language-detection or word segmentation features.

For further reading, see the W3C's primer on Personal Names Around the World.

Development

The Namae source code is hosted on GitHub. You can check out a copy of the latest code using Git:

$ git clone https://github.com/berkmancenter/namae.git

To get started, generate the parser and run all tests:

$ cd namae
$ bundle install
$ bundle exec rake features
$ bundle exec rake spec

If you've found a bug or have a question, please open an issue on the issue tracker. Or, for extra credit, clone the Namae repository, write a failing example, fix the bug and submit a pull request.

Contributors

Credits

Namae was written as a part of a Google Summer of Code project. Thanks Google!

Copyright

Copyright (c) 2013-2020 Sylvester Keil

Copyright (c) 2012 President and Fellows of Harvard College.

Namae is dual licensed under the AGPL and a BSD-style license.

namae's People

Contributors

benbalter avatar hackling avatar inukshuk avatar lostmahbles avatar matthewstephens avatar stuzart avatar terencedignon avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

namae's Issues

The Honorable title

The Honorable and some other titles are not handled correctly. Is there a way to override racc rules?

Namae.parse('The Honorable Chelsea Block')
=> [#<Name family="Block" given="The Honorable Chelsea">]

Option to disable specific parse rules?

Hello! It would be very helpful if specific rules could be disabled entirely. Similar to #12, I'm getting nil back for someone whose last name is Herr. Our names are stricly some combination of first/last, no titles or appellations, so it would be extremely helpful if we could just disable some rules entirely.

Namae isn't thread safe

This causes issues, especially if you are using a threaded webserver like puma https://github.com/puma/puma

The following code reproduces the error:

require 'namae'

name1_s = "Foo Bar"
name2_s = "Baz"

name1 = Namae.parse(name1_s).first
r1 = [name1.given, name1.family]

name2 = Namae.parse(name2_s).first
r2 = [name2.given, name2.family]

$issues = []

def verify(str, expected)
  name = Namae.parse(str).first
  if name == nil
    $issues << {error: "Parser returned nil"}
    return
  end

  result = [name.given, name.family]

  if result != expected
    $issues << {
      name: str,
      expected: expected,
      result: result,
    }
  end
end

N = 10_000

[[name1_s, r1], [name2_s, r2]].map do |str, expected|
  Thread.new do
    N.times do
      verify(str, expected)
    end
  end
end.each(&:join)

require "pp"
pp $issues

Example output:

[{:name=>"Baz", :expected=>["Baz", nil], :result=>["Foo", "Bar"]},
 {:name=>"Foo Bar", :expected=>["Foo", "Bar"], :result=>["Baz", nil]},
 {:name=>"Baz", :expected=>["Baz", nil], :result=>[" Ba", "r"]},
 {:error=>"Parser returned nil"}]
[{:name=>"Foo Bar", :expected=>["Foo", "Bar"], :result=>["Baz", nil]},
 {:name=>"Baz", :expected=>["Baz", nil], :result=>["Bar", nil]},
 {:name=>"Foo Bar", :expected=>["Foo", "Bar"], :result=>["Foo", "Baz"]},
 {:error=>"Parser returned nil"},
 {:error=>"Parser returned nil"},
 {:name=>"Foo Bar", :expected=>["Foo", "Bar"], :result=>["Foo", nil]}]

Presumably this is because it uses a stateful singleton.

include Singleton

What is dropping particle

I can see in the members that dropping_particle is one of the things that can be parsed. I'm unable to find what it means. Could you please give me sample names which has this so that I can handle this case as well.

Fails to detect name parts if suffix preceeds comma

These "last_name suffix, first_name" names generate an empty array.
Gump Jr., Bubba
Gump Jr., Bubba B.
Gump II, Bubba
Gump Jr., B.B.

However, when split-reverse-joined, the names parse correctly.
Bubba Gump Jr.
Bubba B. Gump Jr.
Bubba Gump II
B.B. Gump Jr.

This is not full-proof, because some names are formatted "first_name last_name, suffix". My conversion code below would actually make the suffix the first name.

My code used for correcting the above names

name = 'Gump Jr., Bubba B.'
coerced_name = name.split(',').reverse.join(' ').squish
namae = Namae.parse(coerced_name)

Can't parse name with uppercase symbols.

Example
Namae.parse 'Benedict XVI, Pope'
=> []

If use capital case it's work:
Namae.parse 'Benedict Xvi, Pope'
=> [#<Name family="Benedict Xvi" given="Pope">]

Support for suffixes

name = Namae.parse "Joe Smith Jr."
[#<Name family="Jr." given="Joe Smith">]

I'd expect the output to be

[#<Name family="Smith" given="Joe" suffix="Jr"]

I looked at the parser, and it looks like this is planned, but not implemented?

Sort Order Breaks on Suffixes

Version 0.9.1 seems to incorrectly parse names with suffiexs when parsed in "sort ordering".

E.g.

Namae.parse("Griffey, Ken Jr")  # => nil

Prefixed and Compound Word Surnames

Namae doesn't seem to work correctly with prefixed and compound word family names. For example:

>> Namae::Name.parse("Justin Du Bois")
 => #<Name family="Bois" given="Justin Du"> 

More about prefixed and compound family names here: http://www.barbarahenritze.com/index.php/genealogical-research/genealogy-articles/32-prefixes-suffixes-hyphenates-compound-words-and-titles

I'll try to come up with a fix at some point but I figured I'd open up an issue to get the ball rolling.

Can't parse lowercase given names.

Real world examples:

[32] pry(main)> Namae.parse("bell hooks").first.given
=> nil
[33] pry(main)> Namae.parse("danah boyd").first.given
=> nil

Cant parse names where nickname precedes rest of name

Example Name:
'John' Johnathan Q. Public

Expected Behavior:
[#<Name family="Public" given="Johnathan Q." nick="John">]

Actual Behavior:
[]

Summary: None of the name is parsed if a nickname precedes the other name attributes.

Reproduced on ruby 2.5.1p57 (2018-03-29 revision 63029) [x86_64-darwin18]

How married are you to that AGPL license?

My employer can't continue to use this gem due to the license. Would you consider changing it to just a BSD license or perhaps something else? I know it's a long shot but it doesn't hurt to ask~ Apologies if this is inappropriate.

Suffix with comma separated name edge case

If the suffix is present on the family name it does not resolve, example:

Namae.parse 'DOMINIC G LEWIS JR'
=> [#<struct Namae::Name family="LEWIS", given="DOMINIC G", suffix="JR", particle=nil, dropping_particle=nil, nick=nil, appellation=nil, title=nil>]

Namae.parse 'LEWIS JR, DOMINIC G'
=> []

Chinese Names

Does Namae support Chinese Names?
Say for example "Xi Jinping" where Xi is family name and Jinping is initials

Family name missing on first record if it exists only after a separator

Input

Namae.parse('Larry and Irene Smeltzer')

Expected Output:

# => [#<Name family="Smeltzer" given="Larry">, #<Name family="Smeltzer" given="Irene">]

Actual Output:

#=> [#<Name given="Larry">, #<Name family="Smeltzer" given="Irene">]

Proposed Solution:
If two names are separated by and or & (probably not ;), and one of the two names does not have a family name it should inherit the family name of the other record. I'm going to attempt to solve this myself but the autogenerated racc file is a little intimidating.

Gender pronouns

Hi. First, thank you for this great project. It helped me already a lot.
I now, for the first time, came across an issue:
When people add gender pronouns (e.g. (he/him)) to their names, the parsing fails all together.

Example:

2.7.4 :003 > Namae.parse("Max Power (he/him, er/ihm)")
 => []
2.7.4 :004 > Namae.parse("Max Power (er/ihm)")
 => []
2.7.4 :005 > Namae.parse("Max Power")
 =>
[#<struct Namae::Name
  family="Power",
  given="Max",
  suffix=nil,
  particle=nil,
  dropping_particle=nil,
  nick=nil,
  appellation=nil,
  title=nil>]

I read about adding custom suffices, but then i though that this might become it's own part.
Currently i am trying to build that in and do a PR, but it goes slowely.

So very open for hints or other opinions.

Common Surname Particles

There are many names that use capitalized surname particles: De, De La, Des, Di, St, etc. that should be included in the family name. (https://academia.stackexchange.com/questions/15326/how-to-deal-with-particles-in-a-last-name-in-a-reference-list)

Currently something like "Carlos De Silva" would be parsed as #<Name family="Silva" given="Carlos De" particle=nil>

I'm happy to make a try at updating the grammar (knew I should've paid more attention in Prog Lang) but saw a comment in #33 that might make me think you'd prefer this to be post-processed?

incorrectly parsed name: "laxmi zasdfasdf"

I ran across this name (though this is an anonymized version, obviously) that isn't parsing correctly:

irb(main):006:0> Namae.parse("laxmi zasdfasdf").first
=> #<Name family="zasdfasdf" particle="Laxmi">

I would've expected:

irb(main):006:0> Namae.parse("laxmi zasdfasdf").first
=> #<Name family="zasdfasdf" given="laxmi">

Parse fails on the surname Lord

Discovered this edge case when I encountered an input of "Jeffrey Lord":

Namae::Name.parse("Jeffrey Lord")
=> #<struct Namae::Name family=nil, given=nil, suffix=nil, particle=nil, dropping_particle=nil, nick=nil, appellation=nil, title=nil>
Namae::Name.parse("Jeffrey P. Lord")
=> #<struct Namae::Name family="P.", given="Jeffrey", suffix=nil, particle=nil, dropping_particle=nil, nick=nil, appellation=nil, title="Lord">

This is presumably because of "lord" being one of the supported titles. It's the only one that's also a common surname.

It works fine when in "first, last" format:

Namae::Name.parse("Lord, Jeffrey")
=> #<struct Namae::Name family="Lord", given="Jeffrey", suffix=nil, particle=nil, dropping_particle=nil, nick=nil, appellation=nil, title=nil>

But of course it would be great not to have to worry about it when you can't control your input format.

Really helpful gem though, thanks for your work!

Suffixes of V (fifth) or X (tenth) are not parsed correctly

If a name has a suffix of V (fifth), it is considered a family name, not the suffix:

>> Namae.parse("Adam Burren V")
=> [#<Name family="V" given="Adam Burren">]

This is because the suffix regex is:

/\s*\b(JR|Jr|jr|SR|Sr|sr|[IVX]{2,})(\.|\b)/

The part that accounts for roman numeral suffixes is [IVX]{2,}, which looks for 2 or more characters, while V or X would only be one.

Perhaps this is intentional, because looking for a single character may be problematic and cause a lot of false positives, but I wanted to create an issue for it and see.

Lists of names with variable separators

Is it outside the scope of namae to parse lists of names? For example, I have variably structured lists of names such as

  • John D. Smith; Jack R. Johnson; Emily Tanner
  • Smith, John D.; Johnson, Jack R.; Tanner, Emily
  • John D. Smith, Jack R. Johnson & Emily Tanner
  • C. Foster; C. Hamel, C. Desroches

Namae doesn't parse PHD titles

We're using Faker to generate our test names, and I just noticed that Namae cannot parse a PHD title.

Namae.parse("Bernardo Franecki, PhD")

returns an empty array.

After a bit of googling, looks like this is a tangible use case that Namae should handle.

Bunch of names failing to parse

Allow me to drop a bunch of real-world examples of names which failed to parse, in case they may be helpful in further perfecting the parsing logic. Perhaps eventually these cases can be accounted for?

P.S. these are 53 names failing to parse out of about 100,000 which Namae successfully parsed, apparently, so overall pretty good name parsing logic you got there 😉👌👌👏👏

Mr. Gunner Richard Dornonville de la Co Jr.
Lt Col Marc Lamar Warren
Mr. Leroy Davis, Jr.
Mr. Jonathan H (Jason) Warner
Ms. Elizabeth (Lisa) Carol Vega
Mr. James N. Altiere, III
Mr. Pei-Yuan (Ken) Wu
Lt Col Walt L Trierweiler
Mr. Gillis E (Beau) Powell III
Mr. John Allen Madden, Jr.
Mr. William Thomas Haruch-Roshko, Jr.
Ms. Natalie Lord
Mr. Gary E. Simmons, Jr.
Ms. Yeong (Susan) Joo Kang
Mr. David Cantor
Mr. Bernard Pastor
Mr. Michael von Krogh Foster II
Mr. John Stanley Lord Jr.
Ms. Bettina Cantor Hollo
Mr. Miguel ngel Arn-Roque
Mr. John C. Scarborough, Jr.
Mr. Otto Carlos de Cordoba Jr.
Mr. Domingo Gamboa De la Fuente Jr.
Mr. Harold (Gus) Fletcher
Mr. Carlos Manuel de Cespedes III
Mr. Jorge Victor de Ona Jr.
Mr. Carlos Manuel de la Cruz III
Ms. Suzette Lord
Ms. Jeanette Pastor Rodriguez
Mr. Robert Emmett McNally, IV
Mr. Michael Jr Arguez
Mr. Joseph David Coronato, Jr.
Ms. Maria Lord
Ms. June (Hua) Zhou
Mr. Jerry E. Nichols, Jr.
Mr. Frederick (Fritz) Gray
Mr. Maj Vasigh
Ms. Yamila Pastor DeHombre
Lt Col Ronald Steven Frankel
Ms. Danielle Lord
Mr. Christopher Lord
Mr. Guillermo E deJesus Gomez Jr.
Mr. Michael S ('Mike') Hagen
Mr. Henry Cantor Cohen
Mr. William Charles Richardson,III
Mr. E G -Dan- Boone
Lt Col Edwin D Selby
Mr. Stephen deHart Schwarz II
Mr. Robert L Lord Jr.
Mr. Thomas Scott Smith, III
Mr. Richard Elder Crum
Mr. Jos Antonio Isasi, II

Parse middle initials?

It would be helpful when trying to compare/match names using family and given that any middle initials are separated from the given.

Is it possible to add functionality to parse away middle initials separately?

ESQ suffix breaks name parsing

> Namae.parse('BRYANT H DUNIVAN JR, ESQ')
=> []
> Namae.parse('JAMES W GOVIN, ESQ')
=> [#<Name family="JAMES W GOVIN" given="ESQ">]

Parsing Exchange-formatted names?

Exchange uses "Lastname Firstname prefixes", which is probably the most braindead format they could think of, but is there a way to hint Namae to parse these?

middle names

namae = Namae.parse( 'Homer Jay Simpson')
=> family="Simpson", given="Homer Jay"

Is it possible to tell Namae that the second name is a middle name instead of a given name? I understand that technically, it is a given name. Just looking for a way to parse the middle name in one pass.

In the meantime, this code works by making the assumption that a person has only one first name.

first_name, middle_name = namae.given.split(' ', 2) if namae.given

How do I configure the parser at run time?

This is probably a question more than it is an issue. Once I figure this out, I can create a PR to update the documentation.

I'm unable to find any example of how to configure the Namae parser at run time. I've tried:

Namae.parse('Bob Bailey iv') #=> [#<struct Namae::Name family="iv", given="Bob Bailey", suffix=nil, ...
Namae::Parser.new.parse('Bob Bailey iv') #=> [#<struct Namae::Name family="iv", given="Bob Bailey", suffix=nil, ...
# make the suffix case insensitive
Namae.configure { |config| config[:suffix] = /\s*\b(JR|Jr|jr|SR|Sr|sr|[IVX]{2,})(\.|\b)/i }
Namae.parse('Bob Bailey iv') #=> no change
Namae::Parser.new.parse('Bob Bailey iv') #=> [#<struct Namae::Name family="Bailey", given="Bob", suffix="iv", ...

Shouldn't Namae.parse be using the new configuration? It seems like Namae.options and Thread.current[:namae].options do not get updated.

However, Namae::Parser.defaults does get updated. So, assigning Thread.current[:namae] = Namae::Parser.new after the configure call then makes Namae.parse work with the new configuration. Is this assignment required?

Thanks for any help.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.