GithubHelp home page GithubHelp logo

name-parser's Introduction

THE ICONIC Name Parser

Build Status Coverage Status Scrutinizer Code Quality Latest Stable Version Total Downloads License

Purpose

This is a universal, language-independent name parser.

Its purpose is to split a single string containing a full name, possibly including salutation, initials, suffixes etc., into meaningful parts like firstname, lastname, initials, and so on.

It is mostly tailored towards english names but works pretty well with non-english names as long as they use latin spelling.

E.g. Mr Anthony R Von Fange III is parsed to

  • salutation: Mr.
  • firstname: Anthony
  • initials: R
  • lastname: von Fange
  • suffix: III

This package has been used by The Iconic in production for years, successfully processing hundreds of thousands of customer names.

Features

Supported patterns

This parser is able to handle name patterns with and without comma:

... [firstname] ... [lastname] ...
... [lastname] ..., ... [firstname] ...
... [lastname] ..., ... [firstname] ..., [suffix]

Supported parts

  • salutations (e.g. Mr, Mrs, Dr, etc.)
  • first name
  • middle names
  • initials (single letters, possibly followed by a dot)
  • nicknames (parts within parenthesis, brackets etc.)
  • last names (also supports prefixes like von, de etc.)
  • suffixes (Jr, Senior, 3rd, PhD, etc.)

Other features

  • multi-language support for salutations, suffixes and lastname prefixes
  • customizable nickname delimiters
  • customizable normalisation of all output strings (original values remain accessible)
  • customizable whitespace

Examples

More than 60 different successfully parsed name patterns can be found in the parser unit test.

Setup

composer require theiconic/name-parser

Usage

Basic usage

<?php

$parser = new TheIconic\NameParser\Parser();

$name = $parser->parse($input);

echo $name->getSalutation();
echo $name->getFirstname();
echo $name->getLastname();
echo $name->getMiddlename();
echo $name->getNickname();
echo $name->getInitials();
echo $name->getSuffix();

print_r($name->getAll());

echo $name;

An empty string is returned for missing parts.

Special part retrieval features

Explicit last name parts

You can retrieve last name prefixes and pure last names separately with

echo $name->getLastnamePrefix();
echo $name->getLastname(true); // true enables strict mode for pure lastnames, only

Nick names with normalized wrapping

By default, getNickname() returns the pure string of nick names. However, you can pass true to have the same normalised parenthesis wrapping applied as in echo $name:

echo $name->getNickname(); // The Giant
echo $name->getNickname(true); // (The Giant)

Re-print given name in the order as entered

You can re-print the parts that form a given name (that is first name, middle names and any initials) in the order they were entered in while still applying normalisation via getGivenName():

echo $name->getGivenName(); // J. Peter M.

Re-print full name (actual name parts only)

You can re-print the full name, that is the given name as above followed by any last name parts (excluding any salutations, nick names or suffixes) via getFullName():

echo $name->getFullName(); // J. Peter M. Schluter

Setting Languages

$parser = new TheIconic\NameParser\Parser([
    new TheIconic\NameParser\Language\English(), //default
    new TheIconic\NameParser\Language\German(),
])

Setting nickname delimiters

$parser = new TheIconic\NameParser\Parser();
$parser->setNicknameDelimiters(['(' => ')']);

Setting whitespace characters

$parser = new TheIconic\NameParser\Parser();
$parser->setWhitespace("\t _.");

Limiting the position of salutations

$parser = new TheIconic\NameParser\Parser();
$parser->setMaxSalutationIndex(2);

This will require salutations to appear within the first two words of the given input string. This defaults to half the amount of words in the input string, meaning that effectively the salutation may occur within the first half of the name parts.

Adjusting combined initials support

$parser = new TheIconic\NameParser\Parser();
$parser->setMaxCombinedInitials(3);

Combined initials are combinations of several uppercased letters, e.g. DJ or J.T. without separating spaces. The parser will treat such sequences of uppercase letters (with optional dots) as combined initials and parse them into individual initials. This value adjusts the maximum number of uppercase letters in a single name part are recognised as comnined initials. Parts with more than the specified maximum amount of letters will not be parsed into initials and hence will most likely be parsed into first or middle names.

The default value is 2.

To disable combined initials support, set this value to 1;

Tips

Provide clean input strings

If your input string consists of more than just the name and directly related bits like salutations, suffixes etc., any additional parts can easily confuse the parser. It is therefore recommended to pre-process any non-clean input to isolate the name before passing it to the parser.

Multi-pass parsing

We have not played with this, but you may be able to improve results by chaining several parses in sequence. E.g.

$parser = new Parser();
$name = $parser->parse($input);
$name = $parser->parse((string) $name);
...

You can even compose your new input string from individual parts of a previous pass.

Dealing with names in different languages

The parser is primarily built around the patterns of english names but tries to be compatible with names in other languages. Problems occur with different salutations, last name prefixes, suffixes etc. or in some cases even with the parsing order.

To solve problems with salutations, last name prefixes and suffixes you can create a separate language definition file and inject it when instantiating the parser, see 'Setting Languages' above and compare the existing language files as examples.

To deal with parsing order you may want to reformat the input string, e.g. by simply splitting it into words and reversing their order. You can even let the parser run over the original string and then over the reversed string and then pick the best results from either of the two resulting name objects. E.g. the salutation from the one and the lastname from the other.

The name parser has no in-built language detection. However, you may already ask the user for their nationality in the same form. If you do that you may want to narrow the language definition files passed into the parser to the given language and maybe a fallback like english. You can also use this information to prepare the input string as outlined above.

Alternatively, Patrick Schur as a PHP language detection library that seems to deliver astonishing results. It won't give you much luck if you run it over the the name input string only, but if you have any more text from the person in their actual language, you could use this to detect the language and then proceed as above.

Gender detection

Gender detection is outside the scope of this project. Detecting the gender from a name often requires large lists of first name to gender mappings.

However, you can use this parser to extract salutation, first name and nick names from the input string and then use these to implement gender detection using another package (e.g. this one) or service.

Having fun with normalisation

Writing different language files can not only be useful for parsing, but you can remap the normalised versions of salutations, prefixes and suffixes to transform them into something totally different.

E.g. you could map Ms. to princess of the kingdom of and then output the parts in appropriate order to build a pipeline that automatically transforms e.g. Ms. Louisa Lichtenstein into Louisa, princess of the kingdom of Lichtenstein. Of course, this is a silly and rather contrived example, but you get the gist.

Of course this can also be used in more useful ways, e.g. to spell out abbreviated titles, like Prof. as Professor etc. .

License

THE ICONIC Name Parser library for PHP is released under the MIT License.

name-parser's People

Contributors

codeduck42 avatar estringana avatar francislavoie avatar joeynovak avatar lexmark-haputman avatar udlobster avatar wyrfel avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

name-parser's Issues

Lastname is one of the recognised prefixes

Thanks for your work on this library - much appreciated.

We had a failure today for "yumeng du" as a name.
Looking at the parts array, the 'du' is not ascribed to a part.
I'm guessing that this is because it's one of the recognised prefixes?

Issue with initials before and after firstname

I wanted to use this library for cleanup of publication authors and map them to a common format like

$name->getLastname()).', '.$name->getFirstname() . ' ' . $name->getMiddlename() . ' ' . $name->getInitials())

This works flawlessly for barely every name. But as my own second given name is my primary one I often don't spell out my first one, like:
Schuler, J. Peter M.
which results in

Array
(
    [firstname] => Peter
    [initials] => J. M.
    [lastname] => Schuler
)

and thus I can't retrieve the correct order of names.

I understand that this is a quite special case, as Germany is one of the few countries were multiple firstnames are possible as well as middle names and where firstnames can be used. Additionally up until a few years ago there was a concept of primary firstname which in my case is/was my second one, thus that kind of writing.

Currently I don't understand all of the parsing, but might supply a pull request if I get this to work while still fulfilling the other test cases.

Fails to parse name with lastname first and middlename last

The test case:

            [
                'Smith, John Eric',
                [
                    'lastname' => 'Smith',
                    'firstname' => 'John',
                    'middlename' => 'Eric',
                ]
            ]

It doesn't parse Eric as a part, it gets skipped.

I tried making a fix, but gave up after a bit. First, I noticed that the counts of parts in MiddlenameMapper is 2, i.e. ["John", "Eric"] so the first condition there makes it quit out.

I changed that to check to < 2 to get further. Next, it fails in the mapFrom loop because $k = $start; $k < $length - 1; $k++ becomes $k = 0; 0 < 1 so it only does one iteration and misses the Eric part.

Trying to make that loop not miss the last part, but obviously that will make it fail for every other test case, so I gave up there.

Any idea how this case could be supported? My initial thought is the fact that Lastname was already parsed should be passed in to MiddlenameMapper, so it knows to not skip the last part, but I'm not sure how that information should be passed down the line.

Issue with parsing Vietnamese names

Hello there,

I have a name like this:

Nguyễn Quốc Thái

After parsing, the fullname I get there was "NguyễN QuốC TháI", you can notice how the cases are messed up.

One more thing is that Vietnamese names are written like this: Lastname MiddleName Firstname, how should this be handled?

Prefixes do not get recognized

While adding the Dutch language the van ’t last name prefix could not get detected.

Adding it to a new language's suffixes will not detect it either, so this has something to do with the parsing internals.

Did add a test in draft PR #35

Test data

Expected: Charlotte van ’t Wout

Actual: Charlotte Van ’T Wout

UTF-8 encoding issue

If you paste name like this 'Ren\xE9e Osborne', parse() will accept this input, but it will trigger an Error:
Error: Return value of TheIconic\NameParser\Part\AbstractPart::camelcase() must be of the type string, null returned

To prevent this I'm using ...->parse(utf8_encode($nameString));, but you should probably cover this issue in your repo.

[Feature request] Be able to access the lastname prefix + lastname separately

I really like how the name-parser can recognise the lastname prefix correctly. Now I would like to be able to get this prefix and the lastname separately.
This could be a setting (maybe related to the language setting, where in Dutch the lastname prefix is a separate part of the name (and for example sorting goes by last name, without the prefix).)

Example:

Frank van Delft (Dutch name), is recognised as:

TheIconic\NameParser\Name Object
(
    [parts:protected] => Array
        (
            [0] => TheIconic\NameParser\Part\Firstname Object
                (
                    [value:protected] => Frank
                )

            [1] => TheIconic\NameParser\Part\LastnamePrefix Object
                (
                    [normalized:protected] => van
                    [value:protected] => Van
                )

            [2] => TheIconic\NameParser\Part\Lastname Object
                (
                    [value:protected] => Delft
                )

        )

)

Doing:

echo $name->getFirstname() . "\n";
echo $name->getLastnamePrefix() . "\n";
echo $name->getLastname() . "\n";

Gives:

Frank
van
Delft

Multibyte special character messes up initials

We parsed the author name from a page, where the input name is © Pavel Voinau.

Expected result

not return an initials, return null as initials

Actual result

When dumping the value, name-parser exports "Â ©"

image

How to fix

  • make initial parser multibyte safe
  • do not parse copyright sign

Overriding or appending to arrays (such as salutations)

This library is amazing! Thanks so much for providing it!

It's the best one I've found.

I was wondering if there is an easy way to initialize the library with configurable settings, such as if I want to change the salutations that I see in https://github.com/theiconic/name-parser/blob/master/src/Language/English.php

E.g. I may want to add "Prof." as a salutation.

Or maybe there are occasions where I'd want to customize the other constant arrays there.

Other than forking your repo and editing the code, I haven't figured it out yet.

Thanks again!

Lastname Prefix not recognised if no first name

Input: Vincent Van Gogh
Expected: Vincent van Gogh
Actual: Vincent van Gogh

Input: Mr Vincent Van Gogh
Expected: Mr Vincent van Gogh
Actual: Mr. Vincent van Gogh

Input: Mr Van Gogh
Expected: Mr van Gogh
Actual: Mr. Van Gogh

I would have expected this to be detected as a salutation and lastname.

Is this still maintained?

We find this library useful but there are some things we would like to fix / fixes by others we would like to use. These are mostly in the open PR queue but it looks like the last commit was in 2019.

Are you still maintaining this? Interested in co-maintainers? Happy for this to be forked?

Output in vCard (RFC6350) matching form

According to this question I really would appreciate to get a feature solving my needs.
My idea is, that the return of "vCard-function" returns an array with key is vCard property name and value is name (part).
Steps:

  1. Try to figure out whether it´s a company or a person.
    From my point of view only a language related list of key words (user extendable) can solve this.
    If this matches, than return 'FN' and 'ORG' containing the name string as befor.
  2. If the name string is obviously not a company name, then try to splitt it into its components
  3. Rearrange components and return them as 'FN', 'N' or maybe 'NICKNAME'.

Here is my attempt from last night (draft):

    /**
     * get an array of name properties with vCard property as key
     *
     * @param string $realname
     * @return array
     */
    public function getNameProperties(string $realName)
    {
        $nameParts = explode(',', $realName);                       // "lastname, firstname"
        if (count($nameParts) == 2) {                               // it`s a person
            $nameParts  = $this->parser->parse($realName);
            $salutation = $nameParts->getSalutation();
            $firstName  = $nameParts->getFirstname();
            $lastName   = $nameParts->getLastname();
            $middleName = $nameParts->getMiddlename();
            $nickName   = $nameParts->getNickname();
            $initials   = $nameParts->getInitials();
            $suffix     = $nameParts->getSuffix();
            if (!empty($middleName) && empty($initials)) {
                $additionalName = $middleName;
            } elseif (empty($middleName) && !empty($initials)) {
                $additionalName = $initials;
            } elseif (empty($middleName) && empty($initials)) {
                $additionalName = '';
            } else {
                $additionalName = implode(',', [$middleName, $initials]);
            }
            $names = implode(';', [$lastName, $firstName, $additionalName, $salutation, $suffix]);
            $fullName = implode(' ', [$salutation, $firstName, $additionalName, $lastName]);
            if (!empty($suffix)) {
                $fullName = $fullName .', '. $suffix;
            }
            $fullName = preg_replace('/\s+/', ' ', $fullName);
            $company = '';
        } else {                                                    // it`s a company
            $names = '';
            $nickName = '';
            $fullName = $realName;
            $company = $realName;
        }

        return [
            'N'        => $names,
            'FN'       => $fullName,
            'NICKNAME' => $nickName,
            'ORG'      => $company,
        ];
    }

Surname Prefix

Iconic parser correctly parse a name in this format (name surname) example:
Giulio Di Marco -> name = Giulio surname = Di Marco

but when the order of name and surname is inverted (surname name) like:

Di Marco Giulio -> name = Di surname = Marco Giulio
The problem seems to be the prefix in the surname.

Any fix ?

Nickname after complete name means last name is parsed as middle name

If the nickname is after the complete name, the last name is marked as the middle name (and no last name is marked).

Charles Dixon (20th century) is parsed as:

TheIconic\NameParser\Name Object
(
    [parts:protected] => Array
        (
            [0] => TheIconic\NameParser\Part\Firstname Object
                (
                    [value:protected] => Charles
                )

            [1] => TheIconic\NameParser\Part\Middlename Object
                (
                    [value:protected] => Dixon
                )

            [2] => TheIconic\NameParser\Part\Nickname Object
                (
                    [value:protected] => 20th
                )

            [3] => TheIconic\NameParser\Part\Nickname Object
                (
                    [value:protected] => century
                )

        )

)

Minor Bug in resolving full name if suffix matches a salutation

The following Name: "PAUL M LEWIS MR"

returns the following

Array
(
    [0] => PAUL
    [1] => TheIconic\NameParser\Part\Initial Object
        (
            [value:protected] => M
        )

    [2] => LEWIS
    [3] => TheIconic\NameParser\Part\Salutation Object
        (
            [normalized:protected] => Mr.
            [value:protected] => MR
        )

)

making this break and getFirstname and getLastName returns nothing.

the issue seems to be the following line
https://github.com/theiconic/name-parser/blob/master/src/Language/English.php#L41

this probably going to happen with any of the salutations if they come at the end.

Edit:

Another Example:

"SUJAN MASTER"
"JAMES J MA"
"PETER K MA"

Capitalized names have strange behavior

Found some bugs having to do with capitalized names and multiple middle names. Sharing some examples here, it seems the initials get incorporated somehow.

Names for testing:

  • SOFIA GARCIA DE LA MANCHA
  • DA LAT
  • JUANITA MARIA DE SUR
  • Garcia Marques, Gabriel

The last example should be listed as "Last, First" so "Gabriel" should just be the first name and "Garcia Marques" should be listed as the last name.

Output:

(
    [firstname] => Sofia
    [middlename] => Garcia
    [initials] => D E L A
    [lastname] => Mancha
)
(
    [firstname] => D
    [initials] => A
    [lastname] => Lat
)
(
    [firstname] => Juanita
    [middlename] => Maria
    [initials] => D E
    [lastname] => Sur
)
(
    [firstname] => Garcia Gabriel
    [lastname] => Marques
)
<?php

require_once __DIR__ . '/vendor/autoload.php';

$parser = new TheIconic\NameParser\Parser();
$namesToTest = array( 'SOFIA GARCIA DE LA MANCHA', 'DA LAT', 'JUANITA MARIA DE SUR', 'Garcia Marques, Gabriel' );

foreach ( $namesToTest as $input ) {

    $name = $parser->parse( $input );
    echo $name->getSalutation();
    echo $name->getFirstname();
    echo $name->getLastname();
    echo $name->getMiddlename();
    echo $name->getNickname();
    echo $name->getInitials();
    echo $name->getSuffix();

    print_r( $name->getAll() );

    echo $name;
}

Support Spanish-language names

I see support for German, is there anyone working on adding Spanish? Often there are multiple surnames:

Juan de Jesús López Ortíz

GIVENNAME: "Juan"SURNAME: "de Jesús"SURNAME: "López"SURNAME: "Ortíz"

José Guadalupe Jiménez Montoya

GIVENNAME: "José"
GIVENNAME: "Guadalupe"
SURNAME: "Jiménez"
SURNAME: "Montoya"

I'm not sure if this library can support that, but figured I'd ask.

Last, First, Suffix pattern is not parsed properly

One common (if slightly archaic) way of listing names is as follows:

Lastname, Firstname (optional middle initial or name), Suffix

for example:

Tiptree, James, Jr.
which the parser parses as:

[
lastname => string (13) "Tiptree James"
suffix => string (2) "Jr"
]

and

Miller, Walter M., Jr.
which the parser parses as:

[
firstname => string (6) "Walter"
lastname => string (9) "Miller M."
suffix => string (2) "Jr"
]

Interestingly, if you remove the second comma, the names behave differently.

Tiptree, James Jr. still fails (it drops the suffix):

[
firstname => string (5) "James"
lastname => string (7) "Tiptree"
]

Miller, Walter M. Jr. correctly parses into firstname/initial/lastname/suffix, as shown:

[
firstname => string (6) "Walter"
lastname => string (6) "Miller"
initials => string (2) "M."
suffix => string (2) "Jr"
]

So, definitely a few problems to fix here!

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.