carltonnorthern / nicknames Goto Github PK

View Code? Open in Web Editor NEW

279.0 19.0 146.0 10.45 MB

A CSV file with US given names (first name) and their associated nicknames or diminutive names.

License: Apache License 2.0

Java 3.52% Perl 17.16% Python 70.51% R 8.82%

nicknames's Introduction

Nicknames

A hand-curated CSV file containing English given names (first names) and their associated nicknames.

There are Python, SQL, Java, Perl, and R parsers provided for convenience.

This is a relatively large list with roughly 1100 canonical names. Any help from people to clean this list up and add to it is greatly appreciated. The first name in a line is the canonical name, and the rest are nicknames for that name.

This lookup file was initially created by mining this genealogy page from the Center for African American Research, Inc. Because the lookup originates from a dataset used for genealogy purposes there are old names that aren't commonly used these days, but there are recent ones as well. Examples are "gregory", "greg", or "geoffrey", "geoff". There was also a significant effort to make it machine readable, i.e. separate it with commas, remove human conventions like "rickie(y)" would need to be made into two different names "rickie", and "ricky". Due to the source of the original data, the dataset is heavily biased towards traditionally African American names. Names from other groups may or may not be present.

This project was created by Old Dominion University - Web Science and Digital Libraries Research Group. More information about the creation of this lookup can be found on this blog post about the creation of this library

Python API

The Python parser is available on PyPI from

pip install nicknames

and then you can do:

from nicknames import NickNamer

nn = NickNamer()

# Get the nicknames for a given name as a set of strings
nicks = nn.nicknames_of("Alexander")
assert isinstance(nicks, set)
assert "al" in nicks
assert "alex" in nicks

# Note that the relationship isn't symmetric: al is a nickname for alexander,
# but alexander is not a nickname for al.
assert "alexander" not in nn.nicknames_of("al")

# Capitalization is ignored and leading and trailing whitespace is ignored
assert nn.nicknames_of("alexander") == nn.nicknames_of(" ALEXANDER ")

# Queries that aren't found return an empty set
assert nn.nicknames_of("not a name") == set()

# The other useful thing is to go the other way, nickname to canonical:
# It acts very similarly to nicknames_of.
can = nn.canonicals_of("al")
assert isinstance(can, set)
assert "alexander" in can
assert "alex" in can

assert "al" not in nn.canonicals_of("alexander")

# You can combine these to see if two names are interchangeable:
union = nn.nicknames_of("al") | nn.canonicals_of("al")
are_interchangeable = "alexander" in union

For more advanced usage, such as loading your own data, read the source code.

nicknames's People

Contributors

Stargazers

Watchers

Forkers

cequencer jburke007 loyeer barzilaydn sagizur yangminreal bleizman gordonje helenmc cloverhealth staff0rd zoharkom rkonda rposener mdahlman chengguangnan sohlich rterwedo vendettamit agarwalpranaya cookalino jeff-lewis jeremybmerrill jdrew1303 diputhomas yooper kkalm143 dbitondo terratenney vivekpd15 roberto14 clankjak paris0120 alyzacruz justinkenel murindwaz bdrumheller carolinamattsson emiliepicardcantin deepakjoseph08 jcspino stevenwang-blispay mslaster danieltak aayushi2055 amine-plutoflume djlambert zukasmichael ryanhaigh sureshgulla phzpan jandor64 cadenmahoney paulalbert1 komyaka justin-l-boyer mattmc3 kstephens6054 jameschen79 bstovall j4m355 ehb2126 pselvana hazardv talbor49 mcharters nadime yfcusa infojam manfredwang093 rileynetadmin c0bra davidszakonyi ilushka85 abokman nickstrupat roachbones vibhavikram fpereyra cassie-plutoflume mar3487 jorgebillini lockster99 lumiqai hdachev zkdavis syardumi nmonath shubham-agarwal15 maxweiss55 pmaxit fedexr9 iternovtsii gpilgrim2670 danielleevandenbosch f0rk louking arielbel mict0 stvhanna

nicknames's Issues

The JavaParser has the wrong type definition for the dimNames map

Currently the map is declared with:

public Map<String, String> dimNames = new HashMap<String, String>();

but that won't compile because the code needs to store a map keyed on String but with a List of Strings being the value type. So this is the correct declaration:

Map<String, List<String>> dimNames = new HashMap<>();

Nice list - what is the best way for suggesting new aliases/nicknames?

Documentation out of date?

Hi,

Thanks for creating this useful package. It seems the documentation is out of date or out of sync with the package in pypi:

Python 3.9.12 (main, Apr  5 2022, 01:53:17)
>>> from us_nicknames import NickNamer
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
ImportError: cannot import name 'NickNamer' from 'us_nicknames'

I do have the latest one (pip installed today)

>>> import us_nicknames
>>> us_nicknames.__version__
'0.1.2'

Adding Lantio/Latina names

Not an issue per se, but do you guys consider adding Latnio/Latina names to this package?

Make release on PyPI

Hi! This looks to me to be one of the better maintained datasets of diminutive names on GitHub. It could be easier to use in python if this was actually released on PyPI so people could do a pip install nicknames (surprisingly this package name isn't taken? Could definitely choose another name too.)

If I open a PR for this, would you be open to it? I'd add a github action similar to this one that would build and release the wheels automatically on a git tag. Your admin overhead on a day-to-day would be minimal, you'd just have to set up an PyPI account and add the access token to this repo's Secrets once. I can help with this too if you want.

Thank you!

How is this file structured?

I don't understand how this file is structured. If I want to find the name associated with "Dicky," how would I do that aside from looking through the file manually?
And why are some names spread across multiple lines? Shouldn't each name appear on only one line? Example:

russ,russell
russell,russ,rusty
rusty,russell

Shouldn't those closely associated names be on one line?

I tried the Perl script out but all it did was display how many names had more than 5 mentions in the file.

add a license

'Ed' list does not have 'Edward'

I'm opening this because 'alex', has 'alexander'. Seems like an oversight.

Problems with names.csv

There are some problems with names.csv.

For example, Jon is the shortened form of Jonathan. But it is not mentioned in names.csv.

https://en.wikipedia.org/wiki/Jon

Karon has last name in csv

karonhappuck

see: https://github.com/carltonnorthern/nickname-and-diminutive-names-lookup/blob/master/names.csv#L573

Additional nicknames and name variants to add

traci = tracy (which you already have), tracie
falon = fallon, Fal, Fall, Fallie, Fally, Falcon, Lon, Lonnie (https://momlovesbest.com/fallon-name-meaning)
hillary = hilary
toni = tony, antonia, etc.
lindsay = lindsey, lindsie, lindsy
garrett = Barrett, Gare, Garrison, Gars, Gary, Jerry, Rhett, Variations: Garratt, Garret, Garrod, Jarrett, Jared, Jarratt, Jerrold (https://momlovesbest.com/garrett-name-meaning)
gareth = gary, gare
dacia = Daycia, Daisha, Dacya
marc = mark, marcus, etc.
sheri = sherry, sherryl, sheryl, sherri, cheri, cherie, etc.
dianne = diane, dian
angelika = angelica
miguel = Miguell, Miguael, Miguaell, Miguail, Miguaill, Miguayl, Miguayll = michael/mick (spanish version)
monika = monica, monique
michele = michelle
shelley = sheley, michelle, shellie, etc.
hayley = hailey, haylee, etc.
karl = carl
rosemary = rosemarie, marie, mary, rose, etc.
jalen = Jay, Jaye, Len, Lenny, Lennie, Jaylin, Alen, Al, Jaylen, Jaelen, Jaelin, Jaelyn, Jailyn, Jaylyn
rachael = rachel
kellie = kelli, kelly, kelley
kalli = kali, cali
jodi = jody
lori = lorrie, laurie, lorelei, etc.
shawn = shaun
allen = allan, alan, al
erika = erica
marcia = marcie, marsha
dona = donna
kristi = kristy, Christy, christine, christina, krista, etc.
norman = norm
chelsie = chelsey
stephine = stephanie, stephany, stephani
audree = audrey
kerri = kerry
fiona = fionna
savanna = savannah
bryanna = brianna, bri, briana, etc.
jaine = jane, jayne
leilani = lani
jesse = jessica, jess, jessie
abby = abbie
glenn = glen
carri = carrie, kari, kara
donn = don, donald
kym = kymberly, kim, kimberly, kimberli
gerri, geri = geraldine
nichole = nicky, nicki, nicholette, nicci, nicole
jamey = jaime, jamie
tami = tammie, tammy
derek = derick, derrick, derrek, rick, etc.
jenni = jennie, jenny
karin = karen
gabriela = gabriella
marni = marnie
dena = deena, dina, adina, adena
brittnie = brittany
juston = justin
lesli = leslie, lesley, les
kev = kevin
aga = athaga
carla = karla, carly
tiffanee = tiffany
staci = stacy, stacey, stacie
sara = sarah
katia = kate, katie
terri = teri, terrie, terry
ashly = ashley
jeanie = jeannie
matt = matthew, matthews
jillian = jill
laurel = laurie

(these all came from a registration list I'm working on)

remove doctor,namegivento

See mdahlman@fb82614

Consider renaming repo to `us-nicknames`

existing name is quite verbose. Then it would match the python package name. I think all existing links should be redirected.

allie for allison

Thanks for this -- really finding it useful.

I found an instance of Allie used for Allison in my dataset.

Patch for /trunk/names.csv

Adding to Margaret and William.

Original issue reported on code.google.com by [email protected] on 2 Mar 2014 at 11:46

Attachments:

names.csv.patch

BUG: can't instantiate default nicknamer twice in a row

going

Nicknamer()
Nicknamer()

gives

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Users/nickcrews/Library/Application Support/hatch/env/virtual/noatak-UM6-FHel/noatak/lib/python3.9/site-packages/nicknames/__init__.py", line 36, in __init__
    nickname_lookup = _lookup_nicknames_default()
  File "/Users/nickcrews/Library/Application Support/hatch/env/virtual/noatak-UM6-FHel/noatak/lib/python3.9/site-packages/nicknames/__init__.py", line 120, in _lookup_nicknames_default
    with DEFAULT_NICKNAME_RESOURCE as f:
  File "/Users/nickcrews/.pyenv/versions/3.9.4/lib/python3.9/contextlib.py", line 115, in __enter__
    del self.args, self.kwds, self.func
AttributeError: args

because of some way that the package resource is used. Investigating

How to deal with nickname-canonical-nickname transitive links (jon-johnathon-john)

There are 4 possible combos of "formalness", and how we typically treat them:

(formal, formal) (jonathon, johnathon): we usually include these
(formal, casual) (johnathon, john): we usually include these
(casual, formal) (john, johnathon): we almost never include these
(casual, casual): (john, jon): we are very inconsistent on how we include these. eg (jon, john) is present, but (abbie, abbey) is not.

So in order to catch the (abbie, abbey) case, someone would need to do the abbie->abbigail lookup, and then the abbigail->abby lookup. eg:

def are_aliases(n1, n2):
    for canon in nn.canonicals_of(n1):
        if n2 in nn.nicknames_of(canon):
            return True
    return False

I'm thinking of some uses cases, ideally all of them could be supported. Where is my take on expected behavior:

canonicals_of(jonathon) should just be {johnathon}, no jon or john included.
nicknames_of(jonation) should be {johnathon, jon, john}
canonicals_of(jon): should this be merely {johnathon, jonathon}, or should it also include {john}?
-nicknames_of(john) should be {jon}

What do you think of these test cases and expected outputs? Once we know the expected outputs, that can inform what data representation we should use.

If we went with my suggestion of listing individual pairs, then we could annotate the pairs with their level of casualness. But that is whole other level of subjectivity we may want to avoid.

@carltonnorthern I'd love your thoughts here if you have the time. Thanks!

Get PyPI tokens set up

@carltonnorthern verify your email on test pypi, I tried adding you and got "User 'carlton.northern' does not have a verified primary email address and cannot be added as a Owner for project"
@NickCrews add carlton as a co-owner
@carltonnorthern set up API tokens for test and prod pypi. This is quite easy, should only take 5 minutes for each
- Read https://pypi.org/help/#apitoken
- Make a token
- on this github repo, go to its Settings, then on the sidebar "Secrets" > "repository". On the new page click "New repository secret" and then you have to name them PYPI_API_TOKEN and TEST_PYPI_API_TOKEN so they match https://github.com/carltonnorthern/nickname-and-diminutive-names-lookup/blob/d2c7c72fa1324db0a339bf5daf04e518034a3574/.github/workflows/ci.yml#L90-L98

Create better SQL resources for names.csv

I'll start by saying that having names-mysql.sql is far better than not having it. Thanks to the guys that created it.
But there are a few aspects I don't like about it. I'm thinking of adding some improvements. I would very much like feedback from others about what would be most useful. Here's my mini-spec:

A SQL neutral (ANSI) version of the file would be appropriate here rather than MySQL or other technology-specific version.
Include a version exactly parallel to names.csv. i.e. it would have exactly the same number of rows.
Include a normalized version. Following the model of names-mysql.sql available today.
Include a python script for regenerating these files whenever names.csv is updated.

Ideas for the file names:

names.sql
names-normalized.sql
generate-sql.py