GithubHelp home page GithubHelp logo

carltonnorthern / nicknames Goto Github PK

View Code? Open in Web Editor NEW
279.0 19.0 146.0 10.45 MB

A CSV file with US given names (first name) and their associated nicknames or diminutive names.

License: Apache License 2.0

Java 3.52% Perl 17.16% Python 70.51% R 8.82%

nicknames's Introduction

CI PyPI version

Nicknames

A hand-curated CSV file containing English given names (first names) and their associated nicknames.

There are Python, SQL, Java, Perl, and R parsers provided for convenience.

This is a relatively large list with roughly 1100 canonical names. Any help from people to clean this list up and add to it is greatly appreciated. The first name in a line is the canonical name, and the rest are nicknames for that name.

This lookup file was initially created by mining this genealogy page from the Center for African American Research, Inc. Because the lookup originates from a dataset used for genealogy purposes there are old names that aren't commonly used these days, but there are recent ones as well. Examples are "gregory", "greg", or "geoffrey", "geoff". There was also a significant effort to make it machine readable, i.e. separate it with commas, remove human conventions like "rickie(y)" would need to be made into two different names "rickie", and "ricky". Due to the source of the original data, the dataset is heavily biased towards traditionally African American names. Names from other groups may or may not be present.

This project was created by Old Dominion University - Web Science and Digital Libraries Research Group. More information about the creation of this lookup can be found on this blog post about the creation of this library

Python API

The Python parser is available on PyPI from

pip install nicknames

and then you can do:

from nicknames import NickNamer

nn = NickNamer()

# Get the nicknames for a given name as a set of strings
nicks = nn.nicknames_of("Alexander")
assert isinstance(nicks, set)
assert "al" in nicks
assert "alex" in nicks

# Note that the relationship isn't symmetric: al is a nickname for alexander,
# but alexander is not a nickname for al.
assert "alexander" not in nn.nicknames_of("al")

# Capitalization is ignored and leading and trailing whitespace is ignored
assert nn.nicknames_of("alexander") == nn.nicknames_of(" ALEXANDER ")

# Queries that aren't found return an empty set
assert nn.nicknames_of("not a name") == set()

# The other useful thing is to go the other way, nickname to canonical:
# It acts very similarly to nicknames_of.
can = nn.canonicals_of("al")
assert isinstance(can, set)
assert "alexander" in can
assert "alex" in can

assert "al" not in nn.canonicals_of("alexander")

# You can combine these to see if two names are interchangeable:
union = nn.nicknames_of("al") | nn.canonicals_of("al")
are_interchangeable = "alexander" in union

For more advanced usage, such as loading your own data, read the source code.

nicknames's People

Contributors

bvallerand avatar c0bra avatar calvinwu4 avatar carltonnorthern avatar clankjak avatar corepattern avatar cwag03 avatar danielleevandenbosch avatar ericbn avatar f0rk avatar gpilgrim2670 avatar iternovtsii avatar jimeisenhauer avatar jonathan-bevis1 avatar kenlewis avatar kevinoid avatar kstephens6054 avatar lockster99 avatar louking avatar mdahlman avatar nickcrews avatar rileynetadmin avatar roachbones avatar roberto14 avatar rterwedo avatar staff0rd avatar tamirble avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

nicknames's Issues

The JavaParser has the wrong type definition for the dimNames map

Currently the map is declared with:

public Map<String, String> dimNames = new HashMap<String, String>();

but that won't compile because the code needs to store a map keyed on String but with a List of Strings being the value type. So this is the correct declaration:

Map<String, List<String>> dimNames = new HashMap<>();

Nice list - what is the best way for suggesting new aliases/nicknames?

Documentation out of date?

Hi,

Thanks for creating this useful package. It seems the documentation is out of date or out of sync with the package in pypi:

Python 3.9.12 (main, Apr  5 2022, 01:53:17)
>>> from us_nicknames import NickNamer
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
ImportError: cannot import name 'NickNamer' from 'us_nicknames' 

I do have the latest one (pip installed today)

>>> import us_nicknames
>>> us_nicknames.__version__
'0.1.2'

Make release on PyPI

Hi! This looks to me to be one of the better maintained datasets of diminutive names on GitHub. It could be easier to use in python if this was actually released on PyPI so people could do a pip install nicknames (surprisingly this package name isn't taken? Could definitely choose another name too.)

If I open a PR for this, would you be open to it? I'd add a github action similar to this one that would build and release the wheels automatically on a git tag. Your admin overhead on a day-to-day would be minimal, you'd just have to set up an PyPI account and add the access token to this repo's Secrets once. I can help with this too if you want.

Thank you!

How is this file structured?

I don't understand how this file is structured. If I want to find the name associated with "Dicky," how would I do that aside from looking through the file manually?
And why are some names spread across multiple lines? Shouldn't each name appear on only one line? Example:

russ,russell
russell,russ,rusty
rusty,russell

Shouldn't those closely associated names be on one line?

I tried the Perl script out but all it did was display how many names had more than 5 mentions in the file.

Additional nicknames and name variants to add

traci = tracy (which you already have), tracie
falon = fallon, Fal, Fall, Fallie, Fally, Falcon, Lon, Lonnie (https://momlovesbest.com/fallon-name-meaning)
hillary = hilary
toni = tony, antonia, etc.
lindsay = lindsey, lindsie, lindsy
garrett = Barrett, Gare, Garrison, Gars, Gary, Jerry, Rhett, Variations: Garratt, Garret, Garrod, Jarrett, Jared, Jarratt, Jerrold (https://momlovesbest.com/garrett-name-meaning)
gareth = gary, gare
dacia = Daycia, Daisha, Dacya
marc = mark, marcus, etc.
sheri = sherry, sherryl, sheryl, sherri, cheri, cherie, etc.
dianne = diane, dian
angelika = angelica
miguel = Miguell, Miguael, Miguaell, Miguail, Miguaill, Miguayl, Miguayll = michael/mick (spanish version)
monika = monica, monique
michele = michelle
shelley = sheley, michelle, shellie, etc.
hayley = hailey, haylee, etc.
karl = carl
rosemary = rosemarie, marie, mary, rose, etc.
jalen = Jay, Jaye, Len, Lenny, Lennie, Jaylin, Alen, Al, Jaylen, Jaelen, Jaelin, Jaelyn, Jailyn, Jaylyn
rachael = rachel
kellie = kelli, kelly, kelley
kalli = kali, cali
jodi = jody
lori = lorrie, laurie, lorelei, etc.
shawn = shaun
allen = allan, alan, al
erika = erica
marcia = marcie, marsha
dona = donna
kristi = kristy, Christy, christine, christina, krista, etc.
norman = norm
chelsie = chelsey
stephine = stephanie, stephany, stephani
audree = audrey
kerri = kerry
fiona = fionna
savanna = savannah
bryanna = brianna, bri, briana, etc.
jaine = jane, jayne
leilani = lani
jesse = jessica, jess, jessie
abby = abbie
glenn = glen
carri = carrie, kari, kara
donn = don, donald
kym = kymberly, kim, kimberly, kimberli
gerri, geri = geraldine
nichole = nicky, nicki, nicholette, nicci, nicole
jamey = jaime, jamie
tami = tammie, tammy
derek = derick, derrick, derrek, rick, etc.
jenni = jennie, jenny
karin = karen
gabriela = gabriella
marni = marnie
dena = deena, dina, adina, adena
brittnie = brittany
juston = justin
lesli = leslie, lesley, les
kev = kevin
aga = athaga
carla = karla, carly
tiffanee = tiffany
staci = stacy, stacey, stacie
sara = sarah
katia = kate, katie
terri = teri, terrie, terry
ashly = ashley
jeanie = jeannie
matt = matthew, matthews
jillian = jill
laurel = laurie

(these all came from a registration list I'm working on)

allie for allison

Thanks for this -- really finding it useful.

I found an instance of Allie used for Allison in my dataset.

BUG: can't instantiate default nicknamer twice in a row

going

Nicknamer()
Nicknamer()

gives

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Users/nickcrews/Library/Application Support/hatch/env/virtual/noatak-UM6-FHel/noatak/lib/python3.9/site-packages/nicknames/__init__.py", line 36, in __init__
    nickname_lookup = _lookup_nicknames_default()
  File "/Users/nickcrews/Library/Application Support/hatch/env/virtual/noatak-UM6-FHel/noatak/lib/python3.9/site-packages/nicknames/__init__.py", line 120, in _lookup_nicknames_default
    with DEFAULT_NICKNAME_RESOURCE as f:
  File "/Users/nickcrews/.pyenv/versions/3.9.4/lib/python3.9/contextlib.py", line 115, in __enter__
    del self.args, self.kwds, self.func
AttributeError: args

because of some way that the package resource is used. Investigating

How to deal with nickname-canonical-nickname transitive links (jon-johnathon-john)

There are 4 possible combos of "formalness", and how we typically treat them:

  1. (formal, formal) (jonathon, johnathon): we usually include these
  2. (formal, casual) (johnathon, john): we usually include these
  3. (casual, formal) (john, johnathon): we almost never include these
  4. (casual, casual): (john, jon): we are very inconsistent on how we include these. eg (jon, john) is present, but (abbie, abbey) is not.

So in order to catch the (abbie, abbey) case, someone would need to do the abbie->abbigail lookup, and then the abbigail->abby lookup. eg:

def are_aliases(n1, n2):
    for canon in nn.canonicals_of(n1):
        if n2 in nn.nicknames_of(canon):
            return True
    return False

I'm thinking of some uses cases, ideally all of them could be supported. Where is my take on expected behavior:

  • canonicals_of(jonathon) should just be {johnathon}, no jon or john included.
  • nicknames_of(jonation) should be {johnathon, jon, john}
  • canonicals_of(jon): should this be merely {johnathon, jonathon}, or should it also include {john}?
    -nicknames_of(john) should be {jon}

What do you think of these test cases and expected outputs? Once we know the expected outputs, that can inform what data representation we should use.

If we went with my suggestion of listing individual pairs, then we could annotate the pairs with their level of casualness. But that is whole other level of subjectivity we may want to avoid.

@carltonnorthern I'd love your thoughts here if you have the time. Thanks!

Get PyPI tokens set up

Create better SQL resources for names.csv

I'll start by saying that having names-mysql.sql is far better than not having it. Thanks to the guys that created it.
But there are a few aspects I don't like about it. I'm thinking of adding some improvements. I would very much like feedback from others about what would be most useful. Here's my mini-spec:

  • A SQL neutral (ANSI) version of the file would be appropriate here rather than MySQL or other technology-specific version.
  • Include a version exactly parallel to names.csv. i.e. it would have exactly the same number of rows.
  • Include a normalized version. Following the model of names-mysql.sql available today.
  • Include a python script for regenerating these files whenever names.csv is updated.

Ideas for the file names:

  • names.sql
  • names-normalized.sql
  • generate-sql.py

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.