GithubHelp home page GithubHelp logo

datacoon / russiannames Goto Github PK

View Code? Open in Web Editor NEW
128.0 5.0 7.0 39.34 MB

Russian names parsers, gender identification and processing tools

License: BSD 3-Clause "New" or "Revised" License

Makefile 3.14% Python 96.86%
names-classification names-frequent russian-language russian-specific gender-detection

russiannames's Introduction

Russian Names

russiannames is a Python 3 library dedicated to parse Russian names, surnames and midnames, identify person gender by fullname and how name is written. It uses MongoDB as backend to speed-up name parsing.

Documentation

Documentation is built automatically and can be found on https://russiannames.readthedocs.org/en/latest/

Installation

To install Python library use pip install russiannames via pip or python setup.py install

To use database you need MongoDB instance. Unpack db_data_bson.zip file from https://github.com/datacoon/russiannames/blob/master/data/bson/db_dump_bson.zip

and use mongorestore command to restore names database with 3 collections: names, surnames and midnames

Features

Database of names used for identification

  • 375449 surnames - collection: surnames
  • 32134 first names - collection: names
  • 48274 midnames - collection: midnames

Detailed database statistics by gender and collection

collection total males females universal or unidentified
names 32134 19297 8278 1196
midnames 48274 30114 16143 0
surnames 375274 124662 111534 38827

Supports 12 formats of Russian full names writing style

Format Example Description
f Ольга only first name
s Петров only surname
Fs О. Сидорова first letter of first name and full surname
sF Николаев С. full surname and first letter of surname
sf Абрамов Семен full surname and full first name
fs Соня Камиуллина full first name and full surname
fm Иван Петрович full first name and full middlename
SFM М.Д.М. first letters of surname, first name, middlename
FMs А.Н. Егорова first letters of first and middle name and full furname
sFM Николаенко С.П. full surname and first letters of first and middle names
sfM Петракова Зинаида М. full surname, first name and first letter of middle name
sfm Казаков Ринат Артурович full name as surname, first name and middle name
fms Светлана Архиповна Волкова full name as first name, middle name and surname

Supports names with following ethnics identification

9 ethnic types in names, surnames and middle names supported

key name (en) name (rus)
arab Arabic Арабское
arm Armenian Армянское
geor Georgian Грузинское
germ German Немецкие
greek Greek Греческие
jew Jew Еврейские
polsk Polish Польские
slav Slavic (Russian) Славянские
tur Turkic Тюркские (тюркоязычные)

Limitations

  • very rare names, surnames or middlenames could be not parsed
  • ethnic identification is still on early stage

Speed optimization

  • preconfigured and preindexed MongoDb collections used

Usage and Examples

Parse name and identify gender

Parses names and returns: format, surname, first name, middle name, parsed (True/False) and gender

>>> from russiannames.parser import NamesParser
>>> parser = NamesParser()
>>> parser.parse('Нигматуллин Ринат Ахметович')
{'format': 'sfm', 'sn': 'Нигматуллин', 'fn': 'Ринат', 'mn': 'Ахметович', 'gender': 'm', 'text': 'Нигматуллин Ринат Ахметович', 'parsed': True}
>>> parser.parse('Петрова C.Я.')
{'format': 'sFM', 'sn': 'Петрова', 'fn_s': 'C', 'mn_s': 'Я', 'gender': 'f', 'text': 'Петрова C.Я.', 'parsed': True}

Gender field could have one of following values:

  • m: Male
  • f: Female
  • u: Unknown / unidentified
  • -: Impossible to identify

Ethnic identification (experimental)

Parses surname, first name and middle name and tries to identify person ethic affiliation of the person

>>> from russiannames.parser import NamesParser
>>> parser = NamesParser()
>>> parser.classify('Нигматуллин', 'Ринат', 'Ахметович')
{'ethnics': ['tur'], 'gender': 'm'}
>>> parser.classify('Алексеева', 'Ольга', 'Ивановна')
{'ethnics': ['slav'], 'gender': 'f'}

Supported languages

  • Russian

Requirements

  • pymongo
  • click

Related projects

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.