GithubHelp home page GithubHelp logo

shukkkur / gender-prediction-using-sound Goto Github PK

View Code? Open in Web Editor NEW
9.0 1.0 0.0 798 KB

Analyzing the gender distribution of children's book writers and use sound to match names to gender.

License: Creative Commons Attribution 4.0 International

Jupyter Notebook 100.00%
case-study fuzzy prediction pandas numpy matplotlib data-manipulation

gender-prediction-using-sound's Introduction

Gender-Prediction-using-Sound

Forks Stars Watchers Last Commit

The same name can be spelled out in a many ways, for example, Marc and Mark. Sound can, therefore, be a better way to match names than spelling.In this project, I will use the Python package Fuzzy to find out the genders of authors that have appeared in the New York Times Best Seller list for Children's Picture books.

1. Sound it out!

Grey and Gray. Colour and Color. Words like these have been the cause of many heated arguments between Brits and Americans.
One way to tackle this challenge is to write a program that checks if two strings sound the same, instead of checking for equivalence in spellings. We'll do that here using fuzzy name matching.

print(fuzzy.nysiis('gray'))
>>> GRY

fuzzy.nysiis('colour') == fuzzy.nysiis('color')
>>> True

2. Authoring the authors

Let's begin by reading in the data on the best selling authors from 2008 to 2017.

author_df = pd.read_csv('datasets/nytkids_yearly.csv', delimiter=';')

first_name = []
for name in author_df['Author']:
    first_name.append(name.split()[0])
author_df['first_name'] = first_name

author_df.head()
Year Book Title Author Besteller this year first_name
0 2017 DRAGONS LOVE TACOS Adam Rubin 49 Adam
1 2017 THE WONDERFUL THINGS YOU WILL BE Emily Winfield Martin 48 Emily
2 2017 THE DAY THE CRAYONS QUIT Drew Daywalt 44 Drew
3 2017 ROSIE REVERE, ENGINEER Andrea Beaty 38 Andrea
4 2017 ADA TWIST, SCIENTIST Andrea Beaty 28 Andrea

3. Time to bring on the phonics!

When we were young children, we were taught to read using phonics; sounding out the letters that compose words. So let's relive history and do that again, but using python this time.

nysiis_name = []
for name in author_df['first_name']:
    nysiis_name.append(fuzzy.nysiis(name))

author_df['nysiis_name'] = nysiis_name

author_df.head()
Year Book Title Author Besteller this year first_name nysiis_name
0 2017 DRAGONS LOVE TACOS Adam Rubin 49 Adam ADAN
1 2017 THE WONDERFUL THINGS YOU WILL BE Emily Winfield Martin 48 Emily ENALY
2 2017 THE DAY THE CRAYONS QUIT Drew Daywalt 44 Drew DR
3 2017 ROSIE REVERE, ENGINEER Andrea Beaty 38 Andrea ANDR
4 2017 ADA TWIST, SCIENTIST Andrea Beaty 28 Andrea ANDR

4. The inbetweeners

We'll use babynames_nysiis.csv, a dataset that is derived from the Social Security Administration’s baby name data, to identify author genders. The dataset contains unique NYSIIS versions of baby names, and also includes the percentage of times the name appeared as a female name (perc_female) and the percentage of times it appeared as a male name (perc_male).

babies_df = pd.read_csv('datasets/babynames_nysiis.csv', delimiter=';')

gender = []
for i in range(len(babies_df)):
    if babies_df.iloc[i]['perc_male'] > babies_df.iloc[i]['perc_female']:
        gender.append('M')
    elif babies_df.iloc[i]['perc_male'] < babies_df.iloc[i]['perc_female']:
        gender.append('F')
    else:
        gender.append('N')
        
babies_df['gender'] = gender

babies_df.head()
babynysiis perc_female perc_male gender
0 NaN 62.50 37.50 F
1 RAX 63.64 36.36 F
2 ESAR 44.44 55.56 M
3 DJANG 0.00 100.00 M
4 PARCAL 25.00 75.00 M

5. Playing matchmaker

Now that we have identified the likely genders of different names, let's find author genders by searching for each author's name in the babies_df DataFrame, and extracting the associated gender.

def locate_in_list(a_list, element):
    loc_of_name = a_list.index(element) if element in a_list else -1
    return(loc_of_name)

author_gender = []
for name in author_df['nysiis_name']:
    nloc = locate_in_list(list(babies_df['babynysiis']), name)
    if nloc == -1:
        author_gender.append('Unknown')
    else:
        author_gender.append(babies_df['gender'][nloc])

author_df['author_gender'] = author_gender

author_df['author_gender'].value_counts()
F 395
M 191
Unknown 9
N 8
Name: author_gender, dtype: int64

6. Tally up

From the results above see that there are more female authors on the New York Times best seller's list than male authors. Our dataset spans 2008 to 2017. Let's find out if there have been changes over time.

years = sorted(author_df.Year.unique())

males_by_yr = []
females_by_yr = []
unknown_by_yr = []

for year in years:
    males_by_yr.append(len(author_df[(author_df['author_gender']=='M') & (author_df['Year']==year)]))
    females_by_yr.append(len(author_df[(author_df['author_gender']=='F') & (author_df['Year']==year)]))
    unknown_by_yr.append(len(author_df[(author_df['author_gender']=='Unknown') & (author_df['Year']==year)]))

males_by_yr 

>>> [8, 19, 27, 21, 21, 11, 21, 18, 25, 20]

7. Foreign-born authors?

Our gender data comes from social security applications of individuals born in the US. Hence, one possible explanation for why there are "unknown" genders associated with some author names is because these authors were foreign-born.

years_shifted = list(np.array(years) + 0.25)

plt.bar(years, males_by_yr, width=0.25, color='lightblue')
plt.bar(years_shifted, females_by_yr, width=0.25, color='pink')

plt.xlabel('years')
plt.show()

gender-prediction-using-sound's People

Contributors

shukkkur avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.