GithubHelp home page GithubHelp logo

nwmsu-capstone-project's Introduction

NWMSU-Capstone-Project

Author: Brady Monks

Walk through the AthleteDatabase.csv file with me as I perform analysis on it and look for distinct patterns.

Overview

Using python analysis, I wanted to find where athletes are being produced among the three major sporting leagues in the United States - Major League Baseball, National Basketball Association, and National Football League. Performing EDA on the data and introducing a new metric, I was able to get quality results over an abundance of data.

Data

Exploratory Data Analysis

I started with the EDA.ipynb script as wanted to get a general idea of the data I was using. I was able to put together some insightful visuals that allowed an idea of the data set to be formed. My analysis included code from pandas, matplotlib, plotly, seaborn, and scipy.

Categorical Frequencies

Since my data was primarily categorical, I wanted to see what kind of data was popping up the most for each feature, so I created a table that showed the top 10 most frequent values for each feature.

Summary Statistics

Bar Charts

I started by looking at the league breakdown among Conference and Birth State. Using pyplot from matplotlib, I created some stacked bar charts.

Conference X League Breakdown

State X League Breakdown

Contigency Tables

To transform these graphs to see the actual data, I created some contingency tables. I sorted them by actual athlete production so that the top producers are first. To create these, I used the crosstab function from pandas.

Conference X League Breakdown

State X League Breakdown

Interactive Charts

Using plotly, I was able to create a neat interactive map of the U.S. that feels like I'm taking a deeper dive into the data.

Heat Map

United States Heat Map

Summary Statistics

I also used pandas to help me along with some summary statistics. I used the .describe() function to help me see the top frequencies for each sport among the "College" and "BirthPlace" features.

Summary Statistics

Chi-Squared

I also ran a Chi-Squared test on these three features - BirthPlace, College, Conference - with League to see what the association was like. All three features had indications that there was strong assiciation between each feature and League.

For BirthPlace, the statistic was 20,646 with a P-value of 5.77328e-169 and 15,284 Degrees of Freedom.

Conference had a statistic of 1,122 and a P-value of 6.309067e-205 to go with 46 Degrees of Freedom.

To go with a statistic of 19,392 College had a P-value of 0 and 2,922 Degrees of Freedom.

New Score

In my NewScore.ipynb script, I wanted to introduce a new metric that would allow me to score/rank cities, colleges and combinations of both to help deduct the best options for each professional league. I only scored the features "College" and "BirthPlace".

To start, I loaded the csv and wrote some code that grouped by League and College, giving me the total counts for each college in each league. I then added another column that took the total amount of athletes for each college in a specific league and divided it by the total amount of athletes in that league, giving me the percentage that that college made up of that league. From there I sorted the dataframe by League and Percentage and ranked by three the dataframe, giving me top rank for each league and going down.

College

Ranks & Percents

Then I filtered that dataframe to just give me all the schools that appeared in the top 10 of any league to be included for a graph.

I have experience using ggplot in R to help put together faceted graphs and I know that can be imported into python so I used many imports from the plotnine package to help me achieve this.

Top 10 College

Red bars indicate that school is in the top 10 for that League, while gray bars indicate that school is in the top 10 for another sport, just not the one with the bar.

Final College Rankings

In order to achieve a final ranking for all the colleges, I needed get a pivot table that include a row for each college and had their ranking for each League. From there, I added another column that summed up the totals rankings to give me a score. Using principles from golf (lowest score wins), I then sorted the new summed column and ranked the schools from there. These new scores indicate which schools were near the top for everything, or possibly near the top for one or two and lacked in the other(s).

Final College Ranks

BirthPlace

I performed the same transformation using the BirthPlace feature and got the following results.

Ranks & Percents - BirthPlace

Top 10 - BirthPlace

Final College Ranks

Finding Results

In my FindingResults.ipynb script, I wanted to get actionable insight by generating some BirthPlace-College pipelines.

I initiated this by creating a function that I would use in a for loop to generate scores for every BirthPlace-College combination in each sport. Uisng pandas and itertools, I made the list necessary to apply the function within the for loop.

Using the FinalCollegeRankings.csv and FinalBirthPlaceRankings.csv I created in the NewScore script, for each League, I combined the ranking of the BirthPlace and College to come up with a new score again, and re-ranked the list again, giving me the combinations with the lowest scores in each League and overall.

Top BirthPlace + College Combos

This repo will maintain the code and CSV file that I will be using alongside any extra files to be used.

nwmsu-capstone-project's People

Contributors

bradymonks avatar

Stargazers

 avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.