GithubHelp home page GithubHelp logo

vikasnataraja / predicting-adult-census-income-using-machine-learning Goto Github PK

View Code? Open in Web Editor NEW
0.0 1.0 0.0 647 KB

This repo aims to explore various machine learning methods to predict the income for a test set by training the models on known census data.

Python 100.00%

predicting-adult-census-income-using-machine-learning's Introduction

Predicting-Adult-Census-Income

This repo aims to explore various machine learning methods to predict the income for a test set by training the models on known census data.

Formatting the dataset

  • Before solving the actual problem itself, I used pandas to construct the training dataframe from the 'train.data' and 'train.demo', then used 'test.data' to construct the test dataframe.

  • This function or method is under the name formatDataset in the code. In here, I used pandas to read the data, then converted the non-numeric data to numeric values by using certain discrete mapping values in order to use classification. It is essentially categorical encoding. I also converted the integer values to float values to make the whole dataset uniform.

  • The function returns X,original dataset, and y, if applicable.

Approach

Since the y variable wasn't available for the test dataset, I came up with 2 options to test my algorithms before submission - split the dataset using scikit learn's test_train_split or use K-Fold cross validation.

While the cross-vaidation technique seemed more obvious and better, I still tried test_train_split.

1. Test-Train Split

  • I tested this technique on various models - RandomForestClassifier, DecisionTreeClassifier, AdaBoostClassifier, GradientBoostingClassifier, BaggingClassifier, KNeighborsClassifier and ExtraTreeClassifier.

Outcome

  • I used the same method to test all the algorithms. I used the for loop to loop through different values of the hyperparameters to tune them to the best value. Then calculated accuracy scores to compare any differences or improvements.

  • However, with this method, I was only able to go slightly above 83% accuracy for the test data (which was with RandomForestClassifier).

2. K-Fold Cross Validation

  • This method certainly improved my accuracies. I was able to cross the 87% accuracy with more than one model. Using the documentation on cross validation from scikit-learn (https://scikit-learn.org/stable/modules/cross_validation.html), I was able to optimize the algorithms to the best accuracies.
  • The split of 5 folds,cv=5, seemed to produce the best accuracies.
  • For every algorithm, I tuned the hyperparameters by running each feature through a for loop for various values, then optimizing each carefully, sometimes up to 9 decimal places.
  • Each time a hyperparameter was optimized, the scores went up. Then repeated the same for all hyperparameters.

I used this method to test RandomForestClassifier, DecisionTreeClassifier, AdaBoostClassifier, GradientBoostingClassifier, BaggingClassifier, KNeighborsClassifier and ExtraTreeClassifier.

Outcome

  • Ultimately, the GradientBoostingClassifier produced the best accuracies.
  • I optimized the classifier to a high extent, testing accuracies on each change of value of a hyperparameter.

NOTE: I did not set the random_state hyperparameter intentionally. While this changed accuracies every time by a few decimal points, setting a value would've been nearly impossible since it can literally be any integer. But this also means the accuracies I obtained cannot be replicated to an exact value, but instead can be obtained to within a few decimal places.

Reasoning for outcome

  • It quickly became obvious that with such variety of ranges of data, each having little or no dependence on the other features, RandomForestClassifier and DecisionTreeClassifier would be the top choices.
  • As I saturated the classifier, I realized that certain features would require ensemble methods in order to get better overall accuracies.
  • Gradient Boosting and AdaBoost both produced good accuracies with AdaBoost around 86% cross validation scores, GradientBoosting around 87%.

What you need to run:

  • Python 3
  • Scikit-learn

How to run:

  • Clone the repo to your local directory
  • Verify that all the files are in the same folder and there are no subfolders
  • Open main.py and run

References:

David Quigley, CU Boulder

predicting-adult-census-income-using-machine-learning's People

Contributors

vikasnataraja avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.