GithubHelp home page GithubHelp logo

hassanmahmoodkhan / a-data-driven-analysis-of-global-income-distribution Goto Github PK

View Code? Open in Web Editor NEW
0.0 1.0 0.0 14 MB

This project aims to predict the income bracket of individuals based on a variety of features, and presents a holistic comparative analysis between multiple machine learning algorithms through hyperparameter optimization on a binary classification problem.

License: MIT License

Jupyter Notebook 97.99% Python 2.01%
classification hyperparameter-optimization income-prediction machine-learning k-nearest-neighbor-classifier random-forest support-vector-machine

a-data-driven-analysis-of-global-income-distribution's Introduction

A Data-Driven Analysis of Global Income Distribution: Modeling and Analysis

Income disparity is a significant issue that affects the global population at varying levels. Efforts have been put together to curb it and improve the socio-economic fabric of our societies. This project aims to predict the income bracket of individuals based on a variety of features, and presents a holistic comparative analysis between multiple machine learning algorithms through hyperparameter optimization on a binary classification problem. Using machine learning, the model attempts to predict whether (Y/N) the income of a certain individual, with certain attributes (= features), exceeds $ 50,000 per annum. Three supervised, non-paramteric algorithms have been employed for evaluation i.e., K-nearest Neighbor, Support Vector Machine, & Random Forest.

Dataset

The Adult Data Set available at the UCI Machine Learning Repository is worked with to obtain results. The model is trained with 80% of the dataset and validated on the remaining 20%.

The data set is decribed to have the following characteristics:

  • 48842 instances
  • 8 categorical attributes and 6 continous
  • 3620 instances with missing values
  • Target variable : income (>50K, <=50K)

The feature set is as follows:

Feature set

Feature Selection and Engineering

The correlation matrix for the continuous features compared with the target variable is shown below:

Correlation Matrix

I have dropped the categorical feature ‘education’ from our dataset, since it being the same as 'education-num', with the latter imposing ordinality. Features 'capital-gain' and 'capital-loss' are highly skewed and as such to minimize skewness, I have taken the square root for all instances of these features.

Data Preprocessing

I have employed the following steps to transform the dataset into a more representative form:

  1. Missing Data Imputation
  2. Label Encoding
  3. One-Hot Encoding
  4. Feature Scaling

Model Building & Training

There are three machine learning algorithms employed for this project:

  1. K-Nearest Neighbor
  2. Support Vector Machine
  3. Random Forest

I have employed stratified k-fold cross validation, which is a variation of the k-fold cross validation technique that ensures each fold has approximately the same percentage of target class samples, thus addressing the dataset imbalance to an extent. In addition, it addresses the key issue of overfitting and promotes model generalization. Furthermore, the performance of a model significantly depends on the values of the model hyperparamters. I have employed the use of GridSearchCV to search all possible combinations of hyperparamter values, to determine optimal values for each of the three models.

Model Evalution

Each model has been assesed based on these evaluation metrics:

  1. Accuracy
  2. Confusion Matrix
  3. Reciever Characteristics Curve (ROC)

A comparison of predictive accuracy obtained with those in literature is represented in the table below:

Comparative Analysis

Random forests classifier is the best performer out of the three classifiers and outputs the highest classification accuracy of 86.70% and an AUC score of 0.917.

a-data-driven-analysis-of-global-income-distribution's People

Contributors

hassanmahmoodkhan avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.