GithubHelp home page GithubHelp logo

vanshika97 / ml_incomeprediction Goto Github PK

View Code? Open in Web Editor NEW
0.0 1.0 0.0 8.9 MB

TCD ML Comp. 2019/20 - Income Prediction (Ind.)

Jupyter Notebook 100.00%
machine-learning trinity-college-dublin xgboost-python target-encoding polynomial-regression income-prediction rmse-score kaggle outliers

ml_incomeprediction's Introduction

Machine Learning Comp. 2019/20 - Income Prediction

This repository contains my solution to Trinity College Dublin's CS7CS4: Machine Learning module 2019

Requirements

Python 3 (Jupyter Notebook)

Problem Statement

To apply basic machine learning models on a labelled dataset to predict the target feature 'Income in EUR' on a set of unlabelled data.

Approach

I started my solution by understanding the training data. I used Pandas function describe() to perform statistical analysis and visualization library seaborn to picturize any established patterns/shapes.

Data Cleaning

  • NULL values: I chose to fill all missing values within the dataset with Mode of the respective column, keeping in mind the training set was not huge and I could possibly lose important data.
  • Encoding: Categorical features were transformed using target encoding for one-hot encoding exponentially increased my data size making it extremely difficult for the models to provide best results.
  • Noise:
  • Outlier Detection: I identified outliers as the 0.01 percentile data using the function quantile. However, my experience in this assignment show outliers as an important feature and disregarding them would excessively overfit the model.
  • Feature Reduction/Selection: I chose to drop 2 features namely 'Hair Color' and 'Gender' as they were of least importance and although hardly affected my RMSE score they helped speed up my model performance.
  • Scaling/Normalisation:

Model Selection

  • Linear Regression: This was my first model I trained my data on. I used this model on one-hot encoding and it gave me an RMS Error of ~79,000.
  • Polynomial Regression: I tried polynomial regression as an attempt to optimize my performance on my linear regression model. This however did not help to provide any significant improvements.
  • Random Forest: This model was the best performing model on my data when one-hot encoded. It reduced my RMSE score from 79,000 to 76,000. Later, using target encoding this model again performed well but not better than XGBoost.
  • SVM: Support Vector Machine was comparitively slower model to train and did not perform well in either of the two encoding methodologies used.
  • XGBoost: This was by far the best fit model for the problem, which not only was fast to train but provided exceptional training results. This model reduced my error from 76,000 to 60,000 with target encoding.
  • Decision Trees: Decision trees being a classification model did not help in the current problem.
  • KNN: I believe the results for this model were skewed even after GridSearchCV optimization because of the data being synthetic.

Optimization Strategies I used cross validation and grid search techniques to overcome overfitting and determining the best paramter combination for the problem in hand. I also played around with multiple models to train my dataset. Best results were obtained when I transformed my data using PolynomialFeatures and training my model on those transformed features using XGBoost.

ml_incomeprediction's People

Contributors

vanshika97 avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.