GithubHelp home page GithubHelp logo

kurteulau / pump-it-up Goto Github PK

View Code? Open in Web Editor NEW
0.0 1.0 0.0 1.4 MB

Predicts operational status of wells as part of Driven Data's Pump It Up data science competition.

License: MIT License

Jupyter Notebook 99.23% Python 0.39% HTML 0.34% CSS 0.04%

pump-it-up's Introduction

Pump It Up

drawing

Problem Description

Predict the operating condition of 14,850 unlabelled waterpoints given a labelled dataset of 54,900 waterpoints. Features in the labelled dataset include characteristics such as the location, installer, construction year, and type of well. This provided dataset has missing and incorrect values. This project is an entry in the Pump It Up competition held by Driven Data.

A full description from Driven Data can be found here.

Datasets

The rules of the competition prohibit the publication of the data by third parties, though you can find the problem description at Driven Data using this link. Once you log in or create an account you can then gain access to the datasets.

Workflow of notebooks

  1. The EDA_clean.ipynb notebook contains plots and steps taken to clean the data. Some values are also imputed. Please note that several functions used in this notebook have been stored in imputing_functions.py in order to declutter the notebook. The data is then exported as a .csv file.

  2. train_catboost.ipynb and train_rf.ipynb import the .csv file from EDA_clean.ipynb and train a Catboost and Random Forest model, respectively.

  3. eval_catboost.ipynb and eval_rf.ipynb export a .csv file in the format required by the competition. They both require that the evaluation dataset provided by Driven Data be cleaned with EDA_clean.ipynb. The evaluation dataset does not include labels, as these are stored internally at Driven Data and are used to calculate the accuarcy submitted models.



Notes

Currently, the Random Forest model performs better on the training dataset but significantly worse during evaluation compared to the CatBoost model.

I have used GridSearchCV on both Random Forest and CatBoost models to tune hyperparameters but did not achieve an increase in accuracy.

Next steps (when I have the time) could include:

  • better imputation of missing data, especially construction year, population, and other numerical features
  • using regular expressions to match funders and installers rather than just forcing all to lower case and removing double speaces
  • assessing the reasons for the Random Forest model to do well in training and then losing around 18% accuracy during evaluation, despite the use of cross-validation and stratifying duing train_test_split
  • incorporate ensemble method to increase accuracy

pump-it-up's People

Contributors

kurteulau avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.