GithubHelp home page GithubHelp logo

jakesherman / titanic-kaggle Goto Github PK

View Code? Open in Web Editor NEW
7.0 3.0 10.0 78 KB

Introductory Kaggle competition

Jupyter Notebook 74.39% Python 25.61%
titanic-kaggle kaggle machine-learning kaggle-competiton hyperparameters

titanic-kaggle's Introduction

titanic-kaggle

The sinking of the RMS Titanic is one of the most infamous shipwrecks in history...In this challenge, we ask you to complete the analysis of what sorts of people were likely to survive. In particular, we ask you to apply the tools of machine learning to predict which passengers survived the tragedy.

This is an introductory Kaggle challenge where the goal is to predict which passengers survived the sinking of the Titanic based on a set of attributes of the passengers, including name, gender, age, and more.

Feature engineering

After taking an initial stab at feature engineering, I took some ideas from Megan Risdal and piplimoon. One of the fun parts about this challenge was seeing all of the creative ideas that others have thought up. To summarise what I did:

  • Extracted a person's title (Mr, Mrs, Miss, Col, etc.) from the person's name
  • Created a family size feature by adding up the number of siblings/spouses and parents/children on board
  • Created a family variable from people's last names and their family size - since non-related people can share last names, last name + family size should be a good proxy for a specific family
  • Used the ticket feature (where multiple people can share a ticket) only for cases where a ticket was shared by two or more people across the training and test sets (ths result is a bit of bleeding between the training and test sets)
  • Figured out which deck a person's cabin was on from the cabin feature
  • Used one-hot encoding to create dummies for categorical features
  • Used the fancyimpute package to impute missing values using MICE

Modeling

I used 5-fold grid search to choose hyperparameters and do model selection. I tried logistic regression, KNN, random forest, SVM, and gradient boosted trees models. They all performed reasonably well (accuracy in the ~ .78 - .82 range) except for KNN. My best score on the public leaderboard was from creating a majority voting ensemble of the four reasonably well performing models but giving the random forest model 2 votes (out of 5), giving a score of ~ .825.

To run

Uses Python 2.7, tested on Ubuntu 14.04 LTS.

python project.py --name <FILE-NAME>

Arguments:

  • --name required, name of the resulting .csv file to create
  • --findhyperparameters if you don't include this argument the script uses pre-optimized hyperparameters - including this argument results in grid search being used to optimize the hyperparameters. This takes ~ 1 - 1.5 hours depending on your machine.

titanic-kaggle's People

Contributors

jakesherman avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.