GithubHelp home page GithubHelp logo

42elenz / disasterpipeline Goto Github PK

View Code? Open in Web Editor NEW

This project forked from florianbindereif/disasterpipeline

0.0 1.0 0.0 3.07 MB

Project from Udacity nano degree

Python 9.42% HTML 1.92% Jupyter Notebook 88.65%

disasterpipeline's Introduction

Project Description

The goal of this project was to build a model to classfiy twitter messages that are send during disasters. Messages will be categorized into 35 pre-defined categories, such as Aid Related, Medical Help, Search And Rescue. As a result the messages can be routed to the appropriate disaster relief agencies.

The data set was provided by Figure Eight containing real messages that were sent during disaster events. Steps included building a basic ETL and Machine Learning pipeline as well as a webapp. Text preprocessing was done using tokenizing and lemmatizing. The multi-label classifier was built using the pipeline features of Scikit Learn. Grid Search Cross Validation was used to tune the hyperparameters. The web app is able to process textmessages and classify them according according to the model, and display statistics using graphical plots.

alt text

File Descriptions

  • data/process_data.py - The ETL script
  • models/train_classifier - The ML script
  • app/run.py - The server for the website
  • app/templates - The website HTML/CSS files
  • data/*.csv - The dataset

Installation

pip install -r requirements.txt

Instructions:

  1. Run the following commands in the project's root directory to set up your database and model.
    • To run ETL pipeline that cleans data and stores in database python data/process_data.py data/disaster_messages.csv data/disaster_categories.csv data/DisasterResponse.db
    • To run ML pipeline that trains classifier and saves python models/train_classifier.py data/DisasterResponse.db models/classifier.pkl
  2. Run the following command in the app's directory to run your web app. python run.py
  3. Go to http://0.0.0.0:3001/

alt text

Further Discussion:

Initial exploration of the data showed inbalance in the distrubtions of the classes.

This leads to two main problems:

  1. Evalutation of the model outcomes

  2. Training of our model

  3. Accuracy, can not be used to evaluate model predictions on imbalanced classes. It does not distinguish between the numbers of correctly classified examples of different classes. For an imbalanced class dataset F1 score is a more appropriate metric. It is the harmonic mean of precision and recall. If the classifier predicts the minority class but the prediction is erroneous and false-positive increases, the precision metric will be low and so is F1 score. Also, if the classifier identifies the minority class poorly, i.e. more of this class wrongfully predicted as the majority class then false negatives will increase, so recall and F1 score will low. F1 score only increases if both the number and quality of prediction improves. Therefor I used the F1 score to evaluate the models performance.

  4. The minority class is harder to predict because there are few examples of this class, by definition. This means it is more challenging for a model to learn the characteristics of examples from this class, and to differentiate examples from this class from the majority classes.

Acknowledgements

Udacity Data Scientist Nanodegree Program

This project was done together with Florian Binderreif.

disasterpipeline's People

Contributors

42elenz avatar florianbindereif avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.