GithubHelp home page GithubHelp logo

andrewrreed / cml_amp_canceled_flight_prediction Goto Github PK

View Code? Open in Web Editor NEW

This project forked from cloudera/cml_amp_canceled_flight_prediction

0.0 0.0 0.0 68.57 MB

Perform analytics on a large airline dataset with Spark and build an XGBoost model to predict flight cancellations.

License: Apache License 2.0

CSS 2.59% HTML 3.48% JavaScript 19.72% Shell 0.04% Python 31.16% Jupyter Notebook 43.01%

cml_amp_canceled_flight_prediction's Introduction

Canceled Flight Prediction

This project is a Cloudera Machine Learning (CML) Applied Machine Learning Prototype and has all the code and data needed to deploy an end-to-end machine learning project on a running CML instance.

app

The primary goal of this repository is to build a gradient boosted (XGBoost) classification model to predict the likelihood of a flight being canceled based on years of historical records. To achieve that goal, this project demonstrates the end-to-end processing needed to take two large, raw datasets and transform them into a clean, unified dataset for model training and inference using Spark on CML. Additionally, this project deploys a hosted model and front-end application to allow users to interact with the trained model.

The two datasets used in this project come from Kaggle and the Bureau of Transportation Statistics.

Project Structure

The project is organized with the following folder structure:

.
├── code/           # Backend scripts, and notebooks needed to create project artifacts
├── data/           # A post processed sample of the full dataset used for model training
├── app/            # Assets needed to support the front end application
├── images/         # A collection of images referenced in project docs
├── models/         # Directory to hold trained models
├── cdsw-build.sh   # Shell script used to build environment for experiments and models
├── README.md
├── LICENSE.txt
└── requirements.txt

By following the notebooks, scripts, and documentation in the code directory, you will understand how to perform similar tasks on CML, as well as how to use the platform's major features to your advantage. These features include:

  • Data ingestion, cleaning, and processing with Spark
  • Hive table creation and querying
  • Streamlined model development
  • Point-and-click model deployment to a RESTful API endpoint
  • Application hosting for deploying frontend ML applications

We will focus our attention on working within CML, using all it has to offer, while glossing over the details that are simply standard data science, and in particular, pay special attention to data ingestion and processing at scale with Spark.

Deploying on CML

There are three ways to launch the this prototype on CML:

  1. From Prototype Catalog - Navigate to the Prototype Catalog on a CML workspace, select the "Airline Delay Prediction" tile, click "Launch as Project", click "Configure Project"

  2. As ML Prototype - In a CML workspace, click "New Project", add a Project Name, select "ML Prototype" as the Initial Setup option, copy in the repo URL, click "Create Project", click "Configure Project"

  3. Manual Setup - In a CML workspace, click "New Project", add a Project Name, select "Git" as the Initial Setup option, copy in the repo URL, click "Create Project". Then, follow the steps listed in this document in order

If you deploy this project as an Applied ML Prototype (AMP) (options 1 or 2 above), you will need to specify whether to run the project with STORAGE_MODE set to local or external. Running in external mode requires having external storage configured on your CML workspace and triggers the project to ingest, process, and store ~20GB of raw data using Spark. Running in local mode will bypass the data ingestion and manipulation steps by using the data/preprocessed_flight_data.tgz file to train a model and deploy the application. While running the project as an AMP will install, setup, and build all project artifacts for you, it may still be instructive to review the documentation and files in the code directory.

cml_amp_canceled_flight_prediction's People

Contributors

andrewrreed avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.