GithubHelp home page GithubHelp logo

nyctaxi's Introduction

NYC Taxi Data Analysis

This document contains the descriptions of all the code components present in our system, and how they are meant to be used.

###The Data: The NYC taxi data was originally through a Freedom of Information Law (FOIL) request from the New York City Taxi & Limousine Commission (NYCT&L). The dataset is available for download publicly at multiple hosting sites. You can download the data from any of the following links:
http://www.andresmh.com/nyctaxitrips/
OR
https://uofi.app.box.com/NYCtaxidata

###./k-means/
This folder contains the PySpark code to perform K-means clustering on the txi data, using MLlib. Your system needs to have Spark installed and working to run this. Alternative, it can also be deployed as a spark-job on EMR clusters.

###./*.ipynb
The .ipynb files are the IPython notebooks, which are used for data exploration and cleaning, as well as analysis of trends observed in the data. The system needs to have Anaconda installed to run the IPython notebooks. They are split into multiple ode chunks, with chunk performing a particular task. This allows us to execute one task at a time, rather than all at once. These notebooks are also used to create some visualization charts to describe the general trends observed in the data.

###./visualizations/ This folder contains the visulization code used to generate the Google Map visuals, depicting the most popular taxi-routes in New York City. It leverages JavaScript and HTML, as well as Google Maps API.

###AWS You would first need to launch an AWS EMR cluster comfigured with Spark and Ganglia. This project utilized 20 nodes (1 master and 19 slaves) which were of the m2.xlarge configuration. Next the Spark environment needs to be set up in the local machine and you would need to get the URL of the master node.

Next you would need to upload the data for the trip records to S3. This is achieved using AWS CLI tool for this project.

Post this step, you can run the following command to run the Spark job on your EMR clusters

bin/spark-submit --packages org.apache.hadoop:hadoop-aws:2.7.1 --master spark://url-of-emr-master-node nyctaxikmeans.py

nyctaxi's People

Watchers

 avatar  avatar  avatar  avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.