GithubHelp home page GithubHelp logo

jzyjade / aws_taxi Goto Github PK

View Code? Open in Web Editor NEW

This project forked from vida-nyu/aws_taxi

0.0 2.0 0.0 394 KB

Sample scripts to analyze taxi data on Amazon AWS

License: MIT License

Python 99.67% Shell 0.33%

aws_taxi's Introduction

NYC Taxi Analysis

Sample scripts to analyze taxi data on Amazon AWS

Instruction

  1. Create an Amazon EMR cluster with the following configuration (the bootstrap action is very important -- please pay attention to that):

     * Termination protection: Yes
     * Logging: Enabled (remember to input your S3 bucket to store log file)
     * Hadoop distribution: Amazon AMI 3.3.1
     * Bootstrap action: This is a very important step because the sample scripts 
     make use of python rtree library, but Amazon AMI 3.3.1 does not have rtree installed.
     Click 'Add bootstrap action' -> Custom action -> Configure and add -> 
     Put the following in 'S3 location': s3://mda2014/rtree.sh
     * Don't add any step at this point
     * Cluster Auto-terminate: No
    
  2. Clone this repository and upload the neighborhoods and yearplot scripts to your bucket on S3. For example:

     * neighborhoods: s3://mda2014/neighborhoods
     * yearplot: s3://mda2014/yearplot
    
  3. To run neighborhoods script: Add the following streaming step to your cluster with the following information:

     Replace mda2014 with your bucket name, except in Input
     * Mapper: s3://mda2014/neighborhoods/mapper.py
     * Reducer: s3://mda2014/neighborhoods/reducer.py
     * Input: s3://mda2014/taxi/trip/
     * Output: s3://mda2014/output1
     * Arguments: -D mapred.reduce.tasks=1 -files s3://mda2014/neighborhoods/mapper.py,s3://mda2014/neighborhoods/reducer.py,s3://mda2014/neighborhoods/shapefile.py,s3://mda2014/neighborhoods/ZillowNeighborhoods-NY.shp,s3://mda2014/neighborhoods/ZillowNeighborhoods-NY.prj,s3://mda2014/neighborhoods/ZillowNeighborhoods-NY.shp.xml,s3://mda2014/neighborhoods/ZillowNeighborhoods-NY.shx,s3://mda2014/neighborhoods/ZillowNeighborhoods-NY.dbf
    

    Wait for finish, then download and merge all output into one file called output.txt

    To generate plot, execute:

     python plot_results.py output.txt <location_of_output_plot>
    
  4. To run yearplot script: Add the following streaming step to your cluster with the following information:

     Replace mda2014 with your bucket name, except in Input
     * Mapper: s3://mda2014/yearplot/mapper.py
     * Reducer: s3://mda2014/yearplot/reducer.py
     * Input: s3://mda2014/taxi/trip/
     * Output: s3://mda2014/output2
     * Arguments: -D mapred.reduce.tasks=1
    

    Wait for finish, then download and merge all output into one file called output.txt

    To generate plot, execute:

     python plot_results.py output.txt <location_of_output_plot>
    
  5. Remember to terminate cluster after use.

Author

Huy T. Vo

Contributors

Tuan-Anh Hoang-Vu

aws_taxi's People

Contributors

hvtuananh avatar

Watchers

James Cloos avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.