GithubHelp home page GithubHelp logo

kristiyanto / venturo Goto Github PK

View Code? Open in Web Editor NEW
2.0 3.0 0.0 81.46 MB

An Insight Data Science Project | Kafka, Spark Streaming, ElasticSearch, Flask, Bootstrap

Home Page: http://venturo.kristiyanto.me

Jupyter Notebook 27.61% Python 9.14% JavaScript 5.42% CSS 2.13% HTML 55.66% Shell 0.05%
streaming-data real-time-processing real-time-analytics data-engineering

venturo's Introduction

Build Status

VENTURO: an Insight data science project

Swipe right to your next adventurous ride

Venturo is a ridesharing platform with common destinations, e.g. to complement city passes. Users select multiple destinations and matched with a driver and other users. Currently serving 2 cities: Chicago and New York City.

Demo

Demo available at: http://venturo.kristiyanto.me

Video Demo:

Demo Video

Data

The platform takes 2 data stream, both in JSON format:

  1. From passenger : {id, current location, status, destinations ..}

  2. From the driver: {id, current location, status, destination, passenger, ..}

Click here for the complete schema. The data is generated by placing both drivers and passenger on a map and simulate a moving location toward destinations within 1-2 hops (euclidean).

Matching

The matching is heuristic and driver-centric:

  1. If a driver available / idle, scan for passenger nearby.
  2. Dispatch to pick up the passenger.
  3. Switch status to 'On trip' to the destination. Along the way, unless cab is full, driver continuously scan another passenger nearby with common destinations, and re-route if necessary.
  4. Once arrived, drivers' status are set back to idle to pick up other passengers. Passengers removed after 2 hours later.

Dashboard and Queries

The dashboard put the drivers and passengers on the map in real-time fashion within 1-hour window. Since the data is simulated, some passengers and drivers may appear in unlikely places (e.g. river/lake).

Green dots represent passengers, Blue represents drivers. Green inside blue represents passenger(s) on a trip. The stats:

  • Total cabs: total active drivers (including idle, on trip, and dispatched drivers)
  • Idle driver: cabs without passenger
  • Total passenger: total passengers (including passengers on trip, waiting for passenger, or passengers being pickup, etc)
  • On going trips: total cabs with passengers on going to destination
  • Passenger waiting: total passengers not assigned to any drivers
  • Average waiting time: in seconds, the average time between waiting for passengers to the time passengers hop on into the car.

Other queries that performed internally:

  • Check nearby passengers for driver assigments
  • Current location to perform simulation (called by Kafka Producer)

Architecture

Ingestion layer

Data streams for passengers and drivers generated separately in python (Kafka producers) and ingested in Kafka as different topics. Kafka is set with 4 hours retention policy.

Secor is used to saving all raw streams into Amazon S3 for later purposes (batch, re-play, forensics, or analytics).

Stream processing

Recieved JSON streams are processed and transformed in Spark Streaming with window 3 seconds, consuming data streams from both drivers and passengers. Every incoming message is subject to sanity check: e.g. driver's reported status is matched with the previous status, etc. to anticipate latency.

Sink

Elasticsearch is used as the buffer/transactional interface for the resulted messages. Elasticsearch is also called by Spark to enable assignments. Elasticsearch geo-location and boolean queries are leveraged.

Output

The output is served as API by using Flask, output intended mainly for dashboard and sent back to passanger and drivers. For Dashboard: Bootstrap2, jquery, and leaflet are used to prettify the output.

Infrastructure

Hosted in Amazon S3 with 3 m3.large instances for Spark Processing and 4 m3. medium instances for other services (multitenant).

Performance

The platform was tested to handle 3,000 active drivers and 5,000 passangers in a continues datastream on the forementioned infrastructure.

Docker Approach

An alternate approach was also evaluated and performed by containerizing Kafka Consumer and run it on Google Contaniner on Google Cloud Platform.

About

An Insight Data Science project by Daniel Kristiyanto.

Palo Alto, California -- Autumn, 2016.

venturo's People

Contributors

kristiyanto avatar

Stargazers

 avatar  avatar

Watchers

 avatar  avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.