GithubHelp home page GithubHelp logo

snuspl / pluto Goto Github PK

View Code? Open in Web Editor NEW
17.0 12.0 3.0 2.92 MB

MIST: High-performance IoT Stream Processing

License: Apache License 2.0

Java 99.79% Shell 0.21%
stream-processing stream internet-of-things data-stream

pluto's Introduction

MIST: High-performance IoT Stream Processing

MIST is a stream processing system that is optimized to handle large numbers of IoT stream queries. MIST is built on top of Apache REEF.

Requirements

  • Java 1.8
  • Maven

How to build and run MIST

  1. Build MIST:

    git clone https://github.com/snuspl/mist
    cd mist
    Set $MIST_HOME (export MIST_HOME=`pwd`)
    mvn clean install
    
  2. Run MIST:

    ./bin/start.sh
    [-? Print help]
    -num_threads The number of MIST threads
    -num_master_cores The number of cores of MIST master
    -num_task_cores The number of cores of MIST task 
    [-num_tasks The number of mist tasks]
    [-port The port number of the RPC server that communicates with MIST client]
    [-runtime Type of the MIST Driver runtime (yarn or local. The default value is local runtime)]
    -master_mem_size The size of MIST master memory (MB)
    -task_mem_size The size of MIST task memory (MB) 
    

MIST examples

HelloMist

  1. Description

    The example shows how to implement a simple stateless query and submit it to the MIST. The code sets the environment(Hostnames and ports of MIST, source, sink), generates a simple stateless query, and submits it to the MIST. The query reads strings from a source server, filter strings which start with "HelloMIST:", trim "HelloMIST:" part of the filtered strings, and send them to a sink server.

  2. Run HelloMist

    # 0. Build MIST first!(See above)
    # 1. Run MIST
    ./bin/start.sh -num_threads 1 -num_master_cores 1 -num_task_cores 1 -task_mem_size 1024 -master_mem_size 256
    # 2. Launch source server (You can simply use netcat)
     nc -lk 20331
    # 3. Run a script for HelloMist
    ./bin/run_example.sh HelloMist
    [-? Print help]
    [-d Address of running MIST driver in the form of hostname:port]
    [-s Address of running source server in the form of hostname:port]
    
    # 4. Publish data stream 
     nc -lk 20331
     HelloMIST: hello!  (Then HelloMist query will filter this string and print "hello!" message to the console) 
    

Multi-query submission

To submit stream queries, application developers first submit their jar files to MIST (The HelloMist example automatically submits the jar file and query to MIST).

# submit jar file
./bin/submit_jar.sh ./mist-examples/target/mist-examples-0.2-SNAPSHOT.jar

Then, it will return the app identifier

App identifier: 0

We can use this app identifier to submit multiple stream queries of the app. We will use EMQ as an MQTT message broker. Please download the EMQ and start it.

emqttd start

The default port of the emqttd that subscribes and publishes data is 1883. We then submit a stream query that filters data according to the user-defined prefix. This query subscribes/publishes the data stream from/to the EMQ.

./bin/run_example.sh MqttFilterApplication -source_topic /src/1 -sink_topic /sink/1 -filtered_string "HelloMIST:" -app_id 0

Then, the query will subscribe data from /src/1 EMQ topic, filter data that contains HelloMIST: as a prefix, and publishes the data to /sink/1 EMQ topic.

To publish and subscribe the topic, we can use mosquitto_pub and mosquitto_sub. We can subscribe the /sink/1 topic by:

mosquitto_sub -h 127.0.0.1 -t /sink/1

We can publish the /src/1 topic by :

mosquitto_pub -h 127.0.0.1 -t /src/1 -m "HelloMIST: hello!"

We can submit another stream query that filters HelloWorld:

./bin/run_example.sh MqttFilterApplication -source_topic /src/2 -sink_topic /sink/2 -filtered_string "HelloWorld:" -app_id 0

pluto's People

Contributors

bellatoris avatar bgchun avatar bokyoungbin avatar differentsc avatar junggil avatar kyungissac avatar pigbug419 avatar sanha avatar seojangho avatar taegeonum avatar taehunkim avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

pluto's Issues

Implement simple word generator-aggregator model on REEF

As the first step of MIST project, we need to implement a simple word generator-aggregator stream processing application on REEF. This issue is important for two reasons below.

  • We can measure the number of streaming queries which can be processed in the environment without ZooKeeper.
  • We can practice development on REEF environment (especially Tang and Wake) and get used to some useful tools for building MIST like RemoteManager, NetworkService, etc. We can also think about how we can provide simple API to hide some complex things of REEF.

Implement input data transformation interface

Internal query should have a concrete information of how data should be transformed. We will process this issue by providing interface for (1) setting the input data from issue #11 (2) registering arbitrary transforming functions to the query (3) chaining those functions in order. In one query, multiple chains could be used in case of join operation.

For (2), we could leverage for lambda function to make an interface simple.

Task: Support auto-parallelism in operators

We need to think about how to parallelize operators. There are two ways:

  1. Statically decides the number of parallelism when generating physical plans with cost-base optimization(?)
  2. Dynamically changes the number of parallelism by monitoring the operator stages and detecting bottlenecks.
    This topic needs a discussion.

First, we can explicitly configure the operator parallelism before doing auto-parallelism.

SSM: Read/Write states

Make getState() and setState() to allow the query element to read/write from the SSM db.

Query-evaluator scheduling

In addition to scheduling within an Evaluator, we need to assign each streaming query to appropriate evaluator.

This scheduling can be tricky, so we can start from the simple approach (like random assignment, round-robin) and measure the performance. After that, we can improve the performance by leveraging some performance metric (CPU, memory, ...) for query assignment.

API: Make NCS source configuration builder for providing stream source configuration

To provide source configuration, we need a source configuration builder for users to provide configurations for stream sources.

It will store its configuration in form of a key-value map, and each different type of source will have different setter methods according to its type.

This is not related to Tang Configuration and ConfigurationBuilder.

API: Support KAFKA for data source

Messaging systems (like Kafka, Kineses, MQTT, ...) are widely used as a data source in stream processing. We are planning to Kafka first and make others available in the future.

Should be addressed with #11

Implement a basic physical planner which supports a DAG operation

Physical planner converts the logical plan int a physical plan according to the 1) structure of the logical plan (DAG), 2) element sharing and 3) parallelism.

First, we consider 1).
We need to implement a basic physical planner which just converts the logical plan into a physical plan.
The planner should decide which elements are executed in synchronous or asynchronous stages.
If the elements are performed linearly, the planner just allocates the elements into synchronous stages ( in the same thread).
If there are branches in the logical plan, the planner allocates asynchronous stages to the downstream elements ( in the different threads).

Task: Implement basic stream operators

Mist should support various streaming operations, such as filter, flatMap, map, reduce.
In MistTask, the operations are represented as operators.
For the stepping stone, we need to implement basic stream operators: MapOperator, FilterOperator, and so on.

Task: Profiling physical operators

We need to measure physical operator's

  1. resource usage
  2. or the rate of input/output flow
    to figure out the behavior of operators and user queries.
    Maybe this can be used for 1) load balancing and 2) auto parallelism.

SSM: Putting queries into the SSM

Currently, we have only thought about putting query elements' states into the SSM,
we should consider putting the Query itself into the SSM. (for query statistics etc.)

Support multiple queries in one Evaluator

To increase the number of queries which can be processed in one machine, we need to unify the runtime environment for many queries to reduce the overhead from maintaining large number of queries. Currently, REEF Evaluator is a separate process for one container from RM, so we need to process multiple queries in a single Evaluator.

We need to address those particular issues below.

  • Put multiple queries in one Task via user API (issue #5)
  • Separate each query processing in one REEF evaluator and schedule them within the evaluator
  • Leverage disk space to store large number of query states

Leverage disk for storing query states

To process multiple queries in one machine, we need to store the states of those queries. However, memory space can be not enough for storing all the states. This case can happen in some queries like online ML (need to store many parameters).

The basic approach for this is to store its state in disk when the query is inactive, and reload it to memory and process when the data comes in. However, reading data from the disk should be slow so processing time can increase. Because of that, we need to address those things below.

  • Which data should stay on memory? Which criteria (query SLA, data incoming frequency, ...) should we use?
  • Can we apply batched processing for some cases? i.e. Stack multiple data and update the state at one go when the state is available on memory. It can reduce the number of disk read/write, but it will delay the state update.
  • Can we predict the data arrival time and apply pre-fetching for some queries?

Task: Implement mist executor

First step, we need to implement simple mist executor in which the scheduling policy is FIFO.
One executor consists of a queue, a thread and a scheduler. The thread fetches tasks from the queue and processes the tasks.

We will make scheduling policy pluggable, in order to change the policy easily.

Designing an optimal scheduling policy is future work.

Explore existing streaming API

Before we design our user API, we need to fully understand API of state-of-the-art stream processing systems.

To begin with, we decided to explore the user API of Apache Flink and Spark Streaming. This exploration should begin after we get sufficient background on stream processing frameworks.

API: Support NCS for data source

Our API need to support network input connection for fetching data. For network implementation, we can leverage NCS implemetation from Wake.

Should be addressed with issue #11

WordCounter on Spark Streaming

We already investigate the limitations of microstream in current stream systems such as Storm and REEF.

We found that we also have to perform a similar investigation to Apache Spark Streaming.

We will implement WordGenerator and WordAggregator, and measure the limitation of processed queries in the same time.

Implement on-demand query state loading

As a first step of Issue #7, we will implement the on-demand query state loading feature on MIST. In this feature

  • Query state will be stored in local disk
  • When data arrives, the state will be loaded into memory and the updated state will be saved back to disk.

We'll meausre the performance of this approach and find some ways to improve.

Shorten the gap of generating words

The basic experiments shows that the most important bottleneck of microstreams in current REEF system is a memory limitation.

But, CPU also can be a bottleneck if queries requires more computations.

In this experiment, we examine it by shortening the gap of generating words from 1 second to lower.

Task: Implement a simple round-robin allocator

MistTask allocates OperatorChains to executors.
Then, the executors run and activate the dedicate OperatorChains when their inputs are sent.

For the first step of the above logic, we will simply implement a simple round-robin allocator
which dedicates OperatorChains in round-robin way.

API: Implement generic interface for outputting result

For outputting the result, we need to a generic interface for outputting the result.

  • How do we represent the result?
  • Where do we output the result?

By separating those two things, we can have a more flexibility on defining output method.

Add checkstyle to mist

For automatic check of code readability and formatting, we need to add checkstyle to mist.

Implement low-level API

We need to have a low-level representation of MIST query. This should be done before issue #5.

I don't think we need to have a data-flow representation as Storm does, because we assumed that each query is processed in one machine. So, I think we can focus on the issues below when designing internal APIs.

  1. How can we fetch the data?
  2. How can we transform the query state for the given data?
  3. How can we output the result?

SSM: Memory caching

The SSM should control which states go into the disk, and which remain in the memory.
We need to make decisions on what kind of caching strategy we will use

Implement internal query interface

We need to implement an internal operation representation inside the MIST query. Specifically, we need to define a procedure of how a single data would be processed inside MIST.

We should have a clear definition of ...

  • How stream input data is represented inside the query? (#15)
  • How input data would be transformed? (#16)
  • How state would be changed by input data? Which window should be used for managing the state? (#17)

Each would be addressed on separate issues.

Review necessary papers

Because we need to get some background on stream processing, we need to review some important papers on stream processing. The important papers I think are listed below.

Taehun and I will review those papers in a short time and have a short discussion on it. This issue will remain unclosed in case we need more papers to read.

Please comment to this issue if you come up with more papers worth reading.

API: Support user-defined functions

To support various operations which cannot be expressed via basic operators, we need to provide an interface for defining UDF in MIST API

Design user API

We need a good user API for the system. Particularly, we need to hide the internal implementation of MIST including REEF components. I think we should also hide the data flow model below streaming queries and provide data-centric interface for users.

To be more specific, we need to provide those three features below via our MIST API.

  • Data fetching from diverse sources (e.g. HDFS, KAFKA, ...)
  • Data transformation within queries. It should be explicit and general enough to support various operations.
  • Convenient output of query results

Issue #3 should be addressed before this issue.

Define input data stream inside the internal query

We need to define a way for the input data stream to be represented inside the internal query interface.

I think we can use a simple tuple interface with each field's name defined for this. We do not have to define a specific key here.

Simple word counter with one task

As the first step of the ISSUE #1 , I will implement a simple word counter with one task.

The task performs both word generating and aggregating.

I'm going to implement it, starting with HelloREEF example REEF application.

Task: Convert Logical plan to physical plan

A logical plan is represented by a data flow, which is generated by user-level API.
A physical plan is represented by actual mist operators, in which the number of parallelism is set.

We should convert a logical plan to a physical plan in mist task.

API: Make basic MISTStream interface

We need to make basic MISTStream interface to represent created-by-user data stream. Basic MISTStream interfaces are

  • SourceStream
  • OperatorStream

For those streams, MIST system need know about the type and other necessary informations, so MISTStream and its derivative classes should have methods for those information.

Implement internal query state transformation interface

After data is transformed, internal query state should be updated by the newly input data. For that we should provide a interface for

(1) defining internal state
(2) a updating function with old state and input data
(3) defining window information (window size & slide interval)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.