snuspl / pluto Goto Github PK

MIST: High-performance IoT Stream Processing

License: Apache License 2.0

Java 99.79% Shell 0.21%

stream-processing stream internet-of-things data-stream

pluto's Introduction

MIST: High-performance IoT Stream Processing

MIST is a stream processing system that is optimized to handle large numbers of IoT stream queries. MIST is built on top of Apache REEF.

Requirements

Java 1.8
Maven

How to build and run MIST

Build MIST:

git clone https://github.com/snuspl/mist
cd mist
Set $MIST_HOME (export MIST_HOME=`pwd`)
mvn clean install

Run MIST:

./bin/start.sh
[-? Print help]
-num_threads The number of MIST threads
-num_master_cores The number of cores of MIST master
-num_task_cores The number of cores of MIST task 
[-num_tasks The number of mist tasks]
[-port The port number of the RPC server that communicates with MIST client]
[-runtime Type of the MIST Driver runtime (yarn or local. The default value is local runtime)]
-master_mem_size The size of MIST master memory (MB)
-task_mem_size The size of MIST task memory (MB)

MIST examples

HelloMist

Description

The example shows how to implement a simple stateless query and submit it to the MIST. The code sets the environment(Hostnames and ports of MIST, source, sink), generates a simple stateless query, and submits it to the MIST. The query reads strings from a source server, filter strings which start with "HelloMIST:", trim "HelloMIST:" part of the filtered strings, and send them to a sink server.

Run HelloMist

# 0. Build MIST first!(See above)
# 1. Run MIST
./bin/start.sh -num_threads 1 -num_master_cores 1 -num_task_cores 1 -task_mem_size 1024 -master_mem_size 256
# 2. Launch source server (You can simply use netcat)
 nc -lk 20331
# 3. Run a script for HelloMist
./bin/run_example.sh HelloMist
[-? Print help]
[-d Address of running MIST driver in the form of hostname:port]
[-s Address of running source server in the form of hostname:port]

# 4. Publish data stream 
 nc -lk 20331
 HelloMIST: hello!  (Then HelloMist query will filter this string and print "hello!" message to the console)

Multi-query submission

To submit stream queries, application developers first submit their jar files to MIST (The HelloMist example automatically submits the jar file and query to MIST).

# submit jar file
./bin/submit_jar.sh ./mist-examples/target/mist-examples-0.2-SNAPSHOT.jar

Then, it will return the app identifier

App identifier: 0

We can use this app identifier to submit multiple stream queries of the app. We will use EMQ as an MQTT message broker. Please download the EMQ and start it.

emqttd start

The default port of the emqttd that subscribes and publishes data is 1883. We then submit a stream query that filters data according to the user-defined prefix. This query subscribes/publishes the data stream from/to the EMQ.

./bin/run_example.sh MqttFilterApplication -source_topic /src/1 -sink_topic /sink/1 -filtered_string "HelloMIST:" -app_id 0

Then, the query will subscribe data from /src/1 EMQ topic, filter data that contains HelloMIST: as a prefix, and publishes the data to /sink/1 EMQ topic.

To publish and subscribe the topic, we can use mosquitto_pub and mosquitto_sub. We can subscribe the /sink/1 topic by:

mosquitto_sub -h 127.0.0.1 -t /sink/1

We can publish the /src/1 topic by :

mosquitto_pub -h 127.0.0.1 -t /src/1 -m "HelloMIST: hello!"

We can submit another stream query that filters HelloWorld:

./bin/run_example.sh MqttFilterApplication -source_topic /src/2 -sink_topic /sink/2 -filtered_string "HelloWorld:" -app_id 0

pluto's People

Contributors

Stargazers

Watchers

Forkers

taegeonum kyungissac differentsc

pluto's Issues

Implement simple word generator-aggregator model on REEF

As the first step of MIST project, we need to implement a simple word generator-aggregator stream processing application on REEF. This issue is important for two reasons below.

We can measure the number of streaming queries which can be processed in the environment without ZooKeeper.
We can practice development on REEF environment (especially Tang and Wake) and get used to some useful tools for building MIST like RemoteManager, NetworkService, etc. We can also think about how we can provide simple API to hide some complex things of REEF.

Implement input data transformation interface

Internal query should have a concrete information of how data should be transformed. We will process this issue by providing interface for (1) setting the input data from issue #11 (2) registering arbitrary transforming functions to the query (3) chaining those functions in order. In one query, multiple chains could be used in case of join operation.

For (2), we could leverage for lambda function to make an interface simple.

Task: Support auto-parallelism in operators

We need to think about how to parallelize operators. There are two ways:

Statically decides the number of parallelism when generating physical plans with cost-base optimization(?)
Dynamically changes the number of parallelism by monitoring the operator stages and detecting bottlenecks.
This topic needs a discussion.

First, we can explicitly configure the operator parallelism before doing auto-parallelism.

SSM: Read/Write states

Make getState() and setState() to allow the query element to read/write from the SSM db.

Query-evaluator scheduling

In addition to scheduling within an Evaluator, we need to assign each streaming query to appropriate evaluator.

This scheduling can be tricky, so we can start from the simple approach (like random assignment, round-robin) and measure the performance. After that, we can improve the performance by leveraging some performance metric (CPU, memory, ...) for query assignment.

Add maven configuration and gitignore file

As a start-up of the project, we should add maven configuration and gitignore file for the project.

API: Support HDFS for data source

We need to make HDFS available as a one of the data source option, as long as other options.

Should be addressed with #11

Add a default test package and related dependencies

API: Make NCS source configuration builder for providing stream source configuration

To provide source configuration, we need a source configuration builder for users to provide configurations for stream sources.

It will store its configuration in form of a key-value map, and each different type of source will have different setter methods according to its type.

This is not related to Tang Configuration and ConfigurationBuilder.

API: Implement local file output

As a simple output method along with stdout printing, we should also support local file output.

API: Support KAFKA for data source

Messaging systems (like Kafka, Kineses, MQTT, ...) are widely used as a data source in stream processing. We are planning to Kafka first and make others available in the future.

Should be addressed with #11

Implement a basic physical planner which supports a DAG operation

Physical planner converts the logical plan int a physical plan according to the 1) structure of the logical plan (DAG), 2) element sharing and 3) parallelism.

First, we consider 1).
We need to implement a basic physical planner which just converts the logical plan into a physical plan.
The planner should decide which elements are executed in synchronous or asynchronous stages.
If the elements are performed linearly, the planner just allocates the elements into synchronous stages ( in the same thread).
If there are branches in the logical plan, the planner allocates asynchronous stages to the downstream elements ( in the different threads).

Task: Implement basic stream operators

Mist should support various streaming operations, such as filter, flatMap, map, reduce.
In MistTask, the operations are represented as operators.
For the stepping stone, we need to implement basic stream operators: MapOperator, FilterOperator, and so on.

API: Support Tuple interface and keyed stream in MIST API

To facilitate key-based operations like reduceByKey, we need to support Tuple as well as keyed stream in MIST API.

Separate WordGenerator app and WordAggregator app

Currently WordCounter app contains WordGeneratorTask and WordAggregatorTask.

I'll separate the two task to two different application to make them run in different machines.

SSM: Install and test DBs

Of the DBs, find the best and install to test its performance

Task: Profiling physical operators

We need to measure physical operator's

resource usage
or the rate of input/output flow
to figure out the behavior of operators and user queries.
Maybe this can be used for 1) load balancing and 2) auto parallelism.

SSM: Putting queries into the SSM

Currently, we have only thought about putting query elements' states into the SSM,
we should consider putting the Query itself into the SSM. (for query statistics etc.)

Support multiple queries in one Evaluator

To increase the number of queries which can be processed in one machine, we need to unify the runtime environment for many queries to reduce the overhead from maintaining large number of queries. Currently, REEF Evaluator is a separate process for one container from RM, so we need to process multiple queries in a single Evaluator.

We need to address those particular issues below.

Put multiple queries in one Task via user API (issue #5)
Separate each query processing in one REEF evaluator and schedule them within the evaluator
Leverage disk space to store large number of query states

Leverage disk for storing query states

To process multiple queries in one machine, we need to store the states of those queries. However, memory space can be not enough for storing all the states. This case can happen in some queries like online ML (need to store many parameters).

The basic approach for this is to store its state in disk when the query is inactive, and reload it to memory and process when the data comes in. However, reading data from the disk should be slow so processing time can increase. Because of that, we need to address those things below.

Which data should stay on memory? Which criteria (query SLA, data incoming frequency, ...) should we use?
Can we apply batched processing for some cases? i.e. Stack multiple data and update the state at one go when the state is available on memory. It can reduce the number of disk read/write, but it will delay the state update.
Can we predict the data arrival time and apply pre-fetching for some queries?

Task: Implement mist executor

First step, we need to implement simple mist executor in which the scheduling policy is FIFO.
One executor consists of a queue, a thread and a scheduler. The thread fetches tasks from the queue and processes the tasks.

We will make scheduling policy pluggable, in order to change the policy easily.

Designing an optimal scheduling policy is future work.

Explore existing streaming API

Before we design our user API, we need to fully understand API of state-of-the-art stream processing systems.

To begin with, we decided to explore the user API of Apache Flink and Spark Streaming. This exploration should begin after we get sufficient background on stream processing frameworks.

API: Implement HDFS file output

We need to support save the output to distributed file systems, like HDFS.

API: Support NCS for data source

Our API need to support network input connection for fetching data. For network implementation, we can leverage NCS implemetation from Wake.

Should be addressed with issue #11

SSM: Find DB and Benchmarks

Find appropriate DBs for the SSM
and organize them into a single table showing their characteristics

Task: Avoiding creating a new objects whenever computes inputs

Current operations, such as filter and map, generate a new List object to create outputs.
Maybe it would be inefficient for memory usage. https://github.com/cmssnu/mist/pull/58/files#diff-4514470886a0b95fa75ed1d825e66dddR48

We need to take a look at it at some time.

WordCounter on Spark Streaming

We already investigate the limitations of microstream in current stream systems such as Storm and REEF.

We found that we also have to perform a similar investigation to Apache Spark Streaming.

We will implement WordGenerator and WordAggregator, and measure the limitation of processed queries in the same time.

Implement on-demand query state loading

As a first step of Issue #7, we will implement the on-demand query state loading feature on MIST. In this feature

Query state will be stored in local disk
When data arrives, the state will be loaded into memory and the updated state will be saved back to disk.

We'll meausre the performance of this approach and find some ways to improve.

Shorten the gap of generating words

The basic experiments shows that the most important bottleneck of microstreams in current REEF system is a memory limitation.

But, CPU also can be a bottleneck if queries requires more computations.

In this experiment, we examine it by shortening the gap of generating words from 1 second to lower.

Task: Implement a simple round-robin allocator

MistTask allocates OperatorChains to executors.
Then, the executors run and activate the dedicate OperatorChains when their inputs are sent.

For the first step of the above logic, we will simply implement a simple round-robin allocator
which dedicates OperatorChains in round-robin way.

API: Implement generic interface for outputting result

For outputting the result, we need to a generic interface for outputting the result.

How do we represent the result?
Where do we output the result?

By separating those two things, we can have a more flexibility on defining output method.

Task: Implement improved allocation policy

After implementing the #42, we need to develop an improved allocation policy, to reduce data processing latency.

This need a discussion.

Add checkstyle to mist

For automatic check of code readability and formatting, we need to add checkstyle to mist.

Implement low-level API

We need to have a low-level representation of MIST query. This should be done before issue #5.

I don't think we need to have a data-flow representation as Storm does, because we assumed that each query is processed in one machine. So, I think we can focus on the issues below when designing internal APIs.

How can we fetch the data?
How can we transform the query state for the given data?
How can we output the result?

SSM: Memory caching

The SSM should control which states go into the disk, and which remain in the memory.
We need to make decisions on what kind of caching strategy we will use

Implement internal query interface

We need to implement an internal operation representation inside the MIST query. Specifically, we need to define a procedure of how a single data would be processed inside MIST.

We should have a clear definition of ...

How stream input data is represented inside the query? (#15)
How input data would be transformed? (#16)
How state would be changed by input data? Which window should be used for managing the state? (#17)

Each would be addressed on separate issues.

API: Make MISTExecutionEnvironment for submitting queries

MISTExecutionEnvironment is responsible for controlling user queries. We first need to implement user query submission part, including avro serialization & query verification.

Driver: Implement basic driver for MIST

We need a basic REEF driver for MIST to . Right now, MIST driver will just interconnect between MIST Client and MIST Task.

Review necessary papers

Because we need to get some background on stream processing, we need to review some important papers on stream processing. The important papers I think are listed below.

Spark Streaming - http://people.csail.mit.edu/matei/papers/2013/sosp_spark_streaming.pdf
TimeStream - http://research.microsoft.com:8082/pubs/191070/timestream_eurosys13.pdf

Taehun and I will review those papers in a short time and have a short discussion on it. This issue will remain unclosed in case we need more papers to read.

Please comment to this issue if you come up with more papers worth reading.

API: Implement stdout output

As a basic method of outputting result, we should provide a way to print the result on stdout.

API: Support user-defined functions

To support various operations which cannot be expressed via basic operators, we need to provide an interface for defining UDF in MIST API

API: Implement generic interface for fetching data

Our API should be able to fetch data from multiple issues (Network, HDFS, KAFKA, ...) but we do need to implement a generic interface for configuring those data sources.

Design user API

We need a good user API for the system. Particularly, we need to hide the internal implementation of MIST including REEF components. I think we should also hide the data flow model below streaming queries and provide data-centric interface for users.

To be more specific, we need to provide those three features below via our MIST API.

Data fetching from diverse sources (e.g. HDFS, KAFKA, ...)
Data transformation within queries. It should be explicit and general enough to support various operations.
Convenient output of query results

Issue #3 should be addressed before this issue.

Task: Implement improved scheduling policy in executor

After implementing #41, we need to develop an improved scheduling policy in executor.

This need a discussion.

Define input data stream inside the internal query

We need to define a way for the input data stream to be represented inside the internal query interface.

I think we can use a simple tuple interface with each field's name defined for this. We do not have to define a specific key here.

Simple word counter with one task

As the first step of the ISSUE #1 , I will implement a simple word counter with one task.

The task performs both word generating and aggregating.

I'm going to implement it, starting with HelloREEF example REEF application.

Task: Convert Logical plan to physical plan

A logical plan is represented by a data flow, which is generated by user-level API.
A physical plan is represented by actual mist operators, in which the number of parallelism is set.

We should convert a logical plan to a physical plan in mist task.

API: Make basic MISTStream interface

We need to make basic MISTStream interface to represent created-by-user data stream. Basic MISTStream interfaces are

SourceStream
OperatorStream

For those streams, MIST system need know about the type and other necessary informations, so MISTStream and its derivative classes should have methods for those information.

Implement internal query state transformation interface

After data is transformed, internal query state should be updated by the newly input data. For that we should provide a interface for

(1) defining internal state
(2) a updating function with old state and input data
(3) defining window information (window size & slide interval)

API: Implement avro-serialized output mechanism

Some Java objects could be too complicated to be represented via simple string, so we should provide more elaborate way for serializing it. We can use Apache Avro for this.

snuspl / pluto Goto Github PK

pluto's Introduction

MIST: High-performance IoT Stream Processing

Requirements

How to build and run MIST

MIST examples

HelloMist

Multi-query submission

pluto's People

Contributors

Stargazers

Watchers

Forkers

pluto's Issues

Recommend Projects

Recommend Topics

Recommend Org

Jobs