calrissian / flowmix Goto Github PK

Flowmix is a flexible event processing engine for Apache Storm. It supports complex correlations of events via sliding/tumbling windows. It allows parallel streams to be processed that can be grouped together in different ways.

License: Apache License 2.0

Java 100.00%

flowmix's People

Stargazers

Watchers

Forkers

eawagner ebottabi anujsrc mindis miguelantonio 47billion markrey sis-cats lviiii chenyuanjin animeshinvinci cssdongl lilijiangnan gengdaomi dizhaung mavensky advancedwebdeveloper ilibx

flowmix's Issues

Off-heap storage of cached windows.

Garbage collection will ultimately become a problem as a user's data grows. While having a good stress testing suite will help us better place numbers with the claims, eventually, garbage collection of the longer-lived on-heap items like the windows and caches, is going to become a problem. We should investigate using different off-heap solutions like memcached which will allow the preservation of locality.

FilterOp should take a Filter class and not just assume a QueryBuilder

Suppose we have a filter that looks like this:

public interface Filter() {
    boolean accept(Event event);
}

And suppose we have a CriteriaFilter:

public class CriteriaFilter implements Filter {

   private Criteria criteria;

   public CriteriaFilter(Node node) {
      this.criteria= nodeToCriteria(node);
   }

   public boolean accept(Event event) {
      return criteria.matches(event);
   }
}

LongSumAggregator's internal fields should be protected so that subclasses can access them

flowmix design doc

writing a chinese document about flowmix designment(not totally finished yet): http://zqhxuyuan.github.io/2015/07/26/2015-09-11-Flowmix-CEP/
hopefully can help someone.

@cjnolet After deep into flowmix source code, I also have some question:

AggregatorWindow is composed of Aggregator and Window,
And Aggregator is response for storing aggregate variable, while Window is storing original Event.
Normally there are PartitionOp before AggregatorOp to do some group by operation.
And as partition make sure One Partition corresponding One Window.
If Window store at most 1000 events, and there are 1000 partition, suppose One event take 1kb size
So windows memory in AggregateBolt take 1000Partition*1000kb=1GB.
So that's why Aggregator store temporary variable which is good at aggregate result.
my question is If Aggregator temporary variable is good enough,why do we need Window events?

Support punctuation-style control signals

http://pic.dhe.ibm.com/infocenter/streams/v2r0/index.jsp?topic=%2Fcom.ibm.swg.im.infosphere.streams.spl-language-specification.doc%2Fdoc%2Fpunctuation.html

Custom expire for the guava cache backing the windows

Currently, the expiration for windows that haven't been written to in a significant amount of time is 60 minutes. It would be useful if the user could set their own expiration on a per-stream basis for both the read expiration and the write expiration.

Direct support for new Event's Type field

New version of Mango modifies the Event object to add a "type" field. By default, not specifying a type makes it a blank string. I think Flowmix could make use of this type- specifically in things like filters and stream splitting. I imagine a SQL dialect that could treat event types like tables:

SELECT * from people WHERE ....

Basically this would create a new stream with a filter that would send all events type "people" into a flow where another filter downstream would only return those people matching the contents of the WHERE clause.

Exception stream for debugging/troubleshooting

A common practice in streams processing frameworks is to use a special stream for collecting error information. Without assuming anything about how the information is going to be used, it would be extremely useful to package up exceptions thrown in bolts, aggregators, etc... and send them to a special output in which users can plug in listeners to do stuff.

To optimize resource usage, FlowmixFactory should allow the user to turn on which functions they want

For instance, if someone just wants to use filter, select, aggregate, why should all the other flow operators be running?

Allow events to be broadcast to components which are listening.

This could make an interesting publish/subscribe system where different types of events can be sent to different areas of a flow (across streams, etc...) so that feedback loops of control information can be shared.

MockFlowLoaderSpout never adjusts loaded to true

Hash function used to create partitions should use Guava's hash builders.

Specifically, the hash function should be changed to something that prefers more speed- like Murmur hash.

Join Streams

A solid way to join streams would be to window the left hand side and join each Event on the RHS side the left hand side.

Flow object needs a kryo serializer.

Provide typed Aggregator abstraction to make it easier to write aggregators that deal with single types (numerical functions, etc..).

JoinBolt didn't deal with buffer==null

I see the logic of cacheWindow(buffersForRule) and window(buffer) is different between AggregatorBolt, SortBolt and JoinBolt. In JoinBolt ,there are non processing in situation① below (Line 176):

  if (buffersForRule != null) {
        buffer = buffersForRule.getIfPresent(flowInfo.getPartition());
        if (buffer != null) {    // if we have a buffer already, process it
              if(op.getEvictionPolicy() == Policy.TIME)
                    buffer.timeEvict(op.getEvictionThreshold());
        }
       //Here no else ....  ①
 } else {
       //new buffersForRule and buildWindow, then put to buffer and buffersForRule...
 }

but in Agg and Sort there both has this logic:

                if (cacheWindow != null) { 
                    window = cacheWindow.getIfPresent(flowInfo.getPartition());
                    if (window != null) {    // if we have a buffer already, process it
                        if (op.getEvictionPolicy() == Policy.TIME)
                            window.timeEvict(op.getEvictionThreshold());
                    } else {
                        //....
                    }
                } else {
                    //....
                }

Can u explain why here does't need?
As I understand, different partition should given different window/buffer.

Unit tests for all aggregators

RIght now, there is a generic integration test for the Aggrgation API but it does not test each aggregator. With the addition of the avg aggregator and the other two aggregators, it would be a good idea to get some tests in place sooner rather than later.

Aggregator to support "rates" of counts/sums

This aggregator would ideally build on top of the sum and count aggregators to hold state about when the aggregator is fired so that it can divide the difference in time by the count/sum to figure out the average over that time period.

Join operator should take a predicate to determine a successful join

The InfoSphere Streams API uses a user-built predicate to determine, between an input and output port, whether or not two tuples qualify for a join. Flowmix can do the same thing given a user specified Criteria. The criteriaFromNode(new QueryBuilder()....) API should work just fine for this.

Implementations of all FlowOps need to provide equals() and hashCode() methods

Light Aggregator (not keeping the whole field group)

Right now the aggregators keep the field group and do the math on the fly, I've seen the following approach in Tibco Streambase to process aggregation:
1- the tuple (as group of attributes/columns, basically a map) gets into the aggregator and only the aggregator input fields are considered and the rest is forgotten (the rest of the tuple)
2- the value gets processed into simple math: add, subtract. This is in a deconstructed manner, ej. avg is just one Long/BigInteger count and one Long/BigInteger sum. The actual field gets forgotten at this time.
3- when the aggregation window closes (emits, etc.) the 'heavy' math is done: multiplication, division, etc.

I've seen millions of tuples get processed into this type of aggregator with very low CPU and memory consumption and attack ships on fire off the shoulder of Orion

Rename Tuple to Attribute

As discussed in the tickets for the Mango project, Tuple will be getting renamed to Attribute in the Mango 2 release. This will affect this project as it stays up to date with Mango.

FlomwmixFactory should have a method to take a bolt for loading events

Flowmix has always assumed that the spout itself can serve up collections of events directly. So in order to make something like a JMS spout return Collection, it needs to be extended and the nextTuple() method overridden. This doesn't allow as much reusability as, say, a bolt after the spout that converts the contents of the tuples from the spout into a Collection

Sort operator

The ability to sort a stream allows this to be used for analytics. Sort stream semantics could follow the rules put forth by InfoSphere Streams' sort relational operator

Spout to parse flows dynamically (i.e. passed in through webapp or JMS queue and serialized, etc..)

Investigate possibility of placing API on top of Samza

Samza is a streaming framework that claims it can decouple downstream processors from upstream therefore eliminating the possibility of back pressure in a flow of data. Though I have not tested this, it could be the next thing and it would be useful if the Flowmix API couple be made portable enough to implement on top of it.

Flowmix API on top of Apache Spark

We need a stress testing suite to show the true capability of Flowmix as well as to surface any bottlenecks that could seriously hinder the ability to process flows in real time.

MockEventGeneratorSpout should allow a list of events to be set in the constructor so that they can be more tuned to a use-case

Simple aggregators

Some standard aggregators that could be pretty simple to implement:

Min
Max
Average
Std Dev
Median
All stats (min, max, avg, std dev, median, etc...)

Binary streams demonstration

Sound and image anlysis can be done efficiently in streams processing using algorithms to create clustered aggregations (k-means). It would be extremely useful to demonstrate this using flowbox.

Join operator should perform all possible joins

Currently, the Join operator only performs an inner join and passes all joined Events down the same output stream. All of the possible inner/outer join combinations should be implemented.

Further, the join operator does not use a supplied predicate to determine the nature of the join. This needs to be added as well.

Release Flowmix 0.2.0

This version of Flowmix bumps the version of storm to 0.9.2- giving it the ability to run with Netty instead of ZMQ.

Also- an API package has been established separate from Core. Though things may be moved around a bit more before 1.0.0, the purpose is to allow users to use API without depending on Core- meaning core can change we we need it to but API should remain consistent and go through a deprecation cycle for changes.

Common functionality in bolts needs to be placed in helper classes

There's some pretty common blocks of code that exist in pretty much all the bolts. This code needs to be refactored into a helper class so that it's in one place instead of many.

Stream SQL dialect to provide a higher level API around the builder DSL.

As mentioned by @Kr77, this is for users that may know SQL but not necessarily know code. The dialect is explained here: http://sqlstream.com/docs/

Aggregator window's window eviction fires unexpectedly because the window is only ever written to the cache once.

Fix is to add the window to the cache after every event encountered.

Support for filtering by visibilities

It's often important that a flow itself be associated with particular visibilities so that it's not possible for Events to propagate through those flows. This feature would support them as a first-class attribute to be used for filtering.

The TickSpout is only emitting ticks in nextTuple() which means it's very likely the ticks could get out of sync.

A better option would be to have a TimerTask running that is emitting tuples directly instead of using nextTuple(). This would guarantee tick tuples are always prioritized. It's possible it could cause issues in cases where there is significant back-pressure... but we'll need to verify this through some stress tests.

AggregatedEvent's constructor taking in the event's previous stream as an argument, is never used.

Stream outputs need to be managed internally so that streams can be "exported" to other streams

This is easier than it sounds. By default, streams end up at the "output" stream where the downstream bolt handles everything. The design here is to have the following patern in the builder:

new FlowBuilder()
.stream("stream1")
.endStream("stream2")
.stream("stream2")
.endStream()

This ultimately says "Start with stream 1 and send results of stream1 to stream2, then to the output (since no specific output stream is declared in stream2).

Solidify API and decouple from Core

In theory- API is only code that we are exposing to users that we are guaranteeing to keep compatible with some deprecation cycle for large changes.

Core should rely on API but a user should never be directly interacting with Core. There are still some Core classes that are being exposed to users.

Ability to orchestrate a window based on "control events"

InfoSphere Streams calls this Punctuation. It would be useful if a window's eviction or trigger could be based on an event that's received on a control stream.

Move criteria to mango

Select and Sort operators should take varargs instead of .field() and .sortBy().

This makes the code more useful for dynamic fields

Event tuple same key,but error size

I test Event API, but error happen last line:

        Event event = new BaseEvent(UUID.randomUUID().toString(), System.currentTimeMillis());
        event.put(new Tuple("key1", "val1"));
        event.put(new Tuple("key2", "val2"));
        event.put(new Tuple("key3", "val3"));

        event.put(new Tuple("key1", "val11"));

        Collection<Tuple> tuples = event.getAll("key1");
        assertEquals(2, tuples.size());

        event.put(new Tuple("key2", "val2"));
        Collection<Tuple> tuples2 = event.getAll("key2");
        assertEquals(1, tuples.size());

the firset assertEquals pass, I know, because key1 has two values, which make two tuple.
but second assertEquals failed:

java.lang.AssertionError: 
Expected :1
Actual   :2

As this API Error, test case at JoinBoltIT failed(may be other test case fail too).
then I print tuples2:

[Tuple{key='key2', value=val2, metadata={}}, Tuple{key='key2', value=val2, metadata={}}]

this two tuple are the same, which should be merge to One Tuple.

SplitterOp to allow a stream to go to multiple different streams

Currently, there's no easy way to split off a single stream into many other streams before the std output point occurs. A SplitterOp would be useful to allow:

A single stream to split off into 1 or many other streams
A single stream to split off into other streams AND pass some data through the rest of the current stream.
A single stream to feed some data back through that same stream (possibly for enrichment purposes).

Allow time-based triggered windows to be synchronized

Currently, all windows are independent with regards to time. Storm manages a ticking so that seconds can be incremented at the window level but, for instance, I an aggregator window should trigger every 50 seconds, the windows will start counting those 50 seconds based on when the window opened instead of a global 50 second interval set by the bolt.

There are cases where you'd want both, but specifically it would be nice to be able to lock the aggregators to the bolt's counter so that they all trigger at the same time.

The Event loader component should use the same parallelism as the rest of the topology

It should really be assumed that a parallelizable queue is being used that can round-robin data into the spouts. I would rather have this be the default rather than have the default become a bottleneck.

[StopGate] not closing the gate.

This was verified using a time_delta_lt activation policy with a time-based open policy.

calrissian / flowmix Goto Github PK

flowmix's People

Stargazers

Watchers

Forkers

flowmix's Issues

Recommend Projects

Recommend Topics

Recommend Org

Jobs