GithubHelp home page GithubHelp logo

calrissian / flowmix Goto Github PK

View Code? Open in Web Editor NEW
52.0 52.0 19.0 480 KB

Flowmix is a flexible event processing engine for Apache Storm. It supports complex correlations of events via sliding/tumbling windows. It allows parallel streams to be processed that can be grouped together in different ways.

License: Apache License 2.0

Java 100.00%

flowmix's People

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

flowmix's Issues

Off-heap storage of cached windows.

Garbage collection will ultimately become a problem as a user's data grows. While having a good stress testing suite will help us better place numbers with the claims, eventually, garbage collection of the longer-lived on-heap items like the windows and caches, is going to become a problem. We should investigate using different off-heap solutions like memcached which will allow the preservation of locality.

FilterOp should take a Filter class and not just assume a QueryBuilder

Suppose we have a filter that looks like this:

public interface Filter() {
    boolean accept(Event event);
}

And suppose we have a CriteriaFilter:

public class CriteriaFilter implements Filter {

   private Criteria criteria;

   public CriteriaFilter(Node node) {
      this.criteria= nodeToCriteria(node);
   }

   public boolean accept(Event event) {
      return criteria.matches(event);
   }
}

flowmix design doc

writing a chinese document about flowmix designment(not totally finished yet): http://zqhxuyuan.github.io/2015/07/26/2015-09-11-Flowmix-CEP/
hopefully can help someone.

@cjnolet After deep into flowmix source code, I also have some question:

AggregatorWindow is composed of Aggregator and Window,
And Aggregator is response for storing aggregate variable, while Window is storing original Event.
Normally there are PartitionOp before AggregatorOp to do some group by operation.
And as partition make sure One Partition corresponding One Window.
If Window store at most 1000 events, and there are 1000 partition, suppose One event take 1kb size
So windows memory in AggregateBolt take 1000Partition*1000kb=1GB.
So that's why Aggregator store temporary variable which is good at aggregate result.
my question is If Aggregator temporary variable is good enough,why do we need Window events?

Custom expire for the guava cache backing the windows

Currently, the expiration for windows that haven't been written to in a significant amount of time is 60 minutes. It would be useful if the user could set their own expiration on a per-stream basis for both the read expiration and the write expiration.

Direct support for new Event's Type field

New version of Mango modifies the Event object to add a "type" field. By default, not specifying a type makes it a blank string. I think Flowmix could make use of this type- specifically in things like filters and stream splitting. I imagine a SQL dialect that could treat event types like tables:

SELECT * from people WHERE ....

Basically this would create a new stream with a filter that would send all events type "people" into a flow where another filter downstream would only return those people matching the contents of the WHERE clause.

Exception stream for debugging/troubleshooting

A common practice in streams processing frameworks is to use a special stream for collecting error information. Without assuming anything about how the information is going to be used, it would be extremely useful to package up exceptions thrown in bolts, aggregators, etc... and send them to a special output in which users can plug in listeners to do stuff.

Join Streams

A solid way to join streams would be to window the left hand side and join each Event on the RHS side the left hand side.

JoinBolt didn't deal with buffer==null

I see the logic of cacheWindow(buffersForRule) and window(buffer) is different between AggregatorBolt, SortBolt and JoinBolt. In JoinBolt ,there are non processing in situation① below (Line 176):

  if (buffersForRule != null) {
        buffer = buffersForRule.getIfPresent(flowInfo.getPartition());
        if (buffer != null) {    // if we have a buffer already, process it
              if(op.getEvictionPolicy() == Policy.TIME)
                    buffer.timeEvict(op.getEvictionThreshold());
        }
       //Here no else ....  ①
 } else {
       //new buffersForRule and buildWindow, then put to buffer and buffersForRule...
 }

but in Agg and Sort there both has this logic:

                if (cacheWindow != null) { 
                    window = cacheWindow.getIfPresent(flowInfo.getPartition());
                    if (window != null) {    // if we have a buffer already, process it
                        if (op.getEvictionPolicy() == Policy.TIME)
                            window.timeEvict(op.getEvictionThreshold());
                    } else {
                        //....
                    }
                } else {
                    //....
                }

Can u explain why here does't need?
As I understand, different partition should given different window/buffer.

Unit tests for all aggregators

RIght now, there is a generic integration test for the Aggrgation API but it does not test each aggregator. With the addition of the avg aggregator and the other two aggregators, it would be a good idea to get some tests in place sooner rather than later.

Aggregator to support "rates" of counts/sums

This aggregator would ideally build on top of the sum and count aggregators to hold state about when the aggregator is fired so that it can divide the difference in time by the count/sum to figure out the average over that time period.

Join operator should take a predicate to determine a successful join

The InfoSphere Streams API uses a user-built predicate to determine, between an input and output port, whether or not two tuples qualify for a join. Flowmix can do the same thing given a user specified Criteria. The criteriaFromNode(new QueryBuilder()....) API should work just fine for this.

Light Aggregator (not keeping the whole field group)

Right now the aggregators keep the field group and do the math on the fly, I've seen the following approach in Tibco Streambase to process aggregation:
1- the tuple (as group of attributes/columns, basically a map) gets into the aggregator and only the aggregator input fields are considered and the rest is forgotten (the rest of the tuple)
2- the value gets processed into simple math: add, subtract. This is in a deconstructed manner, ej. avg is just one Long/BigInteger count and one Long/BigInteger sum. The actual field gets forgotten at this time.
3- when the aggregation window closes (emits, etc.) the 'heavy' math is done: multiplication, division, etc.

I've seen millions of tuples get processed into this type of aggregator with very low CPU and memory consumption and attack ships on fire off the shoulder of Orion

Rename Tuple to Attribute

As discussed in the tickets for the Mango project, Tuple will be getting renamed to Attribute in the Mango 2 release. This will affect this project as it stays up to date with Mango.

FlomwmixFactory should have a method to take a bolt for loading events

Flowmix has always assumed that the spout itself can serve up collections of events directly. So in order to make something like a JMS spout return Collection, it needs to be extended and the nextTuple() method overridden. This doesn't allow as much reusability as, say, a bolt after the spout that converts the contents of the tuples from the spout into a Collection

Sort operator

The ability to sort a stream allows this to be used for analytics. Sort stream semantics could follow the rules put forth by InfoSphere Streams' sort relational operator

Investigate possibility of placing API on top of Samza

Samza is a streaming framework that claims it can decouple downstream processors from upstream therefore eliminating the possibility of back pressure in a flow of data. Though I have not tested this, it could be the next thing and it would be useful if the Flowmix API couple be made portable enough to implement on top of it.

Simple aggregators

Some standard aggregators that could be pretty simple to implement:

  • Min
  • Max
  • Average
  • Std Dev
  • Median
  • All stats (min, max, avg, std dev, median, etc...)

Binary streams demonstration

Sound and image anlysis can be done efficiently in streams processing using algorithms to create clustered aggregations (k-means). It would be extremely useful to demonstrate this using flowbox.

Join operator should perform all possible joins

Currently, the Join operator only performs an inner join and passes all joined Events down the same output stream. All of the possible inner/outer join combinations should be implemented.

Further, the join operator does not use a supplied predicate to determine the nature of the join. This needs to be added as well.

Release Flowmix 0.2.0

This version of Flowmix bumps the version of storm to 0.9.2- giving it the ability to run with Netty instead of ZMQ.

Also- an API package has been established separate from Core. Though things may be moved around a bit more before 1.0.0, the purpose is to allow users to use API without depending on Core- meaning core can change we we need it to but API should remain consistent and go through a deprecation cycle for changes.

Support for filtering by visibilities

It's often important that a flow itself be associated with particular visibilities so that it's not possible for Events to propagate through those flows. This feature would support them as a first-class attribute to be used for filtering.

Stream outputs need to be managed internally so that streams can be "exported" to other streams

This is easier than it sounds. By default, streams end up at the "output" stream where the downstream bolt handles everything. The design here is to have the following patern in the builder:

new FlowBuilder()
.stream("stream1")
.endStream("stream2")
.stream("stream2")
.endStream()

This ultimately says "Start with stream 1 and send results of stream1 to stream2, then to the output (since no specific output stream is declared in stream2).

Solidify API and decouple from Core

In theory- API is only code that we are exposing to users that we are guaranteeing to keep compatible with some deprecation cycle for large changes.

Core should rely on API but a user should never be directly interacting with Core. There are still some Core classes that are being exposed to users.

Event tuple same key,but error size

I test Event API, but error happen last line:

        Event event = new BaseEvent(UUID.randomUUID().toString(), System.currentTimeMillis());
        event.put(new Tuple("key1", "val1"));
        event.put(new Tuple("key2", "val2"));
        event.put(new Tuple("key3", "val3"));

        event.put(new Tuple("key1", "val11"));

        Collection<Tuple> tuples = event.getAll("key1");
        assertEquals(2, tuples.size());

        event.put(new Tuple("key2", "val2"));
        Collection<Tuple> tuples2 = event.getAll("key2");
        assertEquals(1, tuples.size());

the firset assertEquals pass, I know, because key1 has two values, which make two tuple.
but second assertEquals failed:

java.lang.AssertionError: 
Expected :1
Actual   :2

As this API Error, test case at JoinBoltIT failed(may be other test case fail too).
then I print tuples2:

[Tuple{key='key2', value=val2, metadata={}}, Tuple{key='key2', value=val2, metadata={}}]

this two tuple are the same, which should be merge to One Tuple.

SplitterOp to allow a stream to go to multiple different streams

Currently, there's no easy way to split off a single stream into many other streams before the std output point occurs. A SplitterOp would be useful to allow:

  1. A single stream to split off into 1 or many other streams
  2. A single stream to split off into other streams AND pass some data through the rest of the current stream.
  3. A single stream to feed some data back through that same stream (possibly for enrichment purposes).

Allow time-based triggered windows to be synchronized

Currently, all windows are independent with regards to time. Storm manages a ticking so that seconds can be incremented at the window level but, for instance, I an aggregator window should trigger every 50 seconds, the windows will start counting those 50 seconds based on when the window opened instead of a global 50 second interval set by the bolt.

There are cases where you'd want both, but specifically it would be nice to be able to lock the aggregators to the bolt's counter so that they all trigger at the same time.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.