GithubHelp home page GithubHelp logo

htm-community / flink-htm Goto Github PK

View Code? Open in Web Editor NEW
136.0 136.0 37.0 236 KB

Distributed, streaming anomaly detection and prediction with HTM in Apache Flink

License: GNU Affero General Public License v3.0

Scala 44.77% Java 53.68% Shell 1.55%

flink-htm's People

Contributors

eronwright avatar gitter-badger avatar rmetzger avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

flink-htm's Issues

Author needs a new github pic.

All of life is out of balance and seems far away due to vision impairment occurring when viewing the authors profile! Please help...

Develop a Socrata connector

Let's develop a connector to read Socrata streams.

Consider using the soda-java library, but be aware that its dependencies are fairly old (jackson 1, jersey 1). We would probably need to shade the dependencies and/or update them.

Alternatively, we could write something from scratch (consumer-side only, to begin with). The River source is written with Spray, which is easy to use. There are quite a few soda model classes (for geo data, timestamps, etc) and query classes that would need to be re-implemented.

Support network 'reset'

It is possible to send a reset to HTM to indicate that a temporal sequence has ended. Investigate how to facilitate a reset in the flink-htm API.

A simple option would be to expose the network object to direct manipulation in the select function. But I would rather a more filter-like approach like HTMStream::resetOn(f: ResetFunction).

Is it better to reset before or after an event?

Java - Overrided select() function defined in HTM.learn().select() never called.

Template:

DataStream<T> result = HTM.learn(stream, new Network())
				.select(new InferenceSelectFunction<R, T>() {
                                           @Override
                                            public T select(Tuple2<R, NetworkInference> inference) throws Exception {
					              	return inference.f1.getAnomalyScore();
                                             }
				});

The overrided select is never called. I verified that following the call hierarchy. This lead to no one output produced. Anyway, I verified the anomaly score, internally, is correctly computed.

Tried to resolve it, no great results reached. Please, fix it.

Support Flink Checkpointing of HTM State

Flink's reliability is based on checkpointing of operator state. In flink-htm the critical state is that of the HTM network.

The operator code is well-prepared to support checkpointing. A state holder is allocated from Flink (see L63 of KeyedHTMInferenceOperator) into which the network instance is stored. Flink expects to use the serializer passed to it, and will automatically serialize the network at the appropriate time. The issue is that the network is not actually serializable (as of htm.java 0.6.5), and the job will crash if checkpointing is enabled.

Stretch goal: Flink supports both synchronous and asynchronous checkpointing. In the latter case, the operator is free to process elements while the state is being checkpointed. Depending on how the network state is ultimately checkpointed, consider it.

Add unit tests for HTM operator

Flink provides a unit test framework for operators - or at least an integration test using an embedded Flink instance.

  • Use a mock Network.
  • Enhance the existing test (HTMIntegrationTest) to validate the output.
  • Test keyed streams.
  • Test the 'encoder input function' code (GenerateEncoderInputFunction), which takes a user type (i.e. case class, POJO, tuple as described by the subclasses of CompositeType) and produces a map of input data for the multi-encoder. Test the various types of user types, and test the value conversion logic (e.g. int -> double for ScalarEncoder).

Implement event-time ordering

It is possible for events to arrive out-of-order into the HTM operator. We should reorder the events using an internal queue, using the watermark to make progress.

See the flink-cep library for an example of this: CEPPatternOperator.

Support stateful `select` function

The Flink API supports stateful mapper functions and so should the HTM DSL.

An important scenario is to store predictions over time for comparison purposes with later events, to calculate an error rate (for example). See the HotGym example.

The workaround at this time is to use a trivial select function followed by mapWithState.

Support unbounded rivers

RiverSource[T] should provide a continuous stream of data, never terminating as it does now.

Ideally we would still provide a batch mode of operation too.

  1. unbounded - the paging iterator should poll the stream on the interval specified in the meta.
  2. bounded - the paging iterator should read until a given point in event time, then terminate. All streams should end at the same (event) time - "now" by default.

Rename `learn` function

The learn function that flink-htm adds to DataStream is the primary API that inserts HTM into a dataflow graph. I consider the name of the function to be an open issue, since learn has another meaning in an HTM context.

The function name must be a verb, similar to existing functions like map, filter, and groupBy.

I suggest that the function name be grok.

Old:

val anomalyScores = env
      .addSource(nycTraffic)
      .keyBy("streamId")
      .learn(network)
      .select(...)

New:

val anomalyScores = env
      .addSource(nycTraffic)
      .keyBy("streamId")
      .grok(network)
      .select(...)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.