GithubHelp home page GithubHelp logo

flink-recon's Introduction

Flink stream reconciliation

Flink job which reads an event stream and an audit stream and reconciles that all event IDs in the audit stream are present in the event stream.

Requirements:

  • it must emit regular % completeness metrics every o minutes - both for the current time period and any previous periods which need adjusting
  • it must emit a missing event when an ID is missing from the event stream within n minutes of the watermark
  • it must emit a found event when an ID which has been reported as missing is found with m minutes of the watermark
  • it will allow for n minutes of out-of-orderness in either stream
  • it must deal with late events on both the event stream and the audit stream - late events with m minutes of the watermark should trigger an update to metrics, missing and found events.

So we have the following configurable time periods:

  • o is the interval for metrics reporting (in event time)
  • n is the time after which to report missing events (and the amount of out-of-orderness to allow) (in event time)
  • m is the time period within which corrections can be made (and which determines state retention) (in event time)
  • z is the processing time period to wait before starting to emit results if the event time watermark stops advancing.

Design

design

data model

Outer join

Implementation thoughts

Key control event stream on topic + source name + ID so that the join operator state is just 2 booleans.

Ideally events on the audit topic should not be batched - each audit event should be a separate event. If we do this it means the windowing of the metrics, timeliness of exceptions raised etc. can be completely configured in the reconciliation code. I can't think of any reasons the audit topic should be batchy (other than maybe it makes the next point a little easier).

Advantage of not batching reconciliation - streams can be partitioned by topic and event ID making the rec more horizontally scaleable that it would be if we reconciled abritrary batches of references.

If we re-emit a metric event for a time range it should replace the previous metric event. I.e. the time range for metric events must be deterministic.

Audit topic needs a schema definition. Should include as a minimum source event time timestamp, timestamp the event was emitted, event reference (to join on), source system name (name + ID must be unique) and possibly a text field for context/aid identification of break. Contract is the ID, system name, event time timestamp must be identical for matching events.

How to find the timestamp and reference in the event stream? Could either configure for each schema or could use Kafka headers and e.g. the CloudEvents spec.

Need to deal with dupes. Anything upstream assuming it can do idempotent writes will cause dupes. Maybe need configurable ability to ignore dupes within the m time period. (And possibly to emit dupe metrics?)

Need to deal with filtering and splitting. Filtering can be handled if the application emits no-op events for filtered inputs. Splitting (1 input topic, n output) can be built into the design of the reconciliation.

The join function needs to have a processing time timeout - if the audit or event topic stops for any reason (and therefore stops the watermark advancing) it needs to start generating metrics and exceptions after some configurable wall-clock time period. The idle sources feature will deal with idle partitions, not sure if it will stop the entire stream blocking progress. This watermark assigner has a processing time timeout - it is recommended by David Anderson on Stack.

Pattern should recommend sending heartbeat messages both on event and inventory message streams.

How to deal with aggregation, where lots of events -> a single projection. We could include a list of message IDs in the projection but it has the possibility to break if the list gets too long. If the source has the projection (like SSENG has all obligations) we could use that as a source of the IDs. Or we consider this out of scope and simply check that all events entered the projection.

Inventory messages are produced with a role. Producer Expected, Consumer Expected and Actual. The expected inventories would only come from external sources/sinks, the actual would be generated by reading topics. Configuration is needed to associate inventories for reconciliation.

flink-recon's People

Contributors

loosechippings avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.