xmidt-org / ears Goto Github PK

Event Async Receiver Service (EARS)

License: Apache License 2.0

Go 99.07% Shell 0.62% Dockerfile 0.07% Makefile 0.13% HTML 0.10%

ears's Introduction

Event Async Routing Service (EARS)

Summary

A simple scalable routing service to usher events from an input plugin (for example, Kafka) to an output plugin (for example, AWS SQS). As an event passes through EARS, it may be filtered or transformed depending on the configuration details of a route and the event payload. Routes can be dynamically added and removed using a simple REST API and modifications to the routing table are quickly synchronized across an EARS cluster.

EARS is designed to eventually replace EEL, offering new features such as quotas and rate limiting as well as highly dynamic routes while still supporting filtering and transformation capabilities similar to EEL.

EARS comes with a set of standard plugins to support some of the most common message protocols including Webhook, Kafka, SQS, Kinesis etc. but also makes the development of third party plugins easy.

Our Kanban Board can be found here.

User Guide

References

Contributing

Refer to CONTRIBUTING.md.

ears's People

Contributors

Stargazers

Watchers

Forkers

alwell-kevin zeushammer luweiy tsasser05 one111eric gtrevg becca-boo-boo

ears's Issues

Performance Tests

Timestamp Filter

Filter for filtering out "old" events. This filter requires at a minimum two configuration options, the path to the timestamp filed in the event payload and a TTL cutoff value. Additional parameters may be useful to configure the format and / or resolution of the event timestamp.

Transformation Filter

Simple JSON-to-JSON transformation capabilities, sufficient to, for example, constructing a Gears message envelope.

documentation file names

https://github.com/xmidt-org/ears/blob/main/docs/EarsStyleGuides.md

Suggestion: Adopt a URL friendly file-name-here.md convention for consistency.

Kinesis Receiver

define best pattern(s) for structure initialization

Overview

Philosophies

This also relates to:

Private structures
Interfaces
Public structure variables

Default Object Should Be Useful

var wg sync.WaitGroup

wg.Add(1)
wg.Done()
wg.Wait()

(&http.Client{}).Get("http://example.com")

(&http.Client{Timeout: 5 * time.Second}).Get("http://example.com")

Functional Options

Background:

https://dave.cheney.net/2014/10/17/functional-options-for-friendly-apis

type MyStruct struct {}

// Functional options pattern uses the "New" function
func NewMyStruct(options ...func(*MyStruct)) (*MyStruct, error) {}

func main() {
   ms, _ := NewMyStruct(
       WithSomeOption(2),
       WithAnotherOption("all the things"),
   )

   ms.DoWork()
}

Thoughts

Fully public structures with no default values seems to work well for "configuration" objects. They are just data payload objects and default values should be the variable type's default value
Objects with behavior
- Using Functional Options seems to be a cleaner way of changing behavior.
  - There may be a number of state values that need to be set based on the new value
  - There may be an order of operations that needs to be kept in order to have state values accurate
  - The structure of the class does not need to be known. In the HTTP case, you need to know the hierarchy of values
  - It ensures that the object is properly initialized prior to any functions being used. Function gatekeeping needs to only to check, for an example an .isInitialized variable
  - Validation can be done at object creation. This would be able to catch things like setting a port value to a number that's too big. Any value that has constraints whose domain is smaller than the value type itself can be checked at initialization, rather than run time.
  - Functional Options can support "late setting" of values. For instance t := NewThing(); t.WithPort(1234);

References should be generic & point to example based URLs

Here are a few examples that should refer to something like example or example.com instead of the word comcast.

ears/internal/pkg/app/docs/doc.go

Line 22 in fbe38ce

// Host: ears.comcast.com

ears/internal/pkg/app/testdata/simpleRouteNoReceiver.json

Line 3 in fbe38ce

"orgId" : "comcast",

Ack: Errors should not have a string message in the struct

The error message should be a constant and not up to the error creator. This would change the code to look like the following:

type TimeoutError struct {}

func (e *TimeoutError) Error() string {
	return "TimeoutError"
}

ack.errFn(&TimeoutError{})

In regard to the actual string value, it is recommended by the Go community to not capitalize sentences. If you decide to use Timeout reached as a more verbose message, the recommendation is to have that message be timeout reached to eliminate capitalization consistency issues that arise when composing error messages.

Going beyond that to the Go 1.13+ error handling pattern, it would be better to make this an error that wraps the original error. The common practice is to have an Err error member value and to implement the Unwrap method. This would change Line 141 to look like:

ack.errFn(&TimeoutError{ Err: ctx.Err() } )

EARS does not use default configuration

Ears does not use default configuration if it cannot find the configuration in config file or command line. Instead, it will error out.

Build the binary & push to ECR

Match & Filter

Filters to allow filtering and matching of incoming events based on configurable event patterns. Event patterns should support wild cards and a simple boolean OR operator.

Gears event processing

Evaluate the arrange project for configs, logs, gorilla mux

We are using 2 of the three referenced projects (viper, gorilla/mux). Zap and zerolog are more or less the same. It supports UberFX bootstrapping:

https://github.com/xmidt-org/arrange

Error formatting documentation

Agreed upon format:

MyError (code=42, k1=v1, k2=v2): wrapped Err.Error() goes here

Background Info:

type MyError struct {
	Err    error
	Code   int
	Values map[string]interface{}
}

If Code and Values didn't exist, I would expect something like:

MyError: wrapped Err.Error() goes here

If I wanted to add in Code, here are some possible ways it could be printed.

#1 --  MyError code:42: wrapped Err.Error() goes here
#2 --  MyError code=42: wrapped Err.Error() goes here
#3 --  MyError (code:42): wrapped Err.Error() goes here
#4 --  MyError (code=42): wrapped Err.Error() goes here
#5 --  MyError: code:42, err:wrapped Err.Error() goes here
#6 --  MyError: code=42, err=wrapped Err.Error() goes here

And then when expanding to the additional values:

#1 -- MyError code:42 k1:v1 k2:v2: wrapped Err.Error() goes here
#2 -- MyError code=42 k1=v1 k2=v2: wrapped Err.Error() goes here
#3 -- MyError (code:42 k1:v1 k2:v2): wrapped Err.Error() goes here
#4 -- MyError (code=42 k1=v1 k2=v2): wrapped Err.Error() goes here
#5 -- MyError: code:42, k1:v1, k2:v2, err:wrapped Err.Error() goes here
#6 -- MyError: code=42, k1=v1, k2=v2, err=wrapped Err.Error() goes here

Test Result Visualizations

stretch: trend over time

Gears integration / Knowledge Transfer

Flow Editor needs visual components in its component library for each of the sender and receiver plugins supported by EARS (KDS, Kafka, SQS, HTTP etc.). The components must offer parameters for all user controlled configuration options (e.g. Kafka topic name, Http endpoint etc). Some limited support for filter chain configuration will also be needed, for example single instance match filters and filter filters should be supported. Also Gears must call an EARS API to create a route upon flow deployment and another EARS API to remove the route upon undeploy. The EARS Hackweek POC has already laid some groundwork for this process, we may be able to reuse and improve this existing solution.

Kinesis Sender

AWS Kinesis sender plugin

Receiver Fanout

The Goals (are there more?)

Under normal operation, slower processing routes should not affect the speed of fast processing route
Under normal operation, a receiver should be able to ingest events at optimal rate
In the cases where one or more routes are congested or backed up (abnormal), it is OK to have degraded performance as long as:
- There are alerts to notify Support/DevOps
- EARs does not blow up (panic, OOM, etc)
- The performance impact is only limited to the particular receiver

Proposal 1

Code snippet in question (none working function for demo only)

func (a *PerReceiver) next(ctx context.Context, event Event) error {
    wg := &sync.WaitGroup{}
    for _, r := a.routes {
        wg.Add(1)
        go func() {
            r.Process(ctx, event)
            wg.Done()
        }
    }
    wg.Wait()
}

The idea with the above code is that all routes are processed in parallel and that they are not subject to the processing time of the other routes.

However, it does not meet goal number 2. Because the next function only returns when every route finishes processing (whatever that means), the latency of the next function is now bounded by the slowest processing route. Assuming that the receiver calls next sequentially for every event, the receiver's ingestion rate is also bounded by the latency of the next function.

Observation

The next function should return as fast as possible. Its latency should be such that it is as fast or faster than the event ingestion rate. For example, if a receiver's event ingestion rate is 1000 events/sec, then the latency of the next function should be no greater than 1ms.

Proposal 2

What if we remove the WaitGroup? The next function will now return very quickly.

func (a *PerReceiver) next(ctx context.Context, event Event) error {
    for _, r := a.routes {
        go func() {
            r.Process(ctx, event)
        }
    }
}

However, it does not meet goal number 3. If a route becomes congested (r.Process takes a long time to return), it will result in goroutine leaks. At some point, we may run out of memory and EARs will crash.

Observation

Unbounded goroutine invocation (ones without coordination construct like WaitGroup) will impact system stability.

Proposal 3

What if we cap the number of goroutines per receiver (or per route or per tenant)?

A per receiver implementation:

func (a *PerReceiver) next(ctx context.Context, event Event) error {
    for _, r := a.routes {
        //the function block if max number of goroutine is reached per receiver
        a.countCond.l.Lock() //a.countCond is *sync.Cond
        for a.count >= MAX_COUNT {
            a.countCond.Wait()
        }
        a.count++
        go func() {
            r.Process(ctx, event)

            //decrement the goroutine count
            a.countCond.l.Lock()
            a.count--
            a.countCond.Signal()
            a.countCond.l.Unlock()
        } 
        a.countCond.l.UnLock()
    }
}

Observation

Should be able to meet all of our goals
Goroutines per route: This will prevent abnormal route affecting normal route with in the same receiver to some extent. At some point, if the back-pressure is too much, it may still affect all routes of the same receiver source. It will not affect routes with different receiver source.
Goroutines shared between routes in same receiver source: Abnormal route will saturate all goroutines and affect normal routes of the same reciever source. It will not affect routes with different receiver source.
Goroutines shared between routes in same tenant: Abnormal route will saturate all goroutines and affect all routes within the tenant

Proposal 4

What about channels and worker pools? (at receiver, route, or tenant level).

The implementation below has a dedicated buffer channel and worker pool at receiver level:

type Work struct {
    r     Route
    event Event   
}
func (a *PerReceiver) next(ctx context.Context, event Event) error {
    for _, r := routes {
        work := &Work{
            r:     r,
            event: event,
        }
        a.ch <- work  //a.ch is a buffered channel
    }
}

func (a *PerReceiver) worker() {
    for work := range a.ch {
        work.r.Process(work.r.event)
    }
}

Observation

Should be able to meet all our goals
Trade-off between proposal 3 and 4 are:
- pre-allocated goroutine vs on-demand goroutine
- buffered channel may absorb temporary down stream congestion without impacting receiver ingestion rate

Return 401 instead of 500 for bad API requests

EARs rest API does not always return 401 for bad requests. Sometimes, they return 500. Here are some tests cases that needs to be fixed:

addroutebadname
addroutenoreceiver
addroutenosender
addroutenouser

Deploy resources and container to ECS with rollback capability

Context propagation and cancelling

We follow a standard of passing in a ctx context.Context into our functions so that:

The transaction can be cancelled
The context can be used for tracing/metrics

For long running contexts (such as receiver.Receive) as well as context that may need to be shared among several receivers, we need to figure out the best approach for managing those contexts when it comes to cancelling and metrics. After some of the dust settles, we should spend some time mapping the context object throughout the system to ensure we're on the same page for how it behaves and how we've implemented it across the system.

SQS Receiver

Management UI

Kafka Receiver

Kafka receiver plugin

Event Routing Pipeline

Event Pipeline Iteration 3

Event Pipeline Iteration 2

Event Pipeline Iteration 1

Struct/Data validation - which tools to use

Background

There are a number of cases where we need to apply some sort of data validation:

API requests
Messages passed into the system
Configurations loaded from yaml docs

Each may have unique constraints that should be addressed. For instance:

Case	Constraints
API	One definition to expose (e.g. via OpenAPI spec file) as well as validate Should be lightweight and fast Should be able to drive API error messages and/or codes
Incoming Messages	Should be lightweight and fast Should be able to produce specific messages and/or codes Should be easy for a customer to configure the validator
Plugin Configurations	Should be able to produce specific messages and/or codes Should be easy for a plugin developer to specify the constraints (stretch) Should be easy for a configuration interface for a plugin to generate a form that enforces validation

Research

A quick search produced the following results

Thoughts

Putting too many validation tags on a struct can make the struct messy (also, struct tags do not support newlines in the definition)
Using struct tags will make it hard to keep the OpenAPI spec in sync with what has been coded
It would be nice to have access to common types, like UUID, email, URL validation

Originally I thought we could lean into json schema (since it can apply to YAML docs as well), but I'm curious how json schema would go about addressing common types, such as UUID, email, URLs.

Message Acknowledgements

Some thoughts on acks. Like other services, there should be a way to guarantee at once delivery. But in the case of bad/slow HTTP targets, you don't want to back up the pipeline (or maybe you do?). An HTTP publisher may need to push to something reliable (kafka) and return an ack. Then another process consumes those individual requests and pushes to the intended target.

Unpack (split?) Filter

Filter component with the ability to split an incoming event into zero, one or more outgoing events. Typically the splitting is performed on an array embedded in the event payload. Thus a configuration parameter is required to define the location of the array in the payload. This is a MVP feature for the Gears integration.

Research panic handling for go routines

We need to see how we (may) need to deal with panics in go routines and how it affects the app.

Tracing

Intermittent unit test failure in multiRouteAABBAB test.

The unit tests pass most of the time, but occasionally, we will see the following error:
handlers_v1_test.go:296: multiRouteAABBAB test: check events sent error: unexpected number of events in sender 8 (10)

The error is intermittent, and there is currently no known way to reliably reproduce the issue. In github, it often happens at the first go test run after significant code changes. Locally, it also happens if there are significant code changes sync down from github.

Persistence Layer

Persistence layer for routing table and other EARS configuration parameters. We are currently favoring a relational database solution for this purpose, possibly AWS Aurora because of its efficient querying capabilities. This persistence layer will be the central source of truth for all EARS service instances. The database needs be configured to perform continues replication into a backup database for resilience and recovery purposes. We expect the number of entries in the routing table to not exceed several thousand items.

EARS Test Harness (EARTH)

Http Receiver

simple file based secrets management

Routing Table Synchronization

Simple solution not based on sophisticated consensus algorithm. Instead (1) do a full sync from Aurora whenever any change is detected via a last modified timestamp or (2), slightly more efficiently, fan out delta information from the acting EARS node to all other EARS nodes via Redis pub/sub or Kafka topic or (3) a combination of (1) and (2).

Change filterer -> filter

I've had to ask the "do I add er" question too many times.

Proposal:

type NewPluginer interface { NewPlugin() (Pluginer, error) }
type NewReceiverer interface { NewReceiver() (Receiver, error) }
type NewSenderer interface { NewSender() (Sender, error) }
type NewFilterer interface { NewFilter() (Filter, error) }  // WAS: NewFilterer() (Filterer, error)

type Receiver interface { Receive() }
type Sender interface { Send() }
type Filter interface { Filter() }  // WAS Filterer interface

The naming conventions align better if we drop the extra er from all the Filter types

Http Sender

Http sender plugin

Plugins should have main function declared

For all test plugins and files with a package main, there needs to be a main() function declared. Otherwise go build ./... will fail for those files.

EARS REST API and OpenAPI doc

EARS core REST API. At a minimum the EARS API must support AddRoute(), RemoveRoute() and GetRoute(). GetRoute() must support various search criteria as well as returning the entire routing table.

RouteConfig struct {
  OrgId        string       `json:"orgId,omitempty"`        // org ID for quota and rate limiting
  AppId        string       `json:"appId,omitempty"`        // app ID for quota and rate limiting
  UserId       string       `json:"userId,omitempty"`       // user ID / author of route
  Name         string       `json:"name,omitempty"`         // optional unique name for route
  Source       *Plugin      `json:"source,omitempty"`       // pointer to source plugin instance
  Destination  *Plugin      `json:"destination,omitempty"`  // pointer to destination plugin instance
  FilterChain  *FilterChain `json:"filterChain,omitempty"`  // optional list of filter plugins that will be applied in order to perform arbitrary filtering and transformation functions
  DeliveryMode string       `json:"deliveryMode,omitempty"` // possible values: fire_and_forget, at_least_once, exactly_once
  Debug        bool         `json:"debug,omitempty"`        // if true generate debug logs and metrics for events taking this route
  Ts           int          `json:"ts,omitempty"`           // timestamp when route was created or updated
}

Plugin struct {
  Type          string                 `json:"type,omitempty"`       // plugin or filter type, e.g. kafka, kds, sqs, webhook, filter
  Version       string                 `json:"version,omitempty"`    // plugin version
  SOName        string                 `json:"soName,omitempty"`     // name of shared library file implementing this plugin
  Params        map[string]interface{} `json:"params,omitempty"`     // plugin specific configuration parameters
  Mode          string                 `json:"mode,omitempty"`       // plugin mode, one of input, output and filter
  State         string                 `json:"state,omitempty"`      // plugin operational state including running, stopped, error etc. (filter plugins are always in state running)
  Name          string                 `json:"name,omitempty"`       // descriptive plugin name
  Encodings     []string               `json:"encodings,omitempty"`  // list of supported encodings
}

Metrics

Decide best locking patterns/style guide

For example, when you explicitly need to lock and unlock due to avoiding deadlocks. Maybe some general rules:

Default:

thing.Lock()
defer thing.Unlock()

If you have to make the lock region as small as possible, pulling it into a visible block may allow programmers to easily verify that all locks have been unlocked in the block:

// some code goes here, then use curly brace to indicate a lock block

{
  thing.Lock()
  if thing.done {
    thing.Unlock()
    return
  }
  thing.Unlock()
}

// continue on

Finally, if you're accessing another object that needs to do locking (even if it's an internal object in the package you're working on), it's best to provide some sort of function that will do the locking for you. For instance:

if thing.Done() {
  return
}

Kafka Sender

SQS Sender

SQS sender plugin.

Multi-Tenancy, Rate Limiting, Quotas

Plugin Framework

Investigate, design and implement EARS plugin framework. The EARS plugin framework will use the go dynamic plugin concept and is inspired by watermill.io

Plugins will integrate with the EARS routing table and event pipeline implementation. Plugins will use hashes over user configuration (instead of random IDs) in order to determine if two plugin (configurations) are equivalent or not. Plugins can then implement the concept of stream sharing which is essential for performance reasons (a million subscribers to the same Kafka queue should not result in a million Kafka plugin instances).

Routing Table Manager

In memory EARS routing table manager used in each active EARS instance. The routing manager acts as the interface between the EARS service and the persistence layer. The routing manager will also trigger synchronization tasks among EARS instances.

// A RoutingTableManager supports CRUD operations on an EARS routing table
RoutingTableManager interface {
  RouteNavigator
  RouteModifier
  Hasher
  Validater
}

RouteModifier interface {
  AddRoute(ctx context.Context, entry *Route) error             // idempotent operation to add a routing entry to a local routing table
  RemoveRoute(ctx context.Context, entry *Route) error          // idempotent operation to remove a routing entry from a local routing table
  ReplaceAllRoutes(ctx context.Context, entries []*Route) error // replace complete local routing table
}

// A RouteNavigator allows searching for routes using various search criteria
RouteNavigator interface {
  GetAllRoutes(ctx context.Context) ([]*Route, error)                                  // obtain complete local routing table
  GetRouteCount(ctx context.Context) int                                               // get current size of routing table
  GetRoutesBySourcePlugin(ctx context.Context, plugin Pluginer) ([]*Route, error)      // get all routes for a specifc source plugin
  GetRoutesByDestinationPlugin(ctx context.Context, plugin Pluginer) ([]*Route, error) // get all routes for a specific destination plugin
  GetRoutesForEvent(ctx context.Context, event *Event) ([]*Route, error)               // get all routes for a given event (and source plugin)
}

Immediately split (double memory usage, burn up CPU time for deep copy)
Split on write (V8 Engine style) -- That way, we don't need to do work if we're only reading the event
- If we split on write, maybe the only thing we need to do is return a shell object that points to the original event (**event) with some minimal data to know that we are not yet "dirty" with a write and thus need to do some work on a write

I think for now, this technique is hidden behind event.Event.Dup() ( https://github.com/xmidt-org/ears/blob/plugin-manager/pkg/event/types.go#L39 )

xmidt-org / ears Goto Github PK

ears's Introduction

Event Async Routing Service (EARS)

Summary

User Guide

References

Contributing

ears's People

Contributors

Stargazers

Watchers

Forkers

ears's Issues

Overview

Philosophies

Default Object Should Be Useful

Functional Options

Categories

Thoughts

The Goals (are there more?)

Proposal 1

Observation

Proposal 2

Observation

Proposal 3

Observation

Proposal 4

Observation

Event Pipeline Iteration 3

Event Pipeline Iteration 2

Event Pipeline Iteration 1

Background

Research

Thoughts

Recommend Projects

Recommend Topics

Recommend Org

Jobs