GithubHelp home page GithubHelp logo

xmidt-org / ears Goto Github PK

View Code? Open in Web Editor NEW
4.0 17.0 7.0 37.25 MB

Event Async Receiver Service (EARS)

License: Apache License 2.0

Go 99.07% Shell 0.62% Dockerfile 0.07% Makefile 0.13% HTML 0.10%

ears's Introduction

Event Async Routing Service (EARS)

Build Status codecov.io Go Report Card Quality Gate Status Apache V2 License GitHub Release GoDoc

Summary

A simple scalable routing service to usher events from an input plugin (for example, Kafka) to an output plugin (for example, AWS SQS). As an event passes through EARS, it may be filtered or transformed depending on the configuration details of a route and the event payload. Routes can be dynamically added and removed using a simple REST API and modifications to the routing table are quickly synchronized across an EARS cluster.

EARS is designed to eventually replace EEL, offering new features such as quotas and rate limiting as well as highly dynamic routes while still supporting filtering and transformation capabilities similar to EEL.

EARS comes with a set of standard plugins to support some of the most common message protocols including Webhook, Kafka, SQS, Kinesis etc. but also makes the development of third party plugins easy.

Our Kanban Board can be found here.

User Guide

References

Contributing

Refer to CONTRIBUTING.md.

ears's People

Contributors

becca-boo-boo avatar boriwo avatar denopink avatar dependabot[bot] avatar gtrevg avatar holidaymike avatar kcajmagic avatar kristinapathak avatar one111eric avatar plaxomike avatar schmidtw avatar toyangxia avatar tsasser05 avatar wingdog avatar zeushammer avatar

Stargazers

 avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

ears's Issues

Timestamp Filter

Filter for filtering out "old" events. This filter requires at a minimum two configuration options, the path to the timestamp filed in the event payload and a TTL cutoff value. Additional parameters may be useful to configure the format and / or resolution of the event timestamp.

Transformation Filter

Simple JSON-to-JSON transformation capabilities, sufficient to, for example, constructing a Gears message envelope.

define best pattern(s) for structure initialization

Overview

Philosophies

This also relates to:

  • Private structures
  • Interfaces
  • Public structure variables

Default Object Should Be Useful

var wg sync.WaitGroup

wg.Add(1)
wg.Done()
wg.Wait()
(&http.Client{}).Get("http://example.com")

(&http.Client{Timeout: 5 * time.Second}).Get("http://example.com")

Functional Options

Background:

type MyStruct struct {}

// Functional options pattern uses the "New" function
func NewMyStruct(options ...func(*MyStruct)) (*MyStruct, error) {}

func main() {
   ms, _ := NewMyStruct(
       WithSomeOption(2),
       WithAnotherOption("all the things"),
   )

   ms.DoWork()
}

Categories

There are possibly just two categories of structures:

  • Configuration Objects

    • Have no methods on them (other than serialize/deserialize helpers)
    • Has only public variables so that users can set them
    • Default values are the type's default value -- Users of the configuration must interpret these values and decide what to do
  • Behavior Objects

    • Has methods on the struct that does work
    • Most likely will have private state that cannot and should not be accessed
    • Needs to provide the ability to have smart default values
    • Needs to provide the ability to be able to override default values

Thoughts

  • Fully public structures with no default values seems to work well for "configuration" objects. They are just data payload objects and default values should be the variable type's default value
  • Objects with behavior
    • Using Functional Options seems to be a cleaner way of changing behavior.
      • There may be a number of state values that need to be set based on the new value
      • There may be an order of operations that needs to be kept in order to have state values accurate
      • The structure of the class does not need to be known. In the HTTP case, you need to know the hierarchy of values
      • It ensures that the object is properly initialized prior to any functions being used. Function gatekeeping needs to only to check, for an example an .isInitialized variable
      • Validation can be done at object creation. This would be able to catch things like setting a port value to a number that's too big. Any value that has constraints whose domain is smaller than the value type itself can be checked at initialization, rather than run time.
      • Functional Options can support "late setting" of values. For instance t := NewThing(); t.WithPort(1234);

Ack: Errors should not have a string message in the struct

The error message should be a constant and not up to the error creator. This would change the code to look like the following:

type TimeoutError struct {}

func (e *TimeoutError) Error() string {
	return "TimeoutError"
}
ack.errFn(&TimeoutError{})

In regard to the actual string value, it is recommended by the Go community to not capitalize sentences. If you decide to use Timeout reached as a more verbose message, the recommendation is to have that message be timeout reached to eliminate capitalization consistency issues that arise when composing error messages.

Going beyond that to the Go 1.13+ error handling pattern, it would be better to make this an error that wraps the original error. The common practice is to have an Err error member value and to implement the Unwrap method. This would change Line 141 to look like:

ack.errFn(&TimeoutError{ Err: ctx.Err() } )

Match & Filter

Filters to allow filtering and matching of incoming events based on configurable event patterns. Event patterns should support wild cards and a simple boolean OR operator.

Error formatting documentation

Agreed upon format:

MyError (code=42, k1=v1, k2=v2): wrapped Err.Error() goes here

Background Info:

type MyError struct {
	Err    error
	Code   int
	Values map[string]interface{}
}

If Code and Values didn't exist, I would expect something like:

MyError: wrapped Err.Error() goes here

If I wanted to add in Code, here are some possible ways it could be printed.

#1 --  MyError code:42: wrapped Err.Error() goes here
#2 --  MyError code=42: wrapped Err.Error() goes here
#3 --  MyError (code:42): wrapped Err.Error() goes here
#4 --  MyError (code=42): wrapped Err.Error() goes here
#5 --  MyError: code:42, err:wrapped Err.Error() goes here
#6 --  MyError: code=42, err=wrapped Err.Error() goes here

And then when expanding to the additional values:

#1 -- MyError code:42 k1:v1 k2:v2: wrapped Err.Error() goes here
#2 -- MyError code=42 k1=v1 k2=v2: wrapped Err.Error() goes here
#3 -- MyError (code:42 k1:v1 k2:v2): wrapped Err.Error() goes here
#4 -- MyError (code=42 k1=v1 k2=v2): wrapped Err.Error() goes here
#5 -- MyError: code:42, k1:v1, k2:v2, err:wrapped Err.Error() goes here
#6 -- MyError: code=42, k1=v1, k2=v2, err=wrapped Err.Error() goes here

Gears integration / Knowledge Transfer

Flow Editor needs visual components in its component library for each of the sender and receiver plugins supported by EARS (KDS, Kafka, SQS, HTTP etc.). The components must offer parameters for all user controlled configuration options (e.g. Kafka topic name, Http endpoint etc). Some limited support for filter chain configuration will also be needed, for example single instance match filters and filter filters should be supported. Also Gears must call an EARS API to create a route upon flow deployment and another EARS API to remove the route upon undeploy. The EARS Hackweek POC has already laid some groundwork for this process, we may be able to reuse and improve this existing solution.

Receiver Fanout

The Goals (are there more?)

  1. Under normal operation, slower processing routes should not affect the speed of fast processing route
  2. Under normal operation, a receiver should be able to ingest events at optimal rate
  3. In the cases where one or more routes are congested or backed up (abnormal), it is OK to have degraded performance as long as:
    • There are alerts to notify Support/DevOps
    • EARs does not blow up (panic, OOM, etc)
    • The performance impact is only limited to the particular receiver

Proposal 1

Code snippet in question (none working function for demo only)

func (a *PerReceiver) next(ctx context.Context, event Event) error {
    wg := &sync.WaitGroup{}
    for _, r := a.routes {
        wg.Add(1)
        go func() {
            r.Process(ctx, event)
            wg.Done()
        }
    }
    wg.Wait()
}

The idea with the above code is that all routes are processed in parallel and that they are not subject to the processing time of the other routes.

However, it does not meet goal number 2. Because the next function only returns when every route finishes processing (whatever that means), the latency of the next function is now bounded by the slowest processing route. Assuming that the receiver calls next sequentially for every event, the receiver's ingestion rate is also bounded by the latency of the next function.

Observation

The next function should return as fast as possible. Its latency should be such that it is as fast or faster than the event ingestion rate. For example, if a receiver's event ingestion rate is 1000 events/sec, then the latency of the next function should be no greater than 1ms.

Proposal 2

What if we remove the WaitGroup? The next function will now return very quickly.

func (a *PerReceiver) next(ctx context.Context, event Event) error {
    for _, r := a.routes {
        go func() {
            r.Process(ctx, event)
        }
    }
}

However, it does not meet goal number 3. If a route becomes congested (r.Process takes a long time to return), it will result in goroutine leaks. At some point, we may run out of memory and EARs will crash.

Observation

Unbounded goroutine invocation (ones without coordination construct like WaitGroup) will impact system stability.

Proposal 3

What if we cap the number of goroutines per receiver (or per route or per tenant)?

A per receiver implementation:

func (a *PerReceiver) next(ctx context.Context, event Event) error {
    for _, r := a.routes {
        //the function block if max number of goroutine is reached per receiver
        a.countCond.l.Lock() //a.countCond is *sync.Cond
        for a.count >= MAX_COUNT {
            a.countCond.Wait()
        }
        a.count++
        go func() {
            r.Process(ctx, event)

            //decrement the goroutine count
            a.countCond.l.Lock()
            a.count--
            a.countCond.Signal()
            a.countCond.l.Unlock()
        } 
        a.countCond.l.UnLock()
    }
}

Observation

  • Should be able to meet all of our goals
  • Goroutines per route: This will prevent abnormal route affecting normal route with in the same receiver to some extent. At some point, if the back-pressure is too much, it may still affect all routes of the same receiver source. It will not affect routes with different receiver source.
  • Goroutines shared between routes in same receiver source: Abnormal route will saturate all goroutines and affect normal routes of the same reciever source. It will not affect routes with different receiver source.
  • Goroutines shared between routes in same tenant: Abnormal route will saturate all goroutines and affect all routes within the tenant

Proposal 4

What about channels and worker pools? (at receiver, route, or tenant level).

The implementation below has a dedicated buffer channel and worker pool at receiver level:

type Work struct {
    r     Route
    event Event   
}
func (a *PerReceiver) next(ctx context.Context, event Event) error {
    for _, r := routes {
        work := &Work{
            r:     r,
            event: event,
        }
        a.ch <- work  //a.ch is a buffered channel
    }
}

func (a *PerReceiver) worker() {
    for work := range a.ch {
        work.r.Process(work.r.event)
    }
} 

Observation

  • Should be able to meet all our goals
  • Trade-off between proposal 3 and 4 are:
    • pre-allocated goroutine vs on-demand goroutine
    • buffered channel may absorb temporary down stream congestion without impacting receiver ingestion rate

Return 401 instead of 500 for bad API requests

EARs rest API does not always return 401 for bad requests. Sometimes, they return 500. Here are some tests cases that needs to be fixed:

  • addroutebadname
  • addroutenoreceiver
  • addroutenosender
  • addroutenouser

Context propagation and cancelling

We follow a standard of passing in a ctx context.Context into our functions so that:

  • The transaction can be cancelled
  • The context can be used for tracing/metrics

For long running contexts (such as receiver.Receive) as well as context that may need to be shared among several receivers, we need to figure out the best approach for managing those contexts when it comes to cancelling and metrics. After some of the dust settles, we should spend some time mapping the context object throughout the system to ensure we're on the same page for how it behaves and how we've implemented it across the system.

Struct/Data validation - which tools to use

Background

There are a number of cases where we need to apply some sort of data validation:

  • API requests
  • Messages passed into the system
  • Configurations loaded from yaml docs

Each may have unique constraints that should be addressed. For instance:

CaseConstraints
API
  • One definition to expose (e.g. via OpenAPI spec file) as well as validate
  • Should be lightweight and fast
  • Should be able to drive API error messages and/or codes
Incoming Messages
  • Should be lightweight and fast
  • Should be able to produce specific messages and/or codes
  • Should be easy for a customer to configure the validator
Plugin Configurations
  • Should be able to produce specific messages and/or codes
  • Should be easy for a plugin developer to specify the constraints
  • (stretch) Should be easy for a configuration interface for a plugin to generate a form that enforces validation

Research

A quick search produced the following results

Thoughts

  • Putting too many validation tags on a struct can make the struct messy (also, struct tags do not support newlines in the definition)
  • Using struct tags will make it hard to keep the OpenAPI spec in sync with what has been coded
  • It would be nice to have access to common types, like UUID, email, URL validation

Originally I thought we could lean into json schema (since it can apply to YAML docs as well), but I'm curious how json schema would go about addressing common types, such as UUID, email, URLs.

Message Acknowledgements

Some thoughts on acks. Like other services, there should be a way to guarantee at once delivery. But in the case of bad/slow HTTP targets, you don't want to back up the pipeline (or maybe you do?). An HTTP publisher may need to push to something reliable (kafka) and return an ack. Then another process consumes those individual requests and pushes to the intended target.

pipeline drawio

Unpack (split?) Filter

Filter component with the ability to split an incoming event into zero, one or more outgoing events. Typically the splitting is performed on an array embedded in the event payload. Thus a configuration parameter is required to define the location of the array in the payload. This is a MVP feature for the Gears integration.

Intermittent unit test failure in multiRouteAABBAB test.

The unit tests pass most of the time, but occasionally, we will see the following error:
handlers_v1_test.go:296: multiRouteAABBAB test: check events sent error: unexpected number of events in sender 8 (10)

The error is intermittent, and there is currently no known way to reliably reproduce the issue. In github, it often happens at the first go test run after significant code changes. Locally, it also happens if there are significant code changes sync down from github.

Persistence Layer

Persistence layer for routing table and other EARS configuration parameters. We are currently favoring a relational database solution for this purpose, possibly AWS Aurora because of its efficient querying capabilities. This persistence layer will be the central source of truth for all EARS service instances. The database needs be configured to perform continues replication into a backup database for resilience and recovery purposes. We expect the number of entries in the routing table to not exceed several thousand items.

Routing Table Synchronization

Simple solution not based on sophisticated consensus algorithm. Instead (1) do a full sync from Aurora whenever any change is detected via a last modified timestamp or (2), slightly more efficiently, fan out delta information from the acting EARS node to all other EARS nodes via Redis pub/sub or Kafka topic or (3) a combination of (1) and (2).

routing table sync

Change filterer -> filter

I've had to ask the "do I add er" question too many times.

Proposal:

type NewPluginer interface { NewPlugin() (Pluginer, error) }
type NewReceiverer interface { NewReceiver() (Receiver, error) }
type NewSenderer interface { NewSender() (Sender, error) }
type NewFilterer interface { NewFilter() (Filter, error) }  // WAS: NewFilterer() (Filterer, error)
type Receiver interface { Receive() }
type Sender interface { Send() }
type Filter interface { Filter() }  // WAS Filterer interface

The naming conventions align better if we drop the extra er from all the Filter types

EARS REST API and OpenAPI doc

EARS core REST API. At a minimum the EARS API must support AddRoute(), RemoveRoute() and GetRoute(). GetRoute() must support various search criteria as well as returning the entire routing table.

RouteConfig struct {
  OrgId        string       `json:"orgId,omitempty"`        // org ID for quota and rate limiting
  AppId        string       `json:"appId,omitempty"`        // app ID for quota and rate limiting
  UserId       string       `json:"userId,omitempty"`       // user ID / author of route
  Name         string       `json:"name,omitempty"`         // optional unique name for route
  Source       *Plugin      `json:"source,omitempty"`       // pointer to source plugin instance
  Destination  *Plugin      `json:"destination,omitempty"`  // pointer to destination plugin instance
  FilterChain  *FilterChain `json:"filterChain,omitempty"`  // optional list of filter plugins that will be applied in order to perform arbitrary filtering and transformation functions
  DeliveryMode string       `json:"deliveryMode,omitempty"` // possible values: fire_and_forget, at_least_once, exactly_once
  Debug        bool         `json:"debug,omitempty"`        // if true generate debug logs and metrics for events taking this route
  Ts           int          `json:"ts,omitempty"`           // timestamp when route was created or updated
}
Plugin struct {
  Type          string                 `json:"type,omitempty"`       // plugin or filter type, e.g. kafka, kds, sqs, webhook, filter
  Version       string                 `json:"version,omitempty"`    // plugin version
  SOName        string                 `json:"soName,omitempty"`     // name of shared library file implementing this plugin
  Params        map[string]interface{} `json:"params,omitempty"`     // plugin specific configuration parameters
  Mode          string                 `json:"mode,omitempty"`       // plugin mode, one of input, output and filter
  State         string                 `json:"state,omitempty"`      // plugin operational state including running, stopped, error etc. (filter plugins are always in state running)
  Name          string                 `json:"name,omitempty"`       // descriptive plugin name
  Encodings     []string               `json:"encodings,omitempty"`  // list of supported encodings
}

Decide best locking patterns/style guide

For example, when you explicitly need to lock and unlock due to avoiding deadlocks. Maybe some general rules:

Default:

thing.Lock()
defer thing.Unlock()

If you have to make the lock region as small as possible, pulling it into a visible block may allow programmers to easily verify that all locks have been unlocked in the block:

// some code goes here, then use curly brace to indicate a lock block

{
  thing.Lock()
  if thing.done {
    thing.Unlock()
    return
  }
  thing.Unlock()
}

// continue on

Finally, if you're accessing another object that needs to do locking (even if it's an internal object in the package you're working on), it's best to provide some sort of function that will do the locking for you. For instance:

if thing.Done() {
  return
}

Plugin Framework

Investigate, design and implement EARS plugin framework. The EARS plugin framework will use the go dynamic plugin concept and is inspired by watermill.io

Plugins will integrate with the EARS routing table and event pipeline implementation. Plugins will use hashes over user configuration (instead of random IDs) in order to determine if two plugin (configurations) are equivalent or not. Plugins can then implement the concept of stream sharing which is essential for performance reasons (a million subscribers to the same Kafka queue should not result in a million Kafka plugin instances).

Further reading:

Routing Table Manager

In memory EARS routing table manager used in each active EARS instance. The routing manager acts as the interface between the EARS service and the persistence layer. The routing manager will also trigger synchronization tasks among EARS instances.

// A RoutingTableManager supports CRUD operations on an EARS routing table
RoutingTableManager interface {
  RouteNavigator
  RouteModifier
  Hasher
  Validater
}
RouteModifier interface {
  AddRoute(ctx context.Context, entry *Route) error             // idempotent operation to add a routing entry to a local routing table
  RemoveRoute(ctx context.Context, entry *Route) error          // idempotent operation to remove a routing entry from a local routing table
  ReplaceAllRoutes(ctx context.Context, entries []*Route) error // replace complete local routing table
}
// A RouteNavigator allows searching for routes using various search criteria
RouteNavigator interface {
  GetAllRoutes(ctx context.Context) ([]*Route, error)                                  // obtain complete local routing table
  GetRouteCount(ctx context.Context) int                                               // get current size of routing table
  GetRoutesBySourcePlugin(ctx context.Context, plugin Pluginer) ([]*Route, error)      // get all routes for a specifc source plugin
  GetRoutesByDestinationPlugin(ctx context.Context, plugin Pluginer) ([]*Route, error) // get all routes for a specific destination plugin
  GetRoutesForEvent(ctx context.Context, event *Event) ([]*Route, error)               // get all routes for a given event (and source plugin)
}

event cloning optimization

There are times where we want to "split" an event. A couple options:

  • Immediately split (double memory usage, burn up CPU time for deep copy)
  • Split on write (V8 Engine style) -- That way, we don't need to do work if we're only reading the event
    • If we split on write, maybe the only thing we need to do is return a shell object that points to the original event (**event) with some minimal data to know that we are not yet "dirty" with a write and thus need to do some work on a write

I think for now, this technique is hidden behind event.Event.Dup() ( https://github.com/xmidt-org/ears/blob/plugin-manager/pkg/event/types.go#L39 )

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.