The go-aws-msg from zerofox-oss

Telemetry!

Verifiability

Are there any easy wins to increase operation visibility and allow clients to easily determine if go-aws-msg is providing its intended value?? Could defining go-aws-msg mission help to guide telemetry choices??? (IMO thinking like efficiently consume, process and fin a bounded number of SQS messages but really don't know :p) Assuming something like this is the case, which metrics would provide visibility into if go-aws-msg is fulfilling its purpose?

Thinking maybe:

latency - how long a message takes to be handled (histogram: median, 95%, 99%, 100%)
traffic / throughput - how many messages / interval are behing handled
errors - message handle results (good / bad)
saturation - when is work waiting on a resource that isn't available?? (looks like it's just handler concurrency? that could become saturated?)
(these also just so happen to be google SRE's "Four Golden Signals" :p )

How would these metrics be gathered, where do they come from? who is responsible for gathering them and reporting them??

Why Telemetry?

Exposing telemetry should allow for easy visibility into the operation of go-aws-msg. Emitting telemetry instead of relying on centralized SQS metrics provides flexibilty and allow easier verification of server processes, it also allows for insights into server performance locally, and allows for metrics which may not be provide out of the box by aws cloudwatch. Having actionable application emitted metrics will pave the way for monitoring/alerting, ie if n errors occur in an interval, or if n percentages of message Recievers result in errors? or if server is spending n seconds waiting need to be alerted.

Also having a flexible interface for emitting these types of metrics is essential for performance testing and capacity planning. How many instances of a server are necessary to handle 1000 messages / minute ?? 10000 / minute ? 100000 etc? Having telemetry would help with local tuning to determine what sorts of concurrency levels would be necessary.

Emitting telemtry should help provide context around issues when they occur.

What?

Latency

receive - histogram of time spent receiving in the client code
message processing - histogram of total time spent processing message in server implementation from start to ack

Traffic / throughput

sqs receives / interval how many SQS recieves server makes
messages received / interval (before or after maxConcurrentReceives waits?)

Errors

Saturation

Blocked message receiver??? How long is spent blocking?

Who?

Should be responsible for making these calls does aws-go have any obligations to provide server level metrics so clients can intro spec? Should server provide hooks for telemetry and leave it up for clients? Should clients be 100% responsible for first iteration? If clients are 100% responsible they can't get to the saturation time spent blocking and the SQS recieve errors from inside their Receive method

Where?

Should these calls take place??? Should the server have some sort of metadata structure and expose a couple of public methods to access it? Could there be some sort of interface and hooks to allow client to configure which type of metric implementation they would like (ie logging, statsd, promethues, new relic, etc)?

type Telemetry interface {
   Timing(METRIC_NAME string, VALUE float64, RATE float64, TAGS []string)
   Increment(METRIC_NAME string, VALUE float64, RATE float64, TAGS []string)
}

Maybe offering a logging based telemetry implementation by default? but allowing clients to configure their own?

https://github.com/zerofox-oss/go-aws-msg/blob/master/sqs/server.go#L33

type Server struct {
  ...
  telemetry Telemetry
}

Resulting in calls like:

start := time.Now()
// Take a slot from the buffered channel
s.maxConcurrentReceives <- struct{}{}
t := time.Now()
elapsed := t.Sub(start)
s.telemetry.Timing("receive_saturation_wait", elapsed.Seconds(), ..., ...)

Rate-limiting

For some microservices it might be beneficial to rate-limit the number of active goroutines based on time. For example, if I have a Server which interacts with a third-party API and that is limited to 10 calls/sec, I would only want to serve 10 messages/second (assuming each message uses 1 API call).

With the current primitives, this rate limiting would have to be done at the Receiver level, which would introduce blocking by means of mutexes or channels - that's probably not the most efficient. We should consider adding this capability to the Server so we can limit the number of active Receivers based on time.

Allow STS Federation Tokens

We'd like to be able to use STS tokens to grant us access to SNS/SQS resources. These are short-lived credentials that last for at most 36h (if using Federation tokens). In order to do that we need to allow AWS_SESSION_TOKEN to be provided.

Better tests

After spending some time away from the initial implementation of the mocks, I think that it's time for them to be revisited. I don't like that I'm basically defining my own behavior for SQS.

Ideally we could run these tests against an SQS docker image...

zerofox-oss / go-aws-msg Goto Github PK

go-aws-msg's People

Contributors

Stargazers

Watchers

Forkers

go-aws-msg's Issues

Verifiability

Why Telemetry?

What?

Latency

Traffic / throughput

Errors

Saturation

Who?

Where?

Recommend Projects

Recommend Topics

Recommend Org

Jobs