gogama / incite Goto Github PK

Hassle-free queries on Amazon CloudWatch Logs Insights in Go

License: MIT License

Go 100.00%

cloudwatch cloudwatch-logs cloudwatchlogs insights aws golang go

incite's Introduction

A native Go library to streamline and supercharge your interactions with the AWS CloudWatch Logs Insights service using minimalist, native Go, paradigms.

Incite makes it easier to write code to query your logs using AWS CloudWatch Logs Insights, and makes it possible to use Insights to query massive, arbitrary, amounts of log data reliably.

Features

Streaming. AWS CloudWatch Logs Insights makes you poll your queries, requiring boilerplate code that is hard to write efficiently. Incite does the polling for you and lets you simply read your query results from a stream.
Auto-Chunking. Every CloudWatch Logs Insights query is limited to 10,000 results. If your query exceeds 10K results, AWS advises you to break it into smaller time ranges. Incite does this chunking automatically and merges the results of all chunks into one convenient stream. Use the Chunk field in QuerySpec to enable chunking.
Dynamic Splitting. Since v1.2.0, Incite can dynamically detect when a query chunk exceeds the 10K result limit, split that chunk into sub-chunks, and re-query the chunks, all automatically and without intervention. Use the SplitUntil field in QuerySpec to enable dynamic splitting.
Multiplexing. Incite efficiently runs multiple queries at the same time and is smart enough to do this without getting throttled or going over your CloudWatch Logs service quota limits.
Unmarshalling. The CloudWatch Logs Insights API can only give you unstructured key/value string pairs, requiring you to write boilerplate code to put your results into a useful structure for analysis. Incite lets you unmarshal your results into maps or structs using a single function call. Incite supports tag-based field mapping just like encoding/json. (And it supportsjson:"..." tags as well as its native incite:"..." tags, right out of the box!)
Go Native. Incite gives you a more Go-friendly coding experience than the AWS SDK for Go, including getting rid of unnecessary pointers and using standard types like time.Time.

Getting Started

Get the code

$ go get github.com/gogama/incite

Concepts

For quick prototyping and scripting type work, simplify your life using the globalQuery function.
When you need finer control over what your app is doing, create a new QueryManager using NewQueryManager and query it using its Query method.
To read all the results from a stream, use the global ReadAll function.
To unmarshal the results into a structure of your choice, use the global Unmarshal function.

A simple app

package main

import (
	"fmt"
	"time"

	"github.com/aws/aws-sdk-go/aws/session"
	"github.com/aws/aws-sdk-go/service/cloudwatchlogs"
	"github.com/gogama/incite"
)

func main() {
	// Use the AWS SDK for Go to get the CloudWatch API actions Incite needs.
	// For simplicity, we assume that the correct AWS region and credentials are
	// already set in the environment.
	a := cloudwatchlogs.New(session.Must(session.NewSession()))

	// Create a QueryManager. An alternative to using a QueryManager is just
	// using the global scope Query function.
	m := incite.NewQueryManager(incite.Config{Actions: a})
	defer func() {
		_ = m.Close()
	}()

	// Look at the last 15 minutes.
	end := time.Now().Truncate(time.Millisecond)
	start := end.Add(-15*time.Minute)

	// Query the results.
	s, err := m.Query(incite.QuerySpec{
		Text:   "fields @timestamp, @message | filter @message =~ /foo/ | sort @timestamp desc",
		Start:  start,
		End:    end,
		Groups: []string{"/my/log/group"},
		Limit:  100,
	})
	if err != nil {
		return
	}
	data, err := incite.ReadAll(s)
	if err != nil {
		return
	}

	// Unpack the results into a structured format.
	var v []struct{
		Timestamp time.Time `incite:"@timestamp"`
		Message string      `incite:"@message"`
	}
	err = incite.Unmarshal(data, &v)
	if err != nil {
		return
	}

	// Print the results!
	fmt.Println(v)
}

Compatibility

Works with all Go versions 1.14 and up, and AWS SDK for Go V1 versions 1.21.6 and up.

Official AWS documentation: Analyzing log data with CloudWatch Logs Insights. Find Insights' query syntax documentation here and the API reference here (look for StartQuery, GetQueryResults, and StopQuery).

License

This project is licensed under the terms of the MIT License.

Acknowledgements

Developer happiness on this project was embiggened by JetBrains, which generously donated an open source license for their lovely GoLand IDE. Thanks JetBrains!

incite's People

Contributors

Stargazers

Watchers

Forkers

abhijitbakshi1 franklinharry alvin-rw artificialinc pgmstream qnlee

incite's Issues

Align chunk time ranges with chunk granularity

User Story

As an application writer, I want my Insights queries to have optimal performance. For this reason, I want Incite to issue CWL Insights queries with time ranges that are aligned to my chunk granularity.

Background

Many map-reduce type big data applications partition their data into time buckets, for example five minutes, one hour, or one day. With these types of applications one may get better performance with a query by only touching one bucket rather than two. And certainly if one issues 100 queries it is preferable to touch only 100 buckets, one per query, rather than 100 buckets twice per query.

The current logic is to just blindly add Chunk to each chunk end time to get the next start time, and if Start and Chunk weren't aligned, to use a smaller duration for the last chunk to not overrun End.

Change

The new logic should make the first chunk the one which has the odd size, so that the start of the second chunk, as well as each successive chunk, aligns with the chunk granularity.

How to run this?

I would really like to try this out!

I've initiated a new project - go mod init, pasted the "simple app" code in a main.go file and changed the Groups value to match my Log Group.
I do have a working AWS Shell Environment, as I can run aws s3 ls and I get correct data.
When I run "go run main.go" nothing happens, I don't get any output either.

Can you please give me a bit more newb-friendly instructions?
Thank you!

Fuzzy JSON decoding should try to unpack timestamps as time.Time

User Story

As a Go programmer I want the data types extracted using incite.Unmarshal to be useful Go types wherever possible.

Specifically, if I use the fuzzy JSON decoding by providing a []map[string]interface{}, I would like the standard CloudWatch Logs Insights timestamp fields @timestamp and @ingestionTime to be extracted into time.Time.

Comments

This should ideally be a simple local edit to the decodeColAsJSONFuzzy function.

Number of records returned is different from when manually querying Logs insight

User Story

When I use incite for querying CloudWatch Logs, the number of logs returned is different from when I use Logs Insight manual querying.

Background

This is my QuerySpec:

incite.QuerySpec{
	Text:       queryString,
        Start:      time.Date(year, month, day, hour, min, sec, nsec, location),
	End:        start.Add(5 * time.Minute),
	Groups:     []string{"aws/log/group"},
	Limit:      incite.MaxLimit,
	Chunk:      1 * time.Minute,
	SplitUntil: 1 * time.Minute,
}

After querying, I use encoding/csv package to format the result to .csv. This is my code snippet. Where dataSlice contains the slices of strings from the log query result.

f, err := os.Create(fileName)
defer f.Close()
if err != nil {
	fmt.Println(err.Error())
	return
}

w := csv.NewWriter(f)
defer w.Flush()

for _, record := range dataSlice {
	err = w.Write(record)
	if err != nil {
		fmt.Println(err.Error())
		return
	}
}

When I used Logs Insight for query, it returns 13,006 records where if I use incite it would only return 12,767.
The result is consistent when I rerun the program.

Split chunks if their queries time out

User Story

As an application developer, I want increased confidence that I'm getting all available results from Insights even when my queries process so much data that they time out.

Notes

This feature would build on the existing splitting added in issue #3. Instead of only being able to split when we get MaxLimit results, we also split when we get a timeout error back from CWL Insights on a GetQueryResults poll.
Part of this feature should include allowing arbitrary client-side timeouts for chunks. The current CWL timeout is 15m, which is a very long time to run to get feedback of this nature. When running a Query, customers may want to set lower timeouts so they can split faster and have some hope of quickly whittling oversize time ranges to achievable levels.

Open Questions

Add a new field into QuerySpec to request this, or just do it automatically when splitting due to result count is enabled?

Progress statistics in `Stats`

User Story

As an interactive application writer, I want to be able to share a "progress bar" or similar information about how quickly Insights queries are progressing, and how much work remains, with my users.

Notes

If we make the counter fields into float64 instead of int we can reflect partial chunks from splitting in the fractional part.

Task list

Add fields to the Stats structure.
Update QueryManager code to track new Stats.
Update Stream code to track new Stats.

`Stats` field sketch

type Stats struct {
        // ChunksBlocked indicates the number of requested chunks
        // belonging to the stream or query manager which are still
        // waiting to be submitted to the CloudWatch Logs service.
        ChunksBlocked         int
        // ChunksSubmitted indicates the number of requested chunks
        // belonging to the stream or query manager which have been
        // submitted to the CloudWatch Logs service but which haven't
        // yet progressed to a terminal state.
        ChunksSubmitted       int
        // ChunksPartlyCompleted indicates the number of requested
        // chunks belonging to the stream or query manager for which
        // some partial results, but not all results, are available from
        // the CloudWatch Logs service.
        ChunksPartlyCompleted int
        // ChunksCompleted indicates the number of requested chunks
        // belonging to the stream or query manager which are in a
        // completed state because all results have been received.
        ChunksCompleted       int
        // ChunksCancelled indicates the number of requested chunks
        // belonging to the stream or query manager which are in a
        // terminal state because they were cancelled.
        ChunksCancelled       int
        // ChunksFailed indicates the number of requested chunks
        // belonging to the stream or query manager which are in a
        // terminal state because they failed.
        ChunksFailed          int
}

Log chunk query ID

User Story

As an application writer, I want to be able to debug issues that occur when Incite is calling the CloudWatch Logs Insights service.

Background

The logChunk method in incite.go does not for some reason log the chunk's query ID even when it is available. This makes it hard to cross-reference log messages. For example if the chunk has terminal status failed, incite 1.0 logs something like this

2021/10/07 23:58:12 incite: QueryManager(0xc0156e0270) unexpected terminal status chunk "[some query here]" [2021-10-01 17:15:00 +0000 UTC..2021-10-01 17:30:00 +0000 UTC): Failed

Then for good measure, it tries to cancel the chunk. If the cancellation fails, it logs something like this:

2021/10/07 23:58:12 incite: QueryManager(0xc0156e0270) failed to cancel chunk "[some query here]" [2021-10-01 17:30:00 +0000 UTC..2021-10-01 17:45:00 +0000 UTC): error from CloudWatch Logs: AccessDeniedException: User: arn:aws:sts::000000000000:assumed-role/ReadOnly/some-person is not authorized to perform: logs:StopQuery on resource: arn:aws:logs:eu-central-1:000000000000:log-group::log-stream:
status code: 400, request id: 1b37f3d9-0099-4561-a914-43866aa232b0

Are they the same chunk? Probably, but I have no way of knowing if it's a big application that's querying lots of things in lots of accounts simultaneously.

Solution

Amend logChunk so it adds the query ID if it's available.

Allow log stream name to be passed as a parameter in the queries

The query feautre of incite does not let us pass the log stream name and only log group name is allowed.

https://docs.aws.amazon.com/cli/latest/reference/logs/get-log-events.html

Allows query names as well.

It would be great if we can implement this feature.
TIA

Retry transient network errors

User Story

As an developer building analytics using Incite, I want confidence my CWL Insights queries will not be aborted due to a transient network failure.

Background

I typically use Incite with a "non-retrying" AWS CloudWatch Logs client because Incite handles the retry. It is good at identifying transient errors from the service HTTP layer. However, it is not so good at identifying issues happening on the transport part of the stack, for example timeouts and other things.

I have seen the below error a few times now. It looks like it might be a connection reset by peer or maybe just the remote service prematurely writing a newline into the response stream. (Probably related to non-graceful issues during service deployment.)

RequestError: send request failed
caused by: Post "https://logs.us-east-2.amazonaws.com/": EOF

This has caused tools built on Incite to prematurely error out. It should be gracefully retried...

Improve documentation about chunks, chunked queries, and Incite differentiators

User Story

As an application developer, I want to understand how chunked queries work so I can better understand whether they are suitable for my use case.

Notes

Chunks and chunked queries

In issue #16, @lloyd-fftrf gave some feedback on improving the documentation around chunked queries. He wrote:

Can you help me explain what's the difference between Chunk and SplitUntil? I've tried reading the documentation but I don't seem to get the difference

...

... For me, I got confused at the first sentence of the Chunk documentation. I kind of not sure what chunked query meant. I think it would be better if the definition of the chunked query is written right after the first sentence. Like below:
// Chunk optionally requests a chunked query and indicates the chunk
// size.

// In a chunked query, each chunk is sent to the CloudWatch Logs
// service as a separate Insights query. This can help large queries
// complete before the CloudWatch Logs query timeout of 15 minutes,
// and can increase performance because chunks can be run in parallel.

// If Chunk is zero, negative, or greater than the difference ...

👆 Here the middle paragraph is @lloyd-fftrf's suggestion for improvement.

Incite differentiators

The Incite README.md sometimes buries the lede or doesn't explain Incite's benefits well enough. As examples, it doesn't clarify that you don't need to poll; and that it makes the query concurrency limit go away; and that it makes the result limit go away. It should be re-examined with a critical eye and a view to making the benefits pop.

Split chunks when max result limit is reached

User Story

As an application developer, I want increased confidence that I'm getting all available results from Insights even when my queries return more results for a chunk than Insights can provide.

Open questions

One-time splits or recursive splits?

Task list

Update QuerySpec structure to allow user to request splitting of chunks.
Update Stats structure to provide visibility into any chunk splitting that occurred.
Implement chunk splitting.

New structure fields sketch

type QuerySpec struct {
        ...

        // Split specifies how Incite should automatically split chunks
        // which produce the maximum number of results that CloudWatch
        // Logs Insights can provide (MaxResults).
        //
        // If zero, no chunk splitting is done and when a chunk produces
        // MaxResults results those results are put into the result
        // stream and the chunk is considered complete.
        //
        // If a positive number, when a chunk produces MaxResults
        // results, the chunk is split into 2^Split sub-chunks of
        // equal size and the sub-chunks are re-queried.
        //
        // - Limit must be set to max.
        // - Incompatible with Preview.
        Split int

        // ReSplit specifies how many recursive chunk splits Incite should
        // do if sub-chunks resulting from a chunk split themselves produce
        // more than the maximum number of results that CloudWatch
        // Insights can provide.
        //
        // If zero, no recursive splits are done. If Split is also zero, no splits
        // are done at all, and if Split is positive then at most one split is
        // performed.
        //
        // If a positive number then at most that many recursive splits of
        // sub-chunks will be done after the initial split.
        ReSplit int
}

type Stats struct {
        ...

        // ChunksSplit counts the number of chunks which were split into
        // smaller chunks because the initial chunk contains MaxResults
        // results. A zero value indicates no chunks were split in the
        // stream or query manager.
        ChunksSplit int

        ...
}

Dynamically adapt to noisy neighbors

User Story

As an application developer, I want my applications to be resilient to noisy neighbor effects. My log queries should not fail because other users of my AWS account are performing Insights log queries at the same time.

Background

Although Incite is currently very resilient to temporary service errors (it will currently retry up to 10 temporary failures to start a chunk query and 10 temporary errors to poll a chunk for results), persistent lack of query capacity and/or persistent throttling by the AWS CloudWatch Logs Insights service can eventually cause a chunk to fail completely, and when that happens it brings down the query.

Tasks

Introduce a dynamic adaptive adjustment to mgr's Parallel value that goes down, to a minimum value of, say ,1 in response to service quota limit errors and goes back up at a more gradual rate, to a maximum value of Parallel, in response to successful chunk starts.
Introduce a dynamic adaptive RPS adjustment to mgr's request regulator so the requests per second go down in response to throttling errors and rise back up at a more gradual rate in response to non-throttled API calls.

Refresh service quota limits (by 9/30/2022)

User Story

As an application developer, I want to be able to maximize the throughput of my log processing using Incite.

Therefore I want Incite to have the most service quota limits for the CloudWatch Logs Insights API.
Furthermore, if my service quota limit is "soft" (can be increased on request to AWS), I want to be able to consume all of my increased capacity without being limited by Incite.

Notes

The CloudWatch Logs Service Quota Limits page is here. Check if things have changed.

As of v1.3.2, Incite has:

Hard RPS limits of 5 requests per second (cloudwatchlogs_actions.go)
Hard concurrency limit (Config.Parallel) of QueryConcurrencyQuotaLimit = 10 (incite.go)

Note that as of 2022-07-26, Service Quota Limits page says the query quota limit can be INCREASED on request to AWS support. Therefore Incite's hard limit approach to concurrency is too limiting and should be fixed.

Stream freezes if enough of its chunks fail to start

Background

This is related to the bug I just fixed, in which, after the v1.3.0 concurrency refactor, the query manager in mgr.go was not correctly respecting its own Config.Parallel configuration. This could result in the query manager trying to start more queries than there is available service quota in CWL, resulting in transient errors like:

LimitExceededException: Account maximum query concurrency limit of
    [10] reached. (Service: AWSLogs; Status Code: 400; Error Code:
    LimitExceededException; Request ID: ...

The query manager was correctly retrying these so you wouldn't even notice them unless you looked at the logs, but...

Bugs

Main Bug - starter

Relatedly, the chunk start worker (starter.go) has a bug where if it gets a transient failure to start a query, it doesn't set the chunk error.

The result is if the chunk retry limit gets exceeded within the starter, the chunk goes back to the query manager without any error being set in it. Because the chunk state is still starting, the manager does the right thing and goes through the killStream flow, but as there is no error, killStream does not actually cause the stream to die, with the result that anyone blocked on a stream read will wait forever instead of getting an error message.

Second Bug - poller

There is an analogous bug in the chunk poll worker (poller.go) where it handles temporary errors.

Speculation on Fixing

To my mind the following fixes need to be done:

Slight restructure of start/poll manipulations so they always set c.err if there's an error, even if it's transient.
The worker loop should nil out the chunk error before running the manipulation.
Obviously unit tests should exercise this scenario both holistically, ideally via scenario, and for each worker.

Option to receive chunks in sorted order

User Story

As an application writer, I would like to be able to receive all my query results sorted in ascending order of time.

Notes

A query writer can indicate sort @timestamp asc.
In multi-chunk queries, the chunks are started in time order.
However in multi-chunk queries, the completed chunks are sent to the stream in the order the completion occurs which may not be the time order.

Task List

Add field to QuerySpec to request this, or perhaps request turning it off instead, and have it on by default.
Add some kind of queuing system for completed chunks that are "early" and can't be sent to the stream because a predecessor chunk isn't finished yet.

Example

type QuerySpec struct {
        ...
        // Jumble indicates that results from completed chunks may be
        // sent to the stream as soon as they are finished even if the
        // results from an earlier chunk have not been sent yet.
        //
        // Setting Jumble to true may improve performance of
        // applications which don't need the results in time order, or
        // which need to post-process the results in any case.
        Jumble bool
        ...
}

How do i use this to bypass the 10000 limit on cloudwatch insight?

I tried reading the readme, but i am not sure what settings i should use.

I tried SplitUntil, but it requires limit, and there is still the MaxLimit that is a constant at 10000, how do i bypass this MaxLimit?

Retry failed queries

User Story

As an application writer, I want my application to be resilient to transitive failures that Insights does not protect me from.

Background

As of 2021-10-07, an Insights query which is accepted, syntactically valid, and running against valid log groups may simply "fail" for unexplained reasons. When this happens, the CloudWatch Logs Insights service puts the query into status "Failed" and the QueryManager will log:

2021/10/07 23:46:58 incite: QueryManager(0xc00063ba00) unexpected terminal status chunk "[some query here]" [2021-10-02 17:00:00 +0000 UTC..2021-10-02 17:15:00 +0000 UTC): Failed

The stream will then return an error to the same effect.

The customer should be protected from these transient failures to some degree.

Task List

Add a configuration line item in Config that specifies a finite number of times to allow retrying a chunk.
- Could be a fixed number per chunk.
- Could also be a throttled global number per QueryManager to try to prevent retry storms during big outage type events.
Implement the code in QueryManager.

QueryManager doesn't consistently log if chunk is finished

Log messages like this:

2021/10/15 17:52:35 incite: QueryManager(0xc015395e10) finished chunk "filter ispresent(SizeAfter) | stats average(SizeAfter) as avg" [2021-10-03 13:30:00 +0000 UTC..2021-10-03 14:00:00 +0000 UTC)

Only happen sometimes.

Seems this code is broken somehow: https://github.com/gogama/incite/blob/main/incite.go#L699-L706

Use custom User-Agent when sending HTTP requests to CloudWatch Logs API

User Story

As an application developer, I want the AWS CloudWatch Logs team to be aware of how much usage comes from Incite.

Notes

This can be accomplished fairly trivially with the the AWS SDK for Go v1 using the request.AddToUserAgent(...) free function.

Provide transparency when chunk result counts are maxed out

User Story

As an application developer, I want to know when my query may be at risk of incomplete results due to the CloudWatch Logs Insights 10K result limit.

Task list

Add a counter field to the Stats structure.
Update QueryManager code to increment the counter.

`Stats` counter field sketch

type Stats struct {
        // ChunksMaxed counts the number of chunks belonging to the
        // stream or query manager for which the chunk contained the
        // maximum number of results (MaxResults) even after splitting.
        //
        // This field can be used to assess whether and to what extent
        // the returned results are 
        //
        // (By making it a floating point value, it can possibly
        // reflect splitting, i.e. if a quartered chunk still maxes out
        // we can have ChunkMaxed = 0.25.)
        ChunksMaxed float64
}

Retry API calls when the CWL API response payload can't be deserialized

User Story

As an developer building analytics using Incite, I want confidence my CWL Insights queries will not be aborted due to a transient network failure.

In particular, I don't want my queries to fail due to an an error like the one below if retrying the CloudWatch Logs API request that produced the error would have produced success:

ERROR [incite: query ID "7e623cab-90dc-4417-97ac-d5e728c57ae8" had unexpected error [query text "<some query>"]: SerializationError: failed to unmarshal response error
        status code: 503, request id: 744CFBE1FEEAB934
caused by: UnmarshalError: error message missing]

Details

Having seen this error several times, it is my belief that the above UnmarshalError represents some kind of transient HTTP problem that would succeed on retry. In this instance it seems the CloudWatch Logs service wanted to return HTTP 503 service unavailable but for some reason:

either the response payload containing the error message JSON got truncated, leading to a failure to desserialize; or
the HTTP 503 error emanated from a component that erroneously doesn't produce a proper response body.

In fact looking at it more closely, the messages emanates from unmarshal_error.go here and that in turn comes from unmarshal.go noting an io.EOF and consequently returning back the UnmarshalError with message "error message missing", see here.

So the root cause was the remote host/load balancer closing the connection and/or sending back and empty response body.

Fix

This can be fixed by slightly enhancing isTransient.

Don't allow start of time range to equal end

Description

With the dynamic chunk splitting feature active, and Parallel greater than 1, it is possible to get into a situation where mgr.getNextReadyChunk() creates a chunk with the "null" time range QuerySpec.End ... QuerySpec.End (at mgr.go lines 132-134).

This bug is described in detail by @artificial-aidan and @pierre-samsara in PR #24.

This results in the StartQuery request to the CloudWatch Logs Insights service failing with the error:

[2023-01-25 23:00:00 +0000 UTC..2023-01-25 23:00:00 +0000 UTC): InvalidParameterException: End time cannot be less than Start time (Service: AWSLogs; Status Code: 400; Error Code: InvalidParameterException;

And this error kills the entire query.

History

This bug was either introduced in v1.2.0 ("Dynamic chunk splitting and progress stats") or in v1.3.0 (Better query performance through higher concurrency).

Cause

Splitting chunks causes mgr.n to increase, which depending on the vicissitudes of parallelism can result in mgr.next < mgr.n and the stream being pushed back into the priority queue in cases where the are no more chunks available to lazily create. So the next chunk from stream.nextChunkRange() ends up having an empty time range because of this code.

Really the root cause is that mgr.n is trying to capture two distinct concepts at the same time:

How many generation 0 chunks need to be lazily created from a stream living in the priority-ordered stream heap mgr.pq? (A static number.)
How many total chunks are known to be needed to complete the stream at the current time? (A dynamic number which increases with chunk splitting.)

Possible Solution

One option is to create a new field mgr.last to do the job mgr.n was originally meant to do, namely provide the constant number of original chunks requested before splitting. Chunk splitting caused mgr.n to stealthily take on a second duty and that should be reversed.

Use DescribeQueries rather than GetQueryResults to poll for status

User Story

As an application developer, I want my log queries to run as fast as possible.

Background

Currently (as of v1.2.0 and soon to be v1.3.0), Incite uses the CloudWatch Logs GetQueryResults API operation to poll the status of running queries.

This is limiting in several ways:

The service quota limit for GetQueryResults is 5 TPS and this value cannot be increased.
Only one running CWL query (Incite chunk") at a time may be queried in this method, which means:
- at most 5 different CWL queries (Incite chunks) may be polled per second;
- if Incite is consuming 100% of CWL capacity, it takes a full two-second cycle to poll all running chunks at full parallelism.

CloudWatch Logs recently introduced a new DescribeQueries API operation. The RPS limit doesn't seem to be onboarded to service quotas yet, but because the operation is capable of returning the status of many CWL queries (Incite chunks) at once, there is an opportunity to reduce the latency between when a query changes status and when Incite learns about it, even at 5 TPS.

Notes

If we do this work, note that preview-mode streams would still have to poll with GetQueryResults.
The work is a fairly big rock since it would entail:
- creating another worker sub-type and differentiating between polling for status changes and polling for results; and
- rewriting zillions of mocks.
The DescribeQueries API isn't super-ergonomic for our use case, because while it has some filters, you can't combine the filters in a way that's efficient. So implementing this in a way that scales to many queries in a QM would require making the polling code pretty clever, and that is yet more work.

Refactor concurrency model for performance and comprehension

User Story

As an application developer using Incite to support my app, I want the best local app performance possible subject only to CloudWatch Logs service limitations.
As an Incite developer I want to understand how the heck it works.

Background

The current QueryManager concurrency design is over-complicated and likely makes inefficient use of available CWL resources.

The current design is a bit hard to grok since the outer loop() basically tries to do one action per iteration and tries to prioritize starting chunks over polling results.
An example of an inefficient resource usage case, if you start a query with, say, 10 startable chunks, and let's say your quota limit is 5 TPS and that the Insights web service is capable of sub-200 ms responses. Then it will take about 2 seconds to start all the queries, and during this period polling is throttled, which means if one of the queries finishes by this time (or for previewing has intermediate results) then these results will not be processed.

Proposal

Idea

Notionally there are now four queues in the QueryManager:

The priority queue of unfinished *stream values (basically queries with unstarted chunks).
The ready queue of startable high-priority chunks (fresh chunks, restarts, and split results). (For some reason this is a *ring.Ring in the code but probably could be a straight slice.0
The polling ring of running chunks.
The cancel queue of chunks that need to be cancelled due to a Close() call or another chunk in the same query dying unexpectedly.

Roughly speaking the idea is to have separate goroutines for separate jobs and managing the data structures that make sense for them. Something like:

A starter goroutine owns the priority queue, the ready queue, and making StartQuery calls. It will block if the size of the polling ring plus the cancel queue is bigger than Parallel.
A stopper goroutine owns the cancel queue which is probably a simple chan *chunk and making StopQuery calls. It is usually blocked but springs into action when chunks need to be cancelled, and it has the ability to unblock the starter goroutine anytime the sum of the cancel queue and the polling ring drops below Parallel.
A poller goroutine owns the polling ring and making GetQueryResults calls. Based on the outcome of these calls it may mark a `stream into error status (starter goroutine will kill it), rotate the polling ring, remove a chunk from the polling ring, and possibly add one or more chunks to the back (front?) of the ready queue.

Out of Scope

Building on the above separation of concerns, an even more crazy improvement would be to run each individual web service call in its own sub-goroutine forked off by the main goroutine concern-owning goroutine. For e.g., the starter goroutine could fork off a new goroutine for each StartQuery call. This would potentially further improve parallelism by, for example, letting us start chunks up to maximum capacity within the first half second rather than taking more than two seconds for it.

This would be nice for a follow-up issue if we can get the main refactor done.

gogama / incite Goto Github PK

incite's Introduction

Features

Getting Started

Get the code

Concepts

A simple app

Compatibility

Related

License

Acknowledgements

incite's People

Contributors

Stargazers

Watchers

Forkers

incite's Issues

User Story

Background

Change

User Story

Comments

User Story

Background

User Story

Notes

Open Questions

User Story

Notes

Task list

Stats field sketch

User Story

Background

Solution

User Story

Background

User Story

Notes

Chunks and chunked queries

Incite differentiators

User Story

Open questions

Task list

New structure fields sketch

User Story

Background

Tasks

User Story

Notes

Background

Bugs

Main Bug - starter

Second Bug - poller

Speculation on Fixing

User Story

Notes

Task List

Example

User Story

Background

Task List

User Story

Notes

User Story

Task list

Stats counter field sketch

User Story

Details

Fix

Description

History

Cause

Possible Solution

User Story

Background

Notes

User Story

Background

Proposal

Idea

`Stats` field sketch

`Stats` counter field sketch