snowplow-devops / kinsumer Goto Github PK

View Code? Open in Web Editor NEW

This project forked from twitchscience/kinsumer

1.0 1.0 1.0 75 KB

Native Go consumer for AWS Kinesis streams.

License: Other

Go 99.38% Makefile 0.62%

kinsumer's Introduction

Kinsumer

Native Go consumer for AWS Kinesis streams.

Rationale

There are several very good ways to consume Kinesis streams, primarily The Amazon Kinesis Client Library, and it is recommended that be investigated as an option.

Kinsumer is designed for a cluster of Go clients that want each client to consume from multiple shards. Kinsumer is designed to be at-least-once with a strong effort to be exactly-once. Kinsumer by design does not attempt to keep shards on a specific client and will shuffle them around as needed.

Behavior

Kinsumer is designed to suit a specific use case of kinesis consuming, specifically when you need to have multiple clients each handling multiple shards and you do not care which shard is being consumed by which client.

Kinsumer will rebalance shards to each client whenever it detects the list of shards or list of clients has changed, and does not attempt to keep shards on the same client.

If you are running multiple Kinsumer apps against a single stream, make sure to increase the throttleDelay to at least 50ms + (200ms * <the number of reader apps>). Note that Kinesis does not support more than two readers per writer on a fully utilized stream, so make sure you have enough stream capacity.

Example

See cmd/noopkinsumer for a fully working example of a kinsumer client.

Testing

Testing with local test servers

By default the tests look for a dynamodb server at localhost:4567 and kinesis server at localhost:4568

For example using kinesalite and dynalite

kinesalite --port 4568 --createStreamMs 1 --deleteStreamMs 1 --updateStreamMs 1 --shardLimit 1000 &
dynalite --port 4567 --createTableMs 1 --deleteTableMs 1 --updateTableMs 1 &

Then go test ./...

Testing with real aws resources

It's possible to run the test against real AWS resources, but the tests create and destroy resources, which can be finicky, and potentially expensive.

Make sure you have your credentials setup in a way that aws-sdk-go is happy with, or be running on an EC2 instance.

Then go test . -dynamo_endpoint= -kinesis_endpoint= -resource_change_timeout=30s

kinsumer's People

Contributors

Stargazers

Watchers

Forkers

svitcov

kinsumer's Issues

Update integration testing setup

The upstream repo has integration tests but no baked in infrastructure to deploy it, and tests are skipped in CI.

Add localstack and makefile, and resolve any issues in running integration tests.

Add unit tests for shard consumer issues

Once the major race condition related issues in #9 were addressed, we were still left with failing integration tests, primarily with TestKinsumer and TestSplit.

These all seem to be related to how shard_consumer behaves - create unit tests to isolate those behaviours, investigate remaining issues and prove out solutions.

Add check for kinesis scaling action

Currently we rebalance shards any time we notice kinesis shard counts change. We have noticed that at high shard counts (100+), kinesis' scaling can take a very long time - up to a couple of hours - before the shard count is stable again.

This will cause kinsumer to constantly stop-start itself, which can be an issue for apps using it.

One solution to work around this is to check the stream status at the same time that we check the shard count, and only rebalance once it's done scaling.

Add fix for duplicates caused by consume exiting too early

TestSplit results in duplicates sometimes. It's difficult to reproduce this issue directly in that integration test, but we consistently hit it over the course of ~7 runs of the test in CI. On local, when resources are constrained it occurs more often.

Deep dive investigation and tests to isolate different related behaviours (see #10) reveal that it occurs in TestSplit, when shards merge, because the shard count changes slowly over time, and there is a chance that for a period of time, different clients are operating upon the same shards.

This alone doesn't produce duplicates, however when a k.stop() request is sent, the consume function exits immediately, and pauses all checkpointing.

When this stop request happens after a record has already been pushed to k.records (and therefore out to the user), but before its checkpointer.update() function is called, that record is processed again by the next client to take ownership of the shard.

In fact, even when the shard doesn't change owner, the same problem exists if we stop and restart consuming the shard between pushing to k.records and calling checkpointer.update(). Both of these scenarios can be isolated with (relatively convoluted) unit tests.

We might expect that [the deferred call to release() on exit mitigates this, but no so - I'm not sure if this is because it evaluates the checkpointer too early, or because a call to update() doesn't update the sequence number on time.

The solution to mitigate this issue is to amend the behaviour when a stop request comes in. We break from both the loop which executes commits/shard refreshes, and the loop which iterates the shard and returns records. Then, before returning, we start a new commit loop which commits as soon as we recognise a new sequence number, and exits only after a timeout.

This may introduce some latency when shard ownership changes - however such shard ownership changes are hopefully rare enough for this impact to be tolerable. This latency impact can also be managed by correlating the timeout to the now configurable maxAgeForClientRecord.

Add unbecomingLeader flag to avoid race condition in unbecomeLeader().

It is possible for unbecomeLeader() to be called more than once. If it is called for a second time, while the first is waiting for the waitgroup to finish, then leaderLost will not be nil, and it will attempt to close the leaderLost channel for a second time, resulting in the app panicing.

We can resolve this by adding an unbecomingLeader flag to the client. If unbecomingLeader is true, we wait for it to become false before proceeding with the unbecomeLeader() function.

Migrate from travis to GH actions

Resolve race conditions in shard ownership

Because a record's 'staleness' interval is tied to the shardCheckFrequency, when we set this frequency to a low number, we can capture a shard, and have another client claim ownership of the shard before it's released.

We can resolve this race condition by decoupling the shardCheckFrequency from the maxAgeForClientRecord, since it is reasonable that we may take longer then 2.5 seconds to process a record when we poll for records every half-second.

Also investigate if maxAgeForLeaderRecord has similar issue.

Gracefully handle ownership clash errors

When a change in shards or clients is detected, shard ownership changes.

If an old checkpointer attempts to claim ownership of a shard after the cutoff for the old checkpointer to release, we can hit a scenario where the old checkpointer is still running but cannot commit or release the checkpoint - because another shard owns it now.

Current behaviour is to return an error in these cases, with the intention that we reboot the entire app. We can mitigate against such clashes with the measures outlined in #3 (comment), but this issue arises from timings which depend on the runtime, so those alone cannot guarantee that we avoid the issue.

The suggested fix is to keep track of the last action internally in the checkpointer, and to ignore conditionalcheckFailed exceptions when committing or releasing a checkpoint, if the last update is longer than the cutoff.

Note that this seems liable to ~~reduce~~ result in duplicates, depending on timings. Ideally we find a way to avoid those duplicates, but the strategy of throwing an error and rebooting in these cases is also liable to produce duplicates.

Update checkpointer to commit periodically when there's no data

When a client consumes a shard which has no data, kinsumer's behaviour is indistinguishable from the scenario where the client has hit some problem and the data is stale - another client will sieze ownership of the shard, and clients consuming empty shards will stop registering their existence in DDB. This will cause redistributions of consumers of shards.

Ideally we avoid this behaviour - while it's uncommon for shards to be empty, in low-latency use cases with scale fluctuations we can have this scenario, and it will cause a problematic amount of stopping and starting of consumers. These restarts introduce the likelihood of duplicates and other problems with shard ownership when data does start to flow again.

We can avoid it by modifying the commit() mechanism to register itself occasionally when there is no data coming from the iterator.

This, however, should only happen when we're not receiving data to push into the k.records channel. Otherwise, legitimate cases where a client is too slow, or has some other issue, won't have a chance to forcefully take ownership of the shard.

The behaviour should therefore be updated as follows:
When:

we have no new sequence number
AND
it's been more than maxAgeForClientRecord/2 since our last commit
AND
it's been more than maxAgeForClientRecord/2 since the last record was pushed to k.records

Then we should commit to DDB.

This way, ownership can change when it should, but lack of data doesn't cause issues.

Validate that clientRecordMaxAge != 0

We do check that it's not lower than shard check frequency - which can't be 0.

So we will get an error in this case, but we should check directly that it's not 0 and throw a specific error for that case too.

Add configuration of clientRecordMaxAge

As described in #3 , to reduce the chances for ownership clashes, and to accommodate a model where we have segmented Next() from checkpointing, we should allow configuration of the clientRecordMaxAge. Default should be the pre-existing behaviour of 5*shardCheckFrequency.

Modify getClients() and shard refresh behaviour to avoid ownership clashes

On every shardChange ticker we refresh shards and asses if any ownership changes have occurred.

There are two central problems that may occur when this happens.

Firstly, refreshShards calls getClients, which returns a list of clients, filtering out clients who haven't updated the clients DDB table since maxAgeForClientRecord. As described in #3, when we have set low frequencies/maxAge values this can cause incorrect results, since the other clients may still exist, but just haven't had time to register with the table, since they will need time to process the data plus time for their shardChangeTicker to register them with the table.

We can mitigate this by adding one shardCheckFrequency to the cutoff when getting clients, which ensures that active clients have time to register with the clients table.

Secondly, refreshShards registers with the clients table, then calls getClients with that filter. Occasionally, when the cutoff is very low and the app is misbehaving (poor connection to ddb, performance issues with the overall app for example), this can cause refreshShards to update the clients table, then not return the current client when it grabs the list back from DDB. We can prevent this from possibly occurring by using the time that refreshShards started itself as the reference for when to apply the filter, rather than the time we get clients.