GithubHelp home page GithubHelp logo

mailgun / gubernator Goto Github PK

View Code? Open in Web Editor NEW
964.0 66.0 102.0 3.44 MB

High Performance Rate Limiting MicroService and Library

License: Apache License 2.0

Makefile 0.93% Go 92.34% Shell 0.41% Dockerfile 0.34% Python 3.13% Smarty 0.79% HCL 2.07%
rate-limiting rate-limiter golang golang-library microservice grpc cloudnative

gubernator's Introduction

Gubernator Logo
Distributed RateLimiting Service

DEVELOPMENT ON GUBERNATOR HAS MOVED TO A NEW HOME AT gubernator-io/gubernator

v2.4.0 is the final version available from his repo, all new features and bug fixes will occur under the new repo.

Gubernator

Gubernator is a distributed, high performance, cloud native and stateless rate-limiting service.

Features

  • Gubernator evenly distributes rate limit requests across the entire cluster, which means you can scale the system by simply adding more nodes.
  • Gubernator doesn’t rely on external caches like memcached or redis, as such there is no deployment synchronization with a dependant service. This makes dynamically growing or shrinking the cluster in an orchestration system like kubernetes or nomad trivial.
  • Gubernator holds no state on disk, It’s configuration is passed to it by the client on a per-request basis.
  • Gubernator provides both GRPC and HTTP access to the API.
  • It Can be run as a sidecar to services that need rate limiting or as a separate service.
  • It Can be used as a library to implement a domain-specific rate limiting service.
  • Supports optional eventually consistent rate limit distribution for extremely high throughput environments. (See GLOBAL behavior architecture.md)
  • Gubernator is the english pronunciation of governor in Russian, also it sounds cool.

Stateless configuration

Gubernator is stateless in that it doesn’t require disk space to operate. No configuration or cache data is ever synced to disk. This is because every request to gubernator includes the config for the rate limit. At first you might think this an unnecessary overhead to each request. However, In reality a rate limit config is made up of only 4, 64bit integers.

Quick Start

# Download the docker-compose file
$ curl -O https://raw.githubusercontent.com/mailgun/gubernator/master/docker-compose.yaml
# Run the docker container
$ docker-compose up -d

Now you can make rate limit requests via CURL

# Hit the HTTP API at localhost:9080 (GRPC is at 9081)
$ curl http://localhost:9080/v1/HealthCheck

# Make a rate limit request
$ curl http://localhost:9080/v1/GetRateLimits \
  --header 'Content-Type: application/json' \
  --data '{
    "requests": [
        {
            "name": "requests_per_sec",
            "uniqueKey": "account:12345",
            "hits": "1",
            "limit": "10",
            "duration": "1000"
        }
    ]
}'

ProtoBuf Structure

An example rate limit request sent via GRPC might look like the following

rate_limits:
    # Scopes the request to a specific rate limit
  - name: requests_per_sec
    # A unique_key that identifies this instance of a rate limit request
    unique_key: account_id=123|source_ip=172.0.0.1
    # The number of hits we are requesting
    hits: 1
    # The total number of requests allowed for this rate limit
    limit: 100
    # The duration of the rate limit in milliseconds
    duration: 1000
    # The algorithm used to calculate the rate limit
    # 0 = Token Bucket
    # 1 = Leaky Bucket
    algorithm: 0
    # The behavior of the rate limit in gubernator.
    # 0 = BATCHING (Enables batching of requests to peers)
    # 1 = NO_BATCHING (Disables batching)
    # 2 = GLOBAL (Enable global caching for this rate limit)
    behavior: 0

An example response would be

rate_limits:
    # The status of the rate limit.  OK = 0, OVER_LIMIT = 1
  - status: 0,
    # The current configured limit
    limit: 10,
    # The number of requests remaining
    remaining: 7,
    # A unix timestamp in milliseconds of when the bucket will reset, or if 
    # OVER_LIMIT is set it is the time at which the rate limit will no 
    # longer return OVER_LIMIT.
    reset_time: 1551309219226,
    # Additional metadata about the request the client might find useful
    metadata:
      # This is the name of the coordinator that rate limited this request
      "owner": "api-n03.staging.us-east-1.mailgun.org:9041"

Rate limit Algorithm

Gubernator currently supports 2 rate limit algorithms.

  1. Token Bucket implementation starts with an empty bucket, then each Hit adds a token to the bucket until the bucket is full. Once the bucket is full, requests will return OVER_LIMIT until the reset_time is reached at which point the bucket is emptied and requests will return UNDER_LIMIT. This algorithm is useful for enforcing very bursty limits. (IE: Applications where a single request can add more than 1 hit to the bucket; or non network based queuing systems.) The downside to this implementation is that once you have hit the limit no more requests are allowed until the configured rate limit duration resets the bucket to zero.

  2. Leaky Bucket is implemented similarly to Token Bucket where OVER_LIMIT is returned when the bucket is full. However tokens leak from the bucket at a consistent rate which is calculated as duration / limit. This algorithm is useful for metering, as the bucket leaks allowing traffic to continue without the need to wait for the configured rate limit duration to reset the bucket to zero.

Performance

In our production environment, for every request to our API we send 2 rate limit requests to gubernator for rate limit evaluation, one to rate the HTTP request and the other is to rate the number of recipients a user can send an email too within the specific duration. Under this setup a single gubernator node fields over 2,000 requests a second with most batched responses returned in under 1 millisecond.

requests graph

Peer requests forwarded to owning nodes typically respond in under 30 microseconds.

peer requests graph

NOTE The above graphs only report the slowest request within the 1 second sample time. So you are seeing the slowest requests that gubernator fields to clients.

Gubernator allows users to choose non-batching behavior which would further reduce latency for client rate limit requests. However because of throughput requirements our production environment uses Behaviour=BATCHING with the default 500 microsecond window. In production we have observed batch sizes of 1,000 during peak API usage. Other users who don’t have the same high traffic demands could disable batching and would see lower latencies but at the cost of throughput.

Gregorian Behavior

Users may choose a behavior called DURATION_IS_GREGORIAN which changes the behavior of the Duration field. When Behavior is set to DURATION_IS_GREGORIAN the Duration of the rate limit is reset whenever the end of selected gregorian calendar interval is reached.

This is useful when you want to impose daily or monthly limits on a resource. Using this behavior you know when the end of the day or month is reached the limit on the resource is reset regardless of when the first rate limit request was received by Gubernator.

Given the following Duration values

  • 0 = Minutes
  • 1 = Hours
  • 2 = Days
  • 3 = Weeks
  • 4 = Months
  • 5 = Years

Examples when using Behavior = DURATION_IS_GREGORIAN

  • If Duration = 2 (Days) then the rate limit will reset to Current = 0 at the end of the current day the rate limit was created.
  • If Duration = 0 (Minutes) then the rate limit will reset to Current = 0 at the end of the minute the rate limit was created.
  • If Duration = 4 (Months) then the rate limit will reset to Current = 0 at the end of the month the rate limit was created.

Reset Remaining Behavior

Users may add behavior Behavior_RESET_REMAINING to the rate check request. This will reset the rate limit as if created new on first use.

When using Reset Remaining, the Hits field should be 0.

Drain Over Limit Behavior

Users may add behavior Behavior_DRAIN_OVER_LIMIT to the rate check request. A GetRateLimits call drains the remaining counter on first over limit event. Then, successive GetRateLimits calls will return zero remaining counter and not any residual value. This behavior works best with token bucket algorithm because the Remaining counter will stay zero after an over limit until reset time, whereas leaky bucket algorithm will immediately update Remaining to a non-zero value.

This facilitates scenarios that require an over limit event to stay over limit until the rate limit resets. This approach is necessary if a process must make two rate checks, before and after a process, and the Hit amount is not known until after the process.

  • Before process: Call GetRateLimits with Hits=0 to check the value of Remaining counter. If Remaining is zero, it's known that the rate limit is depleted and the process can be aborted.
  • After process: Call GetRateLimits with a user specified Hits value. If the call returns over limit, the process cannot be aborted because it had already completed. Using DRAIN_OVER_LIMIT behavior, the Remaining count will be drained to zero.

Once an over limit occurs in the "After" step, successive processes will detect the over limit state in the "Before" step.

Gubernator as a library

If you are using golang, you can use Gubernator as a library. This is useful if you wish to implement a rate limit service with your own company specific model on top. We do this internally here at mailgun with a service we creatively called ratelimits which keeps track of the limits imposed on a per account basis. In this way you can utilize the power and speed of Gubernator but still layer business logic and integrate domain specific problems into your rate limiting service.

When you use the library, your service becomes a full member of the cluster participating in the same consistent hashing and caching as a stand alone Gubernator server would. All you need to do is provide the GRPC server instance and tell Gubernator where the peers in your cluster are located. The cmd/gubernator/main.go is a great example of how to use Gubernator as a library.

Optional Disk Persistence

While the Gubernator server currently doesn't directly support disk persistence, the Gubernator library does provide interfaces through which library users can implement persistence. The Gubernator library has two interfaces available for disk persistence. Depending on the use case an implementor can implement the Loader interface and only support persistence of rate limits at startup and shutdown, or users can implement the Store interface and Gubernator will continuously call OnChange() and Get() to keep the in memory cache and persistent store up to date with the latest rate limit data. Both interfaces can be implemented simultaneously to ensure data is always saved to persistent storage.

For those who choose to implement the Store interface, it is not required to store ALL the rate limits received via OnChange(). For instance; If you wish to support rate limit durations longer than a minute, day or month, calls to OnChange() can check the duration of a rate limit and decide to only persist those rate limits that have durations over a self determined limit.

API

All methods are accessed via GRPC but are also exposed via HTTP using the GRPC Gateway

Health Check

Health check returns unhealthy in the event a peer is reported by etcd or kubernetes as up but the server instance is unable to contact that peer via it's advertised address.

GRPC
rpc HealthCheck (HealthCheckReq) returns (HealthCheckResp)
HTTP
GET /v1/HealthCheck

Example response:

{
  "status": "healthy",
  "peer_count": 3
}

Get Rate Limit

Rate limits can be applied or retrieved using this interface. If the client makes a request to the server with hits: 0 then current state of the rate limit is retrieved but not incremented.

GRPC
rpc GetRateLimits (GetRateLimitsReq) returns (GetRateLimitsResp)
HTTP
POST /v1/GetRateLimits

Example Payload

{
  "requests": [
    {
      "name": "requests_per_sec",
      "uniqueKey": "account:12345",
      "hits": "1",
      "limit": "10",
      "duration": "1000"
    }
  ]
}

Example response:

{
  "responses": [
    {
      "status": "UNDER_LIMIT",
      "limit": "10",
      "remaining": "9",
      "reset_time": "1690855128786",
      "error": "",
      "metadata": {
        "owner": "gubernator:81"
      }
    }
  ]
}

Deployment

NOTE: Gubernator uses etcd, Kubernetes or round-robin DNS to discover peers and establish a cluster. If you don't have either, the docker-compose method is the simplest way to try gubernator out.

Docker with existing etcd cluster
$ docker run -p 8081:81 -p 9080:80 -e GUBER_ETCD_ENDPOINTS=etcd1:2379,etcd2:2379 \
   ghcr.io/mailgun/gubernator:latest

# Hit the HTTP API at localhost:9080
$ curl http://localhost:9080/v1/HealthCheck
Kubernetes
# Download the kubernetes deployment spec
$ curl -O https://raw.githubusercontent.com/mailgun/gubernator/master/k8s-deployment.yaml

# Edit the deployment file to change the environment config variables
$ vi k8s-deployment.yaml

# Create the deployment (includes headless service spec)
$ kubectl create -f k8s-deployment.yaml
Round-robin DNS

If your DNS service supports auto-registration, for example AWS Route53 service discovery, you can use same fully-qualified domain name to both let your business logic containers or instances to find gubernator and for gubernator containers/instances to find each other.

TLS

Gubernator supports TLS for both HTTP and GRPC connections. You can see an example with self signed certs by running docker-compose-tls.yaml

# Run docker compose
$ docker-compose -f docker-compose-tls.yaml up -d

# Hit the HTTP API at localhost:9080 (GRPC is at 9081)
$ curl --cacert certs/ca.cert --cert certs/gubernator.pem --key certs/gubernator.key  https://localhost:9080/v1/HealthCheck

Configuration

Gubernator is configured via environment variables with an optional --config flag which takes a file of key/values and places them into the local environment before startup.

See the example.conf for all available config options and their descriptions.

Architecture

See architecture.md for a full description of the architecture and the inner workings of gubernator.

Monitoring

Gubernator publishes Prometheus metrics for realtime monitoring. See prometheus.md for details.

OpenTelemetry Tracing (OTEL)

Gubernator supports OpenTelemetry. See tracing.md for details.

gubernator's People

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

gubernator's Issues

Is it possible to change some logic for ETCD only setups

I noticed that when using ETCD only for building up the cluster, IsOwner property is not set, but I can find the "Owner" logic with local clustering and k8s setup.

Is it possible to add some logic for checking current peer is the "Owner" so that we don't need to get ratelimit via a GRPC call, which could use more CPU time than a simple method call.

What I said above are the following:

				// If our server instance is the owner of this rate limit
				if peer.info.IsOwner {
					// Apply our rate limit algorithm to the request
					inOut.Out, err = s.getRateLimit(inOut.In)
					if err != nil {
						inOut.Out = &RateLimitResp{
							Error: fmt.Sprintf("while applying rate limit for '%s' - '%s'", globalKey, err),
						}
					}
				} else {
					if HasBehavior(inOut.In.Behavior, Behavior_GLOBAL) {
						inOut.Out, err = s.getGlobalRateLimit(inOut.In)
						if err != nil {
							inOut.Out = &RateLimitResp{Error: err.Error()}
						}

						// Inform the client of the owner key of the key
						inOut.Out.Metadata = map[string]string{"owner": peer.info.Address}

						out <- inOut
						return nil
					}

					// Make an RPC call to the peer that owns this rate limit
					inOut.Out, err = peer.GetPeerRateLimit(ctx, inOut.In)
					if err != nil {
						if IsNotReady(err) {
							attempts++
							goto getPeer
						}
						inOut.Out = &RateLimitResp{
							Error: fmt.Sprintf("while fetching rate limit '%s' from peer - '%s'", globalKey, err),
						}
					}

					// Inform the client of the owner key of the key
					inOut.Out.Metadata = map[string]string{"owner": peer.info.Address}
				}

Ability to reset the rate limit for a key

I have a use case where we'd like to reset the rate limit for a key to 0. I'm currently emulating this by doing a 0 hit request to check the current value, and then doing another request to subtract this.

This method isn't safe under concurrency, however, since concurrent attempts at this could subtract it multiple times.

Feat: support curl in docker image or a /gubernator command for self healthcheck purpose

Docker support healthcheck which is very useful for us to detect dead nodes and auto remove it from cluster (may be network issue or disk issue). Currently the image is base on scratch image which does not contains any other command except /gubernator. I think it would be great if gubernator support a kind of command like /gubernator -healthcheck and return exit code 0 on healthy node, otherwise exit with code 1 (seem like docker swarm only support 0 and 1).

Workaround solution, build another image and add curl/wget to.

Question: On global behavior sideeffects

See https://github.com/mailgun/gubernator/blob/master/architecture.md#side-effects-of-global-behavior

It states: Using GLOBAL can increase the amount of traffic per rate limit request if the cluster is large enough

I assume that with 'the cluster' you mean the cluster of gubernator instances.

Can you give a more detailed indication what you mean with '_large'? Are we talking about tens or hundreds of gubernator instances. Is is still practical to run 20 instances of gubernator in global mode. Or would that already kill the network. Is there a practical limit to the number of gubernator instances running in global mode?

Would be nice if that could be explained in the docs.

TestGlobalRateLimits test fails with DataRace when run with -race option

 WARNING: DATA RACE

 Write at 0x00c000ebde3c by goroutine 91:
   github.com/mailgun/gubernator.(*globalManager).updatePeers()
       /builds/volterra/ves.io/gubernator/gubernator.go:423 +0x1cb
   github.com/mailgun/gubernator.(*globalManager).runBroadcasts.func1()
       /builds/volterra/ves.io/gubernator/global.go:182 +0x287
   github.com/mailgun/holster.(*WaitGroup).Until.func1()
       /go/pkg/mod/github.com/mailgun/[email protected]+incompatible/waitgroup.go:66 +0x55
 Previous read at 0x00c000ebde3c by goroutine 475:
   github.com/mailgun/gubernator.tokenBucket()
       /builds/volterra/ves.io/gubernator/algorithms.go:139 +0x11de
   github.com/mailgun/gubernator.(*Instance).getRateLimit()
       /builds/volterra/ves.io/gubernator/gubernator.go:302 +0x378
   github.com/mailgun/gubernator.(*Instance).GetPeerRateLimits()
       /builds/volterra/ves.io/gubernator/gubernator.go:275 +0x132
   github.com/mailgun/gubernator._PeersV1_GetPeerRateLimits_Handler()
       /builds/volterra/ves.io/gubernator/peers.pb.go:182 +0x2fc
   google.golang.org/grpc.(*Server).processUnaryRPC()
       /go/pkg/mod/google.golang.org/[email protected]/server.go:966 +0x92f
   google.golang.org/grpc.(*Server).handleStream()
       /go/pkg/mod/google.golang.org/[email protected]/server.go:1245 +0x132b
   google.golang.org/grpc.(*Server).serveStreams.func1.1()
       /go/pkg/mod/google.golang.org/[email protected]/server.go:685 +0xc8
 Goroutine 91 (running) created at:
   github.com/mailgun/holster.(*WaitGroup).Until()
       /go/pkg/mod/github.com/mailgun/[email protected]+incompatible/waitgroup.go:64 +0xbb
   github.com/mailgun/gubernator.(*globalManager).runBroadcasts()
       /builds/volterra/ves.io/gubernator/global.go:162 +0x176
   github.com/mailgun/gubernator.newGlobalManager()
       /builds/volterra/ves.io/gubernator/global.go:58 +0x3b4
   github.com/mailgun/gubernator.New()
       /builds/volterra/ves.io/gubernator/gubernator.go:64 +0x2d1
   github.com/mailgun/gubernator/cluster.StartInstance()
       /builds/volterra/ves.io/gubernator/cluster/cluster.go:114 +0x8d
   github.com/mailgun/gubernator/cluster.StartWith()
       /builds/volterra/ves.io/gubernator/cluster/cluster.go:81 +0x243
   github.com/mailgun/gubernator_test.TestMain()
       /builds/volterra/ves.io/gubernator/functional_test.go:36 +0x84
   main.main()
       _testmain.go:82 +0x223
 Goroutine 475 (finished) created at:
   google.golang.org/grpc.(*Server).serveStreams.func1()
       /go/pkg/mod/google.golang.org/[email protected]/server.go:683 +0xb8
   google.golang.org/grpc/internal/transport.(*http2Server).operateHeaders()
       /go/pkg/mod/google.golang.org/[email protected]/internal/transport/http2_server.go:419 +0x14e7
   google.golang.org/grpc/internal/transport.(*http2Server).HandleStreams()
       /go/pkg/mod/google.golang.org/[email protected]/internal/transport/http2_server.go:459 +0x39f
   google.golang.org/grpc.(*Server).serveStreams()
       /go/pkg/mod/google.golang.org/[email protected]/server.go:681 +0x19a
   google.golang.org/grpc.(*Server).handleRawConn.func1()
       /go/pkg/mod/google.golang.org/[email protected]/server.go:643 +0x50
 ==================
 ==================
 WARNING: DATA RACE
 Write at 0x00c000ebde20 by goroutine 91:
   github.com/mailgun/gubernator.(*globalManager).updatePeers()
       /builds/volterra/ves.io/gubernator/global.go:201 +0x1e7
   github.com/mailgun/gubernator.(*globalManager).runBroadcasts.func1()
       /builds/volterra/ves.io/gubernator/global.go:182 +0x287
   github.com/mailgun/holster.(*WaitGroup).Until.func1()
       /go/pkg/mod/github.com/mailgun/[email protected]+incompatible/waitgroup.go:66 +0x55
 Previous read at 0x00c000ebde20 by goroutine 475:
   github.com/mailgun/gubernator.tokenBucket()
       /builds/volterra/ves.io/gubernator/algorithms.go:149 +0x1333
   github.com/mailgun/gubernator.(*Instance).getRateLimit()
       /builds/volterra/ves.io/gubernator/gubernator.go:302 +0x378
   github.com/mailgun/gubernator.(*Instance).GetPeerRateLimits()
       /builds/volterra/ves.io/gubernator/gubernator.go:275 +0x132
   github.com/mailgun/gubernator._PeersV1_GetPeerRateLimits_Handler()
       /builds/volterra/ves.io/gubernator/peers.pb.go:182 +0x2fc
   google.golang.org/grpc.(*Server).processUnaryRPC()
       /go/pkg/mod/google.golang.org/[email protected]/server.go:966 +0x92f
   google.golang.org/grpc.(*Server).handleStream()
       /go/pkg/mod/google.golang.org/[email protected]/server.go:1245 +0x132b
   google.golang.org/grpc.(*Server).serveStreams.func1.1()
       /go/pkg/mod/google.golang.org/[email protected]/server.go:685 +0xc8
 Goroutine 91 (running) created at:
   github.com/mailgun/holster.(*WaitGroup).Until()
       /go/pkg/mod/github.com/mailgun/[email protected]+incompatible/waitgroup.go:64 +0xbb
   github.com/mailgun/gubernator.(*globalManager).runBroadcasts()
       /builds/volterra/ves.io/gubernator/global.go:162 +0x176
   github.com/mailgun/gubernator.newGlobalManager()
       /builds/volterra/ves.io/gubernator/global.go:58 +0x3b4
   github.com/mailgun/gubernator.New()
       /builds/volterra/ves.io/gubernator/gubernator.go:64 +0x2d1
   github.com/mailgun/gubernator/cluster.StartInstance()
       /builds/volterra/ves.io/gubernator/cluster/cluster.go:114 +0x8d
   github.com/mailgun/gubernator/cluster.StartWith()
       /builds/volterra/ves.io/gubernator/cluster/cluster.go:81 +0x243
   github.com/mailgun/gubernator_test.TestMain()
       /builds/volterra/ves.io/gubernator/functional_test.go:36 +0x84
   main.main()
       _testmain.go:82 +0x223
 Goroutine 475 (finished) created at:
   google.golang.org/grpc.(*Server).serveStreams.func1()
       /go/pkg/mod/google.golang.org/[email protected]/server.go:683 +0xb8
   google.golang.org/grpc/internal/transport.(*http2Server).operateHeaders()
       /go/pkg/mod/google.golang.org/[email protected]/internal/transport/http2_server.go:419 +0x14e7
   google.golang.org/grpc/internal/transport.(*http2Server).HandleStreams()
       /go/pkg/mod/google.golang.org/[email protected]/internal/transport/http2_server.go:459 +0x39f
   google.golang.org/grpc.(*Server).serveStreams()
       /go/pkg/mod/google.golang.org/[email protected]/server.go:681 +0x19a
   google.golang.org/grpc.(*Server).handleRawConn.func1()
       /go/pkg/mod/google.golang.org/[email protected]/server.go:643 +0x50

 ==================
 --- FAIL: TestGlobalRateLimits (1.01s)
     testing.go:853: race detected during execution of test

Bug: division by 0 with leaky bucket when rate is higher than duration

On release 0.8. when rate (2000) is higher than duration (1000) using leaky bucket gives the following error.

time="2020-05-27T10:45:00Z" level=info msg="Gubernator Listening on 0.0.0.0:81 ..." category=server
time="2020-05-27T10:45:00Z" level=info msg="HTTP Gateway Listening on 0.0.0.0:80 ..." category=server
time="2020-05-27T10:45:07Z" level=info msg="Peers updated" category=gubernator peers="[{172.17.0.4:81 true}]"
panic: runtime error: integer divide by zero
goroutine 91 [running]:
github.com/mailgun/gubernator.leakyBucket(0x0, 0x0, 0x1715200, 0xc000297d60, 0xc0003d2940, 0x0, 0x0, 0x0)
	/go/src/algorithms.go:235 +0xc84
github.com/mailgun/gubernator.(*Instance).getRateLimit(0xc0002d65b0, 0xc0003d2940, 0x0, 0x0, 0x0)
	/go/src/gubernator.go:304 +0x110
github.com/mailgun/gubernator.(*Instance).GetRateLimits.func1.1(0x13e18e0, 0xc000462ba0, 0x0, 0x0)
	/go/src/gubernator.go:170 +0x673
github.com/mailgun/holster.(*FanOut).Run.func1(0xc000468ae0, 0x13e18e0, 0xc000462ba0, 0xc0003d2980)
	/go/pkg/mod/github.com/mailgun/[email protected]+incompatible/fanout.go:66 +0x3e
created by github.com/mailgun/holster.(*FanOut).Run
	/go/pkg/mod/github.com/mailgun/[email protected]+incompatible/fanout.go:65 +0x7b

This is caused by the fact that integers are used and when duration (1000) is lower than rate (2000) this results in a 0 value which is later used in a division.

This could be fixed by using floats. See the embarrassing ugly sed statements below.

sed -i 's/rate := duration \/ r.Limit/rate := float64(duration) \/ float64(r.Limit)/g' ./gubernator/algorithms.go
sed -i 's/rate = d \/ r.Limit/rate = float64(d) \/ float64(r.Limit)/g' ./gubernator/algorithms.go
sed -i 's/leak := int64(elapsed \/ rate)/leak := int64(float64(elapsed) \/ rate)/g' ./gubernator/algorithms.go
sed -i 's/ResetTime: now + rate,/ResetTime: now + int64(rate),/g' ./gubernator/algorithms.go

Allow multiple behavior flags in request

When I saw that the behavior "flags" were bit fields I assumed that I could combine them:

    // Behavior: no_batching=1, global=2, duration_is_gregorian=4, reset_remaining=8, multi_region=16
    {
        "requests":[{
                "name": "name",
                "unique_key": "key",
                "hits": 1,
                "duration": 60000,
                "limit": 10,
                "behavior": 1 | 2
        }]
    }

But I get:

{"responses":[{"error":"behavior DURATION_IS_GREGORIAN is set; but `Duration` is not a valid gregorian interval","metadata":{"owner":"gubernator-1:91"}}]}

I want to be able to set the behavior in a single request. If I'm rate limiting something bursty that has a config expiring regularly then I need special handling to make sure that the behavior settings haven't reset. If I can send the complete behavior in a single request it simplifies the client.

I don't know how hard or easy this would be or if it's desirable within the vision of this project but I could possibly help.

Support use zookeeper as peer discovery

First of all, thank you for your work. This project is is so awesome! 👍

After read documents of Gubernator, I known it can use etcd or kubernetes to discover peers and establish a cluster. Do you have any plan to add "use zookeeper as peer discovery" support in the near future? Thanks!

Failure in TestHTTPSClientAuth

Seen with the latest code (using go 1.16.6).

nsheth-mbp:gubernator[86505]git remote -vv
origin https://github.com/mailgun/gubernator.git (fetch)
origin https://github.com/mailgun/gubernator.git (push)
nsheth-mbp:gubernator[86506]git branch
* master
nsheth-mbp:gubernator[86507]git log -n1
commit fa69d95 (HEAD -> master, tag: v2.0.0-rc.2, origin/master, origin/HEAD)
Merge: cc98f3a 92b79a8
Author: Maxim Vladimirskiy [email protected]
Date: Mon Jul 5 11:07:27 2021 +0300

Merge pull request #96 from mailgun/maxim/develop

PIP-1319: Use the latest protobuf

nsheth-mbp:gubernator[86508]go test -v -vet off -run TestHTTPSClientAuth
INFO[0000] GRPC Listening on 127.0.0.1:9990 ... instance="127.0.0.1:9990"
INFO[0000] HTTP Gateway Listening on 127.0.0.1:9980 ... instance="127.0.0.1:9990"
INFO[0000] GRPC Listening on 127.0.0.1:9991 ... instance="127.0.0.1:9991"
INFO[0000] HTTP Gateway Listening on 127.0.0.1:9981 ... instance="127.0.0.1:9991"
INFO[0000] GRPC Listening on 127.0.0.1:9992 ... instance="127.0.0.1:9992"
INFO[0000] HTTP Gateway Listening on 127.0.0.1:9982 ... instance="127.0.0.1:9992"
INFO[0000] GRPC Listening on 127.0.0.1:9993 ... instance="127.0.0.1:9993"
INFO[0000] HTTP Gateway Listening on 127.0.0.1:9983 ... instance="127.0.0.1:9993"
INFO[0000] GRPC Listening on 127.0.0.1:9994 ... instance="127.0.0.1:9994"
INFO[0000] HTTP Gateway Listening on 127.0.0.1:9984 ... instance="127.0.0.1:9994"
INFO[0000] GRPC Listening on 127.0.0.1:9995 ... instance="127.0.0.1:9995"
INFO[0000] HTTP Gateway Listening on 127.0.0.1:9985 ... instance="127.0.0.1:9995"
INFO[0000] GRPC Listening on 127.0.0.1:9890 ... instance="127.0.0.1:9890"
INFO[0000] HTTP Gateway Listening on 127.0.0.1:9880 ... instance="127.0.0.1:9890"
INFO[0000] GRPC Listening on 127.0.0.1:9891 ... instance="127.0.0.1:9891"
INFO[0000] HTTP Gateway Listening on 127.0.0.1:9881 ... instance="127.0.0.1:9891"
INFO[0000] GRPC Listening on 127.0.0.1:9892 ... instance="127.0.0.1:9892"
INFO[0000] HTTP Gateway Listening on 127.0.0.1:9882 ... instance="127.0.0.1:9892"
INFO[0000] GRPC Listening on 127.0.0.1:9893 ... instance="127.0.0.1:9893"
INFO[0000] HTTP Gateway Listening on 127.0.0.1:9883 ... instance="127.0.0.1:9893"
=== RUN TestHTTPSClientAuth
INFO[0000] Detected TLS Configuration category=gubernator
INFO[0000] GRPC Listening on 127.0.0.1:9695 ... category=gubernator
INFO[0000] GRPC Gateway Listening on 127.0.0.1:52208 ... category=gubernator
INFO[0000] HTTPS Gateway Listening on 127.0.0.1:9685 ... category=gubernator
INFO[0000] http: TLS handshake error from 127.0.0.1:52209: EOF category=gubernator
tls_test.go:289:
Error Trace: tls_test.go:289
Error: Not equal:
expected: "{"status":"healthy","message":"","peerCount":0}"
actual : "{"status":"healthy", "message":"", "peerCount":0}"

                            Diff:
                            --- Expected
                            +++ Actual
                            @@ -1 +1 @@
                            -{"status":"healthy","message":"","peerCount":0}
                            +{"status":"healthy", "message":"", "peerCount":0}
            Test:           TestHTTPSClientAuth

INFO[0000] HTTP Gateway close for 127.0.0.1:9685 ... category=gubernator
INFO[0000] GRPC close for 127.0.0.1:9695 ... category=gubernator
INFO[0000] GRPC close for 127.0.0.1:52208 ... category=gubernator
--- FAIL: TestHTTPSClientAuth (0.17s)
FAIL
exit status 1
FAIL github.com/mailgun/gubernator/v2 0.573s

nsheth-mbp:gubernator[86510]go version
go version go1.16.6 darwin/amd64

nsheth-mbp:gubernator[86511]uname -a
Darwin nsheth-mbp.local 19.6.0 Darwin Kernel Version 19.6.0: Tue Jan 12 22:13:05 PST 2021; root:xnu-6153.141.16~1/RELEASE_X86_64 x86_64

All tests pass with the following change:

diff --git a/tls_test.go b/tls_test.go
index e9af60f..4a485dd 100644
--- a/tls_test.go
+++ b/tls_test.go
@@ -286,5 +286,5 @@ func TestHTTPSClientAuth(t *testing.T) {
        require.NoError(t, err)
        b, err := ioutil.ReadAll(resp.Body)
        require.NoError(t, err)
-       assert.Equal(t, `{"status":"healthy","message":"","peerCount":0}`, string(b))
+       assert.Equal(t, `{"status":"healthy", "message":"", "peerCount":0}`, string(b))
 }

Connecting to Gubernator on Kubernetes

We want to use Gubernator on our project and have implemented it with local builds using docker-compose. Now we are trying to set up a gubernator cluster on our dev Kubernetes environment to test it out further.

We are running into some issues though where we get connection refused on our http calls to Gubernator

Logs from our ingest service which tries to connect to gubernator

14:34:23.375 ERROR i.o.u.service.http.HttpErrorHandler - Unexpected error for POST /api/v3/data/ingest/int-oblx-monitor?timestampPrecision=milliseconds
io.netty.channel.AbstractChannel$AnnotatedConnectException: Connection refused: gubernator-http.gubernator.svc.cluster.local/10.100.135.70:9080                                                                                             
Caused by: java.net.ConnectException: Connection refused at java.base/sun.nio.ch.Net.pollConnect(Native Method)

Logs from our Gubernator cluster, currently only 1 node active to make debugging easier

time="2021-01-12T14:18:40Z" level=info msg="Gubernator 1.0.0-rc.3 (amd64/linux)"
time="2021-01-12T14:18:40Z" level=debug msg="Debug enabled"
time="2021-01-12T14:18:40Z" level=debug msg="K8s peer pool config found"
time="2021-01-12T14:18:40Z" level=debug msg="(gubernator.DaemonConfig) {\n GRPCListenAddress: (string) (len=12) \"localhost:81\",\n HTTPListenAddress: (string) (len=12) \"localhost:80\",\n AdvertiseAddress: (string) (len=12) \"localhost:81\",\n CacheSize: (int) 50000,\n Behaviors: (gubernator.BehaviorConfig) {\n  BatchTimeout: (time.Duration) 0s,\n  BatchWait: (time.Duration) 0s,\n  BatchLimit: (int) 0,\n  GlobalSyncWait: (time.Duration) 0s,\n  GlobalTimeout: (time.Duration) 0s,\n  GlobalBatchLimit: (int) 0,\n  MultiRegionSyncWait: (time.Duration) 0s,\n  MultiRegionTimeout: (time.Duration) 0s,\n  MultiRegionBatchLimit: (int) 0\n },\n DataCenter: (string) \"\",\n PeerDiscoveryType: (string) (len=3) \"k8s\",\n EtcdPoolConf: (gubernator.EtcdPoolConfig) {\n  AdvertiseAddress: (string) (len=12) \"localhost:81\",\n  Client: (*clientv3.Client)(<nil>),\n  OnUpdate: (gubernator.UpdateFunc) <nil>,\n  KeyPrefix: (string) (len=17) \"/gubernator-peers\",\n  EtcdConfig: (*clientv3.Config)(0xc00029c420)({\n   Endpoints: ([]string) (len=1 cap=1) {\n    (string) (len=14) \"localhost:2379\"\n   },\n   AutoSyncInterval: (time.Duration) 0s,\n   DialTimeout: (time.Duration) 5s,\n   DialKeepAliveTime: (time.Duration) 0s,\n   DialKeepAliveTimeout: (time.Duration) 0s,\n   MaxCallSendMsgSize: (int) 0,\n   MaxCallRecvMsgSize: (int) 0,\n   TLS: (*tls.Config)(<nil>),\n   Username: (string) \"\",\n   Password: (string) \"\",\n   RejectOldCluster: (bool) false,\n   DialOptions: ([]grpc.DialOption) <nil>,\n   LogConfig: (*zap.Config)(<nil>),\n   Context: (context.Context) <nil>,\n   PermitWithoutStream: (bool) false\n  }),\n  Logger: (logrus.FieldLogger) <nil>,\n  DataCenter: (string) \"\"\n },\n K8PoolConf: (gubernator.K8sPoolConfig) {\n  Logger: (logrus.FieldLogger) <nil>,\n  OnUpdate: (gubernator.UpdateFunc) <nil>,\n  Namespace: (string) (len=10) \"gubernator\",\n  Selector: (string) (len=14) \"app=gubernator\",\n  PodIP: (string) (len=10) \"10.244.2.8\",\n  PodPort: (string) (len=2) \"81\"\n },\n MemberListPoolConf: (gubernator.MemberListPoolConfig) {\n  MemberListAddress: (string) (len=14) \"localhost:7946\",\n  AdvertiseAddress: (string) (len=12) \"localhost:81\",\n  KnownNodes: ([]string) {\n  },\n  OnUpdate: (gubernator.UpdateFunc) <nil>,\n  NodeName: (string) \"\",\n  Logger: (logrus.FieldLogger) <nil>,\n  DataCenter: (string) \"\"\n },\n Picker: (gubernator.PeerPicker) <nil>,\n Logger: (logrus.FieldLogger) <nil>,\n TLS: (*gubernator.TLSConfig)(<nil>)\n}\n"
time="2021-01-12T14:18:40Z" level=info msg="GRPC Listening on localhost:81 ..." category=gubernator
time="2021-01-12T14:18:40Z" level=debug msg="Queue (Add) 'gubernator/gubernator-http' - <nil>" category=gubernator
time="2021-01-12T14:18:40Z" level=debug msg="Queue (Add) 'gubernator/gubernator' - <nil>" category=gubernator
time="2021-01-12T14:18:40Z" level=info msg="HTTP Gateway Listening on localhost:80 ..." category=gubernator
time="2021-01-12T14:18:41Z" level=debug msg="Queue (Update) 'gubernator/gubernator-http' - <nil>" category=gubernator
time="2021-01-12T14:18:41Z" level=debug msg="Fetching peer list from endpoints API" category=gubernator
time="2021-01-12T14:18:41Z" level=debug msg="Peer: {DataCenter: HTTPAddress: GRPCAddress:10.244.2.8:81 IsOwner:true}\n" category=gubernator
time="2021-01-12T14:18:41Z" level=debug msg="Peer: {DataCenter: HTTPAddress: GRPCAddress:10.244.2.8:81 IsOwner:true}\n" category=gubernator
time="2021-01-12T14:18:41Z" level=debug msg="peers updated" category=gubernator peers="[{  10.244.2.8:81 true} {  10.244.2.8:81 true}]"
time="2021-01-12T14:18:41Z" level=debug msg="Queue (Update) 'gubernator/gubernator' - <nil>" category=gubernator

We deployed using the default k8s-deployment.yaml from the repository, we then added a non-headless service as well as the standard headless one and used that one for the URL but still the same results.

Thoughts

  • Is it possible that the Http Server is not started correctly/bound to correct address, logs say it is listening on localhost:80 but in further debug log messages like this HTTPAdress is blank like this: time="2021-01-12T13:39:04Z" level=debug msg="Peer: {DataCenter: HTTPAddress: GRPCAddress:10.244.3.110:81 IsOwner:false}\n" category=gubernator
  • Do we need to uncomment the port description in the default k8s-deployment file? Have tried both with and without but no difference so far

Our current k8s-deployment.yaml which includes the extra service

apiVersion: apps/v1
kind: Deployment
metadata:
  name: gubernator
  namespace: gubernator
  labels:
    app: gubernator
spec:
  replicas: 1
  selector:
    matchLabels:
      app: gubernator
  template:
    metadata:
      labels:
        app: gubernator
    spec:
      serviceAccountName: gubernator
      containers:
        - image: thrawn01/gubernator:latest
          imagePullPolicy: IfNotPresent
          ports:
            - name: grpc-port
              containerPort: 81
            - name: http-port
              containerPort: 80
          name: gubernator
          env:
          - name: GUBER_K8S_NAMESPACE
            valueFrom:
              fieldRef:
                fieldPath: metadata.namespace
          - name: GUBER_K8S_POD_IP
            valueFrom:
              fieldRef:
                fieldPath: status.podIP
          # Use the k8s API for peer discovery
          - name: GUBER_PEER_DISCOVERY_TYPE
            value: "k8s"
          # This should match the port number GRPC is listening on
          # as defined by `containerPort`
          - name: GUBER_K8S_POD_PORT
            value: "81"
          # The selector used when listing endpoints. This selector
          # should only select gubernator peers.
          - name: GUBER_K8S_ENDPOINTS_SELECTOR
            value: "app=gubernator"
          # Enable debug for diagnosing issues
          - name: GUBER_DEBUG
            value: "true"
      restartPolicy: Always
---
apiVersion: v1
kind: Service
metadata:
  name: gubernator
  labels:
    app: gubernator
spec:
  clusterIP: None
  # ports:
  # - name: grpc-port
  #   targetPort: 81
  #   protocol: TCP
  #   port: 81
  # - name: http-port
  #   targetPort: 80
  #   protocol: TCP
  #   port: 80
  selector:
    app: gubernator
---
apiVersion: v1
kind: Service
metadata:
  name: gubernator-http
  labels:
    app: gubernator
spec:
  ports:
  - name: http-port
    targetPort: http-port
    protocol: TCP
    port: 9080
  selector:
    app: gubernator
---
apiVersion: v1
kind: ServiceAccount
metadata:
  name: gubernator
---
apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
  name: get-endpoints
rules:
- apiGroups:
  - ""
  resources:
  - endpoints
  verbs:
  - list
  - watch
---
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
  name: get-endpoints
roleRef:
  apiGroup: rbac.authorization.k8s.io
  kind: Role
  name: get-endpoints
subjects:
- kind: ServiceAccount
  name: gubernator

TestLeakyBucket & TestTokenBucket fails occasionally.

=== RUN TestTokenBucket
--- FAIL: TestTokenBucket (0.03s)
functional_test.go:141:
Error Trace: functional_test.go:141
Error: Not equal:
expected: 0
actual : 1
Test: TestTokenBucket
=== RUN TestLeakyBucket
--- FAIL: TestLeakyBucket (0.06s)
functional_test.go:202:
Error Trace: functional_test.go:202
Error: Not equal:
expected: 0
actual : 1
Test: TestLeakyBucket
Messages: 2
functional_test.go:202:
Error Trace: functional_test.go:202
Error: Not equal:
expected: 1
actual : 2
Test: TestLeakyBucket
Messages: 3

Active expiration for cache keys

Currently the expiration of keys is determined passively (i.e. whenever we try to access a particular key having InvalidAt or ExpireAt < now). But this effectively mean that a non-accessed may reside in memory forever! which may be a bottleneck for scenario where we limit requests which may not be repeated frequently, and the gubernator node isn't restarted for a longer period of time.

Also, this may possibly make us consider other parameters like maxmemory and various eviction policies etc.

Do we've any planned way to mitigate this (or rather it exists already and I missed in case).

document error response

The API documentation does not state anything on the format of errors that can be returned when doing calls to gubernator.

On an error you can still get an http 200 back but a totally different response. (for instance the one below during the restart of a ratelimiter POD).

Would be good to document the possible error formats.

{"responses":[
    {"error":"while finding peer that owns rate limit 'xxxx' - 'unable to pick a peer; pool is empty'"},
    {"error":"while finding peer that owns rate limit 'yyyy' - 'unable to pick a peer; pool is empty'"}
]}

Leaky bucket remaining gets reset to limit every duration. Question on behavior change

I was doing some testing against the latest version and I noticed a change to how the leaky bucket algorithm is limiting requests over time. It seems that every duration ms the remaining resets back to limit. The result is that the client can make almost 2*limit hits per duration.

For my test I did one request per 100ms with 1 hit, a limit of 3 and a duration of 1000ms. I can see that roughly every 1000ms gubernator responds with "remaining":2 even though 100ms prior, it responded with OVER_LIMIT

{"limit":3,"remaining":2,"reset_time":1640135081466}
{"limit":3,"remaining":1,"reset_time":1640135081905}
{"limit":3,"reset_time":1640135082341}
{"status":1,"limit":3,"reset_time":1640135082446}
{"limit":3,"reset_time":1640135082552}
{"status":1,"limit":3,"reset_time":1640135082655}
{"status":1,"limit":3,"reset_time":1640135082763}
{"status":1,"limit":3,"reset_time":1640135082869}
{"limit":3,"reset_time":1640135082976}
{"status":1,"limit":3,"reset_time":1640135083079}
{"status":1,"limit":3,"reset_time":1640135083185}
{"limit":3,"remaining":2,"reset_time":1640135082627}
{"limit":3,"remaining":1,"reset_time":1640135083064}
{"limit":3,"reset_time":1640135083503}
{"status":1,"limit":3,"reset_time":1640135083608}
{"limit":3,"reset_time":1640135083715}
{"status":1,"limit":3,"reset_time":1640135083820}
{"status":1,"limit":3,"reset_time":1640135083926}
{"status":1,"limit":3,"reset_time":1640135084034}
{"limit":3,"reset_time":1640135084143}
{"status":1,"limit":3,"reset_time":1640135084248}
{"status":1,"limit":3,"reset_time":1640135084354}
{"limit":3,"remaining":2,"reset_time":1640135083790}
{"limit":3,"remaining":1,"reset_time":1640135084227}
{"limit":3,"reset_time":1640135084663}
{"status":1,"limit":3,"reset_time":1640135084769}
{"limit":3,"reset_time":1640135084876}
{"status":1,"limit":3,"reset_time":1640135084982}
{"status":1,"limit":3,"reset_time":1640135085088}
{"status":1,"limit":3,"reset_time":1640135085197}
{"limit":3,"reset_time":1640135085300}
{"status":1,"limit":3,"reset_time":1640135085405}
{"status":1,"limit":3,"reset_time":1640135085508}
{"limit":3,"remaining":2,"reset_time":1640135084947}
{"limit":3,"remaining":1,"reset_time":1640135085386}
{"limit":3,"reset_time":1640135085823}
{"status":1,"limit":3,"reset_time":1640135085927}
{"limit":3,"reset_time":1640135086033}
{"status":1,"limit":3,"reset_time":1640135086140}
{"status":1,"limit":3,"reset_time":1640135086247}
{"status":1,"limit":3,"reset_time":1640135086351}
{"limit":3,"reset_time":1640135086459}
{"status":1,"limit":3,"reset_time":1640135086565}
{"status":1,"limit":3,"reset_time":1640135086668}
{"limit":3,"remaining":2,"reset_time":1640135086108}
{"limit":3,"remaining":1,"reset_time":1640135086546}
{"limit":3,"reset_time":1640135086985}
{"status":1,"limit":3,"reset_time":1640135087090}
0-1  83.3%  ███▋  5
1-2  66.7%  ██▉   4
2-3  83.3%  ███▋  5
3-4  83.3%  ███▋  5
4-5  66.7%  ██▉   4
23 requests in 5.008026 seconds. 25 requests blocked.

From my reading of the code the bucket has always had this resetting behavior, but the expiration was so long and you couldn't observe it easily. That's because prior to this change: ca70e94#diff-d0d6d4bbf5e9163a8ce16f17cbdf20904932307c2b182a93de16a4e90026c766 the bucket expired now*duration instead of now+duration.

-		c.UpdateExpiration(r.HashKey(), now*duration)
+		rl.ResetTime = now + (rl.Limit - rl.Remaining) * int64(rate)
+		c.UpdateExpiration(r.HashKey(), now + duration)

My expectation of the leaky bucket algorithm is that after remaining drops to 0, it will empty at a rate of one hit per duration/limit milliseconds. Is that expectation correct, or is this cache expiration behavior expected?

go get doesn't work - broken k8s.io import

Hey! 👋
When trying to use go get I currently get these failures. I believe updating the Kubernetes dependencies should help.
Let me know if I should go ahead and do the PR or you want to do it.


../../../../../pkg/mod/k8s.io/[email protected]/discovery/discovery_client.go:29:2: module github.com/googleapis/gnostic@latest found (v0.5.3), but does not contain package github.com/googleapis/gnostic/OpenAPIv2
../../../../../pkg/mod/k8s.io/[email protected]/kubernetes/scheme/register.go:26:2: module k8s.io/api@latest found (v0.19.4), but does not contain package k8s.io/api/auditregistration/v1alpha1
Error: get command failed: get: 0: getting v1.0.0-rc.3: go: finding module for package k8s.io/api/auditregistration/v1alpha1
go: finding module for package github.com/googleapis/gnostic/OpenAPIv2
../../../../../pkg/mod/k8s.io/[email protected]/discovery/discovery_client.go:29:2: module github.com/googleapis/gnostic@latest found (v0.5.3), but does not contain package github.com/googleapis/gnostic/OpenAPIv2
../../../../../pkg/mod/k8s.io/[email protected]/kubernetes/scheme/register.go:26:2: module k8s.io/api@latest found (v0.19.4), but does not contain package k8s.io/api/auditregistration/v1alpha1

Update grpc-gateway module

Hello and first thank you for your great work.

When I use gubernator as a client library I get this warning message:

WARNING: Package "github.com/golang/protobuf/protoc-gen-go/generator" is deprecated.
	A future release of golang/protobuf will delete this package,
	which has long been excluded from the compatibility promise.

Since this is a known issue. I guess the problem is here and as described in the original issue, grpc-gateway should be updated to a version later than v1.14.5.

Goroutine leak in `Instance.SetPeers`

I'm currently testing a Kubernetes deployment of gubernator, and noticed the number of goroutines are growing linearly over time. Here's the graph we're seeing from our monitoring system.

Image+2019-11-13+at+10 03 01+AM

This also results in a similar leak in memory. In order to diagnose this, I've deployed a version of gubernator with pprof endpoints enabled, and found that goroutines grow in 3 functions:

  1. PeerClient.run
  2. Interval.run
  3. grpc.Dial

The root cause seems to be that in Instance.SetPeer, a new PeerClient is created for every PeerInfo without reusing any existing PeerClients. This causes a goroutine leak linearly proportional to the number of peers. In addition, there is no shutdown code for removed peers, so this code should leak a goroutine for every peer that is removed.

I suspect that this was exposed by some weird interaction with the Kubernetes integration and our test environment, since I see peer update logs every few minutes.

I have a fork where I'm testing the following changes that I would be happy to send a pull request for:

  1. Add a Shutdown method to PeerClient that will close the request queue
  2. Change PeerClient.run to send any enqueued requests when the request queue is closed
  3. Change PeerClient.run to call interval.Stop on return
  4. Reuse existing PeerClients inside Instance.SetPeers
  5. Shutdown any removed PeerClients inside Instance.SetPeers

Feature RFC: Sliding Window Rate Limiting

Hi team,

thanks for all the great work so far, the design of gubernator looks rock solid and sensible so far!

We're currently evaluating the project for a number of use cases (Golang / Python shop here as well), and so far I see the Token Bucket and Leaky Bucket algorithms already implemented, both of them having some drawbacks.

Therefore, we wanted to ask if there is interest in having a sliding window algorithm implemented as well. If the maintainers think this is all of wanted, feasible and the algorithm is reasonably pluggable, we'd be happy to contribute in getting this implemented, by the means of reviews, sponsorship and/or investing own development resources.

Some background, citing from https://konghq.com/blog/how-to-design-a-scalable-rate-limiting-algorithm/ :

Sliding Window

This is a hybrid approach that combines the low processing cost of the fixed window algorithm, and the improved boundary conditions of the sliding log. Like the fixed window algorithm, we track a counter for each fixed window. Next, we account for a weighted value of the previous window’s request rate based on the current timestamp to smooth out bursts of traffic. For example, if the current window is 25% through, then we weight the previous window’s count by 75%. The relatively small number of data points needed to track per key allows us to scale and distribute across large clusters.

We recommend the sliding window approach because it gives the flexibility to scale rate limiting with good performance. The rate windows are an intuitive way she to present rate limit data to API consumers. It also avoids the starvation problem of leaky bucket, and the bursting problems of fixed window implementations.

docker-compose.yml is missing etcd service

The current docker-compose.yaml (https://raw.githubusercontent.com/mailgun/gubernator/master/docker-compose.yaml) is missing the etcd service settings.

The etcd service used to be:

  etcd:
    image: quay.io/coreos/etcd:v3.2
    command: >
      /usr/local/bin/etcd
      -name etcd0
      -advertise-client-urls http://etcd:2379
      -listen-client-urls http://0.0.0.0:2379
      -initial-advertise-peer-urls http://0.0.0.0:2380
      -listen-peer-urls http://0.0.0.0:2380
      -initial-cluster-token etcd-cluster-1
      -initial-cluster etcd0=http://0.0.0.0:2380
      -initial-cluster-state new  

Also the wording for "Docker compose" at https://hub.docker.com/r/thrawn01/gubernator perhaps needs to be updated too.

Thanks for the great service!

Cross-DC discovery + updates?

@thrawn01 The architecture docs don’t seem to cover this. How is multi-geo/multi-cluster intended to work? Are you doing full peering across all of your clusters?

Also, would be interesting to understand better why you chose this route rather than simply shipping a rate limiting library or sidecar hitting a pre-existing memory cache. By “no deployment synchronization with a dependent service” are you referring to data structure changes?

how could i change the Limit before Cache expire?

For request below:

{
  "requests": [
    {
      "name": "requests_per_sec",
      "unique_key": "account.id=1235",
      "hits": 1,
      "duration": 60000,
      "limit": 100,
      "behavior": 1
    }
  ]
}

i get response:

{
  "responses": [
    {
      "limit": "100",
      "remaining": "99",
      "reset_time": "1577360227155",
      "metadata": {
        "owner": "127.0.0.1:8880"
      }
    }
  ]
}

then before the reset_time reach, i submit a request with the same request_key but different limit 10:

{
  "requests": [
    {
      "name": "requests_per_sec",
      "unique_key": "account.id=1235",
      "hits": 1,
      "duration": 60000,
      "limit": 10, 
      "behavior": 1
    }
  ]
}

the limit in response did not change, it's still 100:

{
  "responses": [
    {
      "limit": "100",
      "remaining": "98",
      "reset_time": "1577360227155",
      "metadata": {
        "owner": "127.0.0.1:8880"
      }
    }
  ]
}

So how could i change the Limit before the reset_time ?

'reset_time' not always included in the response

There is a small discrepancy in return value between 0 = Token Bucket and 1 = Leaky Bucket algorithm.

When choosing Leaky Bucket no reset_time is returned in the response for the first few requests. Actually up until after the request limit is hit for the first time.
Example:

200: Headers({'ratelimit-limit': '12', 'ratelimit-remaining': '11', 'content-length': '0'}) → b''
200: Headers({'ratelimit-limit': '12', 'ratelimit-remaining': '10', 'content-length': '0'}) → b''
200: Headers({'ratelimit-limit': '12', 'ratelimit-remaining': '9', 'content-length': '0'}) → b''
200: Headers({'ratelimit-limit': '12', 'ratelimit-remaining': '8', 'content-length': '0'}) → b''
200: Headers({'ratelimit-limit': '12', 'ratelimit-remaining': '7', 'content-length': '0'}) → b''
200: Headers({'ratelimit-limit': '12', 'ratelimit-remaining': '6', 'content-length': '0'}) → b''
200: Headers({'ratelimit-limit': '12', 'ratelimit-remaining': '5', 'content-length': '0'}) → b''
200: Headers({'ratelimit-limit': '12', 'ratelimit-remaining': '4', 'content-length': '0'}) → b''
200: Headers({'ratelimit-limit': '12', 'ratelimit-remaining': '3', 'content-length': '0'}) → b''
200: Headers({'ratelimit-limit': '12', 'ratelimit-remaining': '2', 'content-length': '0'}) → b''
200: Headers({'ratelimit-limit': '12', 'ratelimit-remaining': '1', 'content-length': '0'}) → b''
429: Headers({'ratelimit-limit': '12', 'ratelimit-remaining': '0', 'content-length': '66'}) → b'{"id": "too_many_requests", "message": "API Rate limit exceeded."}'
429: Headers({'ratelimit-limit': '12', 'ratelimit-remaining': '0', 'ratelimit-reset': '1573555218615', 'content-length': '66'}) → b'{"id": "too_many_requests", "message": "API Rate limit exceeded."}'
429: Headers({'ratelimit-limit': '12', 'ratelimit-remaining': '0', 'ratelimit-reset': '1573555218643', 'content-length': '66'}) → b'{"id": "too_many_requests", "message": "API Rate limit exceeded."}'
429: Headers({'ratelimit-limit': '12', 'ratelimit-remaining': '0', 'ratelimit-reset': '1573555218662', 'content-length': '66'}) → b'{"id": "too_many_requests", "message": "API Rate limit exceeded."}'
429: Headers({'ratelimit-limit': '12', 'ratelimit-remaining': '0', 'ratelimit-reset': '1573555218690', 'content-length': '66'}) → b'{"id": "too_many_requests", "message": "API Rate limit exceeded."}'

When choosing Token Bucket reset_time is always returned.

This behavior makes sense if you think about how it is implemented. But might be worth noting in the README / documentation.

How to apply the rate_limits rules ?

I 've check the usage gubernator
I see there have k8s-deployment.yaml for deploy the gubernator itselft to the K8S Cluster
But how i can deploy the rate limit rule ? and deploy to where , do there have yaml example ?

Thanks

Peer detection not detecting POD restarts and incorrect rate limit results

  • Minikube + locally build gubernator 0.7.1 docker container
  • in default namespace with cluster admin permissions, deployment with 3 instances of gubernator (https://github.com/mailgun/gubernator#kubernetes). No other changes were made to the deployment.

So far all is fine...

  • Calling /v1/HealthCheck on all 3 endpoints will return {"status":"healthy","peer_count":3}
  • Ratelimit requests (token) will give the expected result (remaining/throttled)
  • Running kubectl logs <pod> -f on 2 of the 3 pods will show something like the following:
time="2019-12-13T12:51:54Z" level=debug msg="Debug enabled"
time="2019-12-13T12:51:54Z" level=debug msg="K8s peer pool config found"
(main.ServerConfig) {
 GRPCListenAddress: (string) (len=10) "0.0.0.0:81",
 EtcdAdvertiseAddress: (string) (len=12) "127.0.0.1:81",
 HTTPListenAddress: (string) (len=10) "0.0.0.0:80",
 EtcdKeyPrefix: (string) (len=17) "/gubernator-peers",
 CacheSize: (int) 50000,
 EtcdConf: (clientv3.Config) {
  Endpoints: ([]string) (len=1 cap=1) {
   (string) (len=14) "localhost:2379"
  },
  AutoSyncInterval: (time.Duration) 0s,
  DialTimeout: (time.Duration) 5s,
  DialKeepAliveTime: (time.Duration) 0s,
  DialKeepAliveTimeout: (time.Duration) 0s,
  MaxCallSendMsgSize: (int) 0,
  MaxCallRecvMsgSize: (int) 0,
  TLS: (*tls.Config)(<nil>),
  Username: (string) "",
  Password: (string) "",
  RejectOldCluster: (bool) false,
  DialOptions: ([]grpc.DialOption) <nil>,
  Context: (context.Context) <nil>
 },
 Behaviors: (gubernator.BehaviorConfig) {
  BatchTimeout: (time.Duration) 0s,
  BatchWait: (time.Duration) 0s,
  BatchLimit: (int) 0,
  GlobalSyncWait: (time.Duration) 0s,
  GlobalTimeout: (time.Duration) 0s,
  GlobalBatchLimit: (int) 0
 },
 K8PoolConf: (gubernator.K8sPoolConfig) {
  OnUpdate: (gubernator.UpdateFunc) <nil>,
  Namespace: (string) (len=7) "default",
  Selector: (string) (len=14) "app=gubernator",
  PodIP: (string) (len=11) "172.17.0.18",
  PodPort: (string) (len=2) "81",
  Enabled: (bool) true
 }
}
time="2019-12-13T12:51:54Z" level=info msg="Gubernator Listening on 0.0.0.0:81 ..." category=server
time="2019-12-13T12:51:54Z" level=debug msg="Queue (Add) 'default/gubernator' - %!s(<nil>)"
time="2019-12-13T12:51:54Z" level=debug msg="Fetching peer list from endpoints API"
time="2019-12-13T12:51:54Z" level=info msg="Peers updated" category=gubernator peers="[]"
time="2019-12-13T12:51:54Z" level=info msg="HTTP Gateway Listening on 0.0.0.0:80 ..." category=server
time="2019-12-13T12:51:55Z" level=debug msg="Queue (Update) 'default/gubernator' - %!s(<nil>)"
time="2019-12-13T12:51:55Z" level=debug msg="Fetching peer list from endpoints API"
time="2019-12-13T12:51:55Z" level=debug msg="Peer: {Address:172.17.0.19:81 IsOwner:false}\n"
time="2019-12-13T12:51:55Z" level=info msg="Peers updated" category=gubernator peers="[{172.17.0.19:81 false}]"
time="2019-12-13T12:51:57Z" level=debug msg="Queue (Update) 'default/gubernator' - %!s(<nil>)"
time="2019-12-13T12:51:57Z" level=debug msg="Fetching peer list from endpoints API"
time="2019-12-13T12:51:57Z" level=debug msg="Peer: {Address:172.17.0.18:81 IsOwner:true}\n"
time="2019-12-13T12:51:57Z" level=debug msg="Peer: {Address:172.17.0.19:81 IsOwner:false}\n"
time="2019-12-13T12:51:57Z" level=info msg="Peers updated" category=gubernator peers="[{172.17.0.18:81 true} {172.17.0.19:81 false}]"
time="2019-12-13T12:51:58Z" level=debug msg="Queue (Update) 'default/gubernator' - %!s(<nil>)"
time="2019-12-13T12:51:58Z" level=debug msg="Fetching peer list from endpoints API"
time="2019-12-13T12:51:58Z" level=debug msg="Peer: {Address:172.17.0.13:81 IsOwner:false}\n"
time="2019-12-13T12:51:58Z" level=debug msg="Peer: {Address:172.17.0.18:81 IsOwner:true}\n"
time="2019-12-13T12:51:58Z" level=debug msg="Peer: {Address:172.17.0.19:81 IsOwner:false}\n"
time="2019-12-13T12:51:58Z" level=info msg="Peers updated" category=gubernator peers="[{172.17.0.13:81 false} {172.17.0.18:81 true} {172.17.0.19:81 false}]"

Now.......

  • kill the pod where you are not monitoring the logs.
  • The 2 remaining PODs will detect the peer deletion but will never detect the 3rd POD being restarted.
  • Only the following is shown in the logs
time="2019-12-13T12:53:25Z" level=debug msg="Queue (Update) 'default/gubernator' - %!s(<nil>)"
time="2019-12-13T12:53:25Z" level=debug msg="Fetching peer list from endpoints API"
time="2019-12-13T12:53:25Z" level=debug msg="Peer: {Address:172.17.0.13:81 IsOwner:false}\n"
time="2019-12-13T12:53:25Z" level=debug msg="Peer: {Address:172.17.0.18:81 IsOwner:true}\n"
time="2019-12-13T12:53:25Z" level=info msg="Peers updated" category=gubernator peers="[{172.17.0.13:81 false} {172.17.0.18:81 true}]"

Calling /v1/HealthCheck on all 3 endpoints will return {"status":"healthy","peer_count":2} on the 2 remaining PODS and {"status":"healthy","peer_count":3} on the restarted POD.

This is already messing up the getratelimit requests.

To make it worse.......

Wait for the the following message to appear in the 2 PODs that were not killed.

W1213 13:17:56.913021       1 reflector.go:302] pkg/mod/k8s.io/[email protected]/tools/cache/reflector.go:98: watch of *v1.Endpoints ended with: too old resource version: 398370 (399065)

After this message delete 1 of the 2 remaining PODs. I would expect that the other POD would detect the deletion but this is also not the case.

I would expect that in both POD deletion scenarios (with and without the watch of *v1.Endpoints ended with: too old resource version message) all /v1/HealthCheck on all 3 POD instances always to return {"status":"healthy","peer_count":3} (except for the time a POD is actually re-starting). And more importantly that the ratelimit requests keep on returning the correct results (remaining/throttled) after POD restarts.

Without this fixed using more that 1 instance of gubernator is not working on kubernetes

Feature: optional metric on current throttling status per key

We are evaluating gubernator and so far so good.
In our business case we manage the number of tenants on our platform and the number of tenants is not too large. We can set throttling limits for each tenant but would also like to warn a tenant before going over their limit.

As gubernator is already exposing prometheus metrics, it would be nice if there would be metrics like:

  • "name" + "unique_key" => "limit"
  • "name" + "unique_key" => "remaining"
    OR
  • "name" + "unique_key" => "percentage used"

Based on that an alert manager could send alerts to tenants when reaching a certain percentage.

I can imagine this metric only makes sense in certain business cases and to avoid consuming too much memory I guess when implemented it should be optional.

K8S Yaml does not properly grant permissions for the kubernetes API

Looks like gubernator is trying to use the "default" SA generated for whatever namespace it's deployed to.

E0827 18:25:17.473059 1 reflector.go:125] pkg/mod/k8s.io/[email protected]/tools/cache/reflector.go:98: Failed to list *v1.Endpoints: endpoints is forbidden: User "system:serviceaccount:guber:default" cannot list resource "endpoints" in API group "" in the namespace "guber"

The K8S yaml needs to be updated to generate a ServiceAccount with the proper role bindings

peerCount: 0 in a healthy cluster

In two kinds of deployments:

  • etcd
  • my custom #99

I'm seeing that gubernator instances are clearly talking to each other (remaining is decreased on peer nodes), and yet the health check returns {"status":"healthy","message":"","peerCount":0}

JSON response spelling

While working on #99 I made a gubernator docker image via:

  • git clone
  • make docker

This image produces response JSON in a slightly different format than documented:

{
  responses: [
  {
    error: '', 
    limit: '100', 
    metadata: {"owner":"'10.125.2.248:81'"}, 
    remaining: '99', 
    resetTime: '1628846037905', 
    status: 'UNDER_LIMIT'
  }
  ]
}

There appears to be exactly 1 difference vs. gubernator from docker hub:
reset_time became resetTime (snake_case to camelCase).

I'm not sure why... I guess it's the gRPC magic?

(unrelated: README.md top-level key is rate_limits in the GRPS sample, but responses in the JSON sample 🤷🏿 )

Feature Request: Atomic Chaining of Limiters

Sometimes when using multiple resources at the same time I want to make sure they are all clear before beginning the work. For example before calling an external API, I want to make sure my own server has capacity to handle the work AND I want to make sure not to abuse the limits of the external API.

I would like to be able to include a chain (list) of limiters in one request. Gubernator would atomically determine if each limiter is UNDER_LIMIT, and only then increment all the hits. If any one of the requested limits is OVER_LIMIT, then the request fails and no hits are incremented.

The response would probably be a Boolean flag of whether the overall request is successful and a list of responses for each limiter requested. The longest reset_time for a failed limiter would be useful as well.

The closest way to achieve this currently is querying each limiter with zero hits to see if they are all open, and only then sending another request to deduct hits from each limiter. The problem is that this operation is not atomic so they could have filled up between requests.

Implementers howto

Create a implementers README with information on how to properly implement and manage a gubernator persistent store.

/v1/HealthCheck dials default EtcdAdvertiseAddress even though only kubernetess is configured.

http://localhost:65111/v1/HealthCheck

{"error":"all SubConns are in TransientFailure, latest connection error: connection error: desc = "transport: Error while dialing dial tcp 127.0.0.1:81: connect: connection refused"","message":"all SubConns are in TransientFailure, latest connection error: connection error: desc = "transport: Error while dialing dial tcp 127.0.0.1:81: connect: connection refused"","code":14}

My env configuration

Environment:
  K8S_NODE_NAME:                  (v1:spec.nodeName)
  K8S_POD_NAME:                   (v1:metadata.name)
  K8S_POD_ID:                     (v1:metadata.uid)
  K8S_NAMESPACE:                  (v1:metadata.namespace)
  K8S_LABELS_APP:                 (v1:metadata.labels['app'])
  K8S_CONTAINER_NAME:            gubernator
  GUBER_DEBUG:                   true
  GUBER_HTTP_ADDRESS:            0.0.0.0:65111
  GUBER_GRPC_ADDRESS:            0.0.0.0:65110
  GUBER_CACHE_SIZE:              50000
  GUBER_BATCH_TIMEOUT:           50ms
  GUBER_BATCH_LIMIT:             1000
  GUBER_BATCH_WAIT:              500ns
  GUBER_K8S_NAMESPACE:            (v1:metadata.namespace)
  GUBER_K8S_POD_IP:               (v1:status.podIP)
  GUBER_K8S_POD_PORT:            65110
  GUBER_K8S_ENDPOINTS_SELECTOR:  app=gubernator

gubernator SSL client auth not working as expected

Thank you for writing this really cool opensource tool.

We are attempting to use gubernator for our own rate limiting purposes and ran into an issue with client SSL authentication. This is on clustered 2 gubernator instances running behind a load balancer. The SSL cert has a name of the vip ($fqdn). When one tries to connect to the VIP with a POST against https://$fqdn/v1/GetRateLimits the following is returned...

{
    "responses": [
        {
            "status": "UNDER_LIMIT",
            "limit": "0",
            "remaining": "0",
            "reset_time": "0",
            "error": "while fetching rate limit 'requests_per_sec_abc' from peer - 'rpc error: code = Unavailable desc = connection error: desc = \"transport: authentication handshake failed: x509: cannot validate certificate for 10.26.42.0 because it doesn't contain any IP SANs\"'",
            "metadata": {}
        }
    ]
}

The environment is as follows

      - GUBER_HTTP_ADDRESS: 0.0.0.0:8080
      - GUBER_GRPC_ADDRESS: 0.0.0.0:8081
      - GUBER_ADVERTISE_ADDRESS: 10.26.42.0:8081
      - GUBER_MEMBERLIST_KNOWN_NODES: $vm1_name,$vm2_name2
      - GUBER_TLS_CA: path_to_intermediate_chain_crt
      - GUBER_TLS_KEY: path_to_key
      - GUBER_TLS_CERT: path_to_crt
      - GUBER_TLS_CLIENT_AUTH: require-any-cert

The 10.26.42.0 address mentioned in the error is the directly bound IP address for my VM that the service is running on but it should not be authenticating against that address, The app should be authenticating against the name (the $fqdn) provided in the HTTPS request I think. Is this a bug or am I doing something wrong?

v2.0.0 Release Roadmap

Release Candidate 1 is available here with the following todo's implemented

TODO

  • #55 Refactor public ABI interface to be consistent with new features.
  • #50 /v1/HealthCheck dials default EtcdAdvertiseAddress even though only kubernetess is configured
  • #72 etcd now correctly identifies IsOwner when updating the peers list
  • #57 TestLeakyBucket & TestTokenBucket fails occasionally.
  • #59 Support TLS for GRPC rate-limiter request interface (#76)
  • #74 v1.0.0. Performance Review
  • #77 Fix Leaky bucket bug
  • Test and document multi-region support
  • Deploy and testing

v2.0.0. Performance Review

Purpose

  • In production we are seeing 300ms response times during very high volumes. (Response times are usually in the 2-5ms range)
  • Profile the distribution hit updates when using Behavior=GLOBAL. Reference our implementation with that of https://ipfs.io.

TODO

  • Profile running gubernators in production.
  • Profile GLOBAL update behavior

fatal error: concurrent map read and map write

I was testing the GLOBAL behavior using version 0.4.1, and could reliably trigger the following panic when testing with 4 threads accessing the same key using the GRPC client.

fatal error: concurrent map read and map write
goroutine 1016 [running]:
runtime.throw(0x1562c2b, 0x21)
	/usr/local/go/src/runtime/panic.go:617 +0x72 fp=0xc0003f0558 sp=0xc0003f0528 pc=0x42c8f2
runtime.mapaccess2(0x1379ec0, 0xc0002dd380, 0xc0003f05d8, 0xc0003f05e8, 0x40969b)
	/usr/local/go/src/runtime/map.go:472 +0x284 fp=0xc0003f05a0 sp=0xc0003f0558 pc=0x40d6d4
github.com/mailgun/gubernator/cache.(*LRUCache).Get(0xc00031a050, 0x1337920, 0xc00043c3c0, 0x1540d2d, 0x1, 0xc000053980)
	/src/cache/lru.go:106 +0x5c fp=0xc0003f05f8 sp=0xc0003f05a0 pc=0x8583bc
github.com/mailgun/gubernator.(*Instance).getGlobalRateLimit(0xc0001be460, 0xc00094b100, 0x1b, 0xc0001b8540, 0x0)
	/src/gubernator.go:178 +0x103 fp=0xc0003f0678 sp=0xc0003f05f8 pc=0x1255453
github.com/mailgun/gubernator.(*Instance).GetRateLimits.func1.1(0x1426900, 0xc00093dae0, 0x0, 0x0)
	/src/gubernator.go:134 +0x240 fp=0xc0003f0780 sp=0xc0003f0678 pc=0x125f3c0
github.com/mailgun/holster.(*FanOut).Run.func1(0xc0009557a0, 0x1426900, 0xc00093dae0, 0xc00094b140)
	/go/pkg/mod/github.com/mailgun/[email protected]+incompatible/fanout.go:66 +0x3e fp=0xc0003f07c0 sp=0xc0003f0780 pc=0x8577fe

This results in the server being stuck in the following error message constantly, and as a result no longer serves requests.

level=error msg="error sending global hits to 'host:81'" category=global-manager error="rpc error: code = DeadlineExceeded desc = context deadline exceeded"

TestPeerClientShutdown/No_batching fails with DataRace

==================
 WARNING: DATA RACE
 Write at 0x00c000713fa8 by goroutine 504:
   sync.(*WaitGroup).Wait()
       /usr/local/go/src/internal/race/race.go:41 +0xee
   github.com/mailgun/holster.(*WaitGroup).Wait()
       /go/pkg/mod/github.com/mailgun/[email protected]+incompatible/waitgroup.go:99 +0x4f
   github.com/mailgun/holster.(*WaitGroup).Stop()
       /go/pkg/mod/github.com/mailgun/[email protected]+incompatible/waitgroup.go:81 +0x86
   github.com/mailgun/gubernator.(*Interval).Stop()
       /builds/volterra/ves.io/gubernator/interval.go:58 +0x3e
   github.com/mailgun/gubernator.(*PeerClient).run()
       /builds/volterra/ves.io/gubernator/peers.go:192 +0x3e2
 Previous read at 0x00c000713fa8 by goroutine 484:
   sync.(*WaitGroup).Add()
       /usr/local/go/src/internal/race/race.go:37 +0x18d
   github.com/mailgun/holster.(*WaitGroup).Until()
       /go/pkg/mod/github.com/mailgun/[email protected]+incompatible/waitgroup.go:63 +0x8f
   github.com/mailgun/gubernator.(*Interval).run()
       /builds/volterra/ves.io/gubernator/interval.go:45 +0xbd
 Goroutine 504 (running) created at:
   github.com/mailgun/gubernator.NewPeerClient()
       /builds/volterra/ves.io/gubernator/peers.go:73 +0x1c6
   github.com/mailgun/gubernator_test.TestPeerClientShutdown.func1()
       /builds/volterra/ves.io/gubernator/peers_test.go:44 +0x116
   testing.tRunner()
       /usr/local/go/src/testing/testing.go:909 +0x199
 Goroutine 484 (finished) created at:
   github.com/mailgun/gubernator.NewInterval()
       /builds/volterra/ves.io/gubernator/interval.go:40 +0x136
   github.com/mailgun/gubernator.(*PeerClient).run()
       /builds/volterra/ves.io/gubernator/peers.go:179 +0x6e
 ==================
 ==================
 WARNING: DATA RACE
 Write at 0x00c000196e98 by goroutine 482:
   sync.(*WaitGroup).Wait()
       /usr/local/go/src/internal/race/race.go:41 +0xee
   github.com/mailgun/gubernator.(*PeerClient).Shutdown.func2()
       /builds/volterra/ves.io/gubernator/peers.go:281 +0x3e
 Previous read at 0x00c000196e98 by goroutine 340:
   sync.(*WaitGroup).Add()
       /usr/local/go/src/internal/race/race.go:37 +0x18d
   github.com/mailgun/gubernator.(*PeerClient).GetPeerRateLimits()
       /builds/volterra/ves.io/gubernator/peers.go:104 +0xd6
   github.com/mailgun/gubernator.(*PeerClient).GetPeerRateLimit()
       /builds/volterra/ves.io/gubernator/peers.go:84 +0x12d
   github.com/mailgun/gubernator_test.TestPeerClientShutdown.func1.1()
       /builds/volterra/ves.io/gubernator/peers_test.go:54 +0x177
 Goroutine 482 (running) created at:
   github.com/mailgun/gubernator.(*PeerClient).Shutdown()
       /builds/volterra/ves.io/gubernator/peers.go:280 +0x160
   github.com/mailgun/gubernator_test.TestPeerClientShutdown.func1()
       /builds/volterra/ves.io/gubernator/peers_test.go:67 +0x2b4
   testing.tRunner()
       /usr/local/go/src/testing/testing.go:909 +0x199
 Goroutine 340 (running) created at:
   github.com/mailgun/gubernator_test.TestPeerClientShutdown.func1()
       /builds/volterra/ves.io/gubernator/peers_test.go:51 +0x253
   testing.tRunner()
       /usr/local/go/src/testing/testing.go:909 +0x199
 ==================

Multitenancy

Howdy! Very cool project, looks well built and a real credit to you.

Feel free to resolve this immediately, this is all very conversational, not an issue as such.

Thoughts on the following kind of setup?....

Centralised cluster. 100 teams each with their own microservice all use the centrally managed cluster.

Fairness and (maybe) scaling problems to consider.

Do you have large central deployments like that? 100s of nodes in a cluster each with many GBs of memory used? Or would noisy neighbours, excessive network load from globals occur?

There are a lot of "it depends" in this conversation :-) .... Just trying to get a sense of the thing.

Out-of-date container registry image

The easy way to reproduce this is to simply create a ratelimit of 1/s and then send 10/s+ (anything > 1/s should be sufficient). Once this is done the ratelimit will have a single succes (UNDER_LIMIT) then all subsequent requests will fail.

Cannot upgrade gubernator?

In my package i'm trying to update gubernator from 2.0.0-rc10 to 2.0.0-rc15:

image

image

However, this breaks my build:

Error: vendor/k8s.io/client-go/discovery/discovery_client.go:32:2: cannot find package "." in:
	/home/runner/work/sandcastle/sandcastle/vendor/github.com/googleapis/gnostic/openapiv2
make[1]: *** [Makefile:156: test-customers-functional] Error 1

It looks like it can't find googleapis/gnostic? So I tried adding this inside go.mod:

replace github.com/googleapis/gnostic => github.com/google/gnostic v0.5.5

but this doesn't seem to have any effect. :(

Issue with leaky bucket algorithm

I tried the leaky bucket algorithm but it seems the tokens doesn't leak from the bucket at consistent rate while we add new hits. When we doesn't add new hits, it's ok, the number or remaining tokens increases at the expected rate.

Maybe I misundertood the leaky bucket algorithm or there is a something strange.

An example may be more explicit.

I use a leaky bucket with 10 requests allowed per 30 seconds. I add a new hit every seconds. I expect that the remaining tokens increases by one at t+3s, t+6s, t+9s but it seems not the case.

You can reproduce this problem with the following script:

for i in 0 1 2 3 4 5 6 7 8 9 10; do curl -X POST http://localhost:9080/v1/GetRateLimits --data '{"requests":[{"name":"test", "unique_key": "testkey1", "hits":1, "duration":30000, "limit":10, "algorithm":"LEAKY_BUCKET"}]}'; echo; sleep 1; done
{"responses":[{"limit":"10","remaining":"9","reset_time":"3000","metadata":{"owner":"192.168.16.5:81"}}]}
{"responses":[{"limit":"10","remaining":"8","reset_time":"1603905514100","metadata":{"owner":"192.168.16.5:81"}}]}
{"responses":[{"limit":"10","remaining":"7","reset_time":"1603905515137","metadata":{"owner":"192.168.16.5:81"}}]}
{"responses":[{"limit":"10","remaining":"6","reset_time":"1603905516170","metadata":{"owner":"192.168.16.5:81"}}]}
{"responses":[{"limit":"10","remaining":"5","reset_time":"1603905517204","metadata":{"owner":"192.168.16.5:81"}}]}
{"responses":[{"limit":"10","remaining":"4","reset_time":"1603905518238","metadata":{"owner":"192.168.16.5:81"}}]}
{"responses":[{"limit":"10","remaining":"3","reset_time":"1603905519270","metadata":{"owner":"192.168.16.5:81"}}]}
{"responses":[{"limit":"10","remaining":"2","reset_time":"1603905520307","metadata":{"owner":"192.168.16.5:81"}}]}
{"responses":[{"limit":"10","remaining":"1","reset_time":"1603905521341","metadata":{"owner":"192.168.16.5:81"}}]}
{"responses":[{"limit":"10","reset_time":"1603905522373","metadata":{"owner":"192.168.16.5:81"}}]}
{"responses":[{"status":"OVER_LIMIT","limit":"10","reset_time":"1603905523406","metadata":{"owner":"192.168.16.5:81"}}]}

We can see the number of remaining tokens always decreases and the first "reset_time" isn't a valid unix timestamp.

If I run the same test but wait 4 seconds between each new hits, the remaining tokens increases correctly between each call.

for i in 0 1 2 3 4; do curl -X POST http://localhost:9080/v1/GetRateLimits --data '{"requests":[{"name":"test", "unique_key": "testkey2", "hits":1, "duration":30000, "limit":10, "algorithm":"LEAKY_BUCKET"}]}'; echo; sleep 4; done
{"responses":[{"limit":"10","remaining":"9","reset_time":"3000","metadata":{"owner":"192.168.16.4:81"}}]}
{"responses":[{"limit":"10","remaining":"9","reset_time":"1603905601743","metadata":{"owner":"192.168.16.4:81"}}]}
{"responses":[{"limit":"10","remaining":"9","reset_time":"1603905605775","metadata":{"owner":"192.168.16.4:81"}}]}
{"responses":[{"limit":"10","remaining":"9","reset_time":"1603905609808","metadata":{"owner":"192.168.16.4:81"}}]}
{"responses":[{"limit":"10","remaining":"9","reset_time":"1603905613840","metadata":{"owner":"192.168.16.4:81"}}]}

Don't hesitate if you need more information. I run the docker-compose platform with the latest docker image.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.