kuadrant / limitador-operator Goto Github PK

View Code? Open in Web Editor NEW

6.0 6.0 13.0 21.27 MB

License: Apache License 2.0

Dockerfile 0.73% Makefile 8.58% Go 89.59% Shell 1.10%

limitador-operator's Introduction

kuadrant

limitador-operator's People

Contributors

Stargazers

Watchers

Forkers

davidor krishvoor mikenairn rahulanand16nov kevfan adam-cattermole grzpiotrowski boomatang pwright jasonmadigan

limitador-operator's Issues

Summary

This is an attempt at streamlining and formalizing the addition of features to Kuadrant and its components: Authorino, Limitador and possible more to come. It describes a process, Request For Comment (RFC). That process would be followed by contributors to Kuadrant in order to add features to the platform. The process aims at enabling the teams to deliver better defined software with a better experience for our end-users.

Motivation

As I went through the process of redefining the syntax for Conditions in Limitador, I found it hard to seed people's mind with the problem's space as I perceived it. I started by asking questions on the issues itself that didn't get the traction I had hoped for until the PR was eventually opened.
This process should help the author to consider the proposed change in its entirety: the change, its pros & cons, its documentation and the error cases. Making it easier for reviewers to understand the impact of the change being considered.
Further more, this keeps a "written" record of "a decision log" of how a feature came to be. It would help the ones among us who tend to forget about things, but would be of incommensurate value for future contributors wanting to either understand a feature deeply or build upon certain features to enable a new one.

Guide-level explanation

A contributor would start by following the template for a new Request For Comment (RFC). Eventually opening a pull request with the proposed change explained. At which point it automatically becomes a point of discussion for the next upcoming technical discussion weekly call.
Anyone is free to add ideas, raise issues, point out possible missing bits in the proposal before the call on the PR itself. The outcome of the technical discussion call is recorded on the PR as well, for future reference.
Once the author feels the proposal is in a good shape and has addressed the comments provided by the team and community, they can label the RFC as FCP, entering the Final Comment Period_. From that point on, there is another week left for commenters to express any remaining concerns. After which, the RFC is merged and going into active status, ready for implementation.

Reference-level explanation

Creating a Kuadrant/rfcs repository, with the README below and a template to start a new RFC from:

See the README.md and 0000-template.md files below for more details.

Drawbacks

The process proposed here adds overhead to addition of new features onto our stack. It will require more upfront specification work. It may require doing a few proof of concepts along the initial authoring, to enable the author to better understand the problem space.

Rationale and alternatives

What we've done until now, investigations have been less formal, but I'm unsure how much of their value got properly and entirely captured. By formalizing the process and having a clear outcome: a implementable piece of documentation, that address all aspects of the user's experience look like a better result.

Prior art

The entire idea isn't new. This very proposal is based on prior art by rust-lang and pony-lang. This process isn't perfect, but has been proven over and over again to work.

Unresolved questions

A week for the FCP seems a lot and very little at the same time… should we revisit this?
Having two core team member accepting an RFC is… acceptable? Should it be more? Less?
Should this all go under Kuadrant/rfcs?
What does it mean for kcp-glbc ?

Future possibilities

I certainly see this process itself evolving overtime. I like to think that this process can itself be supporting its future changes…

README.md

Kuadrant RFCs

The RFC (Request For Comments) process is aiming at providing a consistent and well understood way of adding new features or introducing breaking changes in the Kuadrant stack. It provides a mean for all stakeholders and community at large to provide feedback and be confident about the evolution of our solution.

Many, if not most of the changes will not require to follow this process. Bug fixes, refactoring, performance improvements or documentation additions/improvements can be implemented using the tradition PR (Pull Request) model straight to the targeted repositories on Github.

Additions or any other changes that impact the end user experience will need to follow this process.

When is an RFC required?

This process is meant for any changes that affect the user's experience in any way: addition of new APIs, changes to existing APIs - whether they are backwards compatible or not - and any other change to behaviour that affects the user of any components of Kuadrant.

API additions;
API changes;
… any change in behaviour.

When no RFC is required?

bugfixes;
refactoring;
performance improvements.

The RFC process

The first step in adding a new feature to Kuadrant, or a starting a major change, is the having a RFC merged into the repository. One the file has been merged, the RFC is considered active and ready to be worked on.

Fork the RFC repo
Use the template 0000-template.md to copy and rename it into the rfcs directory. Change the template suffix to something descriptive. But this is still a proposal and as no assigned RFC number to it yet.
Fill the template out. Try to be as thorough as possible. While some sections may not apply to your feature/change request, try to complete as much as possible, as this will be the basis for further discussions.
Submit a pull request for the proposal. That's when the RFC is open for actual comments by other members of team and the broader community.
The PR is to be handled just like a "code PR", wait on people's review and integrate the feedback provided. These RFCs can also be discussed during our weekly technical call meeting, yet the summary would need to be captured on the PR.
How ever much the orignal proposal changes during this process, never force push or otherwise squash the history, or even rebase your branch. Try keeping the commit history as clean as possible on that PR, in order to keep a trace of how the RFC evolved.
Once all point of views have been shared and input has been integrated in the PR, the author can push the RFC in the final comment period (FCP) which lasts a week. This is the last chance for anyone to provide input. If during the FCP, consensus cannot be reached, it can be decided to extend that period by another week. Consensus is achieved by getting two approvals from the core team.
As the PR is merged, it gets a number assigned, making the RFC active.
If on the other hand the consensus is to not implement the feature as discussed, the PR is closed.

The RFC lifecycle

Open: A new RFC as been submitted as a proposal
FCP: Final comment period of one week for last comments
Active: RFC got a number assigned and is ready for implementation with the work tracked in an issue, which summarizes the state of the implementation work.

Implementation

The work is itself tracked in a "master" issue with all the individual, manageable implementation tasks tracked.
The state of that issue is initially "open" and ready for work, which doesn't mean it'd be worked on immediately or by the RFC's author. That work will be planned and integrated as part of the usual release cycle of the Kuadrant stack.

Amendments

It isn't expected for an RFC to change, once it has become active. Minor changes are acceptable, but any major change to an active RFC should be treated as an independent RFC and go through the cycle described here.

EOF

0000-template.md

RFC Template

Feature Name: (fill me in with a unique ident, new_feature)
Start Date: (fill me in with today's date, YYYY-MM-DD)
RFC PR: Kuadrant/rfcs#0000
Rust Issue: Kuadrant/rfcs#0000

Summary

One paragraph explanation of the feature.

Motivation

Why are we doing this? What use cases does it support? What is the expected outcome?

Guide-level explanation

Explain the proposal as if it was implemented and you were teaching it to Kuadrant user. That generally means:

Introducing new named concepts.
Explaining the feature largely in terms of examples.
Explaining how a user should think about the feature, and how it would impact the way they already use Kuadrant. It should explain the impact as concretely as possible.
If applicable, provide sample error messages, deprecation warnings, or migration guidance.
If applicable, describe the differences between teaching this to existing and new Kuadrant users.

Reference-level explanation

This is the technical portion of the RFC. Explain the design in sufficient detail that:

Its interaction with other features is clear.
It is reasonably clear how the feature would be implemented.
How error would be reported to the users.
Corner cases are dissected by example.

The section should return to the examples given in the previous section, and explain more fully how the detailed proposal makes those examples work.

Drawbacks

Why should we not do this?

Rationale and alternatives

Why is this design the best in the space of possible designs?
What other designs have been considered and what is the rationale for not choosing them?
What is the impact of not doing this?

Prior art

Discuss prior art, both the good and the bad, in relation to this proposal.
A few examples of what this can include are:

Does another project have a similar feature?
What can be learned from it? What's good? What's less optimal?
Papers: Are there any published papers or great posts that discuss this? If you have some relevant papers to refer to, this can serve as a more detailed theoretical background.

This section is intended to encourage you as an author to think about the lessons from other tentatives - successful or not, provide readers of your RFC with a fuller picture.

Note that while precedent set by other projects is some motivation, it does not on its own motivate an RFC.

Unresolved questions

What parts of the design do you expect to resolve through the RFC process before this gets merged?
What parts of the design do you expect to resolve through the implementation of this feature before stabilization?
What related issues do you consider out of scope for this RFC that could be addressed in the future independently of the solution that comes out of this RFC?

Future possibilities

Think about what the natural extension and evolution of your proposal would be and how it would affect the platform and project as a whole. Try to use this section as a tool to further consider all possible interactions with the project and its components in your proposal. Also consider how this all fits into the roadmap for the project and of the relevant sub-team.

This is also a good place to "dump ideas", if they are out of scope for the RFC you are writing but otherwise related.

Note that having something written down in the future-possibilities section is not a reason to accept the current or a future RFC; such notes should be in the section on motivation or rationale in this or subsequent RFCs. The section merely provides additional information.

EOF

Production-ready: Configure possible different Image

In 3scale SaaS we have been using successfully limitador for a couple of years together with Redis, to protect all our public endpoints. However:

We are using an old image community image
Yamls are managed individually via ArgoCD

We would like to update how we manage limitador application, and use the most recommended limitador setup using limitador-operator, with a production-ready grade.

Current limitador-operator:

Do not permit to configure an image/tag/pullSecretName via CR
Operator image repository is harcoded to quay.io/kuadrant/limitador
Operator image tag can be overriden on a weird way: setting in the Operator Subscription CR the envvar RELATED_IMAGE_LIMITADOR.
- However, only a different tag inside same image repo quay.io/kuadrant/limitador can be used because it is harcoded

Desired features :

With the possibility of using productized images which are in private image repositories (so requires a pullSecretName reference pointing to a secret holding the private image repo credentials), the image/tag/pullSecretName should be able to be configured via CR to override default values
Not very important immediately as we will use community images from OperatorHub installation, but eventually I guess would be required if it is needed to use productized images elsewhere

Possible CR config

apiVersion: limitador.kuadrant.io/v1alpha1
kind: Limitador
metadata:
  name: limitador-sample
spec:
  image:
    name: brew.registry.redhat.io/rh-osbs/3scale-mas-limitador-rhel8
    tag: 1.2.0-2
    pullSecretName: brew-pull-secret  # this secret holds the private image repo credentials

Which should create something like:

kind: Deployment
apiVersion: apps/v1
metadata:
  name: limitador
spec:
...
  template:
    spec:
      imagePullSecrets:
        - name: brew-pull-secret
...
      containers:
        - name: limitador
          image: brew.registry.redhat.io/rh-osbs/3scale-mas-limitador-rhel8:1.2.0-2

Enhance status reconciling

Instead of requeue, the controller should watch for owned (referenceOwner) deployments and when the deployment says it is available, a new reconciliation loop will be triggered. If for some reason, the Limitador becomes unavailable (because it crashes or whatever), its controller will never know and the status will be available until something changes the spec of the Limitador CR.

Allow to configure the limits storage in Limitador

Limitador supports several storage backends for the rate limits (in-memory, redis, wasm-compatible, cached redis).
To start with, it'd be good to be able to choose between in-memory and redis in the Limitador CR.

dep: use go1.19

Upgrade go.mod and various workflows to go 1.19 to align with most other kuadrant projects

workflow: github.com/mikefarah/yq/v4@latest requires go1.20

Build bundle workflow is failing due the the latest yq version requires go1.20

go: downloading golang.org/x/xerrors v0.0.0-20220609144429-65e65417b02f
# github.com/mikefarah/yq/v4/pkg/yqlib
Error: /home/runner/go/pkg/mod/github.com/mikefarah/yq/[email protected]/pkg/yqlib/encoder_lua.go:139:[29](https://github.com/Kuadrant/limitador-operator/actions/runs/5831295861/job/15815415261#step:4:30): undefined: strings.CutPrefix
Error: /home/runner/go/pkg/mod/github.com/mikefarah/yq/[email protected]/pkg/yqlib/encoder_lua.go:237:29: undefined: strings.CutPrefix
note: module requires Go 1.20
make: *** [Makefile:131: /home/runner/work/limitador-operator/limitador-operator/bin/yq] Error 1
Error: Process completed with exit code 2.

Instead of the always using the latest version, we should pin in to the last supporting version of go1.19 (v4.34.2 was last working before this issue occurred) or alternatively we can upgrade to go1.20

limitador-operator/Makefile

Line 131 in 5381563

$(call go-install-tool,$(YQ),github.com/mikefarah/yq/v4@latest)

workflow: update actions/checkout & actions/setup-go

actions/checkout should be updated to v3
- https://github.com/actions/checkout/releases/tag/v3.5.3
actions/setup-go should be updated to v4
- https://github.com/actions/setup-go#v4

Limits at Limitador CR

Context

After a redefinition of the authority of the limits, Kuadrant/limitador#74, the limits configuration will live as a local file in the pod. This issue is about the control plane of limitador and how the limits get their way to the local file in the pod.

Source of the config map

Limitador operator will reconcile a configmap to be mounted as local file in all the replica pods of limitador. Where are those limits coming from? Currently the limitador's operator reads RateLimit CRs and reconciles with limitador using the HTTP endpoint. The association of RateLimit CR with a limitador instance is currently hardcoded in the limitador operator. The RateLimit CRs need to be created in the same namespace of the limitador's pod and the service name and the port are hardcoded.

In order to make the limits configuration flexible and with a clear association of which limits are applied to which limitador instances, the proposal is about setting the limits to the Limitador CRD. For example:

---                                                       
apiVersion: limitador.kuadrant.io/v1alpha1                
kind: Limitador                                           
metadata:                                                 
  name: limitador                                         
spec:                                                     
  replicas: 1                                             
  version: "0.4.0"     
  limits:
  - conditions: ["get-toy == yes"]
    max_value: 2
    namespace: toystore-app
    seconds: 30
    variables: []
  - conditions:
    - "admin == yes"
    max_value: 2
    namespace: toystore-app
    seconds: 30
    variables: []
  - conditions: ["vhaction == yes"]
    max_value: 6
    namespace: toystore-app
    seconds: 30
    variables: []

the limitador's operator would be responsible of reconciling the content of spec.limits with the content config map mounted in the limitador's pod.

Kuadrant context

Kuadrant users define their limits in the kuadrant API: RateLimitPolicy.

Kuadrant installation owns a Limitador deployment with at least one pod running. This limitador deployment is managed via a Limitador CR. The namespace and names of this Limitador CR is known by the kuadrant controller (kuadrant control plane).

Thus, when a user creates a RateLimitPolicy and adds some limits in it, the following happens behind the scenes.

a) The kuadrant-controller reads the RLP and reconciles the limits with the list in the owned Limitador CR following kuadrant rules. As an example for those rules, the namespace will be set by kuadrant and not exposed in the RLP. When one limit is added/updated/removed in the RLP, that limit is added/updated/removed from the Limitador CR.

b) The limitador's operator will reconcile the limits in the Limitador CR with a configmap that gets mounted in the deployment as a local file for the limitador process. The limitador's operator gets notifications when the Limtador CR changes. When one limit is added/updated/removed in the Limitador's CR, that limit is added/updated/removed from the config/map. That effectively changes the content of the local file.

cc @alexsnaps @didierofrivia @rahulanand16nov

Kubebuilder doesn't support darwin/arm64

The kubebuilder-tools does not support dawrin/arm64 just yet, we need to do a workaround until this is fixed: kubernetes-sigs/controller-runtime#1657

Setup CodeCov integration

https://app.codecov.io/gh/Kuadrant

Storage redis URL from Secret is transformed into plain container command leaking possible password

I have been successfully testing limitador-operator v0.6.0, and I have identified a possible not intended credentials leak in the deployment container command.

I deployed the following CR called cluster:

apiVersion: limitador.kuadrant.io/v1alpha1
kind: Limitador
metadata:
  name: cluster
spec:
  affinity:
    podAntiAffinity:
      preferredDuringSchedulingIgnoredDuringExecution:
        - podAffinityTerm:
            labelSelector:
              matchLabels:
                app: limitador
                limitador-resource: cluster
            topologyKey: kubernetes.io/hostname
          weight: 100
        - podAffinityTerm:
            labelSelector:
              matchLabels:
                app: limitador
                limitador-resource: cluster
            topologyKey: topology.kubernetes.io/zone
          weight: 99
  limits:
    - conditions: []
      max_value: 400
      namespace: kuard
      seconds: 1
      variables:
        - per_hostname_per_second_burst
  listener:
    grpc:
      port: 8081
    http:
      port: 8080
  pdb:
    maxUnavailable: 1
  replicas: 3
  resourceRequirements:
    limits:
      cpu: 500m
      memory: 64Mi
    requests:
      cpu: 250m
      memory: 32Mi
  storage:
    redis:
      configSecretRef:
        name: redisconfig

Then redis storage is configured on an external secret with the connection string set at URL, and I guess it is a secret and not a configmap, because connection string to connect to redis might have use user/password:

apiVersion: v1
kind: Secret
metadata:
  name: redisconfig
stringData:
  URL: redis://127.0.0.1/a # Redis URL of its running instance
type: Opaque

However, instead of mounting the secret on the deployment and extract the URL into possibly an ENVVAR, it is taking the URL from the secret, and configure it directly on the container command showing its plain value (even if it possibly has a secret password):

          command:
            - limitador-server
            - /home/limitador/etc/limitador-config.yaml
            - redis
            - 'redis://redis:6379'

My recommendation would be to extract its value like any standard deployment and inject it on an ENVVAR maybe, something similar to:

          env:
            - name: URL
              valueFrom:
                secretKeyRef:
                  name: limits-config-cluster
                  key: URL

And then, you will need also to update how to use its value from the container command.

test: integration test flaky on updating resource

Sometimes integration test can fail due to resource conflict when updating resource as the controller can be still reconciling the resource from the creation event

  [FAILED] Expected success, but got an error:
      <*errors.StatusError | 0xc0003570e0>: 
      Operation cannot be fulfilled on limitadors.limitador.kuadrant.io "a6bde1452-63e2-4061-bf3b-db842720cee8": the object has been modified; please apply your changes to the latest version and try again
      {
          ErrStatus: {
              TypeMeta: {Kind: "", APIVersion: ""},
              ListMeta: {
                  SelfLink: "",
                  ResourceVersion: "",
                  Continue: "",
                  RemainingItemCount: nil,
              },
              Status: "Failure",
              Message: "Operation cannot be fulfilled on limitadors.limitador.kuadrant.io \"a6bde1452-63e2-4061-bf3b-db842720cee8\": the object has been modified; please apply your changes to the latest version and try again",
              Reason: "Conflict",
              Details: {
                  Name: "a6bde1452-63e2-4061-bf3b-db842720cee8",
                  Group: "limitador.kuadrant.io",
                  Kind: "limitadors",
                  UID: "",
                  Causes: nil,
                  RetryAfterSeconds: 0,
              },
              Code: 409,
          },
      }
  In [It] at: /home/runner/work/limitador-operator/limitador-operator/controllers/limitador_controller_test.go:333 @ 09/13/23 13:15:21.778

Workflow job where this happened:

https://github.com/Kuadrant/limitador-operator/actions/runs/6172979775/job/16754312293

Limitador Service Settings

When a Limitador object is created, the operator creates a Deployment where the actual Limitador Instance resides, and a Service that exposes it. The values of port and name are hardcoded, thus making it impossible to define specific values.
It should be possible to set which ports and protocols will be set to the Limitador Service and also possible to spawn multiple Services by namespace.

Issues with make commands

There are a couple of issues with the make commands with newer go versions (>1.16):

go get won't install the binaries, needs to be replaced by go install
kustomize version fails to install with go install

go: sigs.k8s.io/kustomize/kustomize/[email protected] (in sigs.k8s.io/kustomize/kustomize/[email protected]):
	The go.mod file for the module providing named packages contains one or
	more exclude directives. It must not contain directives that would cause
	it to be interpreted differently than if it were the main module.

operator-sdk signature not properly formatted

Primary key fingerprint: 3B2F 1481 D146 2380 80B3  46BB 0529 96E2 A20B 5C7E
     Subkey fingerprint: 8613 DB87 A5BA 825E F3FD  0EBE 2A85 9D08 BF98 86DB
sha256sum: 'standard input': no properly formatted checksum lines found

install-operator-sdk.sh fails with macos old default bash

bash: syntax error near unexpected token `;;'

Different naming convention on resources created by limitador-operator

I have been successfully testing limitador-operator v0.6.0, and I have identified some inconsistencies with resource name's created by the operator.

I deployed the following CR called cluster:

apiVersion: limitador.kuadrant.io/v1alpha1
kind: Limitador
metadata:
  name: cluster
spec:
  affinity:
    podAntiAffinity:
      preferredDuringSchedulingIgnoredDuringExecution:
        - podAffinityTerm:
            labelSelector:
              matchLabels:
                app: limitador
                limitador-resource: cluster
            topologyKey: kubernetes.io/hostname
          weight: 100
        - podAffinityTerm:
            labelSelector:
              matchLabels:
                app: limitador
                limitador-resource: cluster
            topologyKey: topology.kubernetes.io/zone
          weight: 99
  limits:
    - conditions: []
      max_value: 400
      namespace: kuard
      seconds: 1
      variables:
        - per_hostname_per_second_burst
  listener:
    grpc:
      port: 8081
    http:
      port: 8080
  pdb:
    maxUnavailable: 1
  replicas: 3
  resourceRequirements:
    limits:
      cpu: 500m
      memory: 64Mi
    requests:
      cpu: 250m
      memory: 32Mi
  storage:
    redis:
      configSecretRef:
        name: redisconfig

And I saw that most created resources follow's naming convention of limitador- prefix plus $CR_NAME:

We know it is a limitador resource (in general), thanks to the limitador- prefix
Specifically we know it belongs to cluster instance, thanks to $CR_NAME

In that particual case it would be limitador-cluster, so these are 2 of the created resources:

Service name: limitador-cluster
PDB: limitador-cluster

Actually this same logic is applied to all label selectors, where there are 2 labels:

            labelSelector:
              matchLabels:
                app: limitador
                limitador-resource: cluster

However, there are 2 cases in which this naming convention is not addressed:

Deployment: cluster (without limitador- prefix)
- From my point of view, it is really important having easily the possibility to identify what does a pod by the name, in that case as limitador- prefix is not being added, having pods whose name is just the CR_NAME (that can be anything) can be missleading
Configmap: limits-config-cluster (without limitador- prefix)
- From my point of view, it would be more easier knowing the purpose of the configmap if the same limitador- prefix is used to all created resources, including this configmap
- Suggested name: limitador-limits-config-cluster

Limitador config file

After a few changes on Limitador, it doesn't provide the http endpoints to set up the limits and only listen to changes on a config file. This config file will be mounted when deployed and will be provided by a ConfigMap which is reconciled by the limitador-controller reading from the Limitador CR Spec.limits.

This is more or less how this ConfigMap should look like:

apiVersion: v1
kind: ConfigMap
metadata:
  labels:
    app: limitador
  name: envoy
data:
  limitador-config.yaml: |
  limits:
  - conditions: ["get-toy == yes"]
    max_value: 2
    namespace: toystore-app
    seconds: 30
    variables: []
  - conditions:
    - "admin == yes"
    max_value: 2
    namespace: toystore-app
    seconds: 30
    variables: []
  - conditions: ["vhaction == yes"]
    max_value: 6
    namespace: toystore-app
    seconds: 30
    variables: []

CI/CD workflows

We want to improve automation in all repos for the Kuadrant components. We're aiming for:

good coverage of automation tasks related to code style, testing, CD/CD (image builds, releases), etc
consistency across components
automation as manageable code – i.e. less mouse clicks across scaterred UI "settings" pages and more Gitops, more YAMLs hosted as part of a code base.

As part of a preliminary investigation (Kuadrant/kuadrant-operator#21) of the current state of such automation, the following desired workflows and corresponding status for the Limitador Operator repo were identified. Please review the list below.

A. Linters & Code style
- A1. Built-in auto-format (go fmt, go vet, cargo fmt)
- A2. 3rd-party linters (Reviewdog, golangci-lint, Clippy - Trailing whitespaces, EOF newline, Language-specific)
- A3. Spelling (client9/misspell)
- A4. Language (Woke)
B. Tests
- B1. Unit tests (go test, cargo test)
- B2. End-to-end tests (build an image → deploy → run integration tests)
- B3. Smoke tests (deploy image from quay.io/kuadrant → run integration tests)
- B4. Performance & Benchmarks
C. CI/CD
- C1. Build and push image to quay.io/kuadrant
- C2. Deploy (shared Kubernetes/OpenShift cluster in the cloud)
D. Release
- D1. Release (tag, version)
E. Code scan (vulnerability, dependency updates)
- E1. Dependabot alerts
- E2. Dependabot version updates
- E3. Dependabot security updates
- E4. Code vulnerability scan (CodeQL, Red Hat Dependency Analytics)
- E5. Static Code Analysis (SonarQube)
F. Badges
- F1. Test status & coverage
- F2. Code analysis (dependencies, security)

Workflows do not have to be implemented exactly as in the list. The list is just a driver for the kind of tasks we want to cover. Each component should assess it as it makes sense considering the component's specificities. More details in the original epic: Kuadrant/kuadrant-operator#21.

You may also want to use this issue to reorganize how current workflows are implemented, thus helping us make the whole thing consistent across components.

For an example of how Authorino and Authorino Operator intend to organise this for Golang code bases, see respectively Kuadrant/authorino#351 (comment) and Kuadrant/authorino-operator#96 (comment).

Improve Observability of Limitador resources in kubectl output

Enhance the observability of Limitador resources by adding custom printcolumn annotations to the CRD. This will allow key status and configuration details to be easily displayed in the kubectl get output.

Deployment reconcile inconsistencies for sidecars

Current state allows the user to add sidecar containers to the limitador deployment that is being managed by limitador-operator. There are two states that can be created from adding sidecars.

State One

When the sidecar is defined as the first the container in the deployment, the operator will update the configuration to have the values expected for limitador. It however does not change the name for the container.

This means the user defined container will not be created and two containers for limitador will be created in the same pod. This causes a conflict on ports and is not be in an error state.

State Two

If the user defines the sidecar second container in the list after the limitador configuration, the side car is created as expected. This method works and the limitador-operator does not override the sidecar configuration.

Sidecar creation is depended on ordering of containers in the deployment configuration.

Authorino Deployment reconcile for comparison

If a user tries to add a sidecar to the Authorino deployment in any order, the authorino-operator reverts the changes and removes any user defined configuration.

Expected behaviour

For consistency between products the expected behaviour would be to revert any user defined configuration changes to limitador deployment CRs.

Production-ready: Configure Pod Affinity

In 3scale SaaS we have been using successfully limitador for a couple of years together with Redis, to protect all our public endpoints. However:

We are using an old image community image
Yamls are managed individually via ArgoCD

We would like to update how we manage limitador application, and use the most recommended limitador setup using limitador-operator, with a production-ready grade.

Current limitador-operator:

Do not configure any pod affinity by default
Do not permit to configure any pod affinity via CR

Desired features:

Permit to configure pod affinity via CR
Being thought the operator to possibly having a single limitador pod running at once, maybe pod affinity should not be configured by default

3scale SaaS specific example

Example of pod affinity used in 3scale SaaS production to manage between 3,500 and 5,500 requests/second with 3 limitador pods (selector labels need to coincide with the labels managed right now by limitador-operator):

...
      affinity:
        podAntiAffinity:
          preferredDuringSchedulingIgnoredDuringExecution:
            - weight: 100
              podAffinityTerm:
                labelSelector:
                  matchLabels:
                    app.kubernetes.io/name: limitador
                topologyKey: kubernetes.io/hostname
            - weight: 99
              podAffinityTerm:
                labelSelector:
                  matchLabels:
                    app.kubernetes.io/name: limitador
                topologyKey: topology.kubernetes.io/zone
...

That way, we "try" (preferred) to balance the 3 limitador pods to be distributed on different worker nodes (hostname) from different AWS Availability Zones (zone) having a good fault tolerant High Availability, but without forcing it, which means that if by some reason kube-scheduler can not satisfy this distribution because current nodes are quite full of resources usage... kube-scheduler will try to satisfy this pod distribution on best effort mode with no guarantee, but at least guaranteeing that the 3 pods will be scheduled elsewhere

Production-ready: Configure PDB

In 3scale SaaS we have been using successfully limitador for a couple of years together with Redis, to protect all our public endpoints. However:

We are using an old image community image
Yamls are managed individually via ArgoCD

We would like to update how we manage limitador application, and use the most recommended limitador setup using limitador-operator, with a production-ready grade.

Current limitador-operator:

Do not configure PDB by default
Do not permit to configure PDB via CR

Desired features:

Permit to configure PDB via CR
Being thought the operator to possibly having a single limitador pod running at once, maybe PDB should not be enabled by default
PDB helps when there is more than 1 replica, because it ensures that upon a cluster maintenance where nodes are being updated one after one, there is always a minimum/maximum number of pod replicas providing service so not having a downtime

3scale SaaS specific example

Example of PDB used in 3scale SaaS production to manage between 3,500 and 5,500 requests/second with 3 limitador pods (selector labels need to coincide with the labels managed right now by limitador-operator):

kind: PodDisruptionBudget
apiVersion: policy/v1
metadata:
  name: limitador
spec:
  selector:
    matchLabels:
      app.kubernetes.io/name: limitador
  maxUnavailable: 1

Possible CR config

apiVersion: limitador.kuadrant.io/v1alpha1
kind: Limitador
metadata:
  name: limitador-sample
spec:
  pdb:
    maxUnavailable: 1
    minAvailable: 2  # Note this field is mutually exclusive setting with "minAvailable", normally better use maxUnavailable, only one of them can be used at the same time

Example how we externalize PDB config in 3scale SaaS Operator CR.

https://github.com/3scale-ops/saas-operator/blob/main/docs/api-reference/reference.asciidoc#k8s-api-github-com-3scale-saas-operator-api-v1alpha1-poddisruptionbudgetspec

Production-ready: Configure deployment resources

In 3scale SaaS we have been using successfully limitador for a couple of years together with Redis, to protect all our public endpoints. However:

We are using an old image community image
Yamls are managed individually via ArgoCD

We would like to update how we manage limitador application, and use the most recommended limitador setup using limitador-operator, with a production-ready grade.

Current limitador-operator

Do not configure any resources cpu/memory requests/limits by default
Do not permit to configure any resources cpu/memory requests/limits via CR

Desired features

Configure resources cpu/memory requests/limits by default
Permit to modify (or remove) the resources cpu/memory requests/limits via CR

3scale SaaS specific example

Example of resources used currently in 3scale SaaS production to manage between 3,500 and 5,500 requests/second with 3 limitador pods:

          resources:
            requests:
              cpu: 250m
              memory: 32Mi
            limits:
              cpu: 500m
              memory: 64Mi

Real resources usage:

Memory: 35MB (100% stable)
CPU: 150m-350m (average 200m)
Suffered CPU throttling with cpu.limit=500m? Never, meanign that 50m is a good default cpu.limit that can "guarantee" with a decent traffic 0 cpu throttling

CPU graphs current 3scale SaaS CPU usage:

Memory graphs current 3scale SaaS memory usage:

Flaky test: delete a rate limit

Unfortunately I realized after merging. The test passed in the PR and locally but failed when merged to the main branch.

https://github.com/3scale-labs/limitador-operator/blob/bfa64bdf8595d085368a1da5593d81d2946589e0/controllers/ratelimit_controller_test.go#L149

Production-ready: Configure Observability

In 3scale SaaS we have been using successfully limitador for a couple of years together with Redis, to protect all our public endpoints. However:

We are using an old image community image
Yamls are managed individually via ArgoCD

We would like to update how we manage limitador application, and use the most recommended limitador setup using limitador-operator, with a production-ready grade.

Current limitador-operator (at least the version `0.4.0` that we use):

Provides a few prometheus metrics in the HTTP port
Do not create a prometheus PodMonitor by default
Do not create a GrafanaDashboard by default
Do not permit to create a prometetheus PodMonitor via CR
Do not permit to create a GrafanaDashboard via CR

Desired features:

Permit to create a prometheus PodMonitor via CR
Permit to create a GrafanaDashboard via CR
Being observability something optional, might not be enabled by default

3scale SaaS specific example

Example of the PodMonitor used in 3scale SaaS production to manage between 3,500 and 5,500 requests/second with 3 limitador pods (selector labels need to coincide with the labels managed right now by limitador-operator):

apiVersion: monitoring.coreos.com/v1
kind: PodMonitor
metadata:
  name: limitador
spec:
  podMetricsEndpoints:
    - interval: 30s
      path: /metrics
      port: http
      scheme: http
  selector:
    matchLabels:
      app.kubernetes.io/name: limitador

Possible CR config

Both PodMonitor and GrafanaDashboard should be able to be customized via CR, but use default sane values if they are enabled, so you dont need to provide all the config if you dont want, and want to trust on defaults.

PodMonitor possible customization:
- enabled: true/false
- interval: how often prometheus-operator will scrape limitador pods (have an impact on prometheus memory/timeseries database sizes)
- labelSelector: sometimes prometheus-operator is configured to scrape PodMonitors/ServiceMonitors with specific label selectors
GrafanaOperator possible customization:
- enabled: true/false
- labelSelector: sometimes grafana-operator is configured to scrape GrafanaDashboards with specific label selectors

apiVersion: limitador.kuadrant.io/v1alpha1
kind: Limitador
metadata:
  name: limitador-sample
spec:
  podMonitor:
    enabled: true  # by default it is false, so does not create a PodMonitor
    interval: 30s # by default it is 30 if not defined
    labelSelector: XX ## by default not define any label/selector
    ...  ## maybe in the future permit to override more PodMonitor fields if needed, dont think anymore is needed by now
  grafanaDashboard:
    enabled: true
    labelSelector: XX ## by default not define any label/selector

The initial dashboard would be provided by us initially (3scale SRE), can be embedded into operator as an asset, like done with 3scale-operator.

Current Dashboard screenshots including limitador metrics by limitador_namespace (the app being limited), and also pods, resources cpu/mem/net metrics:

PrometheusRules (aka prometheus alerts)

Regarding PrometheusRules (prometheus alerts), my advise is to not embed them into the operator, but provide in the repo a yaml with an example of possible alerts that can be deployed, tuned... by the app administrator if needed.

Example:

apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  name: limitador
spec:
  groups:
    - name: limitador.rules
      rules:
        - alert: LimitadorJobDown
          annotations:
            message: Prometheus Job {{ $labels.job }} on {{ $labels.namespace }} is DOWN
          expr: up{job=~".*limitador.*"} == 0
          for: 5m
          labels:
            severity: critical

        - alert: LimitadorPodDown
          annotations:
            message: Limitador pod {{ $labels.pod }} on {{ $labels.namespace }} is DOWN
          expr: limitador_up == 0
          for: 5m
          labels:
            severity: critical

Deploy limitador passing info via command line params instead of env vars

Currently the deployment adds env vars to configure the pod

containers:
        - env:
            - name: RUST_LOG
              value: info
            - name: LIMITS_FILE
              value: /home/limitador/etc/limitador-config.yaml

Use the CLI parameters instead:

Limitador Server v1.0.0-dev (28a77d29)  debug build
The Kuadrant team - github.com/Kuadrant
Rate Limiting Server

USAGE:
    limitador-server [OPTIONS] <LIMITS_FILE> [STORAGE]

ARGS:
    <LIMITS_FILE>    The limit file to use

OPTIONS:
    -b, --rls-ip <ip>              The IP to listen on for RLS [default: 0.0.0.0]
    -p, --rls-port <port>          The port to listen on for RLS [default: 8081]
    -B, --http-ip <http_ip>        The IP to listen on for HTTP [default: 0.0.0.0]
    -P, --http-port <http_port>    The port to listen on for HTTP [default: 8080]
    -l, --limit-name-in-labels     Include the Limit Name in prometheus label
    -v                             Sets the level of verbosity
    -h, --help                     Print help information
    -V, --version                  Print version information

STORAGES:
    memory          Counters are held in Limitador (ephemeral)
    redis           Uses Redis to store counters
    redis_cached    Uses Redis to store counters, with an in-memory cache

Manifests.yaml for deploying the operator

When generating the manifests, a multi-document single manifest.yaml file can be generated (and commited), allowing to ease installing the Limitador Operator without having to clone the repo. Similarly, by adding as well Namespace, Deployment, etc (i.e. the resources produced by make deploy) to that same or a second manifest.yaml file, deploying the Limitador Operator can be done directly from one YAML file hosted remotely.

Usually the only customization involved when deploying is the operator image, which can default to either latest or to any last released version of the operator available from the registry (quay.io/kuadrant/limitador-operator), instead of controller:latest, currently hard-coded and only meaningful for devs workflow building the operator locally.

This would be analogous to https://github.com/Kuadrant/authorino-operator/blob/b66abee89a325819442c07af5f36aa05b4eba30d/Makefile#L72-L73 (generates config/install/manifests.yaml) and https://github.com/Kuadrant/authorino-operator/blob/b66abee89a325819442c07af5f36aa05b4eba30d/Makefile#L161 (generates config/deploy/manifests.yaml).

In the exemplified case of Authorino Operator, it's more complicated because it even downloads manifests hosted in the main Authorino repo (e.g. for the AuthConfig CRD). However, for Limitador Operator, this is not needed, as the repo has all it needs to generate the manifests, making it even simpler to implement.

kuadrant / limitador-operator Goto Github PK

limitador-operator's Introduction

kuadrant

limitador-operator's People

Contributors

Stargazers

Watchers

Forkers

limitador-operator's Issues

Summary

Motivation

Guide-level explanation

Reference-level explanation

Drawbacks

Rationale and alternatives

Prior art

Unresolved questions

Future possibilities

Kuadrant RFCs

When is an RFC required?

When no RFC is required?

The RFC process

The RFC lifecycle

Implementation

Amendments

RFC Template

Summary

Motivation

Guide-level explanation

Reference-level explanation

Drawbacks

Rationale and alternatives

Prior art

Unresolved questions

Future possibilities

Current limitador-operator:

Desired features :

Possible CR config

Context

Source of the config map

Kuadrant context

Deployment reconcile inconsistencies for sidecars

State One

State Two

Authorino Deployment reconcile for comparison

Expected behaviour

Current limitador-operator:

Desired features:

3scale SaaS specific example

Current limitador-operator:

Desired features:

3scale SaaS specific example

Possible CR config

Current limitador-operator

Desired features

3scale SaaS specific example

CPU graphs current 3scale SaaS CPU usage:

Memory graphs current 3scale SaaS memory usage:

Current limitador-operator (at least the version 0.4.0 that we use):

Desired features:

3scale SaaS specific example

Possible CR config

PrometheusRules (aka prometheus alerts)

Recommend Projects

Recommend Topics

Recommend Org

Jobs

Current limitador-operator (at least the version `0.4.0` that we use):