GithubHelp home page GithubHelp logo

googlecloudplatform / gke-autoneg-controller Goto Github PK

View Code? Open in Web Editor NEW
152.0 29.0 47.0 308 KB

This GKE controller provides simple custom integration between GKE and GCLB.

License: Apache License 2.0

Dockerfile 1.12% Makefile 4.44% Go 65.16% Shell 1.14% HCL 26.45% Smarty 1.69%
gke gclb

gke-autoneg-controller's Introduction

Autoneg GKE controller

License Tests

autoneg provides simple custom integration between GKE and Google Cloud Load Balancing (both external and internal). autoneg is a GKE-specific Kubernetes controller which works in conjunction with the GKE Network Endpoint Group (NEG) controller to manage integration between your Kubernetes service endpoints and GCLB backend services.

GKE users may wish to register NEG backends from multiple clusters into the same backend service, or may wish to orchestrate advanced deployment strategies in a custom or centralized fashion, or offer the same service via protected public endpoint and more lax internal endpoint. autoneg can enable those use cases.

How it works

autoneg depends on the GKE NEG controller to manage the lifecycle of NEGs corresponding to your GKE services. autoneg will associate those NEGs as backends to the GCLB backend service named in the autoneg configuration.

Since autoneg depends explicitly on the GKE NEG controller, it also inherits the same scope. autoneg only takes action based on a Kubernetes service which has been annotated with autoneg configuration, and does not make any changes corresponding to pods or deployments. Only changes to the service will cause any action by autoneg.

On deleting the Service object, autoneg will deregister NEGs from the specified backend service, and the GKE NEG controller will then delete the NEGs.

Using Autoneg

In your Kubernetes service, two annotations are required in your service definition:

  • cloud.google.com/neg enables the GKE NEG controller; specify as standalone NEGs
  • controller.autoneg.dev/neg specifies name and other configuration
    • Previous version used anthos.cft.dev/autoneg as annotation and it's still supported, but deprecated and will be removed in subsequent releases.

Example annotations

metadata:
  annotations:
    cloud.google.com/neg: '{"exposed_ports": {"80":{},"443":{}}}'
    controller.autoneg.dev/neg: '{"backend_services":{"80":[{"name":"http-be","max_rate_per_endpoint":100}],"443":[{"name":"https-be","max_connections_per_endpoint":1000}]}}
    # For L7 ILB (regional) backends 
    # controller.autoneg.dev/neg: '{"backend_services":{"80":[{"name":"http-be","region":"europe-west4","max_rate_per_endpoint":100}],"443":[{"name":"https-be","region":"europe-west4","max_connections_per_endpoint":1000}]}}

Once configured, autoneg will detect the NEGs that are created by the GKE NEG controller, and register them with the backend service specified in the autoneg configuration annotation.

Only the NEGs created by the GKE NEG controller will be added or removed from your backend service. This mechanism should be safe to use across multiple clusters.

By default, autoneg will initialize the capacityScaler to 1, which means that the new backend will receive a proportional volume of traffic according to the maximum rate or connections per endpoint configuration. You can customize this default by supplying the initial_capacity variable, which may be useful to steer traffic in blue/green deployment scenarios. The capacityScaler mechanism can be used to manage traffic shifting in such uses cases as deployment or failover.

Autoneg Configuration

Specify options to configure the backends representing the NEGs that will be associated with the backend service. Options can be referenced in the backends section of the REST resource definition. Only options listed here are available in autoneg.

Options

Autoneg annotation options

  • name: optional. The name of the backend service to register backends with.
    • If --enable-custom-service-names flag (defaults to true) is set to false, the name values specified in new autoneg annotations would be invalidated and fall back to the template generated names.
    • The default name value for old anthos.cft.dev/autoneg annotation is service name.
    • Note that name is optional here and defaults to a value generated following default-backendservice-name.
  • region: optional. Used to specify that this is a regional backend service.
  • max_rate_per_endpoint: required/optional. Integer representing the maximum rate a pod can handle. Pick either rate or connection.
  • max_connections_per_endpoint: required/optional. Integer representing the maximum amount of connections a pod can handle. Pick either rate or connection.
  • initial_capacity: optional. Integer configuring the initial capacityScaler, expressed as a percentage between 0 and 100. If set to 0, the backend service will not receive any traffic until an operator or other service adjusts the capacity scaler setting. Please note that unless you have existing backends in a backend service, you cannot set initial_capacity to zero (at least some backends have to higher than zero value).
  • capacity_scaler: optional. Autoneg manages the capacity scaler setting if this and the controller.autoneg.dev/sync: '{"capacity_scaler":true}' annotation is set on the service. Please note updating capacityScaler setting out of band (eg. via gcloud) won't be overridden until you change the capacity_scaler (or other value) in the service configuration.

Controller parameters

The controller parameters can be customized via changing the controller deployment.

  • --enable-custom-service-names: optional. Enables defining the backend service name in the autoneg annotation. If set to false (via --enable-custom-service-names=false), the name option will be ignored in the autoneg annotation and the backend service name will be determined via default-backendservice-name (see below).
  • --default-backendservice-name: optional. Sets the backend service name if it's not specified or if --enable-customer-service-names is set to false. The template defaults to {name}-{port}. It can contain namespace, name, port and hash and the non-hash values will be truncated evenly if the full name is longer than 63 characters. <hash> is generated using full length namespace, name and port to avoid name collisions when truncated.
  • --max-rate-per-endpoint: optional. Sets a default value for max-rate-per-endpoint that can be overridden by user config. Defaults to 0.
  • --max-connections-per-endpoint: optional. Same as above but for connections.
  • --always-reconcile: optional. Makes it possible to reconcile periodically even if the status annotations don't change. Defaults to false.
  • --reconcile-period: optional. Sets a reconciliation duration if always-reconcile mode is on. Defaults to 10 hours.

IAM considerations

As autoneg is accessing GCP APIs, you must ensure that the controller has authorization to call those APIs. To follow the principle of least privilege, it is recommended that you configure your cluster with Workload Identity to limit permissions to a GCP service account that autoneg operates under. If you choose not to use Workload Identity, you will need to create your GKE cluster with the "cloud-platform" scope.

Security considerations

  • Since the GKE cluster will require IAM permissions to manipulate the backend services in the project, users may be able to register their services into any available backend service in the project. You can enforce the allowed backends by disabling --enable-custom-service-names and customizing the backend service name template.

Installation

First, set up the GCP resources necessary to support Workload Identity, run the script:

PROJECT_ID=myproject deploy/workload_identity.sh

If you are using Shared VPC, ensure that the autoneg-system service account has the compute.networkUser role in the Shared VPC host project:

gcloud projects add-iam-policy-binding \
  --role roles/compute.networkUser \
  --member "serviceAccount:autoneg-system@${PROJECT_ID}.iam.gserviceaccount.com" \
  ${HOST_PROJECT_ID}

Lastly, on each cluster in your project where you'd like to install autoneg (version v1.1.0), run these two commands:

kubectl apply -f deploy/autoneg.yaml

kubectl annotate sa -n autoneg-system autoneg-controller-manager \
  iam.gke.io/gcp-service-account=autoneg-system@${PROJECT_ID}.iam.gserviceaccount.com

This will create all the Kubernetes resources required to support autoneg and annotate the default service account in the autoneg-system namespace to associate a GCP service account using Workload Identity.

Installation via Terraform

You can use the Terraform module in terraform/autoneg to deploy Autoneg in a GKE cluster of your choice. An end-to-end example is provided in the terraform/test directory as well (simply set your project_id).

Example:

provider "google" {
}

provider "kubernetes" {
  cluster_ca_certificate = "..."
  host                   = "..."
  token                  = "..."
}

module "autoneg" {
  source = "github.com/GoogleCloudPlatform/gke-autoneg-controller//terraform/autoneg"

  project_id = "your-project-id"
  
  # NOTE: You may need to build your own image if you rely on features merged between releases, and do
  # not wish to use the `latest` image.
  controller_image = "ghcr.io/googlecloudplatform/gke-autoneg-controller/gke-autoneg-controller:v1.1.0"
}

Installation via Helm charts

A Helm chart is also provided in deploy/chart and via https://googlecloudplatform.github.io/gke-autoneg-controller/ repository.

You can also use it with Terraform like this:

module "autoneg" {
  source = "github.com/GoogleCloudPlatform/gke-autoneg-controller//terraform/gcp?ref=master"

  project_id         = module.project.project_id
  service_account_id = "autoneg"
  workload_identity = {
    namespace       = "autoneg-system"
    service_account = "autoneg-controller-manager"
  }
  # To add shared VPC configuration, also set shared_vpc variable
}

resource "helm_release" "autoneg" {
  name       = "autoneg"
  chart      = "autoneg-controller-manager"
  repository = "https://googlecloudplatform.github.io/gke-autoneg-controller/"
  namespace  = "autoneg-system"

  create_namespace = true

  set {
    name  = "createNamespace"
    value = false
  }
  
  set {
    name  = "serviceAccount.annotations.iam\\.gke\\.io/gcp-service-account"
    value = module.autoneg.service_account_email
  }

  set {
    name  = "serviceAccount.automountServiceAccountToken"
    value = true
  }
}

Customizing your installation

autoneg is based on Kubebuilder, and as such, you can customize and deploy autoneg according to the Kubebuilder "Run It On the Cluster" section of the Quick Start. autoneg does not define a CRD, so you can skip any Kubebuilder steps involving CRDs.

The included deploy/autoneg.yaml is the default output of Kubebuilder's make deploy step, coupled with a public image.

Do keep in mind the additional configuration to enable Workload Identity.

gke-autoneg-controller's People

Contributors

agh95 avatar arrase avatar clintonb avatar darkstarmv avatar dependabot[bot] avatar derektamsen avatar erikjoh avatar fbaier-fn avatar fdfzcq avatar ferrastas avatar glerchundi avatar golemiso avatar ichbinfrog avatar jawnsy avatar ksemele avatar pburgisser avatar rojomisin avatar romankor avatar rosmo avatar same-id avatar stekole avatar yuriykishchenko avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

gke-autoneg-controller's Issues

Question: "Successfully Reconciled" for non-NEG services ?

Hello,
What is the meaning of the following log, provided that the bar service in foo namespace is not a Network Endpoint Group?

DEBUG	controller-runtime.controller	Successfully Reconciled	{"controller": "service", "request": "foo/bar}

Configurable default capacityScaler

Hello,

Thanks for this excellent project!

At Prefect, we are using this controller to perform traffic splitting and blue/green deployments. As far as I can tell, the default (and hard-coded) behavior of this controller is such that traffic will be immediately load balanced between all participating services once the health checks succeed, and we would instead prefer to gradually shift traffic over by configuring the split ratio.

This seems to be the relevant code:

return compute.Backend{
Group: group,
BalancingMode: "RATE",
MaxRatePerEndpoint: s.AutonegConfig.BackendServices[port][name].Rate,
CapacityScaler: 1,

Would you accept a PR that adds configurability for the InitialCapacity value? We could store that in the existing AutonegConfig object.

I have two use cases in mind for this feature:

  • Bringing up an application and cluster in a new zone/region
  • Creating a new cluster for a blue/green deployment

In both cases, we want to gradually shift some traffic to the new cluster and monitor error rates.

With the current behavior, if we use the same connection rate settings for the service, then bringing up a new cluster would take an equal proportion of traffic (e.g. live cluster A processing 100% of requests, bringing up a new cluster B and attaching to the same NEG will result in a 50%/50% split.) We would like to begin with 100%/0% split, gradually increase the proportion of traffic that Cluster B handles, and then gradually decrease the proportion of traffic that Cluster A handles, to safely transition.

Removing config annotation does not deregister backends

Removing config annotation anthos.cft.dev/autoneg still left the anthos.cft.dev/autoneg-status annotation, and also generated a controller.autoneg.dev/neg-status annotation.

I would expect autoneg to deregister backends, remove the status annotation, and remove the finalizer.

Flaky TestReconcileStatuses because maps don't guarantee order

make test fails occasionally for https://github.com/GoogleCloudPlatform/gke-autoneg-controller/blob/master/controllers/autoneg_test.go#L248-L258 because we transform a slice to map then back to a slice (see https://github.com/GoogleCloudPlatform/gke-autoneg-controller/blob/master/controllers/autoneg.go#L141) which can mess up the order assumed in the test https://github.com/GoogleCloudPlatform/gke-autoneg-controller/blob/master/controllers/autoneg_test.go#L157-L170.

So if we compare the returned slice, sometimes the group of zone2 comes before zone1 and will have the test fail.

I propose modifying isEqual ala:

func (b Backends) isEqual(ob Backends) bool {
	if b.name != ob.name {
		return false
	}
	newB := map[string]compute.Backend{}
	for _, be := range b.backends {
		rb := relevantCopy(be)
		if _, ok := newB[rb.Group]; ok {
			return false
		}
		newB[rb.Group] = rb
	}
	newOB := map[string]compute.Backend{}
	for _, be := range ob.backends {
		rb := relevantCopy(be)
		if _, ok := newOB[rb.Group]; ok {
			return false
		}
		newOB[rb.Group] = rb
	}
	return reflect.DeepEqual(newB, newOB)
}

Multiple NEG/Backend support

A GKE service can have multiple NEGs created for different ports but there isn't a way to associate a specific NEG with a specific backend.

For example, a service exposes two ports. Port 443 for HTTP2/grpc and port 8443 HTTP1.1 for metrics and diagnostics endpoints. Two NEGs can be created for those two ports but there is no way to associate the port 443 NEG with backendA and the port 8443 NEG with backendB.

They must be behind separate backends since they use different protocols (HTTP2 and HTTP) to talk from the backend service to the NEG instances.

Observing ACCESS_TOKEN_SCOPE_INSUFFICIENT when creating service

I have a service defined with

apiVersion: v1
kind: Service
metadata:
  name: frontend-svc
  annotations:
    cloud.google.com/neg: '{"exposed_ports": {"443":{}}}'
    controller.autoneg.dev/neg: '{"backend_services":{"443":[{"name":"https-be","max_connections_per_endpoint":1000}]}}'
spec:
  selector:
    app: frontend-app
  type: NodePort
  ports:
    - protocol: TCP
      port: 443
      targetPort: 3000

when I ran the kubectl command to create it, I observe the following events:

Events:
  Type     Reason        Age                 From                Message
  ----     ------        ----                ----                -------
  Normal   Sync          32s                 autoneg-controller  Synced NEGs for "default/frontend-svc" as backends to backend service "https-be" (port 443)
  Normal   Create        23s                 neg-controller      Created NEG "k8s1-1f4ed5c4-default-frontend-svc-443-9757dbe8" for default/frontend-svc-k8s1-1f4ed5c4-default-frontend-svc-443-9757dbe8--/443-3000-GCE_VM_IP_PORT-L7 in "us-central1-f".
  Warning  BackendError  11s (x13 over 32s)  autoneg-controller  googleapi: Error 403: Request had insufficient authentication scopes.
Details:
[
  {
    "@type": "type.googleapis.com/google.rpc.ErrorInfo",
    "domain": "googleapis.com",
    "metadatas": {
      "method": "compute.v1.BackendServicesService.Get",
      "service": "compute.googleapis.com"
    },
    "reason": "ACCESS_TOKEN_SCOPE_INSUFFICIENT"
  }
]

I can see however that the autoneg IAM role has the permission to perform this operation included:

$ gcloud iam roles describe autoneg --project=$PROJECT_ID
etag: REDACTED
includedPermissions:
- compute.backendServices.get
- compute.backendServices.update
- compute.healthChecks.useReadOnly
- compute.networkEndpointGroups.use
- compute.regionBackendServices.get
- compute.regionBackendServices.update
- compute.regionHealthChecks.useReadOnly
name: projects/${PROJECT_ID}/roles/autoneg
stage: ALPHA
title: autoneg

Any suggestions on how to debug and resolve? What makes this acutely frustrating is that there is no mention of these IAM issues in any of the GCP, GKE, autoneg docs or community forums.

Terraform configuration unusable for multiple clusters

Right now the controller is installed terraform module that way

module "autoneg" {
  source = "github.com/GoogleCloudPlatform/gke-autoneg-controller//terraform/autoneg"

  project_id = "your-project-id"
}

where the only thing configurable is the project-id .

It creates a custom role for the controller, so that module is completely unusable when installing it on different clusters in the same project, it will fail cause it just tries to create the same custom role

Use specific service account instead of `default`

This makes it explicit that there is a service account just for the autoneg pod.

Additionally, when provisioning with terraform there is no way to update the annotation on the default service account. A new service account must be created and managed by terraform to add the annotation.

NEGs not being deregistered on Service deletion

I'm experimenting with the following simple nginx service:

apiVersion: v1
kind: Service
metadata:
  name: neg-demo-svc
  annotations:
    cloud.google.com/neg: '{"exposed_ports": {"80":{}}}'
    controller.autoneg.dev/neg: '{"backend_services":{"80":[{"name":"some-backend-service","region":"europe-west2","max_rate_per_endpoint":100}]}}'
spec:
  type: ClusterIP
  selector:
    app: nginx
  ports:
  - port: 80
    protocol: TCP

I have NEG creation and association with the Backend Service working correctly. However, when I'm deleting the service, I'm running into problems. Based on the autoneg-controller logs, it seems to deregister the NEGs successfully:

2022-02-25T16:45:12.987+0100    DEBUG   controller-runtime.manager.events       Normal  {"object": {"kind":"Service","namespace":"default","name":"neg-demo-svc","uid":"affa73d3-10e3-4698-ae02-d1c1c22ad748","apiVersion":"v1","resourceVersion":"420760"}, "reason": "Delete", "message": "Deregistered NEGs for \"default/neg-demo-svc\" from backend service \"argon-gke-general-blue-01-psc-backend-service\" (port 80)"}

I believe that this is related to this logic here:

var intendedBEKeys []string
for k := range intendedBE {
intendedBEKeys = append(intendedBEKeys, k)
}
sort.Strings(intendedBEKeys)

In this case, intendedBEKeys is empty, which means we're going to skip the entire loop that checks for differences. Shouldn't it also iterate on keys from actualBE to detect the removal of ports?

Please let me know if I'm missing something. Is there anything else I might provide?

Custom role error when recreating project

We have a GCP project that we use to test new infrastructure changes, once the test is done, we tear down all the infrastructure.
We recently switched to using the module provided in the project and we are facing some issues when recreating the custom role.

Received unexpected error:
            	FatalError{Underlying: error while running command: exit status 1; ๏ฟฝ[31mโ•ท๏ฟฝ[0m๏ฟฝ[0m
            	๏ฟฝ[31mโ”‚๏ฟฝ[0m ๏ฟฝ[0m๏ฟฝ[1m๏ฟฝ[31mError: ๏ฟฝ[0m๏ฟฝ[0m๏ฟฝ[1mError creating the custom project role projects/<project-id>/roles/autonegRegional: googleapi: Error 400: You can't create a role_id (autonegRegional) which has been marked for deletion., failedPrecondition๏ฟฝ[0m
            	๏ฟฝ[31mโ”‚๏ฟฝ[0m ๏ฟฝ[0m
            	๏ฟฝ[31mโ”‚๏ฟฝ[0m ๏ฟฝ[0m๏ฟฝ[0m  with module.dwam_test.module.autoneg[0].module.gcp.google_project_iam_custom_role.autoneg,
            	๏ฟฝ[31mโ”‚๏ฟฝ[0m ๏ฟฝ[0m  on .terraform/modules/dwam_test.autoneg/terraform/gcp/main.tf line 57, in resource "google_project_iam_custom_role" "autoneg":
            	๏ฟฝ[31mโ”‚๏ฟฝ[0m ๏ฟฝ[0m  57: resource "google_project_iam_custom_role" "autoneg" ๏ฟฝ[4m{๏ฟฝ[0m๏ฟฝ[0m
            	๏ฟฝ[31mโ”‚๏ฟฝ[0m ๏ฟฝ[0m
            	๏ฟฝ[31mโ•ต๏ฟฝ[0m๏ฟฝ[0m}

Reconcile error: network endpoint group in a specific zone not found

We deployed autoneg in one of our clusters running GKE Autopilot. When running a workload on it and if it's scheduled to just some specific zones but not all, the NEG are not created in all the zones.

That means that autoneg will fail and stop reconciling.

The actual failing log part:

2021-10-22T17:55:52.092Z	INFO	controllers.Service	Applying intended status	{"service": "envoy/envoy", "status": {"backend_services":{"8000":{"myproduct":{"name":"myproduct","max_connections_per_endpoint":1000}}},"network_endpoint_groups":{"8000":"k8s1-4fd3dc4c-envoy-envoy-8000-44f9746b","8001":"k8s1-4fd3dc4c-envoy-envoy-8001-7508741b"},"zones":["europe-west1-b","europe-west1-c","europe-west1-d"]}}
2021-10-22T17:55:52.762Z	ERROR	controller-runtime.controller	Reconciler error	{"controller": "service", "request": "envoy/envoy", "error": "googleapi: Error 404: The resource 'projects/myproduct-dev/zones/europe-west1-c/networkEndpointGroups/k8s1-4fd3dc4c-envoy-envoy-8000-44f9746b' was not found, notFound"}

The question is, should autoneg tolerate missing network endpoint groups in some but not in all available zones?

The `README` and `workload_identity.sh` script contain several errors

The following errors are in the README/script

  1. The README specifies you should run "PROJECT=xyz deploy/workload_identity.sh", but the script itself expects an environment variable named "PROJECT_ID"
  2. If role "autoneg" does not exist, "gcloud iam roles update" fails. You should use "gcloud iam roles create" instead.

Manually removed backend does not add itself back

Hello,

After performing some tests, I've noticed that if a backend added to a backend service by the autoneg controller is manually removed, it is not added back, that's to say the controller is not constantly checking to ensure the desired state is real.

It would be desirable if the autoneg system periodically checked to make sure that desired state is real.

Steps to reproduce:

  1. run autoneg controller and sync a NEG to a backend service
  2. manually remove the backend NEG from the backend service, observe that it does not get added back by the autoneg controller

NEGs are not deleted during service deletion

I'm having issue with the autoneg v0.9.9. Everything seems to be working fine since I deleted service with autoneg annotations.

I can see deregistering message in logs

DEBUG	events	Normal	{"object": {"kind":"Service","namespace":"test","name":"test-marianh","uid":"d362ac9e-f5dd-48ee-a167-91abf6c1be0a","apiVersion":"v1","resourceVersion":"14606392"}, "reason": "Delete", "message": "Deregistered NEGs for \"test/test-marianh\" from backend service \"autoneg-w1\" (port 80)"}"

but these NEGs will never be deleted or detached from backend service.

$ gcloud compute backend-services list --project=anthos-mesh-sandbox-1b827f9b 
NAME        BACKENDS                                                                                                                                                                                                                                            PROTOCOL
autoneg-w1  europe-west1-b/networkEndpointGroups/k8s1-6b097e81-test-test-marianh-80-1330420d,europe-west1-c/networkEndpointGroups/k8s1-6b097e81-test-test-marianh-80-1330420d,europe-west1-d/networkEndpointGroups/k8s1-6b097e81-test-test-marianh-80-1330420d  TCP

$ gcloud compute network-endpoint-groups list --project=anthos-mesh-sandbox-1b827f9b  | grep marianh
k8s1-6b097e81-test-test-marianh-80-1330420d                      europe-west1-b  GCE_VM_IP_PORT  0
k8s1-6b097e81-test-test-marianh-80-1330420d                      europe-west1-c  GCE_VM_IP_PORT  0
k8s1-6b097e81-test-test-marianh-80-1330420d                      europe-west1-d  GCE_VM_IP_PORT  1

Is there anything what can I check why these NEGs persist service deletion, or is there cook book how to correct delete service with NEG annotation?

googleapi: Error 412: Precondition Failed, conditionNotMet

Hi,
I am getting the following error with autoneg controller.
Any idea what the missing precondition could be?

autoneg-controller-manager manager 2019-12-10T19:58:09.561Z    INFO    controllers.Service     Applying intended status        {"service": "foo/bar", "status": {"name":"bar","max_rate_per_endpoint":1000,"network_endpoint_groups":{"80":"k8s1-foo-bar-80-12345a"},"zones":["us-central1-a"]}}
autoneg-controller-manager manager 2019-12-10T19:58:09.990Z    ERROR   controller-runtime.controller   Reconciler error        {"controller": "service", "request": "foo/bar", "error": "googleapi: Error 412: Precondition Failed, conditionNotMet"}

A backend service named bar exists in the project, it has 0 backends at this time.

cc @soellman ๐Ÿ™

Additional pod and container hardening

We're using the following settings for our deployment --

Container security context:

          securityContext:
            allowPrivilegeEscalation: false
            capabilities:
              drop:
                - ALL
            privileged: false
            readOnlyRootFilesystem: true
            runAsUser: 65532
            runAsGroup: 65532
            runAsNonRoot: true
            seccompProfile:
              type: RuntimeDefault

Pod security context:

      securityContext:
        runAsUser: 65532
        runAsGroup: 65532
        runAsNonRoot: true
        seccompProfile:
          type: RuntimeDefault

Using these as the default would be useful, as it makes this controller installable into a wider number of clusters (e.g. clusters with restrictive admission controllers)

How to deploy?

How exactly does one deploy this into one's clusters and make use of it? In particular with a multi cluster ingress (although that part may be out of scope).

The object has been modified; please apply your changes to the latest version and try again

When I deploy a k8s service with the NEG and autoneg annotations, the NEGs are not registered at the backend service with the following error message:

โ€‹ Type     Reason        Age                  From                Message
  ----     ------        ----                 ----                -------
  Normal   Create        7m54s                neg-controller      Created NEG "k8s1-85a2e695-default-app-catfood-80-fb89a260" for default/app-catfood-k8s1-85a2e695-default-app-catfood-80-fb89a260--web/80-80-GCE_VM_IP_PORT-L7 in "europe-west1-b".
  Normal   Create        7m51s                neg-controller      Created NEG "k8s1-85a2e695-default-app-catfood-80-fb89a260" for default/app-catfood-k8s1-85a2e695-default-app-catfood-80-fb89a260--web/80-80-GCE_VM_IP_PORT-L7 in "europe-west1-c".
  Normal   Create        7m49s                neg-controller      Created NEG "k8s1-85a2e695-default-app-catfood-80-fb89a260" for default/app-catfood-k8s1-85a2e695-default-app-catfood-80-fb89a260--web/80-80-GCE_VM_IP_PORT-L7 in "europe-west1-d".
  Normal   Attach        7m45s (x2 over 27h)  neg-controller      Attach 1 network endpoint(s) (NEG "k8s1-85a2e695-default-app-catfood-80-fb89a260" in zone "europe-west1-b")
Warning  BackendError  7m44s (x2 over 26h)  autoneg-controller  Operation cannot be fulfilled on services "app-catfood": the object has been modified; please apply your changes to the latest version and try again

When I update the service again, the NEGs are registered:

Normal Sync 1s (x2 over 26h) autoneg-controller Synced NEGs for "default/app-catfood" as backends to backend service "catfood"

I expect the autoneg controller to eventually synchronise to the desired situation without manual intervention.

error CONNECTION balancing mode is not supported for protocol HTTP, invalid

I'm having issue with the autoneg v0.9.9,

When setting the following annotation on the service

annotations:
    cloud.google.com/neg: '{"exposed_ports": {"80":{}}}'
    controller.autoneg.dev/neg: '{"backend_services":{"80":[{"name":"gke-development-ingress-backend-web","zones":["us-central1-c"]}]}}'

The service is generating the errors below

Events:                                                                                                                                                                                                                                                                               
โ”‚   Type     Reason        Age                From                Message                                                                                                                                                                                                               โ”‚
โ”‚   ----     ------        ----               ----                -------                                                                                                                                                                                                               โ”‚
โ”‚   Normal   Sync          32s                autoneg-controller  Synced NEGs for "traefik/traefik" as backends to backend service "gke-development-ingress-backend-web" (port 80)                                                                                                      โ”‚
โ”‚   Normal   Create        25s (x2 over 15m)  neg-controller      Created NEG "k8s1-c278397d-traefik-traefik-80-230eaa9a" for traefik/traefik-k8s1-c278397d-traefik-traefik-80-230eaa9a--web/80-web-GCE_VM_IP_PORT-L7 in "us-central1-b".                                               โ”‚
โ”‚   Warning  BackendError  25s (x9 over 32s)  autoneg-controller  googleapi: Error 404: The resource 'projects/feltboard-terraform/zones/us-central1-c/networkEndpointGroups/k8s1-c278397d-traefik-traefik-80-230eaa9a' was not found, notFound                                         โ”‚
โ”‚   Normal   Create        13s (x2 over 15m)  neg-controller      Created NEG "k8s1-c278397d-traefik-traefik-80-230eaa9a" for traefik/traefik-k8s1-c278397d-traefik-traefik-80-230eaa9a--web/80-web-GCE_VM_IP_PORT-L7 in "us-central1-c".                                               โ”‚
โ”‚   Normal   Attach        10s (x2 over 14m)  neg-controller      Attach 1 network endpoint(s) (NEG "k8s1-c278397d-traefik-traefik-80-230eaa9a" in zone "us-central1-b")                                                                                                                โ”‚
โ”‚   Warning  BackendError  3s (x7 over 11m)   autoneg-controller  googleapi: Error 400: Invalid value for field 'resource.backends[0]': '{  "resourceGroup": "https://www.googleapis.com/compute/v1/projects/xx/zones/us-cen...'. CONNECTION balancing mode is not supported for protocol HTTP, invalid     

This is causing me to not be able to use the DynamicBackend example as stated in https://github.com/terraform-google-modules/terraform-google-lb-http/tree/v7.0.0/examples/dynamic-backend

At the end, I want to expose a Traefik ingress through the web and websecure service through a GCS Load Balancer

Mixed logging when configuring structured logging

With the following settings:

          command:
            - /manager
          args:
            - --metrics-bind-address=:8080
            - --health-probe-bind-address=:8081
            - --leader-elect
            - --zap-devel=false
            - --zap-encoder=json
            - --zap-time-encoding=rfc3339

The resulting logs are a mix of structured and unstructured logs. In particular, the Kubernetes client and leader election log messages are human-readable, which means that Cloud Logging will be confused about whether these logs are JSON or text.

I0718 03:39:16.676513       1 request.go:601] Waited for 1.004034269s due to client-side throttling, not priority and fairness, request: GET:https://10.30.180.1:443/apis/vulnerabilities.protect.gke.io/v1?timeout=32s
{"level":"info","ts":"2023-07-18T03:39:17Z","logger":"controller-runtime.metrics","msg":"Metrics server is starting to listen","addr":":8080"}
{"level":"info","ts":"2023-07-18T03:39:17Z","logger":"setup","msg":"starting manager"}
{"level":"info","ts":"2023-07-18T03:39:17Z","msg":"Starting server","path":"/metrics","kind":"metrics","addr":"[::]:8080"}
{"level":"info","ts":"2023-07-18T03:39:17Z","msg":"Starting server","kind":"health probe","addr":"[::]:8081"}
I0718 03:39:17.791176       1 leaderelection.go:248] attempting to acquire leader lease autoneg-system/9fe89c94.controller.autoneg.dev...
E0718 03:39:17.801515       1 leaderelection.go:330] error retrieving resource lock autoneg-system/9fe89c94.controller.autoneg.dev: leases.coordination.k8s.io "9fe89c94.controller.autoneg.dev" is forbidden: User "system:serviceaccount:autoneg-system:autoneg" cannot get resource "leases" in API group "coordination.k8s.io" in the namespace "autoneg-system"
E0718 03:39:21.244438       1 leaderelection.go:330] error retrieving resource lock autoneg-system/9fe89c94.controller.autoneg.dev: leases.coordination.k8s.io "9fe89c94.controller.autoneg.dev" is forbidden: User "system:serviceaccount:autoneg-system:autoneg" cannot get resource "leases" in API group "coordination.k8s.io" in the namespace "autoneg-system"
E0718 03:39:24.292654       1 leaderelection.go:330] error retrieving resource lock autoneg-system/9fe89c94.controller.autoneg.dev: leases.coordination.k8s.io "9fe89c94.controller.autoneg.dev" is forbidden: User "system:serviceaccount:autoneg-system:autoneg" cannot get resource "leases" in API group "coordination.k8s.io" in the namespace "autoneg-system"
E0718 03:39:28.356441       1 leaderelection.go:330] error retrieving resource lock autoneg-system/9fe89c94.controller.autoneg.dev: leases.coordination.k8s.io "9fe89c94.controller.autoneg.dev" is forbidden: User "system:serviceaccount:autoneg-system:autoneg" cannot get resource "leases" in API group "coordination.k8s.io" in the namespace "autoneg-system"
E0718 03:39:32.195193       1 leaderelection.go:330] error retrieving resource lock autoneg-system/9fe89c94.controller.autoneg.dev: leases.coordination.k8s.io "9fe89c94.controller.autoneg.dev" is forbidden: User "system:serviceaccount:autoneg-system:autoneg" cannot get resource "leases" in API group "coordination.k8s.io" in the namespace "autoneg-system"

Unused ConfigMap in deploy manifest

The deploy manifest includes a ConfigMap that does not appear to be used (it is not mounted to the Deployment). Can it be removed?

apiVersion: v1
data:
controller_manager_config.yaml: |
# Copyright 2021 Google LLC
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
apiVersion: controller-runtime.sigs.k8s.io/v1alpha1
kind: ControllerManagerConfig
health:
healthProbeBindAddress: :8081
metrics:
bindAddress: 127.0.0.1:8080
webhook:
port: 9443
leaderElection:
leaderElect: true
resourceName: 9fe89c94.controller.autoneg.dev
kind: ConfigMap
metadata:
labels:
app: autoneg
name: autoneg-manager-config
namespace: autoneg-system

Specify non-root user in deployment

The deployment spec should specify the non-root user in the security context following the #20 change:

securityContext:        
  privileged: false        
  allowPrivilegeEscalation: false        
  runAsUser: 1002        
  runAsNonRoot: true

deploy/autoneg.yaml referring incorrect kubernetes service account

Hi Team,

I was using kubectl to install the gke controller using the manifest provide in the repo here https://github.com/GoogleCloudPlatform/gke-autoneg-controller/blob/master/deploy/autoneg.yaml, found that you are creating a new service account with autoneg and referencing that service account name in the Deployment in the later part of the file; but then I found that you are using the default service account in the rolebinding and the clusterrolebinding, which is incorrect. I spend lot of time investigating and found this issue. Please update the autoneg.yaml file attached as a reference.

Regards,
Sreenivas

Update guide or description of compatibility guarantees

Pull request #46 changes some flags, which is a breaking change unless users re-deploy using the Deployment manifest. I'd propose either:

  • Adding a warning that breaking changes can occur, to set user expectations; or,
  • Defining a support strategy (e.g. support both flags for some period of time, deprecating the old one, etc)

Prior to the linked pull request, the code looked like this:

flag.StringVar(&metricsAddr, "metrics-addr", ":8080", "The address the metric endpoint binds to.")
flag.BoolVar(&enableLeaderElection, "enable-leader-election", false,
"Enable leader election for controller manager. Enabling this will ensure there is only one active controller manager.")
flag.StringVar(&serviceNameTemplate, "default-backendservice-name", "{name}-{port}",
"A naming template consists of {namespace}, {name}, {port} or {hash} separated by hyphens, "+
"where {hash} is the first 8 digits of a hash of other given information")
flag.BoolVar(&allowServiceName, "enable-custom-service-names", true, "Enable using custom service names in autoneg annotation.")

Afterwards, it now looks like this:

flag.StringVar(&metricsAddr, "metrics-bind-address", ":8080", "The address the metric endpoint binds to.")
flag.StringVar(&probeAddr, "health-probe-bind-address", ":8081", "The address the probe endpoint binds to.")
flag.BoolVar(&enableLeaderElection, "leader-elect", false,
"Enable leader election for controller manager. "+
"Enabling this will ensure there is only one active controller manager.")
flag.StringVar(&serviceNameTemplate, "default-backendservice-name", "{name}-{port}",
"A naming template consists of {namespace}, {name}, {port} or {hash} separated by hyphens, "+
"where {hash} is the first 8 digits of a hash of other given information")
flag.BoolVar(&allowServiceName, "enable-custom-service-names", true, "Enable using custom service names in autoneg annotation.")

Overall, the goal is just to ensure that users and developers have aligned expectations. Perhaps users should not expect stability given that we are pre-1.0. In this case, users upgrading from 0.9.8 to 0.9.9 will have to update their deployment manifests.

Run as non-root user instead of root

It is not clear that there is any need to run this container as root user. This should be updated to run as non-root or have explicit documentation as to why a root user is required for the container.

New configuration annotations

With #34 merged, we have a new configuration annotation (controller.autoneg.dev/neg) that exposes the full configuration surface as json.

I've been considering a simpler interface for those who don't need that full flexibility, specifically targeting (what I assume is) the basic case of a single-port service, needing only two configuration options:
controller.autoneg.dev/backend-service:<name> (optional, defaulting to the name of the k8s service)
controller.autoneg.dev/max-rate-per-endpoint:<number> OR controller.autoneg.dev/max-connections-per-endpoint:<integer>

Thoughts?

Configuring region for Regional Backend config may be unnecessary and confusing

I missed this when looking at #34 but I don't think there's any opportunity to make choices for the backend region.

An Internal HTTP LB (regional backend service) can only include Backends from the same region. So a Backend Service in us-west1 can only have Backends also in us-west1. Trying to configure autoneg on a us-west1 cluster to add instances to a us-east1 Backend Service will not work, I believe.

I addressed this by pulling the cluster location (region) from metadata: http://metadata.google.internal/computeMetadata/v1/instance/attributes/cluster-location so that user's aren't in the position of needing to magically know the only value that will actually work for region and having to explicitly configure it for each annotation.

One thing that may make this configuration required would be if there's potential for an Internal HTTP LB at some point in the future to allow Backends to be added to a Backend Service in a different region. If that happens, the it would become necessary to specify the Backend Service region in the annotation.

Trying to limit autoneg controller to multiple individual namespaces in the same cluster.

Tried removing all cluster roles but controller does not have filtering based on namespace or maybe I am configuring it incorrectly

message: "pkg/mod/k8s.io/[email protected]+incompatible/tools/cache/reflector.go:94: Failed to list *v1.Service: services is forbidden: User "system:serviceaccount:autoneg-per-ns:default" cannot list resource "services" in API group "" at the cluster scope"
source: "reflector.go:126"

Workload identity 401 invalid credentials error

The autoneg-controller-manager pod returns the following error when using workload identity:

2022-07-27T18:52:41.739Z	ERROR	controller-runtime.controller	Reconciler error	{"controller": "service", "request": "<name_space>/<service>", "error": "googleapi: Error 401: Invalid Credentials, authError"}
github.com/go-logr/zapr.(*zapLogger).Error
	/go/pkg/mod/github.com/go-logr/[email protected]/zapr.go:128
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler
	/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:218
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem
	/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:192
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).worker
	/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:171
k8s.io/apimachinery/pkg/util/wait.JitterUntil.func1
	/go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/wait.go:152
k8s.io/apimachinery/pkg/util/wait.JitterUntil
	/go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/wait.go:153
k8s.io/apimachinery/pkg/util/wait.Until
	/go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/wait.go:88

The default service account has the workload identity annotation and the gcp service account has the correct role binding.

ReconcileBackends succeeds although some operations fail

We ran into an issue that the autoneg controller reports backends being added to NEGs although some patch operations fail with RESOURCE_NOT_READY.

It seems that applying patches to reconcile backends is handled optimistically, leading to a wrong reconciliation result.

For now I applied a workaround by retrying the patch in such case. See fbaier-fn@225576d which I would be more than happy to contribute my fix.

Update Ginkgo due to deprecation warning

When running testes locally (via go test ./...), I see the following output:

You're using deprecated Ginkgo functionality:
=============================================
Ginkgo 2.0 is under active development and will introduce several new features, improvements, and a small handful of breaking changes.
A release candidate for 2.0 is now available and 2.0 should GA in Fall 2021.  Please give the RC a try and send us feedback!
  - To learn more, view the migration guide at https://github.com/onsi/ginkgo/blob/ver2/docs/MIGRATING_TO_V2.md
  - For instructions on using the Release Candidate visit https://github.com/onsi/ginkgo/blob/ver2/docs/MIGRATING_TO_V2.md#using-the-beta
  - To comment, chime in at https://github.com/onsi/ginkgo/issues/711

  You are using a custom reporter.  Support for custom reporters will likely be removed in V2.  Most users were using them to generate junit or teamcity reports and this functionality will be merged into the core reporter.  In addition, Ginkgo 2.0 will support emitting a JSON-formatted report that users can then manipulate to generate custom reports.

  If this change will be impactful to you please leave a comment on https://github.com/onsi/ginkgo/issues/711
  Learn more at: https://github.com/onsi/ginkgo/blob/ver2/docs/MIGRATING_TO_V2.md#removed-custom-reporters

To silence deprecations that can be silenced set the following environment variable:
  ACK_GINKGO_DEPRECATIONS=1.16.5

Failed to pull image from docker.pkg.github.com: no basic auth credentials

Hi,

Thank you for the nice tool, successfully using it for NEG autopopulation.

After trying to switch to the most recent published version as a release, it is actually not possible to pull the image from the registry:

Events:
  Type     Reason     Age                   From               Message
  ----     ------     ----                  ----               -------
  Normal   Scheduled  19m                   default-scheduler  Successfully assigned autoneg-system/autoneg-controller-manager-758bc7f94b-tzvz9 to gke-d003-sb2-k8s-euwe1-node-pool-1-e0234f93-8c4q
  Normal   Pulling    19m                   kubelet            Pulling image "gcr.io/kubebuilder/kube-rbac-proxy:v0.4.0"
  Normal   Started    19m                   kubelet            Started container kube-rbac-proxy
  Normal   Pulled     19m                   kubelet            Successfully pulled image "gcr.io/kubebuilder/kube-rbac-proxy:v0.4.0" in 1.270493418s
  Normal   Created    19m                   kubelet            Created container kube-rbac-proxy
  Warning  Failed     18m (x3 over 19m)     kubelet            Failed to pull image "docker.pkg.github.com/googlecloudplatform/gke-autoneg-controller/gke-autoneg-controller:0.9.1": rpc error: code = Unknown desc = Error response from daemon: Get https://docker.pkg.github.com/v2/googlecloudplatform/gke-autoneg-controller/gke-autoneg-controller/manifests/0.9.1: no basic auth credentials

It should not be behind basic auth, should it?
Should not it be published to gcr.io registry instead?..

Controller manager service account forbidden listing api resources

I ran into an issue where the autoneg controller service account is forbidden listing several kubernetes api resources. The autoneg-controller-manager pod is returning the following errors:

E0727 18:09:24.847396       1 reflector.go:126] pkg/mod/k8s.io/[email protected]+incompatible/tools/cache/reflector.go:94: Failed to list *v1.Service: services is forbidden: User "system:serviceaccount:autoneg-system:autoneg" cannot list resource "services" in API group "" at the cluster scope
E0727 18:09:24.891173       1 leaderelection.go:306] error retrieving resource lock autoneg-system/controller-leader-election-helper: configmaps "controller-leader-election-helper" is forbidden: User "system:serviceaccount:autoneg-system:autoneg" cannot get resource "configmaps" in API group "" in the namespace "autoneg-system"

This occurs because the autoneg-controller-manager service account is autoneg. However, the rbac role bindings for the service account references, default, instead of autoneg used by the deployment.

Unauthorized metrics endpoint

Hello,

in deploy yaml file I see that service has following annotation:

prometheus.io/scrape: "true"

But when my prometheus tries to scrape metrics, it says Unauthorized

I tried to check it with curl and got the same

curl -k https://10.1.41.10:8443/metrics
Unauthorized

Could you please explain, why /metrics endpoint has authorization and how to disable it?

BTW, I did not find list of exposed metrics.

Could please provide?

Thanks.

GCLB backends are not populated by AutoNEG controller: RESOURCE_NOT_READY

Hi,

I am trying to use AutoNEG controller in Workload Identity mode (configured according to the manual), and I face a problem with auto populating the GCLB backends.

I noticed that shortly after creating GKE services, the backends show the NEGs in GC console, but then they disappear and never show up again.

The sequence of events according to Stackdriver logs:
https://pastebin.com/FKdhqYTy

I can see BAD_REQUEST and RESOURCE_NOT_READY errors there.

I have AutoNEG controller working correctly in neighbor projects, but using cloud access scopes. It's just for extra context, I am not sure if this particular issue is related to Workload Identity setup or not.

Inside the affected NEG I can see undefined health status:
image

However I think (though not sure) that this is because it is "Not used yet" by the backend, so it's a consequence.

Also the problem persists, even if I remove GKE services and create them again - so it is not any kind of a race with LB backend service creation.

Any idea what goes wrong here?

Also posted: https://serverfault.com/questions/1036096/gclb-backends-are-not-populated-by-autoneg-controller-resource-not-ready

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.