googlecloudplatform / gke-autoneg-controller Goto Github PK

View Code? Open in Web Editor NEW

153.0 29.0 48.0 308 KB

This GKE controller provides simple custom integration between GKE and GCLB.

License: Apache License 2.0

Dockerfile 1.12% Makefile 4.44% Go 65.16% Shell 1.14% HCL 26.45% Smarty 1.69%

gke gclb

gke-autoneg-controller's Issues

deploy/autoneg.yaml referring incorrect kubernetes service account

Hi Team,

I was using kubectl to install the gke controller using the manifest provide in the repo here https://github.com/GoogleCloudPlatform/gke-autoneg-controller/blob/master/deploy/autoneg.yaml, found that you are creating a new service account with autoneg and referencing that service account name in the Deployment in the later part of the file; but then I found that you are using the default service account in the rolebinding and the clusterrolebinding, which is incorrect. I spend lot of time investigating and found this issue. Please update the autoneg.yaml file attached as a reference.

Regards,
Sreenivas

error CONNECTION balancing mode is not supported for protocol HTTP, invalid

I'm having issue with the autoneg v0.9.9,

When setting the following annotation on the service

annotations:
    cloud.google.com/neg: '{"exposed_ports": {"80":{}}}'
    controller.autoneg.dev/neg: '{"backend_services":{"80":[{"name":"gke-development-ingress-backend-web","zones":["us-central1-c"]}]}}'

The service is generating the errors below

Events:                                                                                                                                                                                                                                                                               
│   Type     Reason        Age                From                Message                                                                                                                                                                                                               │
│   ----     ------        ----               ----                -------                                                                                                                                                                                                               │
│   Normal   Sync          32s                autoneg-controller  Synced NEGs for "traefik/traefik" as backends to backend service "gke-development-ingress-backend-web" (port 80)                                                                                                      │
│   Normal   Create        25s (x2 over 15m)  neg-controller      Created NEG "k8s1-c278397d-traefik-traefik-80-230eaa9a" for traefik/traefik-k8s1-c278397d-traefik-traefik-80-230eaa9a--web/80-web-GCE_VM_IP_PORT-L7 in "us-central1-b".                                               │
│   Warning  BackendError  25s (x9 over 32s)  autoneg-controller  googleapi: Error 404: The resource 'projects/feltboard-terraform/zones/us-central1-c/networkEndpointGroups/k8s1-c278397d-traefik-traefik-80-230eaa9a' was not found, notFound                                         │
│   Normal   Create        13s (x2 over 15m)  neg-controller      Created NEG "k8s1-c278397d-traefik-traefik-80-230eaa9a" for traefik/traefik-k8s1-c278397d-traefik-traefik-80-230eaa9a--web/80-web-GCE_VM_IP_PORT-L7 in "us-central1-c".                                               │
│   Normal   Attach        10s (x2 over 14m)  neg-controller      Attach 1 network endpoint(s) (NEG "k8s1-c278397d-traefik-traefik-80-230eaa9a" in zone "us-central1-b")                                                                                                                │
│   Warning  BackendError  3s (x7 over 11m)   autoneg-controller  googleapi: Error 400: Invalid value for field 'resource.backends[0]': '{  "resourceGroup": "https://www.googleapis.com/compute/v1/projects/xx/zones/us-cen...'. CONNECTION balancing mode is not supported for protocol HTTP, invalid

This is causing me to not be able to use the DynamicBackend example as stated in https://github.com/terraform-google-modules/terraform-google-lb-http/tree/v7.0.0/examples/dynamic-backend

At the end, I want to expose a Traefik ingress through the web and websecure service through a GCS Load Balancer

Mixed logging when configuring structured logging

With the following settings:

          command:
            - /manager
          args:
            - --metrics-bind-address=:8080
            - --health-probe-bind-address=:8081
            - --leader-elect
            - --zap-devel=false
            - --zap-encoder=json
            - --zap-time-encoding=rfc3339

The resulting logs are a mix of structured and unstructured logs. In particular, the Kubernetes client and leader election log messages are human-readable, which means that Cloud Logging will be confused about whether these logs are JSON or text.

I0718 03:39:16.676513       1 request.go:601] Waited for 1.004034269s due to client-side throttling, not priority and fairness, request: GET:https://10.30.180.1:443/apis/vulnerabilities.protect.gke.io/v1?timeout=32s
{"level":"info","ts":"2023-07-18T03:39:17Z","logger":"controller-runtime.metrics","msg":"Metrics server is starting to listen","addr":":8080"}
{"level":"info","ts":"2023-07-18T03:39:17Z","logger":"setup","msg":"starting manager"}
{"level":"info","ts":"2023-07-18T03:39:17Z","msg":"Starting server","path":"/metrics","kind":"metrics","addr":"[::]:8080"}
{"level":"info","ts":"2023-07-18T03:39:17Z","msg":"Starting server","kind":"health probe","addr":"[::]:8081"}
I0718 03:39:17.791176       1 leaderelection.go:248] attempting to acquire leader lease autoneg-system/9fe89c94.controller.autoneg.dev...
E0718 03:39:17.801515       1 leaderelection.go:330] error retrieving resource lock autoneg-system/9fe89c94.controller.autoneg.dev: leases.coordination.k8s.io "9fe89c94.controller.autoneg.dev" is forbidden: User "system:serviceaccount:autoneg-system:autoneg" cannot get resource "leases" in API group "coordination.k8s.io" in the namespace "autoneg-system"
E0718 03:39:21.244438       1 leaderelection.go:330] error retrieving resource lock autoneg-system/9fe89c94.controller.autoneg.dev: leases.coordination.k8s.io "9fe89c94.controller.autoneg.dev" is forbidden: User "system:serviceaccount:autoneg-system:autoneg" cannot get resource "leases" in API group "coordination.k8s.io" in the namespace "autoneg-system"
E0718 03:39:24.292654       1 leaderelection.go:330] error retrieving resource lock autoneg-system/9fe89c94.controller.autoneg.dev: leases.coordination.k8s.io "9fe89c94.controller.autoneg.dev" is forbidden: User "system:serviceaccount:autoneg-system:autoneg" cannot get resource "leases" in API group "coordination.k8s.io" in the namespace "autoneg-system"
E0718 03:39:28.356441       1 leaderelection.go:330] error retrieving resource lock autoneg-system/9fe89c94.controller.autoneg.dev: leases.coordination.k8s.io "9fe89c94.controller.autoneg.dev" is forbidden: User "system:serviceaccount:autoneg-system:autoneg" cannot get resource "leases" in API group "coordination.k8s.io" in the namespace "autoneg-system"
E0718 03:39:32.195193       1 leaderelection.go:330] error retrieving resource lock autoneg-system/9fe89c94.controller.autoneg.dev: leases.coordination.k8s.io "9fe89c94.controller.autoneg.dev" is forbidden: User "system:serviceaccount:autoneg-system:autoneg" cannot get resource "leases" in API group "coordination.k8s.io" in the namespace "autoneg-system"

crd folder seems to be missing?

gke-autoneg-controller/config/default/kustomization.yaml

Lines 29 to 30 in 0287655

 bases: 

 - ../crd

Specify non-root user in deployment

The deployment spec should specify the non-root user in the security context following the #20 change:

securityContext:        
  privileged: false        
  allowPrivilegeEscalation: false        
  runAsUser: 1002        
  runAsNonRoot: true

Unauthorized metrics endpoint

Hello,

in deploy yaml file I see that service has following annotation:

prometheus.io/scrape: "true"

But when my prometheus tries to scrape metrics, it says Unauthorized

I tried to check it with curl and got the same

curl -k https://10.1.41.10:8443/metrics
Unauthorized

Could you please explain, why /metrics endpoint has authorization and how to disable it?

BTW, I did not find list of exposed metrics.

Could please provide?

Thanks.

Container release process

Controller manager service account forbidden listing api resources

I ran into an issue where the autoneg controller service account is forbidden listing several kubernetes api resources. The autoneg-controller-manager pod is returning the following errors:

E0727 18:09:24.847396       1 reflector.go:126] pkg/mod/k8s.io/[email protected]+incompatible/tools/cache/reflector.go:94: Failed to list *v1.Service: services is forbidden: User "system:serviceaccount:autoneg-system:autoneg" cannot list resource "services" in API group "" at the cluster scope
E0727 18:09:24.891173       1 leaderelection.go:306] error retrieving resource lock autoneg-system/controller-leader-election-helper: configmaps "controller-leader-election-helper" is forbidden: User "system:serviceaccount:autoneg-system:autoneg" cannot get resource "configmaps" in API group "" in the namespace "autoneg-system"

This occurs because the autoneg-controller-manager service account is autoneg. However, the rbac role bindings for the service account references, default, instead of autoneg used by the deployment.

New configuration annotations

With #34 merged, we have a new configuration annotation (controller.autoneg.dev/neg) that exposes the full configuration surface as json.

I've been considering a simpler interface for those who don't need that full flexibility, specifically targeting (what I assume is) the basic case of a single-port service, needing only two configuration options:
controller.autoneg.dev/backend-service:<name> (optional, defaulting to the name of the k8s service)
controller.autoneg.dev/max-rate-per-endpoint:<number> OR controller.autoneg.dev/max-connections-per-endpoint:<integer>

Thoughts?

Observing ACCESS_TOKEN_SCOPE_INSUFFICIENT when creating service

I have a service defined with

apiVersion: v1
kind: Service
metadata:
  name: frontend-svc
  annotations:
    cloud.google.com/neg: '{"exposed_ports": {"443":{}}}'
    controller.autoneg.dev/neg: '{"backend_services":{"443":[{"name":"https-be","max_connections_per_endpoint":1000}]}}'
spec:
  selector:
    app: frontend-app
  type: NodePort
  ports:
    - protocol: TCP
      port: 443
      targetPort: 3000

when I ran the kubectl command to create it, I observe the following events:

Events:
  Type     Reason        Age                 From                Message
  ----     ------        ----                ----                -------
  Normal   Sync          32s                 autoneg-controller  Synced NEGs for "default/frontend-svc" as backends to backend service "https-be" (port 443)
  Normal   Create        23s                 neg-controller      Created NEG "k8s1-1f4ed5c4-default-frontend-svc-443-9757dbe8" for default/frontend-svc-k8s1-1f4ed5c4-default-frontend-svc-443-9757dbe8--/443-3000-GCE_VM_IP_PORT-L7 in "us-central1-f".
  Warning  BackendError  11s (x13 over 32s)  autoneg-controller  googleapi: Error 403: Request had insufficient authentication scopes.
Details:
[
  {
    "@type": "type.googleapis.com/google.rpc.ErrorInfo",
    "domain": "googleapis.com",
    "metadatas": {
      "method": "compute.v1.BackendServicesService.Get",
      "service": "compute.googleapis.com"
    },
    "reason": "ACCESS_TOKEN_SCOPE_INSUFFICIENT"
  }
]

I can see however that the autoneg IAM role has the permission to perform this operation included:

$ gcloud iam roles describe autoneg --project=$PROJECT_ID
etag: REDACTED
includedPermissions:
- compute.backendServices.get
- compute.backendServices.update
- compute.healthChecks.useReadOnly
- compute.networkEndpointGroups.use
- compute.regionBackendServices.get
- compute.regionBackendServices.update
- compute.regionHealthChecks.useReadOnly
name: projects/${PROJECT_ID}/roles/autoneg
stage: ALPHA
title: autoneg

Any suggestions on how to debug and resolve? What makes this acutely frustrating is that there is no mention of these IAM issues in any of the GCP, GKE, autoneg docs or community forums.

NEGs are not deleted during service deletion

I'm having issue with the autoneg v0.9.9. Everything seems to be working fine since I deleted service with autoneg annotations.

I can see deregistering message in logs

DEBUG	events	Normal	{"object": {"kind":"Service","namespace":"test","name":"test-marianh","uid":"d362ac9e-f5dd-48ee-a167-91abf6c1be0a","apiVersion":"v1","resourceVersion":"14606392"}, "reason": "Delete", "message": "Deregistered NEGs for \"test/test-marianh\" from backend service \"autoneg-w1\" (port 80)"}"

but these NEGs will never be deleted or detached from backend service.

$ gcloud compute backend-services list --project=anthos-mesh-sandbox-1b827f9b 
NAME        BACKENDS                                                                                                                                                                                                                                            PROTOCOL
autoneg-w1  europe-west1-b/networkEndpointGroups/k8s1-6b097e81-test-test-marianh-80-1330420d,europe-west1-c/networkEndpointGroups/k8s1-6b097e81-test-test-marianh-80-1330420d,europe-west1-d/networkEndpointGroups/k8s1-6b097e81-test-test-marianh-80-1330420d  TCP

$ gcloud compute network-endpoint-groups list --project=anthos-mesh-sandbox-1b827f9b  | grep marianh
k8s1-6b097e81-test-test-marianh-80-1330420d                      europe-west1-b  GCE_VM_IP_PORT  0
k8s1-6b097e81-test-test-marianh-80-1330420d                      europe-west1-c  GCE_VM_IP_PORT  0
k8s1-6b097e81-test-test-marianh-80-1330420d                      europe-west1-d  GCE_VM_IP_PORT  1

Is there anything what can I check why these NEGs persist service deletion, or is there cook book how to correct delete service with NEG annotation?

Configuring region for Regional Backend config may be unnecessary and confusing

I missed this when looking at #34 but I don't think there's any opportunity to make choices for the backend region.

An Internal HTTP LB (regional backend service) can only include Backends from the same region. So a Backend Service in us-west1 can only have Backends also in us-west1. Trying to configure autoneg on a us-west1 cluster to add instances to a us-east1 Backend Service will not work, I believe.

I addressed this by pulling the cluster location (region) from metadata: http://metadata.google.internal/computeMetadata/v1/instance/attributes/cluster-location so that user's aren't in the position of needing to magically know the only value that will actually work for region and having to explicitly configure it for each annotation.

One thing that may make this configuration required would be if there's potential for an Internal HTTP LB at some point in the future to allow Backends to be added to a Backend Service in a different region. If that happens, the it would become necessary to specify the Backend Service region in the annotation.

GCLB backends are not populated by AutoNEG controller: RESOURCE_NOT_READY

Hi,

I am trying to use AutoNEG controller in Workload Identity mode (configured according to the manual), and I face a problem with auto populating the GCLB backends.

I noticed that shortly after creating GKE services, the backends show the NEGs in GC console, but then they disappear and never show up again.

The sequence of events according to Stackdriver logs:
https://pastebin.com/FKdhqYTy

I can see BAD_REQUEST and RESOURCE_NOT_READY errors there.

I have AutoNEG controller working correctly in neighbor projects, but using cloud access scopes. It's just for extra context, I am not sure if this particular issue is related to Workload Identity setup or not.

Inside the affected NEG I can see undefined health status:

However I think (though not sure) that this is because it is "Not used yet" by the backend, so it's a consequence.

Also the problem persists, even if I remove GKE services and create them again - so it is not any kind of a race with LB backend service creation.

Any idea what goes wrong here?

Also posted: https://serverfault.com/questions/1036096/gclb-backends-are-not-populated-by-autoneg-controller-resource-not-ready

Custom role error when recreating project

We have a GCP project that we use to test new infrastructure changes, once the test is done, we tear down all the infrastructure.
We recently switched to using the module provided in the project and we are facing some issues when recreating the custom role.

Received unexpected error:
            	FatalError{Underlying: error while running command: exit status 1; �[31m╷�[0m�[0m
            	�[31m│�[0m �[0m�[1m�[31mError: �[0m�[0m�[1mError creating the custom project role projects/<project-id>/roles/autonegRegional: googleapi: Error 400: You can't create a role_id (autonegRegional) which has been marked for deletion., failedPrecondition�[0m
            	�[31m│�[0m �[0m
            	�[31m│�[0m �[0m�[0m  with module.dwam_test.module.autoneg[0].module.gcp.google_project_iam_custom_role.autoneg,
            	�[31m│�[0m �[0m  on .terraform/modules/dwam_test.autoneg/terraform/gcp/main.tf line 57, in resource "google_project_iam_custom_role" "autoneg":
            	�[31m│�[0m �[0m  57: resource "google_project_iam_custom_role" "autoneg" �[4m{�[0m�[0m
            	�[31m│�[0m �[0m
            	�[31m╵�[0m�[0m}

Unused ConfigMap in deploy manifest

The deploy manifest includes a ConfigMap that does not appear to be used (it is not mounted to the Deployment). Can it be removed?

gke-autoneg-controller/deploy/autoneg.yaml

Lines 189 to 222 in b8b9a87

 apiVersion: v1 

 data: 

 controller_manager_config.yaml: | 

  # Copyright 2021 Google LLC 

  # 

  # Licensed under the Apache License, Version 2.0 (the "License"); 

  # you may not use this file except in compliance with the License. 

  # You may obtain a copy of the License at 

  # 

  # http://www.apache.org/licenses/LICENSE-2.0 

  # 

  # Unless required by applicable law or agreed to in writing, software 

  # distributed under the License is distributed on an "AS IS" BASIS, 

  # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. 

  # See the License for the specific language governing permissions and 

  # limitations under the License. 

  apiVersion: controller-runtime.sigs.k8s.io/v1alpha1 

  kind: ControllerManagerConfig 

  health: 

  healthProbeBindAddress: :8081 

  metrics: 

  bindAddress: 127.0.0.1:8080 

  webhook: 

  port: 9443 

  leaderElection: 

  leaderElect: true 

  resourceName: 9fe89c94.controller.autoneg.dev 

 kind: ConfigMap 

 metadata: 

 labels: 

 app: autoneg 

 name: autoneg-manager-config 

 namespace: autoneg-system

Trying to limit autoneg controller to multiple individual namespaces in the same cluster.

Tried removing all cluster roles but controller does not have filtering based on namespace or maybe I am configuring it incorrectly

message: "pkg/mod/k8s.io/[email protected]+incompatible/tools/cache/reflector.go:94: Failed to list *v1.Service: services is forbidden: User "system:serviceaccount:autoneg-per-ns:default" cannot list resource "services" in API group "" at the cluster scope"
source: "reflector.go:126"

Published container should have a version tag

Currently it appears that the latest tag needs to be used since there aren't any releases.

Change autoneg annotation name

anthos.cft.dev is misleading, as autoneg doesn't require Anthos.

How to deploy?

How exactly does one deploy this into one's clusters and make use of it? In particular with a multi cluster ingress (although that part may be out of scope).

Removing config annotation does not deregister backends

Removing config annotation anthos.cft.dev/autoneg still left the anthos.cft.dev/autoneg-status annotation, and also generated a controller.autoneg.dev/neg-status annotation.

I would expect autoneg to deregister backends, remove the status annotation, and remove the finalizer.

The object has been modified; please apply your changes to the latest version and try again

When I deploy a k8s service with the NEG and autoneg annotations, the NEGs are not registered at the backend service with the following error message:

 Type     Reason        Age                  From                Message
  ----     ------        ----                 ----                -------
  Normal   Create        7m54s                neg-controller      Created NEG "k8s1-85a2e695-default-app-catfood-80-fb89a260" for default/app-catfood-k8s1-85a2e695-default-app-catfood-80-fb89a260--web/80-80-GCE_VM_IP_PORT-L7 in "europe-west1-b".
  Normal   Create        7m51s                neg-controller      Created NEG "k8s1-85a2e695-default-app-catfood-80-fb89a260" for default/app-catfood-k8s1-85a2e695-default-app-catfood-80-fb89a260--web/80-80-GCE_VM_IP_PORT-L7 in "europe-west1-c".
  Normal   Create        7m49s                neg-controller      Created NEG "k8s1-85a2e695-default-app-catfood-80-fb89a260" for default/app-catfood-k8s1-85a2e695-default-app-catfood-80-fb89a260--web/80-80-GCE_VM_IP_PORT-L7 in "europe-west1-d".
  Normal   Attach        7m45s (x2 over 27h)  neg-controller      Attach 1 network endpoint(s) (NEG "k8s1-85a2e695-default-app-catfood-80-fb89a260" in zone "europe-west1-b")
Warning  BackendError  7m44s (x2 over 26h)  autoneg-controller  Operation cannot be fulfilled on services "app-catfood": the object has been modified; please apply your changes to the latest version and try again

When I update the service again, the NEGs are registered:

Normal Sync 1s (x2 over 26h) autoneg-controller Synced NEGs for "default/app-catfood" as backends to backend service "catfood"

I expect the autoneg controller to eventually synchronise to the desired situation without manual intervention.

Update Ginkgo due to deprecation warning

When running testes locally (via go test ./...), I see the following output:

You're using deprecated Ginkgo functionality:
=============================================
Ginkgo 2.0 is under active development and will introduce several new features, improvements, and a small handful of breaking changes.
A release candidate for 2.0 is now available and 2.0 should GA in Fall 2021.  Please give the RC a try and send us feedback!
  - To learn more, view the migration guide at https://github.com/onsi/ginkgo/blob/ver2/docs/MIGRATING_TO_V2.md
  - For instructions on using the Release Candidate visit https://github.com/onsi/ginkgo/blob/ver2/docs/MIGRATING_TO_V2.md#using-the-beta
  - To comment, chime in at https://github.com/onsi/ginkgo/issues/711

  You are using a custom reporter.  Support for custom reporters will likely be removed in V2.  Most users were using them to generate junit or teamcity reports and this functionality will be merged into the core reporter.  In addition, Ginkgo 2.0 will support emitting a JSON-formatted report that users can then manipulate to generate custom reports.

  If this change will be impactful to you please leave a comment on https://github.com/onsi/ginkgo/issues/711
  Learn more at: https://github.com/onsi/ginkgo/blob/ver2/docs/MIGRATING_TO_V2.md#removed-custom-reporters

To silence deprecations that can be silenced set the following environment variable:
  ACK_GINKGO_DEPRECATIONS=1.16.5

make docker-build fails because of missing api dir

When running make docker-build it fails complaining about a missing api dir.

https://github.com/GoogleCloudPlatform/gke-autoneg-controller/blob/master/Dockerfile#L28 where is this coming from?

Instead of https://github.com/GoogleCloudPlatform/gke-autoneg-controller/blob/master/Dockerfile#L19-L29

why not replace it with

COPY . .

Reconcile error: network endpoint group in a specific zone not found

We deployed autoneg in one of our clusters running GKE Autopilot. When running a workload on it and if it's scheduled to just some specific zones but not all, the NEG are not created in all the zones.

That means that autoneg will fail and stop reconciling.

The actual failing log part:

2021-10-22T17:55:52.092Z	INFO	controllers.Service	Applying intended status	{"service": "envoy/envoy", "status": {"backend_services":{"8000":{"myproduct":{"name":"myproduct","max_connections_per_endpoint":1000}}},"network_endpoint_groups":{"8000":"k8s1-4fd3dc4c-envoy-envoy-8000-44f9746b","8001":"k8s1-4fd3dc4c-envoy-envoy-8001-7508741b"},"zones":["europe-west1-b","europe-west1-c","europe-west1-d"]}}
2021-10-22T17:55:52.762Z	ERROR	controller-runtime.controller	Reconciler error	{"controller": "service", "request": "envoy/envoy", "error": "googleapi: Error 404: The resource 'projects/myproduct-dev/zones/europe-west1-c/networkEndpointGroups/k8s1-4fd3dc4c-envoy-envoy-8000-44f9746b' was not found, notFound"}

The question is, should autoneg tolerate missing network endpoint groups in some but not in all available zones?

Ensure that autoneg handles multiple clusters

Add tests that prove this case.

Add maxUtilization to config

Users may want to configure the maxUtilization property in a backend.

Manually removed backend does not add itself back

Hello,

After performing some tests, I've noticed that if a backend added to a backend service by the autoneg controller is manually removed, it is not added back, that's to say the controller is not constantly checking to ensure the desired state is real.

It would be desirable if the autoneg system periodically checked to make sure that desired state is real.

Steps to reproduce:

run autoneg controller and sync a NEG to a backend service
manually remove the backend NEG from the backend service, observe that it does not get added back by the autoneg controller

Run as non-root user instead of root

It is not clear that there is any need to run this container as root user. This should be updated to run as non-root or have explicit documentation as to why a root user is required for the container.

ReconcileBackends succeeds although some operations fail

We ran into an issue that the autoneg controller reports backends being added to NEGs although some patch operations fail with RESOURCE_NOT_READY.

It seems that applying patches to reconcile backends is handled optimistically, leading to a wrong reconciliation result.

For now I applied a workaround by retrying the patch in such case. See fbaier-fn@225576d which I would be more than happy to contribute my fix.

Flaky TestReconcileStatuses because maps don't guarantee order

make test fails occasionally for https://github.com/GoogleCloudPlatform/gke-autoneg-controller/blob/master/controllers/autoneg_test.go#L248-L258 because we transform a slice to map then back to a slice (see https://github.com/GoogleCloudPlatform/gke-autoneg-controller/blob/master/controllers/autoneg.go#L141) which can mess up the order assumed in the test https://github.com/GoogleCloudPlatform/gke-autoneg-controller/blob/master/controllers/autoneg_test.go#L157-L170.

So if we compare the returned slice, sometimes the group of zone2 comes before zone1 and will have the test fail.

I propose modifying isEqual ala:

func (b Backends) isEqual(ob Backends) bool {
	if b.name != ob.name {
		return false
	}
	newB := map[string]compute.Backend{}
	for _, be := range b.backends {
		rb := relevantCopy(be)
		if _, ok := newB[rb.Group]; ok {
			return false
		}
		newB[rb.Group] = rb
	}
	newOB := map[string]compute.Backend{}
	for _, be := range ob.backends {
		rb := relevantCopy(be)
		if _, ok := newOB[rb.Group]; ok {
			return false
		}
		newOB[rb.Group] = rb
	}
	return reflect.DeepEqual(newB, newOB)
}

Terraform configuration unusable for multiple clusters

Right now the controller is installed terraform module that way

module "autoneg" {
  source = "github.com/GoogleCloudPlatform/gke-autoneg-controller//terraform/autoneg"

  project_id = "your-project-id"
}

where the only thing configurable is the project-id .

It creates a custom role for the controller, so that module is completely unusable when installing it on different clusters in the same project, it will fail cause it just tries to create the same custom role

Question: "Successfully Reconciled" for non-NEG services ?

Hello,
What is the meaning of the following log, provided that the bar service in foo namespace is not a Network Endpoint Group?

DEBUG	controller-runtime.controller	Successfully Reconciled	{"controller": "service", "request": "foo/bar}

Additional pod and container hardening

We're using the following settings for our deployment --

Container security context:

          securityContext:
            allowPrivilegeEscalation: false
            capabilities:
              drop:
                - ALL
            privileged: false
            readOnlyRootFilesystem: true
            runAsUser: 65532
            runAsGroup: 65532
            runAsNonRoot: true
            seccompProfile:
              type: RuntimeDefault

Pod security context:

      securityContext:
        runAsUser: 65532
        runAsGroup: 65532
        runAsNonRoot: true
        seccompProfile:
          type: RuntimeDefault

Using these as the default would be useful, as it makes this controller installable into a wider number of clusters (e.g. clusters with restrictive admission controllers)

NEGs not being deregistered on Service deletion

I'm experimenting with the following simple nginx service:

apiVersion: v1
kind: Service
metadata:
  name: neg-demo-svc
  annotations:
    cloud.google.com/neg: '{"exposed_ports": {"80":{}}}'
    controller.autoneg.dev/neg: '{"backend_services":{"80":[{"name":"some-backend-service","region":"europe-west2","max_rate_per_endpoint":100}]}}'
spec:
  type: ClusterIP
  selector:
    app: nginx
  ports:
  - port: 80
    protocol: TCP

I have NEG creation and association with the Backend Service working correctly. However, when I'm deleting the service, I'm running into problems. Based on the autoneg-controller logs, it seems to deregister the NEGs successfully:

2022-02-25T16:45:12.987+0100    DEBUG   controller-runtime.manager.events       Normal  {"object": {"kind":"Service","namespace":"default","name":"neg-demo-svc","uid":"affa73d3-10e3-4698-ae02-d1c1c22ad748","apiVersion":"v1","resourceVersion":"420760"}, "reason": "Delete", "message": "Deregistered NEGs for \"default/neg-demo-svc\" from backend service \"argon-gke-general-blue-01-psc-backend-service\" (port 80)"}

I believe that this is related to this logic here:

gke-autoneg-controller/controllers/autoneg.go

Lines 270 to 274 in d0e0f48

 var intendedBEKeys []string 

 for k := range intendedBE { 

 intendedBEKeys = append(intendedBEKeys, k) 

 } 

 sort.Strings(intendedBEKeys)

In this case, intendedBEKeys is empty, which means we're going to skip the entire loop that checks for differences. Shouldn't it also iterate on keys from actualBE to detect the removal of ports?

Please let me know if I'm missing something. Is there anything else I might provide?

Do not set status annotation when backendservice does not exist

If the backend service referenced in the config does not exist, do not set finalizer or status, and log a more appropriate event message

Migrate to kubebuilder 3

https://book.kubebuilder.io/migration/migration_guide_v2tov3.html

Kubebuilder doesn't have features required for autoneg, but it'd be nice to migrate to v3 to stay current on our dependencies. Migration shouldn't be difficult as autoneg already implements/satisfies some of the changes/requirements in v3.

Update guide or description of compatibility guarantees

Pull request #46 changes some flags, which is a breaking change unless users re-deploy using the Deployment manifest. I'd propose either:

Adding a warning that breaking changes can occur, to set user expectations; or,
Defining a support strategy (e.g. support both flags for some period of time, deprecating the old one, etc)

Prior to the linked pull request, the code looked like this:

gke-autoneg-controller/main.go

Lines 57 to 63 in 05c4b1c

 flag.StringVar(&metricsAddr, "metrics-addr", ":8080", "The address the metric endpoint binds to.") 

 flag.BoolVar(&enableLeaderElection, "enable-leader-election", false, 

 "Enable leader election for controller manager. Enabling this will ensure there is only one active controller manager.") 

 flag.StringVar(&serviceNameTemplate, "default-backendservice-name", "{name}-{port}", 

 "A naming template consists of {namespace}, {name}, {port} or {hash} separated by hyphens, "+ 

 "where {hash} is the first 8 digits of a hash of other given information") 

 flag.BoolVar(&allowServiceName, "enable-custom-service-names", true, "Enable using custom service names in autoneg annotation.")

Afterwards, it now looks like this:

gke-autoneg-controller/main.go

Lines 62 to 70 in b8b9a87

 flag.StringVar(&metricsAddr, "metrics-bind-address", ":8080", "The address the metric endpoint binds to.") 

 flag.StringVar(&probeAddr, "health-probe-bind-address", ":8081", "The address the probe endpoint binds to.") 

 flag.BoolVar(&enableLeaderElection, "leader-elect", false, 

 "Enable leader election for controller manager. "+ 

 "Enabling this will ensure there is only one active controller manager.") 

 flag.StringVar(&serviceNameTemplate, "default-backendservice-name", "{name}-{port}", 

 "A naming template consists of {namespace}, {name}, {port} or {hash} separated by hyphens, "+ 

 "where {hash} is the first 8 digits of a hash of other given information") 

 flag.BoolVar(&allowServiceName, "enable-custom-service-names", true, "Enable using custom service names in autoneg annotation.")

Overall, the goal is just to ensure that users and developers have aligned expectations. Perhaps users should not expect stability given that we are pre-1.0. In this case, users upgrading from 0.9.8 to 0.9.9 will have to update their deployment manifests.

Regional Backend Service does not appear to be supported

This is required to use against an Internal HTTP(S) Load Balancer.

Autoneg controller can block deletion of a Service and even of the namespace containing it.

Reproduce: Create a service with a backend name that conatins an udnerscore. Now try to clean up. Eventually I tried to delete the entire namespace, but that is waiting on the deletion of the Service, which cannot proceed even though there is nothing to cleanup.

Use specific service account instead of `default`

This makes it explicit that there is a service account just for the autoneg pod.

Additionally, when provisioning with terraform there is no way to update the annotation on the default service account. A new service account must be created and managed by terraform to add the annotation.

googleapi: Error 412: Precondition Failed, conditionNotMet

Hi,
I am getting the following error with autoneg controller.
Any idea what the missing precondition could be?

autoneg-controller-manager manager 2019-12-10T19:58:09.561Z    INFO    controllers.Service     Applying intended status        {"service": "foo/bar", "status": {"name":"bar","max_rate_per_endpoint":1000,"network_endpoint_groups":{"80":"k8s1-foo-bar-80-12345a"},"zones":["us-central1-a"]}}
autoneg-controller-manager manager 2019-12-10T19:58:09.990Z    ERROR   controller-runtime.controller   Reconciler error        {"controller": "service", "request": "foo/bar", "error": "googleapi: Error 412: Precondition Failed, conditionNotMet"}

A backend service named bar exists in the project, it has 0 backends at this time.

cc @soellman 🙏

The `README` and `workload_identity.sh` script contain several errors

The following errors are in the README/script

The README specifies you should run "PROJECT=xyz deploy/workload_identity.sh", but the script itself expects an environment variable named "PROJECT_ID"
If role "autoneg" does not exist, "gcloud iam roles update" fails. You should use "gcloud iam roles create" instead.

Failed to pull image from docker.pkg.github.com: no basic auth credentials

Hi,

Thank you for the nice tool, successfully using it for NEG autopopulation.

After trying to switch to the most recent published version as a release, it is actually not possible to pull the image from the registry:

Events:
  Type     Reason     Age                   From               Message
  ----     ------     ----                  ----               -------
  Normal   Scheduled  19m                   default-scheduler  Successfully assigned autoneg-system/autoneg-controller-manager-758bc7f94b-tzvz9 to gke-d003-sb2-k8s-euwe1-node-pool-1-e0234f93-8c4q
  Normal   Pulling    19m                   kubelet            Pulling image "gcr.io/kubebuilder/kube-rbac-proxy:v0.4.0"
  Normal   Started    19m                   kubelet            Started container kube-rbac-proxy
  Normal   Pulled     19m                   kubelet            Successfully pulled image "gcr.io/kubebuilder/kube-rbac-proxy:v0.4.0" in 1.270493418s
  Normal   Created    19m                   kubelet            Created container kube-rbac-proxy
  Warning  Failed     18m (x3 over 19m)     kubelet            Failed to pull image "docker.pkg.github.com/googlecloudplatform/gke-autoneg-controller/gke-autoneg-controller:0.9.1": rpc error: code = Unknown desc = Error response from daemon: Get https://docker.pkg.github.com/v2/googlecloudplatform/gke-autoneg-controller/gke-autoneg-controller/manifests/0.9.1: no basic auth credentials

It should not be behind basic auth, should it?
Should not it be published to gcr.io registry instead?..

Controller user agent hardcoded to 0.9.7-dev

I noticed there's a user agent string which is hard-coded:

gke-autoneg-controller/main.go

Line 43 in b8b9a87

const useragent = "google-pso-tool/gke-autoneg-controller/0.9.7-dev"

deploy/workload_identity.sh : Fix role creation

Script fails updating a role that not exists

Fix at #16

Configurable default capacityScaler

Hello,

Thanks for this excellent project!

At Prefect, we are using this controller to perform traffic splitting and blue/green deployments. As far as I can tell, the default (and hard-coded) behavior of this controller is such that traffic will be immediately load balanced between all participating services once the health checks succeed, and we would instead prefer to gradually shift traffic over by configuring the split ratio.

This seems to be the relevant code:

gke-autoneg-controller/controllers/autoneg.go

Lines 62 to 66 in 64a7216

 return compute.Backend{ 

 Group: group, 

 BalancingMode: "RATE", 

 MaxRatePerEndpoint: s.AutonegConfig.BackendServices[port][name].Rate, 

 CapacityScaler: 1,

Would you accept a PR that adds configurability for the InitialCapacity value? We could store that in the existing AutonegConfig object.

I have two use cases in mind for this feature:

Bringing up an application and cluster in a new zone/region
Creating a new cluster for a blue/green deployment

In both cases, we want to gradually shift some traffic to the new cluster and monitor error rates.

With the current behavior, if we use the same connection rate settings for the service, then bringing up a new cluster would take an equal proportion of traffic (e.g. live cluster A processing 100% of requests, bringing up a new cluster B and attaching to the same NEG will result in a 50%/50% split.) We would like to begin with 100%/0% split, gradually increase the proportion of traffic that Cluster B handles, and then gradually decrease the proportion of traffic that Cluster A handles, to safely transition.

Workload identity 401 invalid credentials error

The autoneg-controller-manager pod returns the following error when using workload identity:

2022-07-27T18:52:41.739Z	ERROR	controller-runtime.controller	Reconciler error	{"controller": "service", "request": "<name_space>/<service>", "error": "googleapi: Error 401: Invalid Credentials, authError"}
github.com/go-logr/zapr.(*zapLogger).Error
	/go/pkg/mod/github.com/go-logr/[email protected]/zapr.go:128
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler
	/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:218
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem
	/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:192
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).worker
	/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:171
k8s.io/apimachinery/pkg/util/wait.JitterUntil.func1
	/go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/wait.go:152
k8s.io/apimachinery/pkg/util/wait.JitterUntil
	/go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/wait.go:153
k8s.io/apimachinery/pkg/util/wait.Until
	/go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/wait.go:88

The default service account has the workload identity annotation and the gcp service account has the correct role binding.

Autoneg controller gets confused when Backend Service doesn't exist and example is causing it

The example in the README uses a Backend Service with a name containing an underscore. However, it is impossible to create a Backend Service with this name.

As a result nothing happens, but even deleting the Service is no longer possible, because the controller keeps getting stuck on the Backend Service name that is rejected.

The Dockerfile does not build a working image

to reproduce, type:

$ git clone [email protected]:GoogleCloudPlatform/gke-autoneg-controller.git
$ cd gke-autoneg-controller
$ docker build -tx .
------
 > [builder 7/9] COPY api/ api/:
------
failed to compute cache key: "/api" not found: not found

Multiple NEG/Backend support

A GKE service can have multiple NEGs created for different ports but there isn't a way to associate a specific NEG with a specific backend.

For example, a service exposes two ports. Port 443 for HTTP2/grpc and port 8443 HTTP1.1 for metrics and diagnostics endpoints. Two NEGs can be created for those two ports but there is no way to associate the port 443 NEG with backendA and the port 8443 NEG with backendB.

They must be behind separate backends since they use different protocols (HTTP2 and HTTP) to talk from the backend service to the NEG instances.

	apiVersion: v1
	data:
	controller_manager_config.yaml: \|
	# Copyright 2021 Google LLC
	#
	# Licensed under the Apache License, Version 2.0 (the "License");
	# you may not use this file except in compliance with the License.
	# You may obtain a copy of the License at
	#
	# http://www.apache.org/licenses/LICENSE-2.0
	#
	# Unless required by applicable law or agreed to in writing, software
	# distributed under the License is distributed on an "AS IS" BASIS,
	# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
	# See the License for the specific language governing permissions and
	# limitations under the License.

	apiVersion: controller-runtime.sigs.k8s.io/v1alpha1
	kind: ControllerManagerConfig
	health:
	healthProbeBindAddress: :8081
	metrics:
	bindAddress: 127.0.0.1:8080
	webhook:
	port: 9443
	leaderElection:
	leaderElect: true
	resourceName: 9fe89c94.controller.autoneg.dev
	kind: ConfigMap
	metadata:
	labels:
	app: autoneg
	name: autoneg-manager-config
	namespace: autoneg-system

	var intendedBEKeys []string
	for k := range intendedBE {
	intendedBEKeys = append(intendedBEKeys, k)
	}
	sort.Strings(intendedBEKeys)

	flag.StringVar(&metricsAddr, "metrics-addr", ":8080", "The address the metric endpoint binds to.")
	flag.BoolVar(&enableLeaderElection, "enable-leader-election", false,
	"Enable leader election for controller manager. Enabling this will ensure there is only one active controller manager.")
	flag.StringVar(&serviceNameTemplate, "default-backendservice-name", "{name}-{port}",
	"A naming template consists of {namespace}, {name}, {port} or {hash} separated by hyphens, "+
	"where {hash} is the first 8 digits of a hash of other given information")
	flag.BoolVar(&allowServiceName, "enable-custom-service-names", true, "Enable using custom service names in autoneg annotation.")

	flag.StringVar(&metricsAddr, "metrics-bind-address", ":8080", "The address the metric endpoint binds to.")
	flag.StringVar(&probeAddr, "health-probe-bind-address", ":8081", "The address the probe endpoint binds to.")
	flag.BoolVar(&enableLeaderElection, "leader-elect", false,
	"Enable leader election for controller manager. "+
	"Enabling this will ensure there is only one active controller manager.")
	flag.StringVar(&serviceNameTemplate, "default-backendservice-name", "{name}-{port}",
	"A naming template consists of {namespace}, {name}, {port} or {hash} separated by hyphens, "+
	"where {hash} is the first 8 digits of a hash of other given information")
	flag.BoolVar(&allowServiceName, "enable-custom-service-names", true, "Enable using custom service names in autoneg annotation.")

	return compute.Backend{
	Group: group,
	BalancingMode: "RATE",
	MaxRatePerEndpoint: s.AutonegConfig.BackendServices[port][name].Rate,
	CapacityScaler: 1,

googlecloudplatform / gke-autoneg-controller Goto Github PK

gke-autoneg-controller's Issues

Recommend Projects

Recommend Topics

Recommend Org

Jobs