googlecloudplatform / gke-autoneg-controller Goto Github PK
View Code? Open in Web Editor NEWThis GKE controller provides simple custom integration between GKE and GCLB.
License: Apache License 2.0
This GKE controller provides simple custom integration between GKE and GCLB.
License: Apache License 2.0
Hi Team,
I was using kubectl
to install the gke controller using the manifest provide in the repo here https://github.com/GoogleCloudPlatform/gke-autoneg-controller/blob/master/deploy/autoneg.yaml, found that you are creating a new service account with autoneg
and referencing that service account name in the Deployment in the later part of the file; but then I found that you are using the default
service account in the rolebinding and the clusterrolebinding, which is incorrect. I spend lot of time investigating and found this issue. Please update the autoneg.yaml file attached as a reference.
Regards,
Sreenivas
I'm having issue with the autoneg v0.9.9,
When setting the following annotation on the service
annotations:
cloud.google.com/neg: '{"exposed_ports": {"80":{}}}'
controller.autoneg.dev/neg: '{"backend_services":{"80":[{"name":"gke-development-ingress-backend-web","zones":["us-central1-c"]}]}}'
The service is generating the errors below
Events:
│ Type Reason Age From Message │
│ ---- ------ ---- ---- ------- │
│ Normal Sync 32s autoneg-controller Synced NEGs for "traefik/traefik" as backends to backend service "gke-development-ingress-backend-web" (port 80) │
│ Normal Create 25s (x2 over 15m) neg-controller Created NEG "k8s1-c278397d-traefik-traefik-80-230eaa9a" for traefik/traefik-k8s1-c278397d-traefik-traefik-80-230eaa9a--web/80-web-GCE_VM_IP_PORT-L7 in "us-central1-b". │
│ Warning BackendError 25s (x9 over 32s) autoneg-controller googleapi: Error 404: The resource 'projects/feltboard-terraform/zones/us-central1-c/networkEndpointGroups/k8s1-c278397d-traefik-traefik-80-230eaa9a' was not found, notFound │
│ Normal Create 13s (x2 over 15m) neg-controller Created NEG "k8s1-c278397d-traefik-traefik-80-230eaa9a" for traefik/traefik-k8s1-c278397d-traefik-traefik-80-230eaa9a--web/80-web-GCE_VM_IP_PORT-L7 in "us-central1-c". │
│ Normal Attach 10s (x2 over 14m) neg-controller Attach 1 network endpoint(s) (NEG "k8s1-c278397d-traefik-traefik-80-230eaa9a" in zone "us-central1-b") │
│ Warning BackendError 3s (x7 over 11m) autoneg-controller googleapi: Error 400: Invalid value for field 'resource.backends[0]': '{ "resourceGroup": "https://www.googleapis.com/compute/v1/projects/xx/zones/us-cen...'. CONNECTION balancing mode is not supported for protocol HTTP, invalid
This is causing me to not be able to use the DynamicBackend example as stated in https://github.com/terraform-google-modules/terraform-google-lb-http/tree/v7.0.0/examples/dynamic-backend
At the end, I want to expose a Traefik ingress through the web and websecure service through a GCS Load Balancer
With the following settings:
command:
- /manager
args:
- --metrics-bind-address=:8080
- --health-probe-bind-address=:8081
- --leader-elect
- --zap-devel=false
- --zap-encoder=json
- --zap-time-encoding=rfc3339
The resulting logs are a mix of structured and unstructured logs. In particular, the Kubernetes client and leader election log messages are human-readable, which means that Cloud Logging will be confused about whether these logs are JSON or text.
I0718 03:39:16.676513 1 request.go:601] Waited for 1.004034269s due to client-side throttling, not priority and fairness, request: GET:https://10.30.180.1:443/apis/vulnerabilities.protect.gke.io/v1?timeout=32s
{"level":"info","ts":"2023-07-18T03:39:17Z","logger":"controller-runtime.metrics","msg":"Metrics server is starting to listen","addr":":8080"}
{"level":"info","ts":"2023-07-18T03:39:17Z","logger":"setup","msg":"starting manager"}
{"level":"info","ts":"2023-07-18T03:39:17Z","msg":"Starting server","path":"/metrics","kind":"metrics","addr":"[::]:8080"}
{"level":"info","ts":"2023-07-18T03:39:17Z","msg":"Starting server","kind":"health probe","addr":"[::]:8081"}
I0718 03:39:17.791176 1 leaderelection.go:248] attempting to acquire leader lease autoneg-system/9fe89c94.controller.autoneg.dev...
E0718 03:39:17.801515 1 leaderelection.go:330] error retrieving resource lock autoneg-system/9fe89c94.controller.autoneg.dev: leases.coordination.k8s.io "9fe89c94.controller.autoneg.dev" is forbidden: User "system:serviceaccount:autoneg-system:autoneg" cannot get resource "leases" in API group "coordination.k8s.io" in the namespace "autoneg-system"
E0718 03:39:21.244438 1 leaderelection.go:330] error retrieving resource lock autoneg-system/9fe89c94.controller.autoneg.dev: leases.coordination.k8s.io "9fe89c94.controller.autoneg.dev" is forbidden: User "system:serviceaccount:autoneg-system:autoneg" cannot get resource "leases" in API group "coordination.k8s.io" in the namespace "autoneg-system"
E0718 03:39:24.292654 1 leaderelection.go:330] error retrieving resource lock autoneg-system/9fe89c94.controller.autoneg.dev: leases.coordination.k8s.io "9fe89c94.controller.autoneg.dev" is forbidden: User "system:serviceaccount:autoneg-system:autoneg" cannot get resource "leases" in API group "coordination.k8s.io" in the namespace "autoneg-system"
E0718 03:39:28.356441 1 leaderelection.go:330] error retrieving resource lock autoneg-system/9fe89c94.controller.autoneg.dev: leases.coordination.k8s.io "9fe89c94.controller.autoneg.dev" is forbidden: User "system:serviceaccount:autoneg-system:autoneg" cannot get resource "leases" in API group "coordination.k8s.io" in the namespace "autoneg-system"
E0718 03:39:32.195193 1 leaderelection.go:330] error retrieving resource lock autoneg-system/9fe89c94.controller.autoneg.dev: leases.coordination.k8s.io "9fe89c94.controller.autoneg.dev" is forbidden: User "system:serviceaccount:autoneg-system:autoneg" cannot get resource "leases" in API group "coordination.k8s.io" in the namespace "autoneg-system"
gke-autoneg-controller/config/default/kustomization.yaml
Lines 29 to 30 in 0287655
The deployment spec should specify the non-root user in the security context following the #20 change:
securityContext:
privileged: false
allowPrivilegeEscalation: false
runAsUser: 1002
runAsNonRoot: true
Hello,
in deploy yaml file I see that service has following annotation:
prometheus.io/scrape: "true"
But when my prometheus tries to scrape metrics, it says Unauthorized
I tried to check it with curl and got the same
curl -k https://10.1.41.10:8443/metrics
Unauthorized
Could you please explain, why /metrics
endpoint has authorization and how to disable it?
BTW, I did not find list of exposed metrics.
Could please provide?
Thanks.
I ran into an issue where the autoneg controller service account is forbidden listing several kubernetes api resources. The autoneg-controller-manager pod is returning the following errors:
E0727 18:09:24.847396 1 reflector.go:126] pkg/mod/k8s.io/[email protected]+incompatible/tools/cache/reflector.go:94: Failed to list *v1.Service: services is forbidden: User "system:serviceaccount:autoneg-system:autoneg" cannot list resource "services" in API group "" at the cluster scope
E0727 18:09:24.891173 1 leaderelection.go:306] error retrieving resource lock autoneg-system/controller-leader-election-helper: configmaps "controller-leader-election-helper" is forbidden: User "system:serviceaccount:autoneg-system:autoneg" cannot get resource "configmaps" in API group "" in the namespace "autoneg-system"
This occurs because the autoneg-controller-manager
service account is autoneg
. However, the rbac role bindings for the service account references, default, instead of autoneg used by the deployment.
With #34 merged, we have a new configuration annotation (controller.autoneg.dev/neg
) that exposes the full configuration surface as json.
I've been considering a simpler interface for those who don't need that full flexibility, specifically targeting (what I assume is) the basic case of a single-port service, needing only two configuration options:
controller.autoneg.dev/backend-service:<name>
(optional, defaulting to the name of the k8s service)
controller.autoneg.dev/max-rate-per-endpoint:<number>
OR controller.autoneg.dev/max-connections-per-endpoint:<integer>
Thoughts?
I have a service defined with
apiVersion: v1
kind: Service
metadata:
name: frontend-svc
annotations:
cloud.google.com/neg: '{"exposed_ports": {"443":{}}}'
controller.autoneg.dev/neg: '{"backend_services":{"443":[{"name":"https-be","max_connections_per_endpoint":1000}]}}'
spec:
selector:
app: frontend-app
type: NodePort
ports:
- protocol: TCP
port: 443
targetPort: 3000
when I ran the kubectl
command to create it, I observe the following events:
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Sync 32s autoneg-controller Synced NEGs for "default/frontend-svc" as backends to backend service "https-be" (port 443)
Normal Create 23s neg-controller Created NEG "k8s1-1f4ed5c4-default-frontend-svc-443-9757dbe8" for default/frontend-svc-k8s1-1f4ed5c4-default-frontend-svc-443-9757dbe8--/443-3000-GCE_VM_IP_PORT-L7 in "us-central1-f".
Warning BackendError 11s (x13 over 32s) autoneg-controller googleapi: Error 403: Request had insufficient authentication scopes.
Details:
[
{
"@type": "type.googleapis.com/google.rpc.ErrorInfo",
"domain": "googleapis.com",
"metadatas": {
"method": "compute.v1.BackendServicesService.Get",
"service": "compute.googleapis.com"
},
"reason": "ACCESS_TOKEN_SCOPE_INSUFFICIENT"
}
]
I can see however that the autoneg
IAM role has the permission to perform this operation included:
$ gcloud iam roles describe autoneg --project=$PROJECT_ID
etag: REDACTED
includedPermissions:
- compute.backendServices.get
- compute.backendServices.update
- compute.healthChecks.useReadOnly
- compute.networkEndpointGroups.use
- compute.regionBackendServices.get
- compute.regionBackendServices.update
- compute.regionHealthChecks.useReadOnly
name: projects/${PROJECT_ID}/roles/autoneg
stage: ALPHA
title: autoneg
Any suggestions on how to debug and resolve? What makes this acutely frustrating is that there is no mention of these IAM issues in any of the GCP, GKE, autoneg docs or community forums.
I'm having issue with the autoneg v0.9.9. Everything seems to be working fine since I deleted service with autoneg
annotations.
I can see deregistering message in logs
DEBUG events Normal {"object": {"kind":"Service","namespace":"test","name":"test-marianh","uid":"d362ac9e-f5dd-48ee-a167-91abf6c1be0a","apiVersion":"v1","resourceVersion":"14606392"}, "reason": "Delete", "message": "Deregistered NEGs for \"test/test-marianh\" from backend service \"autoneg-w1\" (port 80)"}"
but these NEGs will never be deleted or detached from backend service.
$ gcloud compute backend-services list --project=anthos-mesh-sandbox-1b827f9b
NAME BACKENDS PROTOCOL
autoneg-w1 europe-west1-b/networkEndpointGroups/k8s1-6b097e81-test-test-marianh-80-1330420d,europe-west1-c/networkEndpointGroups/k8s1-6b097e81-test-test-marianh-80-1330420d,europe-west1-d/networkEndpointGroups/k8s1-6b097e81-test-test-marianh-80-1330420d TCP
$ gcloud compute network-endpoint-groups list --project=anthos-mesh-sandbox-1b827f9b | grep marianh
k8s1-6b097e81-test-test-marianh-80-1330420d europe-west1-b GCE_VM_IP_PORT 0
k8s1-6b097e81-test-test-marianh-80-1330420d europe-west1-c GCE_VM_IP_PORT 0
k8s1-6b097e81-test-test-marianh-80-1330420d europe-west1-d GCE_VM_IP_PORT 1
Is there anything what can I check why these NEGs persist service deletion, or is there cook book how to correct delete service with NEG annotation?
I missed this when looking at #34 but I don't think there's any opportunity to make choices for the backend region.
An Internal HTTP LB (regional backend service) can only include Backends from the same region. So a Backend Service in us-west1
can only have Backends also in us-west1
. Trying to configure autoneg on a us-west1
cluster to add instances to a us-east1
Backend Service will not work, I believe.
I addressed this by pulling the cluster location (region) from metadata: http://metadata.google.internal/computeMetadata/v1/instance/attributes/cluster-location
so that user's aren't in the position of needing to magically know the only value that will actually work for region and having to explicitly configure it for each annotation.
One thing that may make this configuration required would be if there's potential for an Internal HTTP LB at some point in the future to allow Backends to be added to a Backend Service in a different region. If that happens, the it would become necessary to specify the Backend Service region in the annotation.
Hi,
I am trying to use AutoNEG controller in Workload Identity mode (configured according to the manual), and I face a problem with auto populating the GCLB backends.
I noticed that shortly after creating GKE services, the backends show the NEGs in GC console, but then they disappear and never show up again.
The sequence of events according to Stackdriver logs:
https://pastebin.com/FKdhqYTy
I can see BAD_REQUEST and RESOURCE_NOT_READY errors there.
I have AutoNEG controller working correctly in neighbor projects, but using cloud access scopes. It's just for extra context, I am not sure if this particular issue is related to Workload Identity setup or not.
Inside the affected NEG I can see undefined health status:
However I think (though not sure) that this is because it is "Not used yet" by the backend, so it's a consequence.
Also the problem persists, even if I remove GKE services and create them again - so it is not any kind of a race with LB backend service creation.
Any idea what goes wrong here?
Also posted: https://serverfault.com/questions/1036096/gclb-backends-are-not-populated-by-autoneg-controller-resource-not-ready
We have a GCP project that we use to test new infrastructure changes, once the test is done, we tear down all the infrastructure.
We recently switched to using the module provided in the project and we are facing some issues when recreating the custom role.
Received unexpected error:
FatalError{Underlying: error while running command: exit status 1; �[31m╷�[0m�[0m
�[31m│�[0m �[0m�[1m�[31mError: �[0m�[0m�[1mError creating the custom project role projects/<project-id>/roles/autonegRegional: googleapi: Error 400: You can't create a role_id (autonegRegional) which has been marked for deletion., failedPrecondition�[0m
�[31m│�[0m �[0m
�[31m│�[0m �[0m�[0m with module.dwam_test.module.autoneg[0].module.gcp.google_project_iam_custom_role.autoneg,
�[31m│�[0m �[0m on .terraform/modules/dwam_test.autoneg/terraform/gcp/main.tf line 57, in resource "google_project_iam_custom_role" "autoneg":
�[31m│�[0m �[0m 57: resource "google_project_iam_custom_role" "autoneg" �[4m{�[0m�[0m
�[31m│�[0m �[0m
�[31m╵�[0m�[0m}
The deploy manifest includes a ConfigMap
that does not appear to be used (it is not mounted to the Deployment
). Can it be removed?
gke-autoneg-controller/deploy/autoneg.yaml
Lines 189 to 222 in b8b9a87
Tried removing all cluster roles but controller does not have filtering based on namespace or maybe I am configuring it incorrectly
message: "pkg/mod/k8s.io/[email protected]+incompatible/tools/cache/reflector.go:94: Failed to list *v1.Service: services is forbidden: User "system:serviceaccount:autoneg-per-ns:default" cannot list resource "services" in API group "" at the cluster scope"
source: "reflector.go:126"
Currently it appears that the latest
tag needs to be used since there aren't any releases.
anthos.cft.dev
is misleading, as autoneg
doesn't require Anthos.
How exactly does one deploy this into one's clusters and make use of it? In particular with a multi cluster ingress (although that part may be out of scope).
Removing config annotation anthos.cft.dev/autoneg
still left the anthos.cft.dev/autoneg-status
annotation, and also generated a controller.autoneg.dev/neg-status
annotation.
I would expect autoneg to deregister backends, remove the status annotation, and remove the finalizer.
When I deploy a k8s service with the NEG and autoneg annotations, the NEGs are not registered at the backend service with the following error message:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Create 7m54s neg-controller Created NEG "k8s1-85a2e695-default-app-catfood-80-fb89a260" for default/app-catfood-k8s1-85a2e695-default-app-catfood-80-fb89a260--web/80-80-GCE_VM_IP_PORT-L7 in "europe-west1-b".
Normal Create 7m51s neg-controller Created NEG "k8s1-85a2e695-default-app-catfood-80-fb89a260" for default/app-catfood-k8s1-85a2e695-default-app-catfood-80-fb89a260--web/80-80-GCE_VM_IP_PORT-L7 in "europe-west1-c".
Normal Create 7m49s neg-controller Created NEG "k8s1-85a2e695-default-app-catfood-80-fb89a260" for default/app-catfood-k8s1-85a2e695-default-app-catfood-80-fb89a260--web/80-80-GCE_VM_IP_PORT-L7 in "europe-west1-d".
Normal Attach 7m45s (x2 over 27h) neg-controller Attach 1 network endpoint(s) (NEG "k8s1-85a2e695-default-app-catfood-80-fb89a260" in zone "europe-west1-b")
Warning BackendError 7m44s (x2 over 26h) autoneg-controller Operation cannot be fulfilled on services "app-catfood": the object has been modified; please apply your changes to the latest version and try again
When I update the service again, the NEGs are registered:
Normal Sync 1s (x2 over 26h) autoneg-controller Synced NEGs for "default/app-catfood" as backends to backend service "catfood"
I expect the autoneg controller to eventually synchronise to the desired situation without manual intervention.
When running testes locally (via go test ./...
), I see the following output:
You're using deprecated Ginkgo functionality:
=============================================
Ginkgo 2.0 is under active development and will introduce several new features, improvements, and a small handful of breaking changes.
A release candidate for 2.0 is now available and 2.0 should GA in Fall 2021. Please give the RC a try and send us feedback!
- To learn more, view the migration guide at https://github.com/onsi/ginkgo/blob/ver2/docs/MIGRATING_TO_V2.md
- For instructions on using the Release Candidate visit https://github.com/onsi/ginkgo/blob/ver2/docs/MIGRATING_TO_V2.md#using-the-beta
- To comment, chime in at https://github.com/onsi/ginkgo/issues/711
You are using a custom reporter. Support for custom reporters will likely be removed in V2. Most users were using them to generate junit or teamcity reports and this functionality will be merged into the core reporter. In addition, Ginkgo 2.0 will support emitting a JSON-formatted report that users can then manipulate to generate custom reports.
If this change will be impactful to you please leave a comment on https://github.com/onsi/ginkgo/issues/711
Learn more at: https://github.com/onsi/ginkgo/blob/ver2/docs/MIGRATING_TO_V2.md#removed-custom-reporters
To silence deprecations that can be silenced set the following environment variable:
ACK_GINKGO_DEPRECATIONS=1.16.5
When running make docker-build
it fails complaining about a missing api
dir.
https://github.com/GoogleCloudPlatform/gke-autoneg-controller/blob/master/Dockerfile#L28 where is this coming from?
Instead of https://github.com/GoogleCloudPlatform/gke-autoneg-controller/blob/master/Dockerfile#L19-L29
why not replace it with
COPY . .
?
We deployed autoneg in one of our clusters running GKE Autopilot. When running a workload on it and if it's scheduled to just some specific zones but not all, the NEG are not created in all the zones.
That means that autoneg
will fail and stop reconciling.
The actual failing log part:
2021-10-22T17:55:52.092Z INFO controllers.Service Applying intended status {"service": "envoy/envoy", "status": {"backend_services":{"8000":{"myproduct":{"name":"myproduct","max_connections_per_endpoint":1000}}},"network_endpoint_groups":{"8000":"k8s1-4fd3dc4c-envoy-envoy-8000-44f9746b","8001":"k8s1-4fd3dc4c-envoy-envoy-8001-7508741b"},"zones":["europe-west1-b","europe-west1-c","europe-west1-d"]}}
2021-10-22T17:55:52.762Z ERROR controller-runtime.controller Reconciler error {"controller": "service", "request": "envoy/envoy", "error": "googleapi: Error 404: The resource 'projects/myproduct-dev/zones/europe-west1-c/networkEndpointGroups/k8s1-4fd3dc4c-envoy-envoy-8000-44f9746b' was not found, notFound"}
The question is, should autoneg
tolerate missing network endpoint groups in some but not in all available zones?
Add tests that prove this case.
Users may want to configure the maxUtilization property in a backend.
Hello,
After performing some tests, I've noticed that if a backend added to a backend service by the autoneg controller is manually removed, it is not added back, that's to say the controller is not constantly checking to ensure the desired state is real.
It would be desirable if the autoneg system periodically checked to make sure that desired state is real.
Steps to reproduce:
It is not clear that there is any need to run this container as root user. This should be updated to run as non-root or have explicit documentation as to why a root user is required for the container.
We ran into an issue that the autoneg controller reports backends being added to NEGs although some patch operations fail with RESOURCE_NOT_READY
.
It seems that applying patches to reconcile backends is handled optimistically, leading to a wrong reconciliation result.
For now I applied a workaround by retrying the patch in such case. See fbaier-fn@225576d which I would be more than happy to contribute my fix.
make test
fails occasionally for https://github.com/GoogleCloudPlatform/gke-autoneg-controller/blob/master/controllers/autoneg_test.go#L248-L258 because we transform a slice to map then back to a slice (see https://github.com/GoogleCloudPlatform/gke-autoneg-controller/blob/master/controllers/autoneg.go#L141) which can mess up the order assumed in the test https://github.com/GoogleCloudPlatform/gke-autoneg-controller/blob/master/controllers/autoneg_test.go#L157-L170.
So if we compare the returned slice, sometimes the group of zone2
comes before zone1
and will have the test fail.
I propose modifying isEqual
ala:
func (b Backends) isEqual(ob Backends) bool {
if b.name != ob.name {
return false
}
newB := map[string]compute.Backend{}
for _, be := range b.backends {
rb := relevantCopy(be)
if _, ok := newB[rb.Group]; ok {
return false
}
newB[rb.Group] = rb
}
newOB := map[string]compute.Backend{}
for _, be := range ob.backends {
rb := relevantCopy(be)
if _, ok := newOB[rb.Group]; ok {
return false
}
newOB[rb.Group] = rb
}
return reflect.DeepEqual(newB, newOB)
}
Right now the controller is installed terraform module that way
module "autoneg" {
source = "github.com/GoogleCloudPlatform/gke-autoneg-controller//terraform/autoneg"
project_id = "your-project-id"
}
where the only thing configurable is the project-id
.
It creates a custom role for the controller, so that module is completely unusable when installing it on different clusters in the same project, it will fail cause it just tries to create the same custom role
Hello,
What is the meaning of the following log, provided that the bar
service in foo
namespace is not a Network Endpoint Group?
DEBUG controller-runtime.controller Successfully Reconciled {"controller": "service", "request": "foo/bar}
We're using the following settings for our deployment --
Container security context:
securityContext:
allowPrivilegeEscalation: false
capabilities:
drop:
- ALL
privileged: false
readOnlyRootFilesystem: true
runAsUser: 65532
runAsGroup: 65532
runAsNonRoot: true
seccompProfile:
type: RuntimeDefault
Pod security context:
securityContext:
runAsUser: 65532
runAsGroup: 65532
runAsNonRoot: true
seccompProfile:
type: RuntimeDefault
Using these as the default would be useful, as it makes this controller installable into a wider number of clusters (e.g. clusters with restrictive admission controllers)
I'm experimenting with the following simple nginx
service:
apiVersion: v1
kind: Service
metadata:
name: neg-demo-svc
annotations:
cloud.google.com/neg: '{"exposed_ports": {"80":{}}}'
controller.autoneg.dev/neg: '{"backend_services":{"80":[{"name":"some-backend-service","region":"europe-west2","max_rate_per_endpoint":100}]}}'
spec:
type: ClusterIP
selector:
app: nginx
ports:
- port: 80
protocol: TCP
I have NEG creation and association with the Backend Service working correctly. However, when I'm deleting the service, I'm running into problems. Based on the autoneg-controller
logs, it seems to deregister the NEGs successfully:
2022-02-25T16:45:12.987+0100 DEBUG controller-runtime.manager.events Normal {"object": {"kind":"Service","namespace":"default","name":"neg-demo-svc","uid":"affa73d3-10e3-4698-ae02-d1c1c22ad748","apiVersion":"v1","resourceVersion":"420760"}, "reason": "Delete", "message": "Deregistered NEGs for \"default/neg-demo-svc\" from backend service \"argon-gke-general-blue-01-psc-backend-service\" (port 80)"}
I believe that this is related to this logic here:
gke-autoneg-controller/controllers/autoneg.go
Lines 270 to 274 in d0e0f48
In this case, intendedBEKeys
is empty, which means we're going to skip the entire loop that checks for differences. Shouldn't it also iterate on keys from actualBE
to detect the removal of ports?
Please let me know if I'm missing something. Is there anything else I might provide?
If the backend service referenced in the config does not exist, do not set finalizer or status, and log a more appropriate event message
https://book.kubebuilder.io/migration/migration_guide_v2tov3.html
Kubebuilder doesn't have features required for autoneg, but it'd be nice to migrate to v3 to stay current on our dependencies. Migration shouldn't be difficult as autoneg already implements/satisfies some of the changes/requirements in v3.
Pull request #46 changes some flags, which is a breaking change unless users re-deploy using the Deployment manifest. I'd propose either:
Prior to the linked pull request, the code looked like this:
gke-autoneg-controller/main.go
Lines 57 to 63 in 05c4b1c
Afterwards, it now looks like this:
gke-autoneg-controller/main.go
Lines 62 to 70 in b8b9a87
Overall, the goal is just to ensure that users and developers have aligned expectations. Perhaps users should not expect stability given that we are pre-1.0. In this case, users upgrading from 0.9.8 to 0.9.9 will have to update their deployment manifests.
This is required to use against an Internal HTTP(S) Load Balancer.
Reproduce: Create a service with a backend name that conatins an udnerscore. Now try to clean up. Eventually I tried to delete the entire namespace, but that is waiting on the deletion of the Service, which cannot proceed even though there is nothing to cleanup.
This makes it explicit that there is a service account just for the autoneg pod.
Additionally, when provisioning with terraform there is no way to update the annotation on the default
service account. A new service account must be created and managed by terraform to add the annotation.
Hi,
I am getting the following error with autoneg controller.
Any idea what the missing precondition could be?
autoneg-controller-manager manager 2019-12-10T19:58:09.561Z INFO controllers.Service Applying intended status {"service": "foo/bar", "status": {"name":"bar","max_rate_per_endpoint":1000,"network_endpoint_groups":{"80":"k8s1-foo-bar-80-12345a"},"zones":["us-central1-a"]}}
autoneg-controller-manager manager 2019-12-10T19:58:09.990Z ERROR controller-runtime.controller Reconciler error {"controller": "service", "request": "foo/bar", "error": "googleapi: Error 412: Precondition Failed, conditionNotMet"}
A backend service named bar
exists in the project, it has 0 backends at this time.
cc @soellman 🙏
The following errors are in the README/script
PROJECT=xyz deploy/workload_identity.sh
", but the script itself expects an environment variable named "PROJECT_ID
"gcloud iam roles update
" fails. You should use "gcloud iam roles create
" instead.Hi,
Thank you for the nice tool, successfully using it for NEG autopopulation.
After trying to switch to the most recent published version as a release, it is actually not possible to pull the image from the registry:
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Scheduled 19m default-scheduler Successfully assigned autoneg-system/autoneg-controller-manager-758bc7f94b-tzvz9 to gke-d003-sb2-k8s-euwe1-node-pool-1-e0234f93-8c4q
Normal Pulling 19m kubelet Pulling image "gcr.io/kubebuilder/kube-rbac-proxy:v0.4.0"
Normal Started 19m kubelet Started container kube-rbac-proxy
Normal Pulled 19m kubelet Successfully pulled image "gcr.io/kubebuilder/kube-rbac-proxy:v0.4.0" in 1.270493418s
Normal Created 19m kubelet Created container kube-rbac-proxy
Warning Failed 18m (x3 over 19m) kubelet Failed to pull image "docker.pkg.github.com/googlecloudplatform/gke-autoneg-controller/gke-autoneg-controller:0.9.1": rpc error: code = Unknown desc = Error response from daemon: Get https://docker.pkg.github.com/v2/googlecloudplatform/gke-autoneg-controller/gke-autoneg-controller/manifests/0.9.1: no basic auth credentials
It should not be behind basic auth, should it?
Should not it be published to gcr.io registry instead?..
I noticed there's a user agent string which is hard-coded:
gke-autoneg-controller/main.go
Line 43 in b8b9a87
Script fails updating a role that not exists
Fix at #16
Hello,
Thanks for this excellent project!
At Prefect, we are using this controller to perform traffic splitting and blue/green deployments. As far as I can tell, the default (and hard-coded) behavior of this controller is such that traffic will be immediately load balanced between all participating services once the health checks succeed, and we would instead prefer to gradually shift traffic over by configuring the split ratio.
This seems to be the relevant code:
gke-autoneg-controller/controllers/autoneg.go
Lines 62 to 66 in 64a7216
Would you accept a PR that adds configurability for the InitialCapacity value? We could store that in the existing AutonegConfig object.
I have two use cases in mind for this feature:
In both cases, we want to gradually shift some traffic to the new cluster and monitor error rates.
With the current behavior, if we use the same connection rate settings for the service, then bringing up a new cluster would take an equal proportion of traffic (e.g. live cluster A processing 100% of requests, bringing up a new cluster B and attaching to the same NEG will result in a 50%/50% split.) We would like to begin with 100%/0% split, gradually increase the proportion of traffic that Cluster B handles, and then gradually decrease the proportion of traffic that Cluster A handles, to safely transition.
The autoneg-controller-manager pod returns the following error when using workload identity:
2022-07-27T18:52:41.739Z ERROR controller-runtime.controller Reconciler error {"controller": "service", "request": "<name_space>/<service>", "error": "googleapi: Error 401: Invalid Credentials, authError"}
github.com/go-logr/zapr.(*zapLogger).Error
/go/pkg/mod/github.com/go-logr/[email protected]/zapr.go:128
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler
/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:218
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem
/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:192
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).worker
/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:171
k8s.io/apimachinery/pkg/util/wait.JitterUntil.func1
/go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/wait.go:152
k8s.io/apimachinery/pkg/util/wait.JitterUntil
/go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/wait.go:153
k8s.io/apimachinery/pkg/util/wait.Until
/go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/wait.go:88
The default
service account has the workload identity annotation and the gcp service account has the correct role binding.
The example in the README uses a Backend Service with a name containing an underscore. However, it is impossible to create a Backend Service with this name.
As a result nothing happens, but even deleting the Service is no longer possible, because the controller keeps getting stuck on the Backend Service name that is rejected.
to reproduce, type:
$ git clone [email protected]:GoogleCloudPlatform/gke-autoneg-controller.git
$ cd gke-autoneg-controller
$ docker build -tx .
------
> [builder 7/9] COPY api/ api/:
------
failed to compute cache key: "/api" not found: not found
A GKE service can have multiple NEGs created for different ports but there isn't a way to associate a specific NEG with a specific backend.
For example, a service exposes two ports. Port 443 for HTTP2/grpc and port 8443 HTTP1.1 for metrics and diagnostics endpoints. Two NEGs can be created for those two ports but there is no way to associate the port 443 NEG with backendA
and the port 8443 NEG with backendB
.
They must be behind separate backends since they use different protocols (HTTP2
and HTTP
) to talk from the backend service to the NEG instances.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.