A solution to gracefully handle GCE VM terminations in kubernetes clusters

License: Apache License 2.0

Makefile 2.90% Python 13.18% Shell 6.84% Go 75.66% Dockerfile 1.43%

k8s-node-termination-handler's Introduction

⚠️ Deprecation Notice

As of Kubernetes 1.20, Graceful Node Shutdown replaces the need for GCP Node termination handler. GKE on versions 1.20+ enables Graceful Node Shutdown by default. Refer to the GKE documentation and Kubernetes documentation for more info about Graceful Node Shutdown (docs, blog post).

This is not an official Google Project

Kubernetes on GCP Node Termination Event Handler

This project provides an adapter for translating GCE node termination events to graceful pod terminations in Kubernetes. GCE VMs are typically live migratable. However, Preemptible VMs and VMs with Accelerators are not live migratable and are hence prone to VM terminations. Do not consume this project unless you are managing k8s clusters that run non migratable VM types.

To deploy this solution to a GKE or a GCE cluster:

kubectl apply -f deploy/

Note: This solution requires kubernetes versions >= 1.11 to work on Preemptible nodes.

The app deployed as part of this solution does the following:

Launch a pod on every node in the cluster which contains the node termination monitoring agent.
The agent in the pod watches for node terminations via GCE metadata APIs.
Whenever a termination event is observed, the agent does the following:
1. Taints the node to prevent new pods from being scheduled
2. Delete all pods that are not in the kube-system namespace first before deleting the ones in it. Certain system pods like logging agents might need more time to flush out logs prior to termination and for this reason, pods in kube-system namespaces are deleted last.
3. Reboot the node if the underlying VM is not a preemptible VM. VMs with Accelerators when restarted are expected to handle host maintenance events transparently. Restarts are generally faster too!
If the underlying node is not scheduled for maintenance, the agent will remove any previously applied taints, thereby restoring the node post termination.

The agent crashes whenever it encounters an unrecoverable error with the metadata APIs. This agent is not production hardened yet and so use it with caution.

Graceful terminations for regular pods (Non-system pods)

The pods that are not in the kube-system are called regular pods in this agent. By default, regular pods are deleted immediately before deleting system pods. If you want to delete regular pods gracefully, please add --system-pod-grace-period=n in arguments according to the following rules:

If targeted VM is Preemptible VM, specify n with a value from 0s to 14s.
If targeted VM is regular VM, specify n with a value from 0s to the value of (--regular-vm-timeout / 2) - 1.

If you follow the rules above, VM timeout - system-grace-pod-period will be given as a grace period for deleting regular pods. Note that VM timeout in Preemptible VM is 30 seconds.

If you specify 0s, the system pods will be terminated immediately and the regular pods will have about 30 seconds of grace period. If you specify 14s, both system and regular pods will have about 14s of grace period.

Also, the timeout value of VM (e.g. preemptible=30s) / 2 cannot be used as a maximum value in --system-pod-grace-period for regular pods.

In addition, if the actual delete process fails, it will retry internally based on exponential backoff. In that case, the grace period is set considering the elapsed time, but it may shorten the actual grace period.

k8s-node-termination-handler's People

Stargazers

Watchers

k8s-node-termination-handler's Issues

[Propose] Call removing a node from a cluster for a preemptible VM.

Now after removing the VM, the cluster waits another 5 minutes for the node to return. Since this wait is meaningless, we can immediately notify the cluster about this by running node deletion. In the same way as a reboot is called for nodes with accelerators.

Error thrown on GKE private clusterr

Hi,

I've got your project running nicely on a public cluster but now I had to create a private one. The setup fo the cluster, excluding the part that makes it private, and the deployment are exactly the same as on the public one yet the node-termination-handler throws this error:

I0227 04:54:26.860229  772575 main.go:72] Excluding pods map[node-termination-handler-bd2d4:kube-system]
F0227 04:54:26.861467  772575 main.go:79] metadata: GCE metadata "instance/scheduling/on-host-maintenance" not defined

Any idea what I'm missing here?

Kind regards,

Eric

[Question] Why not evicting the pods instead of deleting them?

Instead of deleting the pod, why not evicting it? Evicting the pod would be better since it would respect any PDBs targeting the pod.

Can't find kubeconfig

Thanks very much for your work on this! One problem when I tried to get it going: the node-termination-handler container starts up and produces the following error in the logs:

F0618 16:10:47.732227 23581 main.go:59] Failed to get kubernetes API Server Client. Error: stat /var/lib/kubelet/kubeconfig: no such file or directory

Any ideas?

the command should be "kubectl apply -k deploy/" on Readme.md

Please edit it accordingly. Thanks.

How to handle G2 soft off signal.

We are testing how to use Preemptible VM in GKE with this project.
In the Preemptible VM document here said,

Compute Engine sends a preemption notice to the instance in the form of an ACPI G2 Soft Off signal. You can use a shutdown script to handle the preemption notice and complete cleanup actions before the instance stops.

In our test, if we do not block systemd to handle this signal, pod couldn't graceful shutdown.[1]
So my question is instead to monitor instance/preempted event, how about to handle G2 soft off signal, is this a better way to perform?

Thanks.

[1] https://raw.githubusercontent.com/axot/examples/daisaru11/nginx-stresstest/staging/nginx-metrics/disable-powerkey.yaml

[Question] Handles simulated termination events?

Hi there,

Does this handle simulated terminations via Stopping a preemptible instance manually? Initial test seems to suggest not, node was not being drained

If not, how can we test terminations?

Thanks,
Jonathan

Does this handle preemptible termination events?

[Reminder] Update the deploy/k8s.yaml "args" with "taint" flag when releasing new image.

Story:
I built my own node-termination-handler from master/head and tried to use deploy/k8s.yaml to bring up the node-termination-handler pods. I encountered the error below:

I0417 20:56:13.190654   40157 main.go:109] Using kube config: &{Host:https://10.11.240.1:443 APIPath: ContentConfig:{AcceptContentTypes: ContentType: GroupVersion:<nil> NegotiatedSerializer:<nil>} Username: Password: BearerToken:eyJhbGciOiJSUzI1NiIsImtpZCI6IiJ9.eyJpc3MiOiJrdWJlcm5ldGVzL3NlcnZpY2VhY2NvdW50Iiwia3ViZXJuZXRlcy5pby9zZXJ2aWNlYWNjb3VudC9uYW1lc3BhY2UiOiJrdWJlLXN5c3RlbSIsImt1YmVybmV0ZXMuaW8vc2VydmljZWFjY291bnQvc2VjcmV0Lm5hbWUiOiJub2RlLXRlcm1pbmF0aW9uLWhhbmRsZXItdG9rZW4tNnZtYzYiLCJrdWJlcm5ldGVzLmlvL3NlcnZpY2VhY2NvdW50L3NlcnZpY2UtYWNjb3VudC5uYW1lIjoibm9kZS10ZXJtaW5hdGlvbi1oYW5kbGVyIiwia3ViZXJuZXRlcy5pby9zZXJ2aWNlYWNjb3VudC9zZXJ2aWNlLWFjY291bnQudWlkIjoiMjM1YjFkN2YtNjE0OS0xMWU5LTgwYzItNDIwMTBhODAwMWNmIiwic3ViIjoic3lzdGVtOnNlcnZpY2VhY2NvdW50Omt1YmUtc3lzdGVtOm5vZGUtdGVybWluYXRpb24taGFuZGxlciJ9.W_GIvwGLbd7KIdgDWnRk2dQQJaXoIk0LbyzMcLLWMyS45SOIl8baSPGVMYiMMzcoCWUk03Z7fXIKWtzB2lSxQ7FVw_KAPZW3_JI4PBEI9jb19GZmbF7dsiANFKulArFi0WUGGB-LJ9vPjFTYaKIE3AknQsarql5furB807Pl-TVtwLjw10k6voLnS9sB3EHEW2rV5tHK6IANRF50QVBbyBUgy6rqqEXdK6bE1OCykIhr_qNE31wALMXG0ukemHg33GkSDHHD_M5FrVffFEqlzp3ycViiF_gLvoT7lZhOia0Zevyo7L-_XuW8v4LwR40I8LOqowvEN3wz6gOjP1V-dw Impersonate:{UserName: Groups:[] Extra:map[]} AuthProvider:<nil> AuthConfigPersister:<nil> ExecProvider:<nil> TLSClientConfig:{Insecure:false ServerName: CertFile: KeyFile: CAFile:/var/run/secrets/kubernetes.io/serviceaccount/ca.crt CertData:[] KeyData:[] CAData:[]} UserAgent: Transport:<nil> WrapTransport:<nil> QPS:0 Burst:0 RateLimiter:<nil> Timeout:0s Dial:<nil>}

F0417 20:56:13.191863   40157 main.go:66] Must specify one of taint or annotation

@mindprince helped on debugging this and we noticed some new flags are added in recent code but the example in deploy/k8s.yaml is still based on an old version. (It works with the old version because the change is not included. But will break when release a new image.)

Suggestion:
This issue is a reminder that when releasing a new image, we also need to update the containers/args with flag "-taint" or "-annotation", otherwise the yaml won't work.

e.g.

-image: gcr.io/node-termination-handler@sha256:NEW_SHA
 name: node-termination-handler
  command: ["./node-termination-handler"]
   args: ["--logtostderr", "--exclude-pods=$(POD_NAME):$(POD_NAMESPACE)", "-v=10", "-taint=blah:blah:blah"]

What if containers take more than 30 secs to start?

Hi Team,

There is scenario where many of our containers may take more than 30 secs to start. In other words, 30 seconds (probably) will not be enough for new replicas to start when VMs receive preemption signal.

Is it possible to modify this draining_timeout_when_node_expired_ms values to 45 secs to solve the above problem?

Consider consolidating logging

The logs don't seem to all follow a structured output which makes it really hard to follow what's happening when you're trying to debug an issue, would it be possible to unify logging?

[Question]: What about drain node, rather than deleting the pods?

Might be a noob question, but just wanted to ask the reason not ti just Drain the node, rather than taking the responsibility of eviction, deletion, tainting. Can't we just cordon, and then drain the node?

Metrics endpoint

Does this expose an OpenMetrics endpoint? If not that would be a great feature addition.

[Question] Why sometimes the node-termination is not able to delete all the pods

Hi,

I got preemptible nodes with more than 40 pods.
For some reason is not able to delete all the pods. It starts and when it has deleted around 20 pods, it stops. No logs further this moment.
I tried to delete the pods at the same time that listing the pods

eviction.go:66

is taking place , but no success either.

Thanks for your help

system pod may not graceful shutdown

Thank you for the good tool.

I use this tool on GKE Clusters with Preemptible VMs. I've had two issues with system pod not graceful shutdown in my environment. I hit these issue, and when the Preemptible VM's grace period (30sec) was reached, it was forced to quit and the system pods not graceful shutdown.

(1) When GracePeriod of the regular pod is 0, the process of waiting to delete the regular pod may enter an infinite loop.

GracePeriod of a regular pod is treated as 0 if it does not satisfy "regular pod timeout" >= 2 * "system pod timeout".

k8s-node-termination-handler/termination/eviction.go

Lines 75 to 80 in d56ae03

 // Evict regular pods first. 

 var gracePeriod int64 

 // Reserve time for system pods if regular pods have adequate time to exit gracefully. 

 if timeout >= 2*p.systemPodGracePeriod { 

 gracePeriod = int64(timeout.Seconds() - p.systemPodGracePeriod.Seconds()) 

 }

If a delete request succeeds but the regular pod is not deleted, it will wait indefinitely. Because A timeout of 0 is interpreted as an infinity.

k8s-node-termination-handler/termination/eviction.go

Lines 105 to 110 in d56ae03

 // wait for pods to be actually deleted since deletion is asynchronous & pods have a deletion grace period to exit gracefully. 

 for _, pod := range pods { 

 if err := p.waitForPodNotFound(pod.Name, pod.Namespace, time.Duration(*deleteOptions.GracePeriodSeconds)*time.Second); err != nil { 

 glog.Errorf("Pod %q/%q did not get deleted within grace period %d seconds: %v", pod.Namespace, pod.Name, deleteOptions.GracePeriodSeconds, err) 

 } 

 }

k8s-node-termination-handler/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go

Lines 328 to 333 in d56ae03

 // Over very short intervals you may receive no ticks before the channel is 

 // closed. A timeout of 0 is interpreted as an infinity. 

 // 

 // Output ticks are not buffered. If the channel is not ready to receive an 

 // item, the tick is skipped. 

 func poller(interval, timeout time.Duration) WaitFunc {

I thought that if the GracePeriod of the regular pod is 0, there is no need to wait for the deletion to be complete, but if you do that, you could lose the log of the regular pod by deleting the system pod first.

Alternatively, you might want to retry the delete request. Or maybe the current logic is correct, because it doesn't delete the system pods until all the regular pods are deleted. I don't have a good idea.

(2) If the GracePeriod of the regular pod is greater than or equal to 1, the GracePeriod of the system pod may be less.

If some regular pods take a long time to delete (or a state occurs where they are not deleted), you may lose GracePeriod for system pods. This is because waiting to delete regular pods is handled in series. And each pod has a timeout set, it may take some time for all pod deletions to complete.

k8s-node-termination-handler/termination/eviction.go

Lines 105 to 110 in d56ae03

 // wait for pods to be actually deleted since deletion is asynchronous & pods have a deletion grace period to exit gracefully. 

 for _, pod := range pods { 

 if err := p.waitForPodNotFound(pod.Name, pod.Namespace, time.Duration(*deleteOptions.GracePeriodSeconds)*time.Second); err != nil { 

 glog.Errorf("Pod %q/%q did not get deleted within grace period %d seconds: %v", pod.Namespace, pod.Name, deleteOptions.GracePeriodSeconds, err) 

 } 

 }

It seemed to me that I needed to change the delete wait to parallel processing and set the timeout as the overall time, rather than per pod.

Q: Is this tool discontinued or ready and production safe?

Hi, I would love to use this termination handler, but in the readme is written the following:

This agent is not production hardened yet and so use it with caution.

and the last commit was 9 months ago. Is this project discontinued or just ready to be used?

What is the API that it uses to know that the node will be down after 30 seconds?

Can you please share the URL of the API so that we can also create a script own our own?

Any plans to add a slack notifications?

Hi,
is there any plan to add a slack notification about node termination also to slack? A similar approach has this tool for aws:
https://github.com/kube-aws/kube-spot-termination-notice-handler

Error thrown on GKE cluster, permission issue

Hi,

I just deployed your application, created a Helm chart for it, but I get this error with your ClusterRole:

I0114 09:24:30.253282 10558 round_trippers.go:405] GET https://10.92.128.1:443/api/v1/nodes/gke-dashur-dev-dashur-dev-nodepool-2c0fca4c-xjcv 403 Forbidden in 22 milliseconds I0114 09:24:30.253309 10558 round_trippers.go:411] Response Headers: I0114 09:24:30.253314 10558 round_trippers.go:414] Content-Length: 409 I0114 09:24:30.253318 10558 round_trippers.go:414] Date: Tue, 14 Jan 2020 09:24:30 GMT I0114 09:24:30.253322 10558 round_trippers.go:414] Audit-Id: 58e9a508-c00c-4d00-a71e-7a55fbfa1e24 I0114 09:24:30.253326 10558 round_trippers.go:414] Content-Type: application/json I0114 09:24:30.253329 10558 round_trippers.go:414] X-Content-Type-Options: nosniff I0114 09:24:30.253355 10558 request.go:874] Response Body: {"kind":"Status","apiVersion":"v1","metadata":{},"status":"Failure","message":"nodes \"gke-dashur-dev-dashur-dev-nodepool-2c0fca4c-xjcv\" is forbidden: User \"system:serviceaccount:kube-system:node-terrmination-handler\" cannot get resource \"nodes\" in API group \"\" at the cluster scope","reason":"Forbidden","details":{"name":"gke-dashur-dev-dashur-dev-nodepool-2c0fca4c-xjcv","kind":"nodes"},"code":403} I0114 09:24:30.253923 10558 taint.go:81] Failed to remove taint: nodes "gke-dashur-dev-dashur-dev-nodepool-2c0fca4c-xjcv" is forbidden: User "system:serviceaccount:kube-system:node-terrmination-handler" cannot get resource "nodes" in API group "" at the cluster scope I0114 09:24:30.253947 10558 handler.go:90] Failed to process initial node state - nodes "gke-dashur-dev-dashur-dev-nodepool-2c0fca4c-xjcv" is forbidden: User "system:serviceaccount:kube-system:node-terrmination-handler" cannot get resource "nodes" in API group "" at the cluster scope F0114 09:24:30.253957 10558 main.go:87] nodes "gke-dashur-dev-dashur-dev-nodepool-2c0fca4c-xjcv" is forbidden: User "system:serviceaccount:kube-system:node-terrmination-handler" cannot get resource "nodes" in API group "" at the cluster scope

Looks like the ClusterRole is still missing something to access the node pool. Any idea what needs to be added?

Kind regards,

Eric V.

Node Termination handler may still be necessary

The current README states this handler is deprecated in favor of the new Graceful Node Shutdown:

⚠️ Deprecation Notice
As of Kubernetes 1.20, Graceful Node Shutdown replaces the need for GCP Node termination handler. GKE on versions 1.20+ enables Graceful Node Shutdown by default. Refer to the GKE documentation and Kubernetes documentation for more info about Graceful Node Shutdown (docs, blog post).

I have been using the Node Termination handler with GKE < 1.20, using pre-emptibles with GPUs. The handler was needed to avoid a race condition on node restart that sometimes caused pods not to correctly recognize the GPU.

I have moved to GKE 1.21.1-gke.2200 and found the same error I would get with version <1.20 without the Node Termination handler. This handler happens only occasionally, so it seems like potentially the same race condition.

ImportError: libcuda.so.1: cannot open shared object file: No such file or directory

I filed the following GKE issue.
https://issuetracker.google.com/issues/192809336

For the moment, I would ask that this repo not be deprecated.

Slack URL is specified, but not getting notice

✗  kubectl -n kube-system exec -it  pod/node-termination-handler-s8nkk -- sh
/app # apk add curl
...
OK: 7 MiB in 19 packages
/app # curl -X POST -H 'Content-type: application/json' --data '{"text":"Manual message from inside pod/node-termination-handler-s8nkk"}' $SLACK_WEBHOOK_URL
ok/app #

The message goes to Slack.

✗  gcloud compute instances simulate-maintenance-event [redacted]-n2d-standard-4-preemptibl-9f1a666f-177n --zone=us-central1-a
Simulating maintenance on instance(s) [https://compute.googleapis.com/compute/v1/projects/[redacted]/zones/us-central1-a/instances/[redacted]-n2d-standard-4-preemptibl-9f1a666f-177n]...done.

✗  kubectl -n kube-system logs -f node-termination-handler-s8nkk
...
 I1119 22:48:46.179414   46739 gceTerminationHandler.go:135] Handling maintenance event with state: "TRUE"                                                                                                                                                                        
I1119 22:48:46.179484   46739 gceTerminationHandler.go:141] Recording impending termination                                                                                                                                                                                      
I1119 22:48:46.179630   46739 handler.go:54] Current node state: {[redacted]-n2d-standard-4-preemptibl-f73d7c1b-t9x5 true 2020-11-19 22:49:16.179487009 +0000 UTC m=+6840.282105137 false}                                                                                      
I1119 22:48:46.179826   46739 handler.go:64] Applying taint prior to handling termination 
...

We have a impending termination, but there is no slack notification... :(

	// Evict regular pods first.
	var gracePeriod int64
	// Reserve time for system pods if regular pods have adequate time to exit gracefully.
	if timeout >= 2*p.systemPodGracePeriod {
	gracePeriod = int64(timeout.Seconds() - p.systemPodGracePeriod.Seconds())
	}

	// wait for pods to be actually deleted since deletion is asynchronous & pods have a deletion grace period to exit gracefully.
	for _, pod := range pods {
	if err := p.waitForPodNotFound(pod.Name, pod.Namespace, time.Duration(deleteOptions.GracePeriodSeconds)time.Second); err != nil {
	glog.Errorf("Pod %q/%q did not get deleted within grace period %d seconds: %v", pod.Namespace, pod.Name, deleteOptions.GracePeriodSeconds, err)
	}
	}

	// Over very short intervals you may receive no ticks before the channel is
	// closed. A timeout of 0 is interpreted as an infinity.
	//
	// Output ticks are not buffered. If the channel is not ready to receive an
	// item, the tick is skipped.
	func poller(interval, timeout time.Duration) WaitFunc {

googlecloudplatform / k8s-node-termination-handler Goto Github PK

k8s-node-termination-handler's Introduction

⚠️ Deprecation Notice

Kubernetes on GCP Node Termination Event Handler

Graceful terminations for regular pods (Non-system pods)

k8s-node-termination-handler's People

Stargazers

Watchers

Forkers

k8s-node-termination-handler's Issues

Recommend Projects

Recommend Topics

Recommend Org

Jobs