GithubHelp home page GithubHelp logo

maksim-paskal / aks-node-termination-handler Goto Github PK

View Code? Open in Web Editor NEW
35.0 35.0 6.0 260 KB

Gracefully handle Azure Virtual Machines shutdown within Kubernetes

License: Apache License 2.0

Dockerfile 0.29% Makefile 3.55% Go 94.16% Shell 1.99%
aks kubernetes maintenance-events spot-instances

aks-node-termination-handler's People

Contributors

maksim-paskal avatar nclaeys avatar sarita-maersk avatar tomaszslawski-tomtom avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar

aks-node-termination-handler's Issues

Feature request: Run in Queue mode similar to AWS NTH

I'm mostly asking this to see if there is interest for this, I haven't dug deep enough to know how complex this might turn out. If there is interest I'd be happy to assist with development of this feature.

I think the requirements are pretty simple?

  • Create ASQ
  • Send related node events to Queue. This is where I'm unsure of the possibilities with Azure
  • Create alternate code paths for running while watching Queue instead of local endpoint
  • Continue with standard taint/drain node process

Support Proxy in Webhook client

Hey, I was wondering if the webhook client should already honor the HTTPS_PROXY environment variable or not?
I set

env:
  - name: "HTTPS_PROXY"
    value: "http://someProxy:somePort"
args:
  - "-webhook.url=https://myWebhook"
  - "-webhook.template-file=/files/webhook.json"
  - "-webhook.contentType=application/json"
  - "-webhook.method=POST"
  - "-webhook.timeout=30s"

and it does seem to be used, since if I do not set the NO_PROXY correctly the pod cannot start.

But the webhook requests seem to not go to the configured proxy.

Can a proxy be configured for the webhook client already, if so how?

Else this would be a feature request if possible :)

Issue with metric scrapping via PodMonitor

First of all thank you for your work on this project!

I wanted to enable metric scrapping for Prometheus using a PodMonitor. However, I encountered an issue with the PodMonitor in the following configuration

apiVersion: monitoring.coreos.com/v1
kind: PodMonitor
metadata:
  namespace: kube-system
  name: podmonitor-aks-node-termination-handler
  labels:
    release: prometheus
spec:
  selector:
    matchLabels:
      app: aks-node-termination-handler
  podMetricsEndpoints:
  - port: 17923
    path: /metrics
    interval: 15s

The problem is that a properly defined PodMonitor should point to the port by name rather than by number - https://github.com/prometheus-operator/prometheus-operator/blob/main/Documentation/api.md#podmetricsendpoint

Could you please add the port definition for the container in DaemonSet so that the appropriate port with a name is created for the container in the pod? I'm attaching an example patch for the DaemonSet below:

kubectl patch daemonset aks-node-termination-handler --type='json' -p='[{"op": "add", "path": "/spec/template/spec/containers/0/ports", "value": [{"containerPort": 17923, "name": "metrics", "protocol": "TCP"}]}]'

Thank you

OpenShift Support

Hi,
When running in OpenShift, there are no VirtualMachineScaleSets (only VirtualMachines), and for that reason, the DaemonSet is crashing (attached logs below).
Can we request for OpenShift support?

{"file":"github.com/maksim-paskal/aks-node-termination-handler/cmd/main.go:55","func":"main.main","level":"info","msg":"Starting 1.0.13-d8d5a71-1707463489...","time":"2024-03-07T10:48:45Z"}
{"file":"github.com/maksim-paskal/aks-node-termination-handler/pkg/alert/alert.go:29","func":"github.com/maksim-paskal/aks-node-termination-handler/pkg/alert.Init","level":"warning","msg":"not sending Telegram message, no token","time":"2024-03-07T10:48:45Z"}
{"file":"github.com/maksim-paskal/aks-node-termination-handler/pkg/client/client.go:45","func":"github.com/maksim-paskal/aks-node-termination-handler/pkg/client.Init","level":"info","msg":"No kubeconfig file use incluster","time":"2024-03-07T10:48:45Z"}
{"error":"error in getting azure resource name: azure:///subscriptions/dd6b40ef-de5f-4649-95a7-bd2337c71900/resourceGroups/ocp-azure-uat-euw-8npmn-rg/providers/Microsoft.Compute/virtualMachines/master-1: azureProviderID not valid","file":"github.com/maksim-paskal/aks-node-termination-handler/cmd/main.go:86","func":"main.main","level":"fatal","msg":"","time":"2024-03-07T10:48:45Z"}

Enhancement for EventType freeze

Hi,
Currently when azure sends eventType FREEZE , aks node terminator drains all pods and stops watching for new events.
The issue what we see is , azure does not take that worker node down , so no new worker node creates by VMscaleset.
The worker remains in unscheduled state and it is charged .

I possible for FREEZE state alone , after drain watch for events again and when the new event comes related to unfreeze/normal , uncordon that worker node and keep watching for new events .
eventTypeFreeze

missing information

hello,
what are the requirements to use this tool, for example do we need to activate the "Instance termination" in the scaleset of the aks pool? and if yes, how do you recommend to activate it? as with "az aks create" i do not see an option to do it.

Is Slack integration actually working?

Hi, I have set values this way:

aks-node-termination-handler:
  image: image
  imagePullPolicy: Always

  args:
    - "-webhook.url=https://myhook"
    - "-webhook.template='node_termination_event{node=\"{{ .Node }}\"} 1'"
  env: []

  priorityClassName: "system-node-critical"
k get pod
NAME                                 READY   STATUS    RESTARTS   AGE
aks-node-termination-handler-4r776   1/1     Running   0          8m16s
aks-node-termination-handler-g6x25   1/1     Running   0          8m16s
aks-node-termination-handler-ncccj   1/1     Running   0          8m16s
aks-node-termination-handler-tgrzr   1/1     Running   0          8m16s
aks-node-termination-handler-wc2kf   1/1     Running   0          8m17s
aks-node-termination-handler-xc6dt   1/1     Running   0          8m16s
---
k get pod aks-node-termination-handler-tgrzr -o yaml
apiVersion: v1
kind: Pod
metadata:
  annotations:
...
  containers:
  - args:
    - -webhook.url=https://myhook
    - -webhook.template='node_termination_event{node="{{ .Node }}"} 1'
    env:
    - name: MY_NODE_NAME
      valueFrom:
        fieldRef:
          apiVersion: v1
          fieldPath: spec.nodeName

In the logs I am getting:

aks-node-termination-handler-r49nz aks-node-termination-handler {"error":"error in sending to webhook: StatusCode=400: http result not OK","file":"github.com/maksim-paskal/aks-node-termination-handler/pkg/events/events.go:140","func":"github.com/maksim-paskal/aks-node-termination-handler/pkg/events.readEndpoint","level":"error","msg":"error in alerts.Send","time":"2024-01-22T13:47:52Z"}

When I try to send message to the channel via curl:

curl -X POST --data-urlencode "payload={\"channel\": \"#mychannel\", \"username\": \"webhookbot\", \"text\": \"This is posted to #my-channel-here and comes from a bot named webhookbot.\", \"icon_emoji\": \":ghost:\"}" https://myhoook
ok%
image

dublicated metrics in aks_node_termination_handler_scheduled_events_total

dublicated Prometheus metric

aks_node_termination_handler_scheduled_events_total{type="Freeze"} 100

logs

{"level":"info","msg":"Excluded event Freeze by user config","time":"2023-06-30T21:47:28Z"}
{"level":"info","msg":"{\"DocumentIncarnation\":7,\"Events\":[{\"EventId\":\"FE88BA46-96D4-432E-8AED-89A0F0E52D99\",\"EventStatus\":\"Scheduled\",\"EventType\":\"Freeze\",\"ResourceType\":\"VirtualMachine\",\"Resources\":[\"aks-spotcpu2m8-23666972-vmss_862\"],\"NotBefore\":\"Fri, 30 Jun 2023 22:00:16 GMT\",\"Description\":\"\",\"EventSource\":\"Platform\",\"DurationInSeconds\":30}]}","time":"2023-06-30T21:47:33Z"}
{"level":"info","msg":"Excluded event Freeze by user config","time":"2023-06-30T21:47:33Z"}
{"level":"info","msg":"{\"DocumentIncarnation\":7,\"Events\":[{\"EventId\":\"FE88BA46-96D4-432E-8AED-89A0F0E52D99\",\"EventStatus\":\"Scheduled\",\"EventType\":\"Freeze\",\"ResourceType\":\"VirtualMachine\",\"Resources\":[\"aks-spotcpu2m8-23666972-vmss_862\"],\"NotBefore\":\"Fri, 30 Jun 2023 22:00:16 GMT\",\"Description\":\"\",\"EventSource\":\"Platform\",\"DurationInSeconds\":30}]}","time":"2023-06-30T21:47:38Z"}
{"level":"info","msg":"Excluded event Freeze by user config","time":"2023-06-30T21:47:38Z"}
{"level":"info","msg":"{\"DocumentIncarnation\":7,\"Events\":[{\"EventId\":\"FE88BA46-96D4-432E-8AED-89A0F0E52D99\",\"EventStatus\":\"Scheduled\",\"EventType\":\"Freeze\",\"ResourceType\":\"VirtualMachine\",\"Resources\":[\"aks-spotcpu2m8-23666972-vmss_862\"],\"NotBefore\":\"Fri, 30 Jun 2023 22:00:16 GMT\",\"Description\":\"\",\"EventSource\":\"Platform\",\"DurationInSeconds\":30}]}","time":"2023-06-30T21:47:43Z"}

Windows nodes Support

Hello,

We've been lucky so far while using AWS and aws-handler does support Windows nodes.

image

We do have some Windows Nodepools running in the AKS therefore, I am wondering if there are any plans for Windows support? Thanks

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.