GithubHelp home page GithubHelp logo

epenchev / node-healthcheck-operator Goto Github PK

View Code? Open in Web Editor NEW

This project forked from medik8s/node-healthcheck-operator

0.0 0.0 0.0 12.69 MB

K8s Node Health Check Operator

License: Apache License 2.0

Shell 0.57% Go 91.29% Makefile 7.30% Dockerfile 0.83%

node-healthcheck-operator's Introduction

Node Healthcheck Operator

Introduction

Hardware is imperfect, and software contains bugs. When node level failures such as kernel hangs or dead NICs occur, the work required from the cluster does not decrease - workloads from affected nodes need to be restarted somewhere.

However, some workloads, such as RWO volumes and StatefulSets, may require at-most-one semantics. Failures affecting these kind of workloads risk data loss and/or corruption if nodes (and the workloads running on them) are assumed to be dead whenever we stop hearing from them. For this reason it is important to know that the node has reached a safe state before initiating recovery of the workload.

Unfortunately it is not always practical to require admin intervention in order to confirm the node's true status. In order to automate the recovery of exclusive workloads, the Medik8s project presents a collection of operators that can be installed on any kubernetes-based cluster to automate failure detection and fencing / remediation. For more information visit our homepage

Failure detection with the Node Healthcheck operator

Handling unhealthy nodes

A Node entering an unready state after 5 minutes is an obvious sign that a failure occurred. However, there may be other criteria or thresholds that are more appropriate based on your particular physical environment, workloads, and tolerance for risk.

The Node Healthcheck operator checks each Node's set of NodeConditions against the criteria and thresholds defined in NodeHealthCheck (NHC) custom resources (CRs).

If the Node is deemed to be in a failed state, and remediation is appropriate, the controller will instantiate a remediation custom resources based on the remediation template(s) as defined in the NHC CR. NHC offers to configure a single remediation method, or a list of remediation methods which will be used one after another with specified order and timeout.

This template based mechanism allows cluster admins to use the best remediator for their environment, without NHC having to know them beforehand. Remediators might use e.g. Kubernetes' ClusterAPI, OpenShift's MachineAPI, BMC, Watchdog or software based reboots for fencing the workloads. For more details see the remediation documentation.

When the Node recovers and gets healthy again, NHC will delete the remediation CR for signalling that node recovery was successful.

Special cases

Control plane problems

Remediation is not always the correct response to a failure. Especially in larger clusters, we want to protect against failures that appear to take out large portions of compute capacity but are really the result of failures on or near the control plane. For this reason, the NHC CR includes the ability to define a minimum number of healthy nodes, by percentage or absolute number. When the cluster is falling short of this threshold, no further remediation will be started.

Cluster Upgrades

Cluster upgrades usually draw workers reboots, mainly to apply OS updates. These nodes might get unhealthy for some time during these reboots. This disruption can als cause other nodes to overload and appear unhealthy, when compensating for the lost compute capacity. Making remediation decisions at this moment may interfere with the upgrade and may even fail it completely. For that reason NHC will stop remediating new unhealthy nodes in case it detects that a cluster is upgrading.

At the moment this is only supported on OpenShift, by monitoring the ClusterVersionOperator.

Manual pausing

Before running cluster upgrades on kubernetes, or for any other reason, cluster admins can prevent new remediation by pausing the NHC CR.

Further information

For more details about using or contributing to Node Healthcheck, check out our docs.

Help

Please join our Google group for asking questions. When you find a bug, please open an issue in this repository.

node-healthcheck-operator's People

Contributors

slintes avatar rgolangh avatar openshift-merge-robot avatar razo7 avatar mshitrit avatar openshift-ci[bot] avatar clobrano avatar n1r1 avatar beekhof avatar dependabot[bot] avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.