GithubHelp home page GithubHelp logo

chrishirsch / kuberhealthy-storage-check Goto Github PK

View Code? Open in Web Editor NEW
10.0 2.0 11.0 408 KB

A storage check for the kuberhealthy project

License: Apache License 2.0

Dockerfile 0.91% Makefile 0.44% Go 93.65% Shell 4.99%

kuberhealthy-storage-check's Introduction

Kuberhealthy

Please see the parent project (needed to run this check) at Kuberhealthy

Storage Check

This check tests if a persistent volume claim (PVC) can be created and used within your Kubernetes cluster. It will attempt to create a PVC using either the default storage class (SC) or a user specified one. When the PVC is successfully created, the check will initialize the storage with a known value and then discover the nodes in the cluster and attempt to use that PVC on each node. If the check can create a PVC, initialize the PVC and use/mount and verify the contents of the storage on each discovered (or explicitly allowed/ignored) node then the check will succeed and you can have confidence in your ability to mount storage on nodes that are allowed to be scheduled.

Once the contents of the PVC have been validated, the check Job, the init Job and the PVC will be cleaned up and the check will be marked as successful.

Container resource requests are set to 15 millicores of CPU and 20Mi units of memory and use the Alpine image alpine:3.11 for the Job and a default of 1Gi for the PVC. If the environment variable CHECK_STORAGE_PVC_SIZE is set then the value of that will be used instead of the default.

By default, the nodes of the cluster will be discovered and only those nodes that are untainted (or has taints that are all specified in CHECK_TOLERATIONS), in a Ready state and not in the role of master will be used. If node(s) need to be ignored for whatever reason, then the environment variable CHECK_STORAGE_IGNORED_CHECK_NODES should be used a space or comma separated list of nodes should be supplied. If auto-discovery is not desired, the environment variable CHECK_STORAGE_ALLOWED_CHECK_NODES can be used and a space or comma separated list of nodes that should be checked needs to be supplied. If CHECK_STORAGE_ALLOWED_CHECK_NODES is supplied and a node in that list matches a node in the ignored (CHECK_STORAGE_IGNORED_CHECK_NODES) list then that node will be ignored.

By default, the storage check Job and initialize storage check Job will use Alpine's alpine:3.11 image. If a different image is desired, use the environment variable CHECK_STORAGE_IMAGE or CHECK_STORAGE_INIT_IMAGE depending on which image should be changed.

Initializing the storage is pretty simple and a file with the contents of storage-check-ok is created as /data/index.html. There is no reason it's called index.html except maybe for future additional checks. If the storage initialization should be done differently, or needs to be more complex, the option to use a completely different image exists as described above (CHECK_STORAGE_INIT_IMAGE). To override the arguments used to create the known data, use the environment variable CHECK_STORAGE_INIT_COMMAND_ARGS and change it from the default of echo storage-check-ok > /data/index.html && ls -la /data && cat /data/index.html.

Checking the storage is also pretty simple (there is a theme here). The check simply mounts the PVC at /data, cats the /data/index.html file and pipes the output to grep looking for the contents of storage-check-ok. If it sees that, the exit code will be 0 and the check passes for that particular node. Because the Pod on the node could mount the storage, see the previously created file AND see the known contents of the file the check is OK. If a more advanced check is desired, the entire imaged can be changed with the environment variable CHECK_STORAGE_IMAGE or to just change the command line arguments use CHECK_STORAGE_COMMAND_ARGS and change from the default of ls -la /data && cat /data/index.html && cat /data/index.html | grep storage-check-ok.

Custom images can be used for this check and can be specified with the CHECK_STORAGE_IMAGE and CHECK_STORAGE_INIT_IMAGE environment variables as described above. If a custom image requires the use of environment variables, they can be passed down into the custom container by setting the environment variable ADDITIONAL_ENV_VARS to a string of comma-separated values ("X=foo,Y=bar").

A successful run implies that a PVC was successfully created and Bound, a Storage init Job was able to use the PVC and correctly initialize it with known data, and all schedulable nodes were able to run the check Job with the mounted PVC and validate the contents. A failure implies that an error or timeout occurred anywhere in the PVC request, Init Job creation, Check Job creation, validation of known data, or tear down process -- resulting in an error report to the Kuberhealthy status page.

Storage Check Diagram

Animated Gif generated by: Tall Tweets

Check Steps

This check follows the list of actions in order during the run of the check:

  1. Looks for old storage check job, storage init job, and PVC belonging to this check and cleans them up.
  2. Creates a PVC in the namespace and waits for the PVC to be ready.
  3. Creates a storage init configuration, applies it to the namespace, and waits for the storage init job to come up and initialize the PVC with known data.
  4. Determine which nodes in the cluster are going to run the storage check by auto-discovery or a list supplied nodes via the CHECK_STORAGE_IGNORED_CHECK_NODES and CHECK_STORAGE_ALLOWED_CHECK_NODES environment variables. Nodes with taints will not be included unless the toleration is configured in CHECK_TOLERATIONS.
  5. For each node that needs a check, creates a storage check configuration, applies it to the namespace, and waits for the storage check job to start and validate the contents of storage on each desired node.
  6. Tear everything down once completed.

Check Details

  • Namespace: kuberhealthy
  • Timeout: 15 minutes
  • Check Interval: 30 minutes
  • Check name: storage-check
  • Configurable check environment variables:
    • CHECK_STORAGE_IMAGE: Storage check container image. (default=alpine:1.3)
    • CHECK_STORAGE_INIT_IMAGE: Storage initialization container image. (default=alpine:1.3)
    • CHECK_NAMESPACE: Namespace for the check (default=kuberhealthy).
    • CHECK_STORAGE_ALLOWED_CHECK_NODES: The explicit list of nodes to check (default=auto-discover worker nodes)
    • CHECK_STORAGE_IGNORED_CHECK_NODES: The list of nodes to ignore for the check (default=empty)
    • CHECK_STORAGE_INIT_COMMAND_ARGS: The arguments to the storage check data initialization (default=echo storage-check-ok > /data/index.html && ls -la /data && cat /data/index.html)
    • CHECK_STORAGE_COMMAND_ARGS: The arguments to the storage check command (default=ls -la /data && cat /data/index.html && cat /data/index.html | grep storage-check-ok)
    • CHECK_POD_CPU_REQUEST: Check pod deployment CPU request value. Calculated in decimal SI units (15 = 15m cpu).
    • CHECK_POD_CPU_LIMIT: Check pod deployment CPU limit value. Calculated in decimal SI units (75 = 75m cpu).
    • CHECK_POD_MEM_REQUEST: Check pod deployment memory request value. Calculated in binary SI units (20 * 1024^2 = 20Mi memory).
    • CHECK_POD_MEM_LIMIT: Check pod deployment memory limit value. Calculated in binary SI units (75 * 1024^2 = 75Mi memory).
    • CHECK_TOLERATIONS: Check pod tolerations of node taints. In the format "key=value:effect,key=value:effect". By default no taints are tolerated.
    • ADDITIONAL_ENV_VARS: Comma separated list of key=value variables passed into the pod's containers.
    • SHUTDOWN_GRACE_PERIOD: Amount of time in seconds the shutdown will allow itself to clean up after an interrupt signal (default=30s).
    • DEBUG: Verbose debug logging.

Example KuberhealthyStorageCheck Spec

The following configuration will create a storage check for all non-master nodes except node4 using the VMware vsan-default storage class:

apiVersion: comcast.github.io/v1
kind: KuberhealthyCheck
metadata:
  name: storage-check
  namespace: kuberhealthy
spec:
  runInterval: 5m
  timeout: 10m
  podSpec:
    containers:
    - env: 
        - name: CHECK_STORAGE_NAME
          value: "mysuperfuntime-pv-claim"
        - name: CHECK_STORAGE_PVC_STORAGE_CLASS_NAME
          value: "vsan-default"
        - name : CHECK_STORAGE_IGNORED_CHECK_NODES
          value: "node4"
      image: chrishirsch/kuberhealthy-storage-check:v0.0.2
      imagePullPolicy: IfNotPresent
      name: main
      resources:
        requests:
          cpu: 10m
          memory: 50Mi
      securityContext:
        allowPrivilegeEscalation: false
        readOnlyRootFilesystem: true
    restartPolicy: Never
    serviceAccountName: storage-sa
    securityContext:
      runAsUser: 999
      fsGroup: 999
---
apiVersion: v1
kind: ServiceAccount
metadata:
  name: storage-sa
  namespace: kuberhealthy
---
apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
  name: storage-role
  namespace: kuberhealthy
rules:
  - apiGroups:
      - ""
    resources:
      - services
      - persistentvolumeclaims
    verbs:
      - create
      - delete
      - get
      - list
      - patch
      - update
      - watch
  - apiGroups:
      - "batch"
      - "extensions"
    resources:
      - jobs
    verbs:
      - create
      - delete
      - get
      - list
      - patch
      - update
      - watch
  - apiGroups:
      - ""
    resources:
      - pods
    verbs:
      - get
      - list
      - watch
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
  name: kuberhealthy-storage-cr
rules:
  - apiGroups:
      - ""
    resources:
      - nodes
    verbs:
      - get
      - list
      - watch
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
  name: kuberhealthy-storage-crb
roleRef:
    apiGroup: rbac.authorization.k8s.io
    kind: ClusterRole
    name: kuberhealthy-storage-cr
subjects:
  - kind: ServiceAccount
    name: storage-sa
    namespace: kuberhealthy
---
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
  name: storage-rb
  namespace: kuberhealthy
roleRef:
  apiGroup: rbac.authorization.k8s.io
  kind: Role
  name: storage-role
subjects:
  - kind: ServiceAccount
    name: storage-sa

Install

To use the Storage Check with Kuberhealthy, apply the configuration file storage-check.yaml to your Kubernetes Cluster. The following command will also apply the configuration file to your current context:

kubectl apply -f https://raw.githubusercontent.com/ChrisHirsch/kuberhealthy-storage-check/master/deploy/storage-check.yaml

Make sure you are using the latest release of Kuberhealthy 2.0.0 or later.

The check configuration file contains:

  • KuberhealthyCheck
  • Role
  • Rolebinding
  • ClusterRole
  • ClusterRoleBinding
  • ServiceAccount

The role, rolebinding, clusterrole, clusterrolebinding and service account are all required to create and delete all PVCs and jobs from the check in the given namespaces you install the check for. The assumed default service account does not provide enough permissions for this check to run.

Go Run Gosec

kuberhealthy-storage-check's People

Contributors

chrishirsch avatar dependabot[bot] avatar ls80 avatar thechristschn avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar

kuberhealthy-storage-check's Issues

Changing CHECK_STORAGE_IMAGE and CHECK_STORAGE_INIT_IMAGE has no affect

I want to have the storage check use our local docker repo (configured as pull through cache for docker hub) instead of pulling from docker hub directly. As per the instructions in the readme I have specified the CHECK_STORAGE_IMAGE and CHECK_STORAGE_INIT_IMAGE environment variables to point to our local repo:

apiVersion: comcast.github.io/v1
kind: KuberhealthyCheck
metadata:
  name: storage-check
  namespace: {{ $.Release.Namespace }}
spec:
  runInterval: 1m
  timeout: 3m
  podSpec:
    containers:
      #updated to use our local artifactory
    - image: local.docker.com/chrishirsch/kuberhealthy-storage-check:v0.0.1
      imagePullPolicy: IfNotPresent
      name: main
      resources:
        requests:
          cpu: 10m
          memory: 50Mi
      securityContext:
        allowPrivilegeEscalation: false
        readOnlyRootFilesystem: true
      env:
        - name: CHECK_STORAGE_NAME
          value: rook-ceph-block
        - name: CHECK_STORAGE_PVC_STORAGE_CLASS_NAME
          value: rook-ceph-block
        - name: CHECK_NAMESPACE
          value: {{ $.Release.Namespace }}
        - name: CHECK_STORAGE_IMAGE
          value: local.docker.com/alpine:3.11
        - name: CHECK_STORAGE_INIT_IMAGE
          value: local.docker.com/alpine:3.11
    restartPolicy: Never
    serviceAccountName: storage-sa
    securityContext:
      runAsUser: 999
      fsGroup: 999

But when I look in kubectl I can see its stull pulling "alpine:3.11" not "local.docker.com/alpine:3.11":

> kubectl describe pods rook-ceph-block-check-job-gfhr8
Name:         rook-ceph-block-check-job-gfhr8
...
Containers:
  rook-ceph-block-check-job:
    Container ID:  docker://31930b240aa748805666de729c1def47f2504d46f632f637df49984c3bde457d
    Image:         alpine:3.11
...

storage-check-pvc-init-job constantly failing

Hello,
I'd like to ask about storage check job, I'm only getting /bin/sh: 1: cannot create /data/index.html: Permission denied using storage-check-pvc-init-job. I've just tried to enable allowPrivilegeEscalation and it did't help. Normally we have no problem to write to PVs. I'm thinking about securityContext.readOnlyRootFilesystem, but this swich is quite dangerous for production as it is global switch.
Is it possible, that this test is not compatible with old StorageClass we are still using?

kind: StorageClass
apiVersion: storage.k8s.io/v1
metadata:
  name: standard
  annotations:
    storageclass.beta.kubernetes.io/is-default-class: 'true'
provisioner: kubernetes.io/vsphere-volume
parameters:
  diskformat: thin
reclaimPolicy: Delete
volumeBindingMode: Immediate

What Am I missing?

EDIT: nope, not working even with securityContext.readOnlyRootFilesystem=false

Init job runs as root

The init job container runs as root and there is no variable to change this. This causes errors on Kubernetes clusters that do not allow pods to run as root.

wrong default policy for imagePullRequest

Default policy is incorrect for imagePullRequest which will result in a very high number of pulls from Docker.io (which is now rate limited). The good news is that the readme is correct ;-)

nodeSelector for all pods

Hi,

Using EBS volumes in a multi-AZ cluster causes problems with pods not scheduled in the same zone.

I tried to get around this by setting up different storageclasses with a topology value limiting it to a known AZ based on the name i.e.
kh-check-ew2 = topology.ebs.csi.aws.com/zone=eu-west-2a

Then I put a nodeSelector in the podSpec of the khcheck object based on the same for the nodes.

The main pod has the nodeSelector but the init/check pod don't pick up that value and therefore go into different AZ's and cannot mount the volume.

Is there anyway to get them to use the nodeSelector labels ?

Many thanks.

accomodate zone bound PV

Hi,

we use kubernetes.io/aws-ebs as provisioner and dynamic provisioning. When a PVC is created, it'll create PV at random and we'll need to provision a pod in the same zone as PV is provisioned. In regular workflow where this synthetic should cover, we provision a pod using PVC with matchLabel of failure-domain.beta.kubernetes.io/zone=<zone>

we do not like to specify node name with CHECK_STORAGE_ALLOWED_CHECK_NODES or CHECK_STORAGE_IGNORED_CHECK_NODES because node name is dynamic.

one suggestion: instead of static node name,

  • we should instead use the label failure-domain.beta.kubernetes.io/zone

Also, how can we solve we'll only provision one pod that's bound to the PVC instead of performing all in selected nodes.

Changing CHECK_STORAGE_INIT_COMMAND_ARGS or CHECK_STORAGE_COMMAND_ARGS does not have any effect

Trying to invoke the check with different command args does not work due to the following problem:

diff --git a/cmd/storage-check/storage.go b/cmd/storage-check/storage.go
index d42aefe..3f0955d 100644
--- a/cmd/storage-check/storage.go
+++ b/cmd/storage-check/storage.go
@@ -113,7 +113,7 @@ func initializeStorageConfig(jobName string, pvcName string) *batchv1.Job {
 
        // Make a Pod spec.
        var command = []string{defaultCheckStorageInitCommand}
-       var args = []string{"-c", defaultCheckStorageInitCommandArgs}
+       var args = []string{"-c", checkStorageInitCommandArgs}
        pvc := &corev1.PersistentVolumeClaimVolumeSource{
                ClaimName: pvcName,
        }
@@ -172,7 +172,7 @@ func checkNodeConfig(jobName string, pvcName string, node string) *batchv1.Job {
 
        // Make a Job spec.
        var command = []string{defaultCheckStorageCommand}
-       var args = []string{"-c", defaultCheckStorageCommandArgs}
+       var args = []string{"-c", checkStorageCommandArgs}
        pvc := &corev1.PersistentVolumeClaimVolumeSource{
                ClaimName: pvcName,
        }

Allow VolumeBindingMode=WaitForFirstConsumer on StorageClass

Currently, main storage check pod waits for PVC to be Bound. When a selected StorageClass uses WaitForFirstConsumer for VolumeBindingMode, this check fails with timeout because PVC remains in Pending state.

Can we skip this when a used SC binding mode is WaitForFirstConsumer?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.