k8s-failure-reproduction

Directory tree

TBD

Summary of failure cases

Case ID	Keyword	Description	Expected runtime for reproduction
C1	CPU change	Pods consumed high CPU during bootstrapping leading HPA to scale up rapidly to max replicas	540s
C2	Replica field	Not enough replicas because users applied an updated YAML file without defining number of replicas (1 by default).	160s
C3	PodSpreadConstraint	Configurations of two SPTS constraints caused the 6th pod to fail to be scheduled	246s
C5	Scheduler + Descheduler	Conflict configurations of scheduler and RemoveDuplicate policy in descheduler.	150s
C6	Node maintenance	Pods unbalanced after maintenance. Node failures then caused the pod count to drop too low.	115s
C7	Infinite taint loop	Conflicting configurations of node taint and pod NodeName caused scheduling and eviction loop	64s
C8	Scheduler + Descheduler	Conflicting descheduler and scheduler configurations caused scheduling and eviction loop.	210s

How to reproduce

Create right size of K8S cluster. Easist way will be using kind cluster.
Each failure case directory has kubectl_command.sh shell script. All you have to do is to run it.
Lots of K8S log will be stored in the failure case subdirectory. To evaluate fidelity of Kivi, we will use <CURRENT_TIME>-describe_all.log.txt. It will be used as an input file to Kivi parser later.

General flow of `kubectl_command.sh`

The script includes a detailed logging mechanism that records each command's execution time and outcome.

Environment Setup
- Loads a custom time function script for logging timestamps.
Logging Initialization
- Creates a CSV log file to record the sequence of commands and their execution details.
- Starts a Python script (logging_start.py) that presumably initiates logging.
Application Deployment
- Deploys a Kubernetes application using a YAML file (deploy_h1-app.yaml).
Additional controller deployment (e.g., deploying autoscaler)
- Applies an autoscaler configuration using another YAML file (autoscaler.yaml).
Monitoring Period
- Includes a long enough sleep duration (e.g., 500 seconds) to observe the events in K8S.
Stop Initial Logging
- Terminates the logging Python process started earlier.
Finishing logging and Cleanup K8S setup
- Runs another Python script (logging_end.py) to presumably conclude the logging session.
- Cleans up by deleting the Kubernetes deployment and the Horizontal Pod Autoscaler (HPA) configuration.

Each step is recorded in the CSV log file with timestamps for both start and end times, and the commands are executed asynchronously or with deliberate pauses (sleep). They are exhaustive logging output. You don't need to take a look at all of them.

What `logging_start.py` does

Import Necessary Libraries
- The script uses time for handling delays and subprocess for executing shell commands.
Function Definition
- print_exec_log: A helper function that prints a formatted log message whenever a command starts. It logs the name of the command and its start time.
Log and Command Paths Setup
- Defines paths to various scripts responsible for clearing logs and capturing system states. These scripts are organized under a main directory specified by log_dir.
Clear Logs
- Executes scripts to clear previous logs from Kubernetes nodes (clear_kubelet_worker_log.sh) and the control plane (clear_scheduler_log.sh). These ensure that only relevant, fresh logs are collected for the session.
Record Start Time
- A script is prepared (record_start_time.sh) but not executed immediately, intended to timestamp the start of the logging process.
Log Collection Commands Setup
- Scripts are set up to collect various types of data:
  - Pod, node, deployment, replicaset, and HPA (Horizontal Pod Autoscaler) details.
  - Descriptions of the cluster’s overall state, individual nodes, pods, deployments, and HPAs.
  - Resource usage statistics for pods and nodes.
- These scripts are stored in the log_dir and include operations like getpod, getnode, gethpa, etc.
Execution of Clearing Logs
- The script begins by clearing logs on specified nodes (kind-control-plane, kind-worker, etc.) using the clear_kubelet_worker_log.sh script for each node.
Parallel Command Execution
- Executes the log clearing for the scheduler and then runs all the collection commands in parallel using Python’s subprocess.Popen. Each command execution is logged using print_exec_log.
Process Monitoring
- After starting each subprocess, the script prints the process ID for monitoring purposes. This can be useful for debugging if any subprocess does not perform as expected.
Wait Period
- Ends with a sleep of 1200 seconds (20 minutes), representing the maximum experiment duration. This period allows enough time for all subprocesses to gather data before the script completes.

All the asynchronous logging running on the background will be terminated by logging/logging_end.py

How to create KinD cluster

Using kind cluster, you can create k8s cluster on one host machine. Each node will be running in a form of docker container.

To test Kivi's minimum scale in Table 2, you can create a cluster with correponding number of nodes, or you can drain a node from the cluster when it is not required. For example

When you want to create a new cluster with 1/2/3 nodes in it. You can use following yaml configs in cluster directory: one_node_cluster.yaml, three_node_cluster.yaml, two_node_cluster.yaml

More information please refer to the official tutorial: https://kind.sigs.k8s.io/docs/user/quick-start/

gangmuk / k8s-failure-reproduction Goto Github PK