sieve-project / sieve Goto Github PK

View Code? Open in Web Editor NEW

319.0 11.0 20.0 44.37 MB

Automatic Reliability Testing for Kubernetes Controllers and Operators

License: BSD 2-Clause "Simplified" License

Python 59.46% Shell 4.01% Go 36.53%

kubernetes kubernetes-operator operator software-reliability system-reliability

sieve's Introduction

Sieve: Automated Reliability Testing for Kubernetes Controllers/Operators

Overview

The Kubernetes ecosystem has thousands of controller implementations for different applications and platform capabilities. A controller’s correctness is critical as it manages the application's deployment, scaling and configurations. However, a controller's correctness can be compromised by myriad factors, such as asynchrony, unexpected failures, networking issues, and controller restarts. This in turn can lead to severe safety and liveness violations.

Sieve is a tool to help developers test their controllers by deterministically injecting faults and detecting dormant bugs at development time. Sieve does not require the developers to modify the controller and can reliably reproduce the bugs it finds.

To use Sieve, developers need to port their controllers and provide end-to-end test cases (see Getting started for more information). Sieve will automatically instrument the controller by intercepting the event handlers in client-go and controller-runtime. Sieve runs in two stages. In the learning stage, Sieve will run a test case and identify promising points in an execution to inject faults. It does so by analyzing the sequence of events traced by the instrumented controller. The learning produces test plans that are then executed in the testing stage. A test plan tells Sieve of the type of fault and the point in the execution to inject the fault.

The high-level architecture is shown as below.

Note that Sieve is an early stage prototype. The tool might not be user-friendly enough due to potential bugs and lack of documentation. We are working hard to address these issues and add new features. Hopefully we will release Sieve as a production-quality software in the near future.

We welcome any users who want to test their controllers using Sieve and we are more than happy to help you port and test your controllers.

Testing approaches

Approach	Description
Intermediate-state Pattern	Intermediate-state Pattern restarts the controller in the middle of its reconcile loop. After restart, the controller will see a partially updated cluster state (i.e., an intermediate state). If the controller fails to recover from the intermediate state, Sieve recognizes it as a bug.
Unobserved-state Pattern	Unobserved-state pattern manipulates the interleaving between the informer goroutines and the reconciler goroutines in a controller to make the controller miss some particular events received from the apiserver. As controllers are supposed to be fully level-triggered, failing to achieve the desired final state after missing the event indicates a bug.
Stale-state Pattern	Stale-state pattern aims to find bugs in High-Availability clusters where multiple apiservers are running. It redirects a controller to a relatively stale apiserver. Sieve reports a bug if the controller misbehaves after reading stale cluster state.

Pre-requisites for use

Docker daemon must be running (please ensure you can run docker commands without sudo)
A docker repo that you have write access to
python3 installed
go (preferably 1.19.1) installed and $GOPATH set
kind installed and $KUBECONFIG set (Sieve runs tests in a kind cluster)
kubectl installed
python3 installed and dependency packages installed: run pip3 install -r requirements.txt

You can run python3 check_env.py to check whether your environment meets the requirement.

Getting started

Users need to port the controller before testing it with Sieve. Basically, users need to provide the steps to build and deploy the controller and necessary configuration files (e.g., CRD yaml files). We list the detailed porting steps here. We are actively working on simplify the porting process.

Bugs found by Sieve

Sieve has found 46 bugs in 10 different controllers, which are listed here. We also provide steps to reproduce all the intermediate-state/unobserved-states/stale-state bugs found by Sieve. We would appreciate a lot if you mention Sieve and inform us when you report bugs found by Sieve.

Contributing

We welcome all feedback and contributions. Please use Github issues for user questions and bug reports.

Learn more

You can learn more about Sieve from the following references:

Talks:

OSDI 2022 (18 minutes)
KubeCon 2021 (27 minutes)
HotOS 2021 (10 minutes)

Research papers:

Automatic Reliability Testing for Cluster Management Controllers
Xudong Sun, Wenqing Luo, Jiawei Tyler Gu, Aishwarya Ganesan, Ramnatthan Alagappan, Michael Gasch, Lalith Suresh, and Tianyin Xu. In Proceedings of the 16th USENIX Symposium on Operating Systems Design and Implementation (OSDI'22), Carlsbad, CA, USA, Jul. 2022.
Reasoning about modern datacenter infrastructures using partial histories
Xudong Sun, Lalith Suresh, Aishwarya Ganesan, Ramnatthan Alagappan, Michael Gasch, Lilia Tang, and Tianyin Xu. In Proceedings of the 18th Workshop on Hot Topics in Operating Systems (HotOS-XVIII), Virtual Event, May 2021.

Others:

Artifact evaluation

If you are looking for how to reproduce the evaluation results in the paper Automatic Reliability Testing for Cluster Management Controllers, please follow the instructions here.

sieve's People

Contributors

Stargazers

Watchers

sieve's Issues

Experience when working on a fix for time-travel bug in xtradb operator

When I was working on to fix K8SPXC-725 and K8SPXC-763 , I have met the following two issues:

Firstly, sieve saves a snapshot for all the crd config files for a certain operator, and those config files can be easily divergent from the upstream. For example, when I tried to run my fixed xtradb operator again for the same workload, since my fix is checkouted from the main branch, which is the latest version of the xtradb operator, but the crd config files inside sieve is outdated for a previous version, and I need to spend some efforts to make those crd file stored in sieve sync with the upstream so that the xtradb operators can be successfully set up with the matched crd configs.

And this issues will also occur if developers want to replay bugs for their own version operators.

Secondly, also for xtradb, they have some tricky logic to specify an init image (image for some init container) for the xtradbcluster. Basically, if the version specified for xtradbcluster is same as the operator, the init image will be same as the operator image, e.g. xxx/xtradb-operator:time-travel

However, if the version for xtradbcluster (e.g. 1.7) is different from the operator, the init image will be xxx/xtradb-operator:1.7

And it is obvious that we do not build such image as xxx/xtradb-operator:1.7.

In that case, we also need to update all the workload’s config files (which specifies the version of xtradbcluster) to catch up with the version of the operator, so that the init image will be assigned same as the operator, otherwise, we may get some image pulling 404 error.

What creates mask.json?

I've been trying to port a controller to test it using Sieve. I've stumbled across a variety of minor issues, but have thus far managed to overcome them, so I won't talk about them here. Now, however, I'm at an issue that I'm not sure how to resolve. I've been following https://github.com/sieve-project/sieve/blob/main/docs/port.md and am currently at

First run Sieve learning stage

python3 sieve.py -p your-controller -t your-test-case-name -s learn -m learn-twice

Sieve appears to properly deploy my controller and execute my test case. Once the test case finishes executing, however, Sieve fails because it cannot find a mask.json file:

wait for final grace period 50 seconds
Generating controller family list...
Generating state update summary...
Generating end state...
Sanity checking the sieve log log/appian-operator/recreate/learn/learn-once/learn.yaml/sieve-server.log...
[FAIL] cannot find mask.json

The error appears to be stemming from here. What creates this file? Is it possible that an earlier step created it but I've since deleted it? Is the file meant to be created manually? I see no reference to it in the docs and it appears that the file isn't manually created based on its appearance in other examples.

Thanks!

PS: I would share the code, but the operator I'm porting is (currently) closed source. Please let me know what information might be useful and I'll try to share if I can!

Are there any plans to support Node images above v1.18.9?

Hi!
It appears that only v1.18.9 is currently available for kind node images.

https://github.com/sieve-project/sieve/pkgs/container/action%2Fnode

However, the controller using the latest features (SSA, etc.) does not work well with v1.18.9. Are there any plans to support node images with versions higher than v1.18.9?

Thanks!

Cannot set up kind cluster during CI

As mentioned in #114
Need to investigate it later

Explore using API server tracing for additional context

Now that kubernetes/kubernetes#94942 has landed, we might want to think about ways to use such tracing functionality to assist users in replicating/debugging time-travel issues, e.g. by injecting custom trace data, visualizing flows, etc.

Just thinking out loud here, haven‘t spend much time thinking deeper about the applicability of tracing in the various areas of sieve (beyond detection algo).

Multi-controller Support for Running Sieve

⭐ Following up from NA KubeCon 2021 ⭐
Per my discussion with Lalith Suresh and Xudong Sun

Is there a way to test race conditions between multiple controllers running in the same Kubernetes cluster simultaneously with Sieve? Testing each controller independently may work as an alternative approach but it can become impractical given the number of controllers that may be running in a given Kubernetes cluster and the desired number of test workloads to be tested.

For example, testing the HPA controller, VPA controller, and Cluster Autoscaling controller along with a custom controller/operator requires testing each component separately for each test workload you'd like to check.

Cluster creation fails while running Sieve with kapp-controller

I am hitting issues when trying to run Sieve with kapp-controller.

I am able to build the controller image successfully:

$ python3 build.py -c examples/kapp-controller -m all
...

Succeeded
kapp-controller-sha256-47c5a7b5df0fc9142e825b6ce5d767760db91b7d381bd0c2ce4b7fc05256c8ee
Untagged: kbld:kapp-controller-sha256-47c5a7b5df0fc9142e825b6ce5d767760db91b7d381bd0c2ce4b7fc05256c8ee

But running Sieve with kapp-controller in learn mode fails:

$ python3 sieve.py -c examples/kapp-controller -w create -m learn --build-oracle
...
ERROR: failed to create cluster: failed to init node with kubeadm: command "docker exec --privileged kind-control-plane kubeadm init --skip-phases=preflight --config=/kind/kubeadm.conf --skip-token-print --v=6" failed with error: exit status 1
Command Output: I0309 00:17:00.337861     217 initconfiguration.go:255] loading configuration from "/kind/kubeadm.conf"
...

[FAIL] kind create cluster --image ghcr.io/sieve-project/action/node:v1.24.10-learn --config kind_configs/kind-1a-2w.yaml
Traceback (most recent call last):
  File "/Users/jshajigeorge/work/sieve/sieve.py", line 264, in setup_kind_cluster
    os_system(
  File "/Users/jshajigeorge/work/sieve/sieve_common/common.py", line 181, in os_system
    raise Exception(
Exception: Failed to execute kind create cluster --image ghcr.io/sieve-project/action/node:v1.24.10-learn --config kind_configs/kind-1a-2w.yaml with return code 1

(full logs attached in kapp-learn.err.txt)

See kubelet-log.txt for the logs exported by kind (kind export logs).

I'm trying this on a Mac

$ sw_vers
ProductName:		macOS
ProductVersion:		13.0.1
BuildVersion:		22A400

Questions related to how to contribute this project

Hi, you mentioned supporting e2e testing framework like https://github.com/kubernetes-sigs/e2e-framework to make writing tests easier in the wish list. We are considering to make some contribution on it but to be honest we don't have much previous experence about this.

Could you provide some suggestions how to do it or what are the difficulties to do it?

In addition, we are looking for some short-term and feasible workload to contribute this project, so if it is possible that you could give us some insights, we would be really happy to hear them!

Wish list: make Sieve more general and ergonomic

Ergonomics

There is a manual porting process to onboard a project to use Sieve. We hope to automate this step by integrating Sieve with existing frameworks like operator-sdk so that users can build controllers that are Sieve-friendly from the very beginning.

Automatically instrument the controller deployment manifest file
Support e2e testing framework like https://github.com/kubernetes-sigs/e2e-framework to make writing tests easier
Integrate with https://github.com/operator-framework/operator-sdk
Better ways to browse test results

Generality

Sieve currently places the instrumentation in controller-runtime to intercept the interactions between the controller and the Kubernetes API. That said, we also need to support controllers that are not built on top of controller-runtime but directly rely on client-go.

For generality, move instrumentation of the controller read/write APIs to client-go
#85
#96
Allow users to specify the boundary of reconcile functions
Instrument the user-specified reconcile function automatically

Miscellaneous

Run Sieve server (i.e., test coordinator and trace collector) as a pod in the Kind cluster

Error while creating cluster with ghcr.io/sieve-project/action/node:v1.18.9-test image

Thanks for documenting the process to run sieve to reproduce the stale-state bug on the RabbitMQ operator.

I've been following the steps and while running the command
python3 sieve.py -c rabbitmq-operator -m test -w recreate -p bug_reproduction_test_plans/rabbitmq-operator-stale-state-1.yaml

I am facing and issue while creating the Kubernetes cluster with the ghcr.io/sieve-project/action/node:v1.18.9-test image.
I am getting the following error:

ERROR: failed to create cluster: failed to init node with kubeadm: command "docker exec --privileged kind-control-plane kubeadm init --skip-phases=preflight --config=/kind/kubeadm.conf --skip-token-print --v=6" failed with error: exit status 1

And successive attempts failing to connect to kind-external-load-balancer

I0928 21:37:50.729870     139 round_trippers.go:443] GET https://kind-external-load-balancer:6443/healthz?timeout=10s  in 1 milliseconds
I0928 21:37:50.730110     139 request.go:907] Got a Retry-After 1s response for attempt 1 to https://kind-external-load-balancer:6443/healthz?timeout=10s
I0928 21:37:51.731382     139 round_trippers.go:443] GET https://kind-external-load-balancer:6443/healthz?timeout=10s  in 1 milliseconds
I0928 21:37:51.731424     139 request.go:907] Got a Retry-After 1s response for attempt 2 to https://kind-external-load-balancer:6443/healthz?timeout=10s
I0928 21:37:52.732674     139 round_trippers.go:443] GET https://kind-external-load-balancer:6443/healthz?timeout=10s  in 1 milliseconds

Would really appreciate any help regarding this issue. (I am running kind on Ubuntu x86_64)

Add more information about required platform to run Sieve

The README only states that kind is required but it would be good to explicitly state that:

Kubernetes (incl. min_version) is currently required to use/run Sieve
etcd as the backing store is assumed

The second bullet is important IMHO as per my understanding we rely on the monotonicity and total ordering guarantees of etcd (revisions/resourceVersions) for the correctness of the checker. K8s (and derivatives like k3s) allow for custom stores (backends) which might have different semantics or not provide linearizability guarantees.

Please correct me if my understanding of the required behavior of the underlying K8s store to reliably detect bugs with Sieve is correct. If not, bullet two might be obsolete.

Hitting PermissionError when trying to reproduce bug rabbitmq-cluster-operator-782 with Sieve

Hi,

Following the steps in reprod.md to reproduce intermediate-state bug 1 found in rabbitmq-operator (rabbitmq-cluster-operator-782), I am hitting a PermissionError:

$ python3 reproduce_bugs.py -c rabbitmq-operator -b intermediate-state-1
...

wait for final grace period 80 seconds
Traceback (most recent call last):
  File "/Users/jshajigeorge/work/sieve/sieve.py", line 657, in run_test
    run_workload(test_context)
  File "/Users/jshajigeorge/work/sieve/sieve.py", line 550, in run_workload
    os.killpg(streaming.pid, signal.SIGTERM)
PermissionError: [Errno 1] Operation not permitted

Total time: 254.97307181358337 seconds
Please refer to sieve_test_results/rabbitmq-operator-resize-pvc-rabbitmq-operator-intermediate-state-1.yaml.json for more detailed information

Full logs attached in sieve-rabbitmq-782-EPERM.txt

This was run on a Mac OS machine. Version details attached in sw_vers.txt.

All of the bugs are not reproduced (i.e., the reproduced column is False)

Hi! I tried to reproduce the full evaluation of paper, but I got 'False' results for all bugs in ‘reproduced’ column. I checked the test results file and the "exception_message" is like this:

"exception_message": "Traceback (most recent call last):\n File "/home/sieve/.local/lib/python3.8/site-packages/docker/api/client.py", line 214, in _retrieve_server_version\n return self.version(api_version=False)["ApiVersion"]\n File "/home/sieve/.local/lib/python3.8/site-packages/docker/api/daemon.py", line 181, in version\n return self._result(self._get(url), json=True)\n File "/home/sieve/.local/lib/python3.8/site-packages/docker/utils/decorators.py", line 46, in inner\n return f(self, *args, **kwargs)\n File "/home/sieve/.local/lib/python3.8/site-packages/docker/api/client.py", line 237, in _get\n return self.get(url, **self._set_request_timeout(kwargs))\n File "/home/sieve/.local/lib/python3.8/site-packages/requests/sessions.py", line 600, in get\n return self.request("GET", url, **kwargs)\n File "/home/sieve/.local/lib/python3.8/site-packages/requests/sessions.py", line 587, in request\n resp = self.send(prep, **send_kwargs)\n File "/home/sieve/.local/lib/python3.8/site-packages/requests/sessions.py", line 701, in send\n r = adapter.send(request, **kwargs)\n File "/home/sieve/.local/lib/python3.8/site-packages/requests/adapters.py", line 486, in send\n resp = conn.urlopen(\n File "/home/sieve/.local/lib/python3.8/site-packages/urllib3/connectionpool.py", line 790, in urlopen\n response = self._make_request(\n File "/home/sieve/.local/lib/python3.8/site-packages/urllib3/connectionpool.py", line 496, in _make_request\n conn.request(\nTypeError: request() got an unexpected keyword argument 'chunked'\n\nDuring handling of the above exception, another exception occurred:\n\nTraceback (most recent call last):\n File "sieve.py", line 557, in run_test\n setup_cluster(test_context)\n File "sieve.py", line 307, in setup_cluster\n redirect_kubectl()\n File "sieve.py", line 161, in redirect_kubectl\n client = docker.from_env()\n File "/home/sieve/.local/lib/python3.8/site-packages/docker/client.py", line 96, in from_env\n return cl

Do you have any idea why this happened? I don't understand what does "request() got an unexpected keyword argument 'chunked'" means and it seems not because kind cluster accidentally crashes as you mentioned in README.md.

Thanks in advance for any help!

Can sieve be used for performance testing?

First, very nice project, folks!

I'm wondering if I can leverage the tools here to do stress/load testing on controllers and to the correctness under heavy load/traffic? If not, is it something possible to do by modifying some of the code/adding few patches?

Broken link

https://github.com/sieve-project/sieve/blob/main/docs/port.md#test has a broken link to https://github.com/sieve-project/sieve/blob/main/controllers.py

I briefly looked for a file by that name and didnt see it; presumably a rename occured?

Randomness in resource names obstructs more fine-grained oracle implementation

We generate the cluster state history and final cluster state and build our general-purpose oracles (event-oracle and resource-oracle) on top of them. The general-purpose oracles simply compare the history (creation/deletion events) and the final state across learning/testing runs. They do not require any knowledge specific to a controller and can be applied to every controller.

Ideally, we want to compare every field (e.g., replica, status) of every resource (e.g., pod, statefulset). However, we find that in k8s cluster resources tend to have random names. We identified two types of random names so far:

the random name derived from generateName. This usually happens to pods belonging to a deployment. A deployment mydeployment can own a pod with a random name mydeployment-12ab5.
the random name w/o generateName. For example sometimes secret objects have fully random names.

These random names are causing trouble in comparing the state/history: for example, say we observe a secret named mysecret-12ab5 in learning run (but not in testing run) and a secret named mysecret-09zy8 in testing run (but not in learning run), we have no way to tell whether they two logically represent the same resource and should be compared. Simply comparing the two resources that are not logically the same one can lead to confusing false alarms.

Currently, when encountering the above situation, we have to mask the value of such resources and only leverage very limited information (say the number of secrets) in the oracle. We welcome any help that can eliminate the randomness in the resource name, which can help us build better oracles.

Safety check throughout testing procedure

⭐ Following up from NA KubeCon 2021 ⭐
Per my discussion with Lalith Suresh and Xudong Sun

Safety checking throughout the testing process would be invaluable for our controller, which enforces a set order of dependencies between the pods of a group of Deployments.

For example, we have 3 Deployments First, Second, and Third -- where the pods of Second rely on the pods of First being available and the pods of Third rely on the pods of both Second and First.

For simplification it's easiest to imagine that all three deployments have the same number of replicas e.g. 5; however, in reality we calculate this based on a ratio between the deployments*.

Consider the case where each Deployment is expected to have 5 replicas at the end of the roll and the dependency structure is as described above:

Time	First	Second	Third
0	0	0	0
1	1	0	0
2	2	1	0
3	3	1	1
4	4	2	2
5	4	3	3
6	5	3	3
7	5	4	3
8	5	5	4
9	5	5	5

We would like to be able to check that we are not violating this dependency tree while Deployments are becoming available.

For example First can have 7 replicas and Second can have 5 and Third can have 13. The ratio between First and Second would be for every pod of First we can have (1/7 * 5 = ~0.7 ) pods of Second, and similarly for the ratio between Second and Third and First and Third etc...

Events related to services are not detected from API server side

As mentioned in previous meeting, through batch analysis on mongodb operator, there are some cases that have matched for crucial event, but not detect side effect event on API server side (however, we observed side effect issued by the operator).
And all of those failed patterns have the crucial event of service, e.g.

se-name: mongodb-cluster-cfg
se-namespace: default
se-rtype: service
se-etype: ADDED

After dumping all the event keys from here, I observed that keys related to service is like /services/specs/default/mongodb-cluster-cfg or /services/endpoints/default/mongodb-cluster-cfg, and the event resource is parsed as spec / endpoints instead of service according to logic here.

After looking into some source code of k8s, I figure the reason behind this. The basic idea is that k8s will add some special prefix to certain resources as the key stored in etcd. For endpoints, the key is then transformed into services/endpoints, for service, the key is services/specs. As for our matching, we also need to consider for those special prefix.

The current fix is to manually map the services/endpoints to endpoints, and services/specs to services. After the fix, service related side effect event can be successfully detected and sieve can then detect bug related to service ADD / DELETE.

Assertion failed during sanity check of build_causality_graph

Backtrace:

Traceback (most recent call last):
  File "sieve.py", line 714, in <module>
    options.phase,
  File "sieve.py", line 498, in run
    phase,
  File "sieve.py", line 439, in run_test
    oracle_config,
  File "sieve.py", line 312, in check_result
    analyze.analyze_trace(project, log_dir, two_sided=two_sided)
  File "/home/tyler/sieve/analyze.py", line 372, in analyze_trace
    causality_graph = build_causality_graph(event_list, side_effect_list)
  File "/home/tyler/sieve/analyze.py", line 320, in build_causality_graph
    causality_graph.sanity_check()
  File "/home/tyler/sieve/analyze_util.py", line 508, in sanity_check
    > self.event_vertices[i - 1].content.id
AssertionError

How to reproduce

Run learning stage for workload scaleup-scaledown-tserver in yugabyte-operator
python3 sieve.py -p yugabyte-operator -d tylergu1998 -s learn -t scaleup-scaledown-tserver