sieve-project / sieve Goto Github PK
View Code? Open in Web Editor NEWAutomatic Reliability Testing for Kubernetes Controllers and Operators
License: BSD 2-Clause "Simplified" License
Automatic Reliability Testing for Kubernetes Controllers and Operators
License: BSD 2-Clause "Simplified" License
We generate the cluster state history and final cluster state and build our general-purpose oracles (event-oracle and resource-oracle) on top of them. The general-purpose oracles simply compare the history (creation/deletion events) and the final state across learning/testing runs. They do not require any knowledge specific to a controller and can be applied to every controller.
Ideally, we want to compare every field (e.g., replica
, status
) of every resource (e.g., pod
, statefulset
). However, we find that in k8s cluster resources tend to have random names. We identified two types of random names so far:
generateName
. This usually happens to pods belonging to a deployment. A deployment mydeployment
can own a pod with a random name mydeployment-12ab5
.generateName
. For example sometimes secret
objects have fully random names.These random names are causing trouble in comparing the state/history: for example, say we observe a secret named mysecret-12ab5
in learning run (but not in testing run) and a secret named mysecret-09zy8
in testing run (but not in learning run), we have no way to tell whether they two logically represent the same resource and should be compared. Simply comparing the two resources that are not logically the same one can lead to confusing false alarms.
Currently, when encountering the above situation, we have to mask the value of such resources and only leverage very limited information (say the number of secrets) in the oracle. We welcome any help that can eliminate the randomness in the resource name, which can help us build better oracles.
Hi,
Following the steps in reprod.md to reproduce intermediate-state bug 1 found in rabbitmq-operator (rabbitmq-cluster-operator-782), I am hitting a PermissionError
:
$ python3 reproduce_bugs.py -c rabbitmq-operator -b intermediate-state-1
...
wait for final grace period 80 seconds
Traceback (most recent call last):
File "/Users/jshajigeorge/work/sieve/sieve.py", line 657, in run_test
run_workload(test_context)
File "/Users/jshajigeorge/work/sieve/sieve.py", line 550, in run_workload
os.killpg(streaming.pid, signal.SIGTERM)
PermissionError: [Errno 1] Operation not permitted
Total time: 254.97307181358337 seconds
Please refer to sieve_test_results/rabbitmq-operator-resize-pvc-rabbitmq-operator-intermediate-state-1.yaml.json for more detailed information
Full logs attached in sieve-rabbitmq-782-EPERM.txt
This was run on a Mac OS machine. Version details attached in sw_vers.txt.
I've been trying to port a controller to test it using Sieve. I've stumbled across a variety of minor issues, but have thus far managed to overcome them, so I won't talk about them here. Now, however, I'm at an issue that I'm not sure how to resolve. I've been following https://github.com/sieve-project/sieve/blob/main/docs/port.md and am currently at
First run Sieve learning stage
python3 sieve.py -p your-controller -t your-test-case-name -s learn -m learn-twice
Sieve appears to properly deploy my controller and execute my test case. Once the test case finishes executing, however, Sieve fails because it cannot find a mask.json
file:
wait for final grace period 50 seconds
Generating controller family list...
Generating state update summary...
Generating end state...
Sanity checking the sieve log log/appian-operator/recreate/learn/learn-once/learn.yaml/sieve-server.log...
[FAIL] cannot find mask.json
The error appears to be stemming from here. What creates this file? Is it possible that an earlier step created it but I've since deleted it? Is the file meant to be created manually? I see no reference to it in the docs and it appears that the file isn't manually created based on its appearance in other examples.
Thanks!
PS: I would share the code, but the operator I'm porting is (currently) closed source. Please let me know what information might be useful and I'll try to share if I can!
Thanks for documenting the process to run sieve to reproduce the stale-state bug on the RabbitMQ operator.
I've been following the steps and while running the command
python3 sieve.py -c rabbitmq-operator -m test -w recreate -p bug_reproduction_test_plans/rabbitmq-operator-stale-state-1.yaml
I am facing and issue while creating the Kubernetes cluster with the ghcr.io/sieve-project/action/node:v1.18.9-test image.
I am getting the following error:
ERROR: failed to create cluster: failed to init node with kubeadm: command "docker exec --privileged kind-control-plane kubeadm init --skip-phases=preflight --config=/kind/kubeadm.conf --skip-token-print --v=6" failed with error: exit status 1
And successive attempts failing to connect to kind-external-load-balancer
I0928 21:37:50.729870 139 round_trippers.go:443] GET https://kind-external-load-balancer:6443/healthz?timeout=10s in 1 milliseconds
I0928 21:37:50.730110 139 request.go:907] Got a Retry-After 1s response for attempt 1 to https://kind-external-load-balancer:6443/healthz?timeout=10s
I0928 21:37:51.731382 139 round_trippers.go:443] GET https://kind-external-load-balancer:6443/healthz?timeout=10s in 1 milliseconds
I0928 21:37:51.731424 139 request.go:907] Got a Retry-After 1s response for attempt 2 to https://kind-external-load-balancer:6443/healthz?timeout=10s
I0928 21:37:52.732674 139 round_trippers.go:443] GET https://kind-external-load-balancer:6443/healthz?timeout=10s in 1 milliseconds
Would really appreciate any help regarding this issue. (I am running kind on Ubuntu x86_64)
Now that kubernetes/kubernetes#94942 has landed, we might want to think about ways to use such tracing functionality to assist users in replicating/debugging time-travel issues, e.g. by injecting custom trace data, visualizing flows, etc.
Just thinking out loud here, haven‘t spend much time thinking deeper about the applicability of tracing in the various areas of sieve (beyond detection algo).
Hi! I tried to reproduce the full evaluation of paper, but I got 'False' results for all bugs in ‘reproduced’ column. I checked the test results file and the "exception_message" is like this:
"exception_message": "Traceback (most recent call last):\n File "/home/sieve/.local/lib/python3.8/site-packages/docker/api/client.py", line 214, in _retrieve_server_version\n return self.version(api_version=False)["ApiVersion"]\n File "/home/sieve/.local/lib/python3.8/site-packages/docker/api/daemon.py", line 181, in version\n return self._result(self._get(url), json=True)\n File "/home/sieve/.local/lib/python3.8/site-packages/docker/utils/decorators.py", line 46, in inner\n return f(self, *args, **kwargs)\n File "/home/sieve/.local/lib/python3.8/site-packages/docker/api/client.py", line 237, in _get\n return self.get(url, **self._set_request_timeout(kwargs))\n File "/home/sieve/.local/lib/python3.8/site-packages/requests/sessions.py", line 600, in get\n return self.request("GET", url, **kwargs)\n File "/home/sieve/.local/lib/python3.8/site-packages/requests/sessions.py", line 587, in request\n resp = self.send(prep, **send_kwargs)\n File "/home/sieve/.local/lib/python3.8/site-packages/requests/sessions.py", line 701, in send\n r = adapter.send(request, **kwargs)\n File "/home/sieve/.local/lib/python3.8/site-packages/requests/adapters.py", line 486, in send\n resp = conn.urlopen(\n File "/home/sieve/.local/lib/python3.8/site-packages/urllib3/connectionpool.py", line 790, in urlopen\n response = self._make_request(\n File "/home/sieve/.local/lib/python3.8/site-packages/urllib3/connectionpool.py", line 496, in _make_request\n conn.request(\nTypeError: request() got an unexpected keyword argument 'chunked'\n\nDuring handling of the above exception, another exception occurred:\n\nTraceback (most recent call last):\n File "sieve.py", line 557, in run_test\n setup_cluster(test_context)\n File "sieve.py", line 307, in setup_cluster\n redirect_kubectl()\n File "sieve.py", line 161, in redirect_kubectl\n client = docker.from_env()\n File "/home/sieve/.local/lib/python3.8/site-packages/docker/client.py", line 96, in from_env\n return cl
Do you have any idea why this happened? I don't understand what does "request() got an unexpected keyword argument 'chunked'" means and it seems not because kind cluster accidentally crashes as you mentioned in README.md.
Thanks in advance for any help!
Hi!
It appears that only v1.18.9 is currently available for kind node images.
https://github.com/sieve-project/sieve/pkgs/container/action%2Fnode
However, the controller using the latest features (SSA, etc.) does not work well with v1.18.9. Are there any plans to support node images with versions higher than v1.18.9?
Thanks!
https://github.com/sieve-project/sieve/blob/main/docs/port.md#test has a broken link to https://github.com/sieve-project/sieve/blob/main/controllers.py
I briefly looked for a file by that name and didnt see it; presumably a rename occured?
First, very nice project, folks!
I'm wondering if I can leverage the tools here to do stress/load testing on controllers and to the correctness under heavy load/traffic? If not, is it something possible to do by modifying some of the code/adding few patches?
Hi, you mentioned supporting e2e testing framework like https://github.com/kubernetes-sigs/e2e-framework to make writing tests easier in the wish list. We are considering to make some contribution on it but to be honest we don't have much previous experence about this.
Could you provide some suggestions how to do it or what are the difficulties to do it?
In addition, we are looking for some short-term and feasible workload to contribute this project, so if it is possible that you could give us some insights, we would be really happy to hear them!
As mentioned in #114
Need to investigate it later
⭐ Following up from NA KubeCon 2021 ⭐
Per my discussion with Lalith Suresh and Xudong Sun
Safety checking throughout the testing process would be invaluable for our controller, which enforces a set order of dependencies between the pods of a group of Deployments.
For example, we have 3 Deployments First
, Second
, and Third
-- where the pods of Second
rely on the pods of First
being available and the pods of Third
rely on the pods of both Second
and First
.
For simplification it's easiest to imagine that all three deployments have the same number of replicas e.g. 5; however, in reality we calculate this based on a ratio between the deployments*.
Consider the case where each Deployment is expected to have 5 replicas at the end of the roll and the dependency structure is as described above:
Time | First | Second | Third |
---|---|---|---|
0 | 0 | 0 | 0 |
1 | 1 | 0 | 0 |
2 | 2 | 1 | 0 |
3 | 3 | 1 | 1 |
4 | 4 | 2 | 2 |
5 | 4 | 3 | 3 |
6 | 5 | 3 | 3 |
7 | 5 | 4 | 3 |
8 | 5 | 5 | 4 |
9 | 5 | 5 | 5 |
We would like to be able to check that we are not violating this dependency tree while Deployments are becoming available.
- For example
First
can have 7 replicas andSecond
can have 5 andThird
can have 13. The ratio betweenFirst
andSecond
would be for every pod ofFirst
we can have (1/7 * 5 = ~0.7 ) pods ofSecond
, and similarly for the ratio betweenSecond
andThird
andFirst
andThird
etc...
As mentioned in previous meeting, through batch analysis on mongodb operator, there are some cases that have matched for crucial event, but not detect side effect event on API server side (however, we observed side effect issued by the operator).
And all of those failed patterns have the crucial event of service
, e.g.
se-name: mongodb-cluster-cfg
se-namespace: default
se-rtype: service
se-etype: ADDED
After dumping all the event keys from here, I observed that keys related to service
is like /services/specs/default/mongodb-cluster-cfg
or /services/endpoints/default/mongodb-cluster-cfg
, and the event resource is parsed as spec
/ endpoints
instead of service
according to logic here.
After looking into some source code of k8s, I figure the reason behind this. The basic idea is that k8s will add some special prefix to certain resources as the key stored in etcd. For endpoints
, the key is then transformed into services/endpoints
, for service
, the key is services/specs
. As for our matching, we also need to consider for those special prefix.
The current fix is to manually map the services/endpoints
to endpoints
, and services/specs
to services
. After the fix, service
related side effect event can be successfully detected and sieve can then detect bug related to service
ADD / DELETE.
⭐ Following up from NA KubeCon 2021 ⭐
Per my discussion with Lalith Suresh and Xudong Sun
Is there a way to test race conditions between multiple controllers running in the same Kubernetes cluster simultaneously with Sieve? Testing each controller independently may work as an alternative approach but it can become impractical given the number of controllers that may be running in a given Kubernetes cluster and the desired number of test workloads to be tested.
For example, testing the HPA controller, VPA controller, and Cluster Autoscaling controller along with a custom controller/operator requires testing each component separately for each test workload you'd like to check.
Backtrace:
Traceback (most recent call last):
File "sieve.py", line 714, in <module>
options.phase,
File "sieve.py", line 498, in run
phase,
File "sieve.py", line 439, in run_test
oracle_config,
File "sieve.py", line 312, in check_result
analyze.analyze_trace(project, log_dir, two_sided=two_sided)
File "/home/tyler/sieve/analyze.py", line 372, in analyze_trace
causality_graph = build_causality_graph(event_list, side_effect_list)
File "/home/tyler/sieve/analyze.py", line 320, in build_causality_graph
causality_graph.sanity_check()
File "/home/tyler/sieve/analyze_util.py", line 508, in sanity_check
> self.event_vertices[i - 1].content.id
AssertionError
Run learning stage for workload scaleup-scaledown-tserver in yugabyte-operator
python3 sieve.py -p yugabyte-operator -d tylergu1998 -s learn -t scaleup-scaledown-tserver
The README only states that kind
is required but it would be good to explicitly state that:
etcd
as the backing store is assumedThe second bullet is important IMHO as per my understanding we rely on the monotonicity and total ordering guarantees of etcd
(revisions/resourceVersions) for the correctness of the checker. K8s (and derivatives like k3s) allow for custom stores
(backends) which might have different semantics or not provide linearizability guarantees.
Please correct me if my understanding of the required behavior of the underlying K8s store
to reliably detect bugs with Sieve is correct. If not, bullet two might be obsolete.
There is a manual porting process to onboard a project to use Sieve. We hope to automate this step by integrating Sieve with existing frameworks like operator-sdk so that users can build controllers that are Sieve-friendly from the very beginning.
Sieve currently places the instrumentation in controller-runtime to intercept the interactions between the controller and the Kubernetes API. That said, we also need to support controllers that are not built on top of controller-runtime but directly rely on client-go.
When I was working on to fix K8SPXC-725 and K8SPXC-763 , I have met the following two issues:
Firstly, sieve saves a snapshot for all the crd config files for a certain operator, and those config files can be easily divergent from the upstream. For example, when I tried to run my fixed xtradb operator again for the same workload, since my fix is checkouted from the main branch, which is the latest version of the xtradb operator, but the crd config files inside sieve is outdated for a previous version, and I need to spend some efforts to make those crd file stored in sieve sync with the upstream so that the xtradb operators can be successfully set up with the matched crd configs.
And this issues will also occur if developers want to replay bugs for their own version operators.
Secondly, also for xtradb, they have some tricky logic to specify an init image (image for some init container) for the xtradbcluster. Basically, if the version specified for xtradbcluster is same as the operator, the init image will be same as the operator image, e.g. xxx/xtradb-operator:time-travel
However, if the version for xtradbcluster (e.g. 1.7) is different from the operator, the init image will be xxx/xtradb-operator:1.7
And it is obvious that we do not build such image as xxx/xtradb-operator:1.7
.
In that case, we also need to update all the workload’s config files (which specifies the version of xtradbcluster) to catch up with the version of the operator, so that the init image will be assigned same as the operator, otherwise, we may get some image pulling 404 error.
I am hitting issues when trying to run Sieve with kapp-controller.
I am able to build the controller image successfully:
$ python3 build.py -c examples/kapp-controller -m all
...
Succeeded
kapp-controller-sha256-47c5a7b5df0fc9142e825b6ce5d767760db91b7d381bd0c2ce4b7fc05256c8ee
Untagged: kbld:kapp-controller-sha256-47c5a7b5df0fc9142e825b6ce5d767760db91b7d381bd0c2ce4b7fc05256c8ee
But running Sieve with kapp-controller in learn mode fails:
$ python3 sieve.py -c examples/kapp-controller -w create -m learn --build-oracle
...
ERROR: failed to create cluster: failed to init node with kubeadm: command "docker exec --privileged kind-control-plane kubeadm init --skip-phases=preflight --config=/kind/kubeadm.conf --skip-token-print --v=6" failed with error: exit status 1
Command Output: I0309 00:17:00.337861 217 initconfiguration.go:255] loading configuration from "/kind/kubeadm.conf"
...
[FAIL] kind create cluster --image ghcr.io/sieve-project/action/node:v1.24.10-learn --config kind_configs/kind-1a-2w.yaml
Traceback (most recent call last):
File "/Users/jshajigeorge/work/sieve/sieve.py", line 264, in setup_kind_cluster
os_system(
File "/Users/jshajigeorge/work/sieve/sieve_common/common.py", line 181, in os_system
raise Exception(
Exception: Failed to execute kind create cluster --image ghcr.io/sieve-project/action/node:v1.24.10-learn --config kind_configs/kind-1a-2w.yaml with return code 1
(full logs attached in kapp-learn.err.txt)
See kubelet-log.txt for the logs exported by kind (kind export logs
).
I'm trying this on a Mac
$ sw_vers
ProductName: macOS
ProductVersion: 13.0.1
BuildVersion: 22A400
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.