Comments (12)
/assign @skonto
from serving.
failed run here: https://prow.knative.dev/view/gs/knative-prow/logs/nightly_serving_main_periodic/1793933717098467328
(in case you need logs)
from serving.
@dprotaso
The failed run prints the following before timing out:
ha.go:92: WaitForNewLeaders[controller] still see intersection: map[controller-d476b879b-hdtwr:{}]
It seems that there are enough new leaders (first check passes) but an old lease is not deleted, although we delete all the old pods one by one.
The initial leader set (old) is:
controller_test.go:53: Got initial leader set: map[controller-d476b879b-6vpsg:{} controller-d476b879b-8jbrb:{} controller-d476b879b-hdtwr:{} controller-d476b879b-pmlmc:{}]
Wrt the number of controllers each time we enable a feature related to net-certmanager-reconciler we make sure we disable it after the corresponding tests are done. Also we restart the controller pod. I think an exception to that is https
tests where we run all tests end to end with https enabled (to be precise there we could increase the number of controllers).
Here were care about the regular run and it does not seem to be related as we have enough leaders (due to leases being equally enough). I think the question is if have old leases staying around (eg due to the new reconciler).
🤔
from serving.
It seems that is the case when I restart the pod the certificate's reconciler lease is not being garbage collected:
kubectl -n knative-serving get lease controller.knative.dev.serving.pkg.reconciler.certificate.reconciler.00-of-01 -oyaml
apiVersion: coordination.k8s.io/v1
kind: Lease
metadata:
creationTimestamp: "2024-05-27T09:47:49Z"
name: controller.knative.dev.serving.pkg.reconciler.certificate.reconciler.00-of-01
namespace: knative-serving
resourceVersion: "18567"
uid: 0f7b4bd6-3a67-410d-bfb2-3497f18d1f35
spec:
acquireTime: "2024-05-27T09:57:10.364190Z"
holderIdentity: ""
leaseDurationSeconds: 1
leaseTransitions: 0
renewTime: "2024-05-27T09:57:10.364190Z"
In our tests we have:
if lease.Spec.HolderIdentity == nil {
continue
}
We never check about empty string though (I verified that empty string is what our tests see). So we either delete all related leases after we restart the controller pod or we add that condition in tests to skip reporting the old lease related to the certificate reconciler.
from serving.
@skonto looks like this is still flaking - i think we also need to update the controller count in the test
from serving.
I can try that (we dont always have 10 btw, depends on whether we have encryption on) but the failure though shows:
ha.go:93: WaitForNewLeaders[controller] still see intersection: map[controller-6fd686dfd4-glxh4:{}]
So that means not all reconcilers of the new pod are becoming leaders on time?
Could you elaborate on how the reconciler counter is related with the failure we see?
The code is the following:
err := wait.PollUntilContextTimeout(ctx, time.Second, time.Minute, true, func(ctx context.Context) (bool, error) {
currLeaders, err := GetLeaders(ctx, t, client, deploymentName, namespace)
if err != nil {
return false, err
}
if len(currLeaders) < n {
t.Logf("WaitForNewLeaders[%s] not enough leaders, got: %d, want: %d", deploymentName, len(currLeaders), n)
return false, nil
}
l := sets.New[string](currLeaders...)
if previousLeaders.HasAny(currLeaders...) {
t.Logf("WaitForNewLeaders[%s] still see intersection: %v", deploymentName, previousLeaders.Intersection(l))
return false, nil
}
leaders = l
return true, nil
})
n = buckets * # reconcilers. Since we have 9*10 in tests, the check if len(currLeaders) < n {
is going to pass anyway if we have 9 or 10 reconcilers, because at some point we will have 90 or 100 leaders (does not matter if the certififcate reconciler is on).
Now in the test we get the leader list and then based on that we delete each pod later on:
controller_test.go:53: Got initial leader set: map[controller-6fd686dfd4-6znfs:{} controller-6fd686dfd4-dfv4q:{} controller-6fd686dfd4-glxh4:{} controller-6fd686dfd4-pdksh:{}]
If controller-6fd686dfd4-glxh4 is deleted then we should not see any lease related to that pod in the holdidentity field but maybe sometime leases are not updated fast enough. Anyway I will debug further.
from serving.
For the sake of being precise I am working on setting the reconcile num to 10 (it is tricky as we need to know if we are running with encryption on) but does not seem the root cause.
from serving.
@dprotaso could be that we are facing something similar to kubernetes/client-go#1155, maybe although we do graceful shutdown the lease is not released when we restart the pod (disable the certificate reconciler) and thus the relic of the certificate lease. I will dig into it.
Latest run here: https://storage.googleapis.com/knative-prow/pr-logs/pull/knative_serving/15312/istio-latest-no-mesh_serving_main/1799043483181256704/build-log.txt. One lease is left without identity update.
I am wondering if the stop leading callback was called for that bucket when controller pod was killed.
ha.go:61: GetLeaders[controller] skipping lease controller.knative.dev.serving.pkg.reconciler.certificate.reconciler.00-of-10 as it has no holder
ha.go:61: GetLeaders[controller] skipping lease controller.knative.dev.serving.pkg.reconciler.certificate.reconciler.01-of-10 as it has no holder
ha.go:61: GetLeaders[controller] skipping lease controller.knative.dev.serving.pkg.reconciler.certificate.reconciler.02-of-10 as it has no holder
ha.go:61: GetLeaders[controller] skipping lease controller.knative.dev.serving.pkg.reconciler.certificate.reconciler.03-of-10 as it has no holder
ha.go:70: GetLeaders[controller] adding lease controller.knative.dev.serving.pkg.reconciler.certificate.reconciler.04-of-10 for pod controller-7f689f785c-jtsgf
ha.go:61: GetLeaders[controller] skipping lease controller.knative.dev.serving.pkg.reconciler.certificate.reconciler.05-of-10 as it has no holder
ha.go:61: GetLeaders[controller] skipping lease controller.knative.dev.serving.pkg.reconciler.certificate.reconciler.06-of-10 as it has no holder
ha.go:61: GetLeaders[controller] skipping lease controller.knative.dev.serving.pkg.reconciler.certificate.reconciler.07-of-10 as it has no holder
ha.go:61: GetLeaders[controller] skipping lease controller.knative.dev.serving.pkg.reconciler.certificate.reconciler.08-of-10 as it has no holder
ha.go:61: GetLeaders[controller] skipping lease controller.knative.dev.serving.pkg.reconciler.certificate.reconciler.09-of-10 as it has no holder
from serving.
A quick thing to check is to update the client-go libs and see if there is any change - in case they've fixed any bugs.
Generally we should be releasing the leases on a normal shutdown - I recall the library did that in the past.
from serving.
@dprotaso I run some tests. It seems we are not waiting enough for the lease cleanup. Check here. With 110 leases, 30 secs graceful shutdown is not enough. Also as a side note if you have to many leases you might run into rate limit issues.
from serving.
As @dprotaso pointed out it could be due to a bug in client-go such as kubernetes/client-go@f9eba8e. I am exploring why leases are not being updated.
from serving.
Latest update here: #15321 (comment). Go client lib is not updating all leases.
from serving.
Related Issues (20)
- Unable to fetch image "ghcr.io/knative/helloworld-go:latest" HOT 7
- Multiple Domains with Wildcard Certs HOT 4
- Pass env variable into queue-proxy container HOT 2
- revision_request_latencies_bucket duration time of queue-proxy HOT 1
- Allow specifying different issuers for different Services
- Properly Handle Pods with schedulingGates HOT 7
- How to add initial delay for Activator Probing time HOT 6
- Support for Resource Claims and DRA HOT 1
- Revision gets active before replicas are ready
- Feature Request: Implement waitUntil-like Functionality in Knative/Serving HOT 2
- Context Canceled in Long-Running Knative Serving Process HOT 3
- Knative Timeout Issues with Long-Running Requests HOT 7
- TestActivatorNotInRequestPath is flakey HOT 1
- [flaky] pkg/reconciler/nscert: TestChangeDefaultDomain HOT 7
- Knative should not modify the istio-proxy container spec HOT 1
- How to block to auto-scaling in specific situation HOT 1
- Unreachable validating webhook when using external/virtual control plane HOT 1
- [flaky] test/e2e.TestGRPCStreamingPingViaActivator HOT 2
- Set k8s default container label HOT 1
- About activators and request buffering
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from serving.