It seems like TestControllerHA in regular e2e is flak

/assign <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-

failed run here: <a href="https://prow.knative.dev/view/gs/knative-prow/logs/nightly_s

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

`TestControllerHA` is flakey about serving HOT 12 CLOSED

dprotaso commented on July 20, 2024 1

`TestControllerHA` is flakey

from serving.

Comments (12)

dprotaso commented on July 20, 2024

/assign @skonto

from serving.

dprotaso commented on July 20, 2024

failed run here: https://prow.knative.dev/view/gs/knative-prow/logs/nightly_serving_main_periodic/1793933717098467328

(in case you need logs)

from serving.

skonto commented on July 20, 2024

@dprotaso
The failed run prints the following before timing out:

    ha.go:92: WaitForNewLeaders[controller] still see intersection: map[controller-d476b879b-hdtwr:{}]

It seems that there are enough new leaders (first check passes) but an old lease is not deleted, although we delete all the old pods one by one.
The initial leader set (old) is:

 controller_test.go:53: Got initial leader set: map[controller-d476b879b-6vpsg:{} controller-d476b879b-8jbrb:{} controller-d476b879b-hdtwr:{} controller-d476b879b-pmlmc:{}]

Wrt the number of controllers each time we enable a feature related to net-certmanager-reconciler we make sure we disable it after the corresponding tests are done. Also we restart the controller pod. I think an exception to that is https tests where we run all tests end to end with https enabled (to be precise there we could increase the number of controllers).
Here were care about the regular run and it does not seem to be related as we have enough leaders (due to leases being equally enough). I think the question is if have old leases staying around (eg due to the new reconciler).

🤔

from serving.

skonto commented on July 20, 2024

It seems that is the case when I restart the pod the certificate's reconciler lease is not being garbage collected:

kubectl -n knative-serving get lease controller.knative.dev.serving.pkg.reconciler.certificate.reconciler.00-of-01 -oyaml
apiVersion: coordination.k8s.io/v1
kind: Lease
metadata:
  creationTimestamp: "2024-05-27T09:47:49Z"
  name: controller.knative.dev.serving.pkg.reconciler.certificate.reconciler.00-of-01
  namespace: knative-serving
  resourceVersion: "18567"
  uid: 0f7b4bd6-3a67-410d-bfb2-3497f18d1f35
spec:
  acquireTime: "2024-05-27T09:57:10.364190Z"
  holderIdentity: ""
  leaseDurationSeconds: 1
  leaseTransitions: 0
  renewTime: "2024-05-27T09:57:10.364190Z"

In our tests we have:

		if lease.Spec.HolderIdentity == nil {
			continue
		}

We never check about empty string though (I verified that empty string is what our tests see). So we either delete all related leases after we restart the controller pod or we add that condition in tests to skip reporting the old lease related to the certificate reconciler.

from serving.

dprotaso commented on July 20, 2024

@skonto looks like this is still flaking - i think we also need to update the controller count in the test

https://prow.knative.dev/view/gs/knative-prow/pr-logs/pull/knative_serving/15277/istio-latest-no-mesh_serving_main/1796264673356550144

from serving.

skonto commented on July 20, 2024

@dprotaso

I can try that (we dont always have 10 btw, depends on whether we have encryption on) but the failure though shows:

    ha.go:93: WaitForNewLeaders[controller] still see intersection: map[controller-6fd686dfd4-glxh4:{}]

So that means not all reconcilers of the new pod are becoming leaders on time?

Could you elaborate on how the reconciler counter is related with the failure we see?

The code is the following:

	err := wait.PollUntilContextTimeout(ctx, time.Second, time.Minute, true, func(ctx context.Context) (bool, error) {
		currLeaders, err := GetLeaders(ctx, t, client, deploymentName, namespace)
		if err != nil {
			return false, err
		}
		if len(currLeaders) < n {
			t.Logf("WaitForNewLeaders[%s] not enough leaders, got: %d, want: %d", deploymentName, len(currLeaders), n)
			return false, nil
		}
		l := sets.New[string](currLeaders...)
		if previousLeaders.HasAny(currLeaders...) {
			t.Logf("WaitForNewLeaders[%s] still see intersection: %v", deploymentName, previousLeaders.Intersection(l))
			return false, nil
		}
		leaders = l
		return true, nil
	})

n = buckets * # reconcilers. Since we have 9*10 in tests, the check if len(currLeaders) < n { is going to pass anyway if we have 9 or 10 reconcilers, because at some point we will have 90 or 100 leaders (does not matter if the certififcate reconciler is on).

Now in the test we get the leader list and then based on that we delete each pod later on:

controller_test.go:53: Got initial leader set: map[controller-6fd686dfd4-6znfs:{} controller-6fd686dfd4-dfv4q:{} controller-6fd686dfd4-glxh4:{} controller-6fd686dfd4-pdksh:{}]

If controller-6fd686dfd4-glxh4 is deleted then we should not see any lease related to that pod in the holdidentity field but maybe sometime leases are not updated fast enough. Anyway I will debug further.

from serving.

skonto commented on July 20, 2024

For the sake of being precise I am working on setting the reconcile num to 10 (it is tricky as we need to know if we are running with encryption on) but does not seem the root cause.

from serving.

skonto commented on July 20, 2024

@dprotaso could be that we are facing something similar to kubernetes/client-go#1155, maybe although we do graceful shutdown the lease is not released when we restart the pod (disable the certificate reconciler) and thus the relic of the certificate lease. I will dig into it.

Latest run here: https://storage.googleapis.com/knative-prow/pr-logs/pull/knative_serving/15312/istio-latest-no-mesh_serving_main/1799043483181256704/build-log.txt. One lease is left without identity update.
I am wondering if the stop leading callback was called for that bucket when controller pod was killed.

ha.go:61: GetLeaders[controller] skipping lease controller.knative.dev.serving.pkg.reconciler.certificate.reconciler.00-of-10 as it has no holder
ha.go:61: GetLeaders[controller] skipping lease controller.knative.dev.serving.pkg.reconciler.certificate.reconciler.01-of-10 as it has no holder
ha.go:61: GetLeaders[controller] skipping lease controller.knative.dev.serving.pkg.reconciler.certificate.reconciler.02-of-10 as it has no holder
ha.go:61: GetLeaders[controller] skipping lease controller.knative.dev.serving.pkg.reconciler.certificate.reconciler.03-of-10 as it has no holder
ha.go:70: GetLeaders[controller] adding lease controller.knative.dev.serving.pkg.reconciler.certificate.reconciler.04-of-10 for pod controller-7f689f785c-jtsgf
ha.go:61: GetLeaders[controller] skipping lease controller.knative.dev.serving.pkg.reconciler.certificate.reconciler.05-of-10 as it has no holder
ha.go:61: GetLeaders[controller] skipping lease controller.knative.dev.serving.pkg.reconciler.certificate.reconciler.06-of-10 as it has no holder
ha.go:61: GetLeaders[controller] skipping lease controller.knative.dev.serving.pkg.reconciler.certificate.reconciler.07-of-10 as it has no holder
ha.go:61: GetLeaders[controller] skipping lease controller.knative.dev.serving.pkg.reconciler.certificate.reconciler.08-of-10 as it has no holder
ha.go:61: GetLeaders[controller] skipping lease controller.knative.dev.serving.pkg.reconciler.certificate.reconciler.09-of-10 as it has no holder

from serving.

dprotaso commented on July 20, 2024

A quick thing to check is to update the client-go libs and see if there is any change - in case they've fixed any bugs.

Generally we should be releasing the leases on a normal shutdown - I recall the library did that in the past.

from serving.

skonto commented on July 20, 2024

@dprotaso I run some tests. It seems we are not waiting enough for the lease cleanup. Check here. With 110 leases, 30 secs graceful shutdown is not enough. Also as a side note if you have to many leases you might run into rate limit issues.

from serving.

skonto commented on July 20, 2024

As @dprotaso pointed out it could be due to a bug in client-go such as kubernetes/client-go@f9eba8e. I am exploring why leases are not being updated.

from serving.

skonto commented on July 20, 2024

Latest update here: #15321 (comment). Go client lib is not updating all leases.

from serving.

`TestControllerHA` is flakey about serving HOT 12 CLOSED

Comments (12)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent

Jobs