GithubHelp home page GithubHelp logo

`TestControllerHA` is flakey about serving HOT 12 CLOSED

dprotaso avatar dprotaso commented on July 20, 2024 1
`TestControllerHA` is flakey

from serving.

Comments (12)

dprotaso avatar dprotaso commented on July 20, 2024

/assign @skonto

from serving.

dprotaso avatar dprotaso commented on July 20, 2024

failed run here: https://prow.knative.dev/view/gs/knative-prow/logs/nightly_serving_main_periodic/1793933717098467328

(in case you need logs)

from serving.

skonto avatar skonto commented on July 20, 2024

@dprotaso
The failed run prints the following before timing out:

    ha.go:92: WaitForNewLeaders[controller] still see intersection: map[controller-d476b879b-hdtwr:{}]

It seems that there are enough new leaders (first check passes) but an old lease is not deleted, although we delete all the old pods one by one.
The initial leader set (old) is:

 controller_test.go:53: Got initial leader set: map[controller-d476b879b-6vpsg:{} controller-d476b879b-8jbrb:{} controller-d476b879b-hdtwr:{} controller-d476b879b-pmlmc:{}]

Wrt the number of controllers each time we enable a feature related to net-certmanager-reconciler we make sure we disable it after the corresponding tests are done. Also we restart the controller pod. I think an exception to that is https tests where we run all tests end to end with https enabled (to be precise there we could increase the number of controllers).
Here were care about the regular run and it does not seem to be related as we have enough leaders (due to leases being equally enough). I think the question is if have old leases staying around (eg due to the new reconciler).

🤔

from serving.

skonto avatar skonto commented on July 20, 2024

It seems that is the case when I restart the pod the certificate's reconciler lease is not being garbage collected:

kubectl -n knative-serving get lease controller.knative.dev.serving.pkg.reconciler.certificate.reconciler.00-of-01 -oyaml
apiVersion: coordination.k8s.io/v1
kind: Lease
metadata:
  creationTimestamp: "2024-05-27T09:47:49Z"
  name: controller.knative.dev.serving.pkg.reconciler.certificate.reconciler.00-of-01
  namespace: knative-serving
  resourceVersion: "18567"
  uid: 0f7b4bd6-3a67-410d-bfb2-3497f18d1f35
spec:
  acquireTime: "2024-05-27T09:57:10.364190Z"
  holderIdentity: ""
  leaseDurationSeconds: 1
  leaseTransitions: 0
  renewTime: "2024-05-27T09:57:10.364190Z"

In our tests we have:

		if lease.Spec.HolderIdentity == nil {
			continue
		}

We never check about empty string though (I verified that empty string is what our tests see). So we either delete all related leases after we restart the controller pod or we add that condition in tests to skip reporting the old lease related to the certificate reconciler.

from serving.

dprotaso avatar dprotaso commented on July 20, 2024

@skonto looks like this is still flaking - i think we also need to update the controller count in the test

https://prow.knative.dev/view/gs/knative-prow/pr-logs/pull/knative_serving/15277/istio-latest-no-mesh_serving_main/1796264673356550144

from serving.

skonto avatar skonto commented on July 20, 2024

@dprotaso

I can try that (we dont always have 10 btw, depends on whether we have encryption on) but the failure though shows:

    ha.go:93: WaitForNewLeaders[controller] still see intersection: map[controller-6fd686dfd4-glxh4:{}]

So that means not all reconcilers of the new pod are becoming leaders on time?

Could you elaborate on how the reconciler counter is related with the failure we see?

The code is the following:

	err := wait.PollUntilContextTimeout(ctx, time.Second, time.Minute, true, func(ctx context.Context) (bool, error) {
		currLeaders, err := GetLeaders(ctx, t, client, deploymentName, namespace)
		if err != nil {
			return false, err
		}
		if len(currLeaders) < n {
			t.Logf("WaitForNewLeaders[%s] not enough leaders, got: %d, want: %d", deploymentName, len(currLeaders), n)
			return false, nil
		}
		l := sets.New[string](currLeaders...)
		if previousLeaders.HasAny(currLeaders...) {
			t.Logf("WaitForNewLeaders[%s] still see intersection: %v", deploymentName, previousLeaders.Intersection(l))
			return false, nil
		}
		leaders = l
		return true, nil
	})

n = buckets * # reconcilers. Since we have 9*10 in tests, the check if len(currLeaders) < n { is going to pass anyway if we have 9 or 10 reconcilers, because at some point we will have 90 or 100 leaders (does not matter if the certififcate reconciler is on).

Now in the test we get the leader list and then based on that we delete each pod later on:

controller_test.go:53: Got initial leader set: map[controller-6fd686dfd4-6znfs:{} controller-6fd686dfd4-dfv4q:{} controller-6fd686dfd4-glxh4:{} controller-6fd686dfd4-pdksh:{}]

If controller-6fd686dfd4-glxh4 is deleted then we should not see any lease related to that pod in the holdidentity field but maybe sometime leases are not updated fast enough. Anyway I will debug further.

from serving.

skonto avatar skonto commented on July 20, 2024

For the sake of being precise I am working on setting the reconcile num to 10 (it is tricky as we need to know if we are running with encryption on) but does not seem the root cause.

from serving.

skonto avatar skonto commented on July 20, 2024

@dprotaso could be that we are facing something similar to kubernetes/client-go#1155, maybe although we do graceful shutdown the lease is not released when we restart the pod (disable the certificate reconciler) and thus the relic of the certificate lease. I will dig into it.

Latest run here: https://storage.googleapis.com/knative-prow/pr-logs/pull/knative_serving/15312/istio-latest-no-mesh_serving_main/1799043483181256704/build-log.txt. One lease is left without identity update.
I am wondering if the stop leading callback was called for that bucket when controller pod was killed.

ha.go:61: GetLeaders[controller] skipping lease controller.knative.dev.serving.pkg.reconciler.certificate.reconciler.00-of-10 as it has no holder
ha.go:61: GetLeaders[controller] skipping lease controller.knative.dev.serving.pkg.reconciler.certificate.reconciler.01-of-10 as it has no holder
ha.go:61: GetLeaders[controller] skipping lease controller.knative.dev.serving.pkg.reconciler.certificate.reconciler.02-of-10 as it has no holder
ha.go:61: GetLeaders[controller] skipping lease controller.knative.dev.serving.pkg.reconciler.certificate.reconciler.03-of-10 as it has no holder
ha.go:70: GetLeaders[controller] adding lease controller.knative.dev.serving.pkg.reconciler.certificate.reconciler.04-of-10 for pod controller-7f689f785c-jtsgf
ha.go:61: GetLeaders[controller] skipping lease controller.knative.dev.serving.pkg.reconciler.certificate.reconciler.05-of-10 as it has no holder
ha.go:61: GetLeaders[controller] skipping lease controller.knative.dev.serving.pkg.reconciler.certificate.reconciler.06-of-10 as it has no holder
ha.go:61: GetLeaders[controller] skipping lease controller.knative.dev.serving.pkg.reconciler.certificate.reconciler.07-of-10 as it has no holder
ha.go:61: GetLeaders[controller] skipping lease controller.knative.dev.serving.pkg.reconciler.certificate.reconciler.08-of-10 as it has no holder
ha.go:61: GetLeaders[controller] skipping lease controller.knative.dev.serving.pkg.reconciler.certificate.reconciler.09-of-10 as it has no holder

from serving.

dprotaso avatar dprotaso commented on July 20, 2024

A quick thing to check is to update the client-go libs and see if there is any change - in case they've fixed any bugs.

Generally we should be releasing the leases on a normal shutdown - I recall the library did that in the past.

from serving.

skonto avatar skonto commented on July 20, 2024

@dprotaso I run some tests. It seems we are not waiting enough for the lease cleanup. Check here. With 110 leases, 30 secs graceful shutdown is not enough. Also as a side note if you have to many leases you might run into rate limit issues.

from serving.

skonto avatar skonto commented on July 20, 2024

As @dprotaso pointed out it could be due to a bug in client-go such as kubernetes/client-go@f9eba8e. I am exploring why leases are not being updated.

from serving.

skonto avatar skonto commented on July 20, 2024

Latest update here: #15321 (comment). Go client lib is not updating all leases.

from serving.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.