I have a deployment with 5 replicas following a :latest tag. From the logs, I can see

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Unversioned deployment with multiple pods does not update all pods about keel HOT 22 CLOSED

keel-hq commented on July 24, 2024

Unversioned deployment with multiple pods does not update all pods

from keel.

Comments (22)

larte commented on July 24, 2024 1

That would work for non-production workloads, which applies to working with latest tag anyway.

The other would be to set spec.strategy.type to 'Recreate' in the deployment which results to some downtime as well, but wouldn't require changes in keel.

I'm currently trying out a very rough patch, where no reset is performed, but a ENV variable is set to each container, resulting in a new rc each time. What is your opinion on this? I remember seeing some discussion earlier on the force-policy feature ticket.

from keel.

rusenask commented on July 24, 2024 1

No, I didn't see it. Yeah, totally forgot that perms are required for deletion. Only pod deletion permissions are required, thanks!

Regarding the quay:

After a simple unit test that pretty much does the same thing as for Zalando registry, Quay returns an error (every registry wants to be unique). Will get it fixed.

from keel.

rusenask commented on July 24, 2024

It seems that Kubernetes doesn't manage to destroy existing replicas in time. Could you try scaling down to 1 replica and then trying an update? If it would solve the issue then Keel could do it for you. I imagine workflow could be:

Set replicas to 1 or 0
Set tag to 0.0.0
Set replica count and tag to whatever you had before

from keel.

rusenask commented on July 24, 2024

Could work. Another option is to terminate pods, it could be done "slowly" so it's almost like a rolling update. Regarding non-production workloads - I guess it's reasonable to expect that production workloads would be versioned.

Regarding that patch - feel free to open a work in progress PR :)

from keel.

taylorchu commented on July 24, 2024

I tested keel 0.4.7 with gke server version 1.8, and "force update" does not work for me.

Here is the sequence of events that happened:

1. scheduler assigns the allocation to a node, and replicaset-controller creates a new pod.
2. the image is set to 0.0.0 by keel. Since the image does not exist, it shows `Failed to pull image` and `Error syncing pod`.
3. the node backoffs pulling the image
4. the new pod is deleted

The notification I got is that the image is updated successfully, and yet the pod is not updated at all. (There is only 1 pod.)

Instead of pulling tag 0.0.0, which creates unexpected "fail to pull" events in the cluster (unless that person knows keel very well), we only have to change the replica count from N to 0 and then from 0 to N.

@larte @rusenask

from keel.

rusenask commented on July 24, 2024

Seems like k8s scheduler behaviour changed. I think force update should be reimplemented with your suggestion. Seems like a clean approach.

@taylorchu do you have to wait a little bit when you set replicas to 0 or it terminates pods immediately?

from keel.

taylorchu commented on July 24, 2024

no, I do not set replica count to 0 myself.

from keel.

rusenask commented on July 24, 2024

@taylorchu started looking at this issue. One problem with setting replicas to 0 would mean that auto scaler would stop working (it has to be unset).

What about terminating all pods? that would result in k8s recreating them. If it was done with some breaks in the process it could even mean no downtime.

from keel.

The-Loeki commented on July 24, 2024

We're just starting to use Keel (on GCP K8s 1.8.7) and are hit with this problem on 0.6.1.
It occured to me that the only reason Keel has to do this (and b0rks it up apparently) is because Replication Controllers already explicitly refuse to (as per K8s docs)

As my 5c, I think emulating the rolling update would be the cleanest way to go.

Also, we're quite happy running (a carefully selected set of) latest tagged containers in production and some apps have a gitflow where master branch is always deemed stable; so merges to the branch are only approved once they're production ready, leading again to a valid use case for a stable/latest tag.

from keel.

rusenask commented on July 24, 2024

Hi, thanks. Will get this sorted ASAP. Do you think my suggested strategy by terminating pods would do the job? Terminated pod will always pull the new version as I understand.

from keel.

The-Loeki commented on July 24, 2024

Well AFAIK you'd need to set imagePullPolicy: Always on the Deployment, but other than that, we'd be perfectly happy with it; it's pretty much what we're doing manually now.

from keel.

rusenask commented on July 24, 2024

Awesome, I am a bit swamped by work these days but will try to add and test this strategy either this evening or on the weekend :)

from keel.

The-Loeki commented on July 24, 2024

That would be awesome! We'll be more than happy to help you test the changes if you like.

from keel.

rusenask commented on July 24, 2024

Hi @The-Loeki, just pushed alpha tag that is built based on #154. Did testing and it seems to be a reliable way to force update for same image tags.

It would be nice if you did more testing as it should also solve that other #153 issue (even added a unit test for that specific docker registry :)).

Migrated client-go (which is now split into multiple repos) to release-6.0 which should ensure that everything works for foreseeable future. There were a bunch of other updates to dependencies which required more changes (how we parse images) so any additional testing would be really welcome :)

from keel.

The-Loeki commented on July 24, 2024

Hi @rusenask thanks for your hard work :) Today we've done the first round of testing on the alpha tag.

The Good

Zalando works w00t
We've done two deployment upgrades with 'latest' tags to see how it works and it looks nice for now. We'll be doing a bunch more of those and be sure to let you know!

The Bad

RBAC permissions need to be fixed to allow deletion of pods.
Question: Do you need to delete replicasets and replicacontrollers as well?
I'll hack up a PR

The Ugly

time="2018-03-09T10:23:29Z" level=debug msg="registry client: getting digest" registry="https://quay.io" repository=coreos/dex tag=v2.9.0
2018/03/09 10:23:30 registry failed ping request, error: Get https://quay.io/v2/: http: non-successful response (status=401 body="{\"error\": \"Invalid bearer token format\"}")
time="2018-03-09T10:23:30Z" level=debug msg="registry.manifest.head url=https://quay.io/v2/coreos/dex/manifests/v2.9.0 repository=coreos/dex reference=v2.9.0"
time="2018-03-09T10:23:30Z" level=info msg="trigger.poll.RepositoryWatcher: new watch repository tags job added" digest="sha256:c9ab4b2f064b8dd3cde614af50d5f1c49d6c45603ce377022c15bc9aa217e2db" image="quay.io/coreos/dex:v2.9.0" job_name=quay.io/coreos/dex schedule="@every 24h"
time="2018-03-09T10:23:37Z" level=debug msg="secrets.defaultGetter.lookupSecrets: pod secrets found" image=quay.io/jetstack/cert-manager-controller namespace=kube-system pod_selector="app=cert-manager,release=cert-manager" provider=helm registry=quay.io secrets="[]"
time="2018-03-09T10:23:37Z" level=debug msg="secrets.defaultGetter.lookupSecrets: no secrets for image found" image=quay.io/jetstack/cert-manager-controller namespace=kube-system pod_selector="app=cert-manager,release=cert-manager" pods_checked=1 provider=helm registry=quay.io
time="2018-03-09T10:23:37Z" level=debug msg="registry client: getting digest" registry="https://quay.io" repository=jetstack/cert-manager-controller tag=v0.2.3
2018/03/09 10:23:37 registry failed ping request, error: Get https://quay.io/v2/: http: non-successful response (status=401 body="{\"error\": \"Invalid bearer token format\"}")
time="2018-03-09T10:23:37Z" level=debug msg="registry.manifest.head url=https://quay.io/v2/jetstack/cert-manager-controller/manifests/v0.2.3 repository=jetstack/cert-manager-controller reference=v0.2.3"
time="2018-03-09T10:23:37Z" level=info msg="trigger.poll.RepositoryWatcher: new watch repository tags job added" digest="sha256:6bccc03f2e98e34f2b1782d29aed77763e93ea81de96f246ebeb81effd947085" image="quay.io/jetstack/cert-manager-controller:v0.2.3" job_name=quay.io/jetstack/cert-manager-controller schedule="@every 24h"
time="2018-03-09T10:24:35Z" level=debug msg="secrets.defaultGetter.lookupSecrets: pod secrets found" image=quay.io/jetstack/cert-manager-controller namespace=kube-system pod_selector="app=cert-manager,release=cert-manager" provider=helm registry=quay.io secrets="[]"
time="2018-03-09T10:24:35Z" level=debug msg="secrets.defaultGetter.lookupSecrets: no secrets for image found" image=quay.io/jetstack/cert-manager-controller namespace=kube-system pod_selector="app=cert-manager,release=cert-manager" pods_checked=1 provider=helm registry=quay.io
time="2018-03-09T10:25:30Z" level=debug msg="secrets.defaultGetter.lookupSecrets: pod secrets found" image=quay.io/jetstack/cert-manager-controller namespace=kube-system pod_selector="app=cert-manager,release=cert-manager" provider=helm registry=quay.io secrets="[]"

curl -m 5 -Lv -H "Content-Type: application/json" https://quay.io/v2/jetstack/cert-manager-controller/manifests/v0.2.3 of course 'just works'

I'd venture from the logs that it tries to auth against Quay with the empty/nonexistent secret or something, but that's just a guess

from keel.

rusenask commented on July 24, 2024

Hi @The-Loeki thanks for trying it out :)

Great regarding the good part.

As for the bad, maybe it's angry about empty credentials (try sending empty basic auth). Not sure what changed though. I will dig into it.

from keel.

The-Loeki commented on July 24, 2024

Did you see my updated comments? I'm hacking up a PR with fixed RBAC perms, but I'm not sure if you need to be able to delete replicasets & controllers too?

from keel.

The-Loeki commented on July 24, 2024

We'll be deploying Harbor as our own registry service soon, so you might want to get more coffee ;)

from keel.

rusenask commented on July 24, 2024

at least it's open source :)

from keel.

rusenask commented on July 24, 2024

Apparently that error was just a log of failed ping, the manifest was retrieved successfully. I have removed Ping function from the registry client as I can see that public index.docker.io doesn't have that endpoint anymore too. New alpha image is available.

Merging into master branch.

from keel.

The-Loeki commented on July 24, 2024

Looks much better indeed

[theloeki@murphy ~]$ kubectl -n kube-system logs -f keel-85f9fd6447-4gtt2 |grep quay
time="2018-03-09T12:19:13Z" level=debug msg="registry client: getting digest" registry="https://quay.io" repository=coreos/dex tag=v2.9.0
time="2018-03-09T12:19:13Z" level=debug msg="registry.manifest.head url=https://quay.io/v2/coreos/dex/manifests/v2.9.0 repository=coreos/dex reference=v2.9.0"
time="2018-03-09T12:19:14Z" level=info msg="trigger.poll.RepositoryWatcher: new watch repository tags job added" digest="sha256:c9ab4b2f064b8dd3cde614af50d5f1c49d6c45603ce377022c15bc9aa217e2db" image="quay.io/coreos/dex:v2.9.0" job_name=quay.io/coreos/dex schedule="@every 24h"
time="2018-03-09T12:19:18Z" level=debug msg="secrets.defaultGetter.lookupSecrets: pod secrets found" image=quay.io/jetstack/cert-manager-controller namespace=kube-system pod_selector="app=cert-manager,release=cert-manager" provider=helm registry=quay.io secrets="[]"
time="2018-03-09T12:19:18Z" level=debug msg="secrets.defaultGetter.lookupSecrets: no secrets for image found" image=quay.io/jetstack/cert-manager-controller namespace=kube-system pod_selector="app=cert-manager,release=cert-manager" pods_checked=1 provider=helm registry=quay.io
time="2018-03-09T12:19:18Z" level=debug msg="registry client: getting digest" registry="https://quay.io" repository=jetstack/cert-manager-controller tag=v0.2.3
time="2018-03-09T12:19:18Z" level=debug msg="registry.manifest.head url=https://quay.io/v2/jetstack/cert-manager-controller/manifests/v0.2.3 repository=jetstack/cert-manager-controller reference=v0.2.3"
time="2018-03-09T12:19:19Z" level=info msg="trigger.poll.RepositoryWatcher: new watch repository tags job added" digest="sha256:6bccc03f2e98e34f2b1782d29aed77763e93ea81de96f246ebeb81effd947085" image="quay.io/jetstack/cert-manager-controller:v0.2.3" job_name=quay.io/jetstack/cert-manager-controller schedule="@every 24h"
time="2018-03-09T12:20:15Z" level=debug msg="secrets.defaultGetter.lookupSecrets: pod secrets found" image=quay.io/jetstack/cert-manager-controller namespace=kube-system pod_selector="app=cert-manager,release=cert-manager" provider=helm registry=quay.io secrets="[]"
time="2018-03-09T12:20:15Z" level=debug msg="secrets.defaultGetter.lookupSecrets: no secrets for image found" image=quay.io/jetstack/cert-manager-controller namespace=kube-system pod_selector="app=cert-manager,release=cert-manager" pods_checked=1 provider=helm registry=quay.io

from keel.

rusenask commented on July 24, 2024

Fixed, available from 0.7.x.

from keel.

Unversioned deployment with multiple pods does not update all pods about keel HOT 22 CLOSED

Comments (22)

The Good

The Bad

The Ugly

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent

Jobs