GithubHelp home page GithubHelp logo

Upgrading helm release causes stuck at "PENDING_UPGRADE" due to backup job pods (which are completed, so not in READY state) about helm-charts HOT 10 CLOSED

timescale avatar timescale commented on June 5, 2024
Upgrading helm release causes stuck at "PENDING_UPGRADE" due to backup job pods (which are completed, so not in READY state)

from helm-charts.

Comments (10)

schahal avatar schahal commented on June 5, 2024 1

@feikesteenbergen thanks for that tidbit... So today I was able to get more information (yesterday I was more focused on getting workaround out to unblock our production).

I was remiss in not pasting/checking tiller logs when helm was stuck. Today, as I did the helm upgrade described in the ticket, I checked to see what tiller was waiting on, and sure enough, it complains that the "backup" pods are not in a "ready" state (which they aren't because the backup job terminates the pods upon completion):

$  kubectl logs tiller-deploy-xxx -n yyy
...
[kube] 2020/04/14 16:36:49 Pod is not ready: <namespace>/timescaledb-xyz-full-weekly-1586657520-bgspt
[tiller] 2020/04/14 16:36:49 warning: Upgrade "timescaledb-xyz" failed: timed out waiting for the condition

Deleting each of the pods it complained about (full-weekly and incremental-daily), helped resolve the issue. Before deleting the pods, the jobs and their pods look like they ran fine (but obviously in a Not Ready State because it "Completed"):

Info on Backup Jobs and Pods

$ kubectl get cronjob,job -l release=timescaledb-live -n xyz
...
NAME                                                      COMPLETIONS   DURATION   AGE
job.batch/timescaledb-xyz-full-weekly-1586657520         1/1           3s         2d17h
job.batch/timescaledb-xyz-incremental-daily-1586571120   1/1           5s         3d17h
job.batch/timescaledb-xyz-incremental-daily-1586743920   1/1           3s         41h
job.batch/timescaledb-xyz-incremental-daily-1586830320   1/1           5s         17h

$ kubectl get pods...
...
timescaledb-xyz-full-weekly-1586657520-bgspt              0/1     Completed   0          2d16h
<similar for a few incremental-daily>

How Helm Deployed Successfully

So basically, while helm was upgrading today, and before it timed out, I deleted the weekly and daily backup pods. Basically kubectl delete po timescaledb-xyz-full-weekly-1586657520-bgspt along with the incremental-daily pods, and finally helm got the release in a "Deployed" state (rather than "Failed")

Questions:

  • Any reason the helm upgrade has tiller complaining about those completed backup pods? It doesn't seem to complain in our test cluster (with only difference being the default namespace).
  • I'll wait until the next incremental backup job runs to see if I can indeed reproduce this by doing an upgrade again. In meantime, any ideas?

from helm-charts.

feikesteenbergen avatar feikesteenbergen commented on June 5, 2024

Not sure if the fact this error occurs on clusters where it's deployed on a non-default namespace is related.

We deploy them in different namespaces pretty much always and don't run into this, so that's not the issue.

We'll try to investigate; using a Job inside a Helm Upgrade path might not be the best thing ever; as it can fail, I don't think failure of the Job is covered very well in these Helm Charts.

The whole purpose of the Job is to do exactly what you described: up the connections from say 100 to 120, without having to restart the pods.

from helm-charts.

schahal avatar schahal commented on June 5, 2024

OK, ☝️ reproducible in my environment... I disabled and then re-enabled backup:

  1. The backup job was started and completed:
$ kubectl get cronjob,job -l release=timescaledb-xyz -n xyz
NAME                                               SCHEDULE        SUSPEND   ACTIVE   LAST SCHEDULE   AGE
cronjob.batch/timescaledb-xyz-full-weekly         12 02 * * 0     False     0        <none>          7h
cronjob.batch/timescaledb-xyz-incremental-daily   12 02 * * 1-6   False     0        57m             7h

NAME                                                      COMPLETIONS   DURATION   AGE
job.batch/timescaledb-xyz-incremental-daily-1587003120   1/1           5s         57m

## And the Pod
$ kubectl get pods -n xyz
...
timescaledb-xyz-incremental-daily-1587003120-x2rwg             0/1     Completed   0          57m
...
  1. Did a helm upgrade ... by changing a parameter... It was stuck in "PENDING_UPGRADE" so we check the tiller log:
$ kubectl logs tiller-deploy-6dbd4749b6-rxqx5 -n kube-system
[kube] 2020/04/16 03:00:39 beginning wait for 15 resources with timeout of 5m0s
[kube] 2020/04/16 03:00:41 Pod is not ready: xyz/timescaledb-xyz-incremental-daily-1587003120-x2rwg
[kube] 2020/04/16 03:00:43 Pod is not ready: xyz/timescaledb-xyz-incremental-daily-1587003120-x2rwg
[kube] 2020/04/16 03:00:45 Pod is not ready: xyz/timescaledb-xyz-incremental-daily-1587003120-x2rwg
  1. It will stay like this until i delete the pod timescaledb-xyz-incremental-daily-1587003120-x2rwg, at which point it upgrades successfully.

Any clues how we can upgrade parameters without it "waiting" on the already terminated backup job pods?

from helm-charts.

schahal avatar schahal commented on June 5, 2024

I've updated the title to reflect root reason for the "PENDING_UPGRADE". @feikesteenbergen or anyone else in the timescale community: thoughts on how we can get around that rather than manually deleting pods before the helm upgrade? Or if we're missing something in the above?

from helm-charts.

feikesteenbergen avatar feikesteenbergen commented on June 5, 2024

I'll have a look; I think it may be an issue with labelling: The Backup jobs getting labels that are interpreted by Helm as saying these jobs belong to Helm (they don't).

from helm-charts.

feikesteenbergen avatar feikesteenbergen commented on June 5, 2024

I cannot reproduce this somehow, I used helm 2 and helm 3.

Could you give more information:

  • what is your helm version
  • what is your charts version
  • what parameter(s) do you change during the upgrade

Regardless of this, I'm going to be removing some labels from the jobs and the pods that are scheduled, because I think they are not useful and may actually be the cause here.

from helm-charts.

feikesteenbergen avatar feikesteenbergen commented on June 5, 2024

This is (probably) fixed with This should be fixed with https://github.com/timescale/timescaledb-kubernetes/releases/tag/v0.6.1.

However I could not reproduce it locally, so it would be graet if you could verify whether or not it is fixde? @schahal

from helm-charts.

schahal avatar schahal commented on June 5, 2024

We'll give it a shot in the coming weeks! We have to schedule an activity to do the 0.5 -> 0.6 upgrade because clusters we can currently reproduce this on are being used. I'll update this ticket when complete.

from helm-charts.

feikesteenbergen avatar feikesteenbergen commented on June 5, 2024

Feel free to reopen if the issue persists, as it should have been fixed.

from helm-charts.

CVirus avatar CVirus commented on June 5, 2024

Facing exactly the same issue on a big chart I'm building. Very hard to reproduce as well :-(
Workaround is to delete the Completed pod that helm is waiting for.

from helm-charts.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.