GithubHelp home page GithubHelp logo

Comments (9)

manno avatar manno commented on June 2, 2024

The job controller in gitjob should collect the job's output from a Failed job. If I remember correctly the error is propagated from the job to the gitjob status, to the gitrepo status. UI finally reads it from the gitrepo status.

Does the error from the "bundlereader" not result in a Failed job? Does the controller fail to pick up the state, does propagation fail?

from fleet.

Martin-Weiss avatar Martin-Weiss commented on June 2, 2024

+1

One situation where we had to see this "problem" was when the helm credentials that were used to fetch an OCI helm chart have not been valid - the Rancher UI for continuous delivery showed the gitrepo with a green / ok status even though the job failed to fetch the helm chart.. (only checking the logs of the fleet container showed the problem).

from fleet.

khushalchandak17 avatar khushalchandak17 commented on June 2, 2024

+1

I had seen similar behavior when we provide a invalid path in git-repo.

Well, I have created a few scenarios to elaborate this issue in detail;
Scenario 1: gitrepo (name: failbranch) with the wrong branch which shows the expected result failed on gitrepo.
Scenario 2: gitrepo (name: test) with the wrong path.
Scenario 3: gitrepo (name: logapp) with an invalid chart version.

In scenario 1, I do see that gitrepo ends up with a failed status with the error reported as “No commit for branch: fakebranch,” which is the expected result.

In scenario 2, I do see that the git repo remains active even though an invalid dir path has been provided to the git repo. However, we do see that for a fraction of a second, on the UI, we see the error reported as “no resource found at the following path to deploy:[<Path>]” with the gitrepo status as ‘Git Updating.’ From the terminal, we can see a similar error in gitjob status, but it stays the same for a few seconds, and then, I guess, it reconciles and puts the gitrepo back in the active state, flushing the error on the UI.

In scenario 3, again, I see that even if an invalid chart version is provided in the fleet.yaml, the git repo remains in the active state. But again, for a fraction of a second, we do see that the error is reported on the UI with “no chart version found for <chart-version>.” We can see a similar error in gitjob and gitrepo. The status of git-repo was git-updating, but after reconciling, the git-repo status changes to active.

The expected result in scenarios 2 & 3 was to update the git-repo status with failed and print the error rather than reconciling and becoming active.

I have attached screenshots for the error captured over the UI for a fraction of a second in the second and third scenarios.
Gitjob status
Scenario 2   3

from fleet.

Martin-Weiss avatar Martin-Weiss commented on June 2, 2024

For debugging this is really annoying - especially because the failing pods (fleet container fails) are deleted really fast so that getting the logs is not easy.. basically as a workaround I use a bash for loop to get the logs of the fleet container as soon as the new pod is launched.

from fleet.

skanakal avatar skanakal commented on June 2, 2024

It appears to be functioning as intended, but the process is exceptionally swift, making it challenging to capture the information effectively.

I think the job is continually being deleted and retried, it's likely due to the fatal error condition detected in the GitJob status. It seems the GitJob is designed to respond to such errors by deleting the job to initiate a retry...

https://github.com/rancher/gitjob/blob/release/fleet/v0.9/pkg/controller/gitjob/gitjobs.go#L125

{
  "commit": "4ff289ba5a9108502f83ee41fb17208d84bf2bb0",
  "conditions": [
    {
      "lastUpdateTime": "2024-01-23T07:21:19Z",
      "status": "False",
      "type": "Reconciling"
    },
    {
      "lastUpdateTime": "2024-01-23T07:21:47Z",
      "message": "time=\"2024-01-23T07:21:44Z\" level=fatal msg=\"no chart version found for rancher-logging-45.5.0\"\n",
      "reason": "Stalled",
      "status": "True",
      "type": "Stalled"
    },
    {
      "lastUpdateTime": "2024-01-23T07:21:27Z",
      "status": "True",
      "type": "Synced"
    }
  ],
  "jobStatus": "Failed",
  "lastSyncedTime": "2024-01-23T07:21:27Z",
  "observedGeneration": 5,
  "updateGeneration": 11
}

time="2024-01-23T07:21:19Z" level=info msg="Deleting failed job to trigger retry fleet-local/loggin-final-1c010 due to: time="2024-01-23T07:21:16Z" level=fatal msg="no chart version found for rancher-logging-45.5.0"\n"

time="2024-01-23T07:22:20Z" level=info msg="Deleting failed job to trigger retry fleet-local/loggin-final-1c010 due to: time="2024-01-23T07:22:17Z" level=fatal msg="no chart version found for rancher-logging-45.5.0"\n"

I was able to see them in gitjob pod logs...

from fleet.

manno avatar manno commented on June 2, 2024

For debugging this is really annoying - especially because the failing pods (fleet container fails) are deleted really fast so that getting the logs is not easy.. basically as a workaround I use a bash for loop to get the logs of the fleet container as soon as the new pod is launched.

Yes, you can also try "stern", if you know how to match the pod, e.g. by label you can do stern -n cattle-fleet-system -l "app=fleet-job" and it will tail any output from jobs like that.

from fleet.

Martin-Weiss avatar Martin-Weiss commented on June 2, 2024

Could Fleet and the Rancher UI be extended so that in the UI one can see that a specific git repo is constantly failing?

from fleet.

manno avatar manno commented on June 2, 2024

Could Fleet and the Rancher UI be extended so that in the UI one can see that a specific git repo is constantly failing?

How would you define "constantly failing"? Like a retry counter, which we reset on a successful deployment?

from fleet.

0xavi0 avatar 0xavi0 commented on June 2, 2024

This is working as expected in fleet v0.10.0-rc.13 (Rancher 2.9-head)

I've tested it with Ranched 2.7.9 and, although I can see all the job pods trying to get an invalid version for a helm chart, it still shows up as ACTIVE in Rancher.

I see this:

NAME                    READY   STATUS   RESTARTS   AGE
supertest-512fe-p4gms   0/2     Error    0          31s
supertest-512fe-jcqnz   0/2     Error    0          23s
supertest-512fe-xxsmw   0/2     Error    0          5s

But Rancher is still showing this:
Image

If we test the same scenario with Rancher 2.9-head we can see:

NAME                      READY   STATUS      RESTARTS   AGE
supertest29-0ea0d-htsmw   0/1     Completed   0          2m9s
supertest29-160cf-d62f4   0/1     Error       0          69s
supertest29-160cf-gmzdv   0/1     Error       0          63s
supertest29-160cf-k9zxd   0/1     Error       0          48s

And, after a few seconds we can see the error in Rancher: (and the error persists)

Image

I'm closing this because its Milestone is 2.9.0 and it works fine for it.

from fleet.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.