Comments (9)
The job controller in gitjob should collect the job's output from a Failed
job. If I remember correctly the error is propagated from the job to the gitjob status, to the gitrepo status. UI finally reads it from the gitrepo status.
Does the error from the "bundlereader" not result in a Failed
job? Does the controller fail to pick up the state, does propagation fail?
from fleet.
+1
One situation where we had to see this "problem" was when the helm credentials that were used to fetch an OCI helm chart have not been valid - the Rancher UI for continuous delivery showed the gitrepo with a green / ok status even though the job failed to fetch the helm chart.. (only checking the logs of the fleet container showed the problem).
from fleet.
+1
I had seen similar behavior when we provide a invalid path in git-repo.
Well, I have created a few scenarios to elaborate this issue in detail;
Scenario 1: gitrepo (name: failbranch) with the wrong branch which shows the expected result failed on gitrepo.
Scenario 2: gitrepo (name: test) with the wrong path.
Scenario 3: gitrepo (name: logapp) with an invalid chart version.
In scenario 1, I do see that gitrepo ends up with a failed status with the error reported as “No commit for branch: fakebranch,” which is the expected result.
In scenario 2, I do see that the git repo remains active even though an invalid dir path has been provided to the git repo. However, we do see that for a fraction of a second, on the UI, we see the error reported as “no resource found at the following path to deploy:[<Path>]”
with the gitrepo status as ‘Git Updating.’ From the terminal, we can see a similar error in gitjob status, but it stays the same for a few seconds, and then, I guess, it reconciles and puts the gitrepo back in the active state, flushing the error on the UI.
In scenario 3, again, I see that even if an invalid chart version is provided in the fleet.yaml, the git repo remains in the active state. But again, for a fraction of a second, we do see that the error is reported on the UI with “no chart version found for <chart-version>.”
We can see a similar error in gitjob and gitrepo. The status of git-repo was git-updating, but after reconciling, the git-repo status changes to active.
The expected result in scenarios 2 & 3 was to update the git-repo status with failed and print the error rather than reconciling and becoming active.
I have attached screenshots for the error captured over the UI for a fraction of a second in the second and third scenarios.
from fleet.
For debugging this is really annoying - especially because the failing pods (fleet container fails) are deleted really fast so that getting the logs is not easy.. basically as a workaround I use a bash for loop to get the logs of the fleet container as soon as the new pod is launched.
from fleet.
It appears to be functioning as intended, but the process is exceptionally swift, making it challenging to capture the information effectively.
I think the job is continually being deleted and retried, it's likely due to the fatal error condition detected in the GitJob status. It seems the GitJob is designed to respond to such errors by deleting the job to initiate a retry...
https://github.com/rancher/gitjob/blob/release/fleet/v0.9/pkg/controller/gitjob/gitjobs.go#L125
{
"commit": "4ff289ba5a9108502f83ee41fb17208d84bf2bb0",
"conditions": [
{
"lastUpdateTime": "2024-01-23T07:21:19Z",
"status": "False",
"type": "Reconciling"
},
{
"lastUpdateTime": "2024-01-23T07:21:47Z",
"message": "time=\"2024-01-23T07:21:44Z\" level=fatal msg=\"no chart version found for rancher-logging-45.5.0\"\n",
"reason": "Stalled",
"status": "True",
"type": "Stalled"
},
{
"lastUpdateTime": "2024-01-23T07:21:27Z",
"status": "True",
"type": "Synced"
}
],
"jobStatus": "Failed",
"lastSyncedTime": "2024-01-23T07:21:27Z",
"observedGeneration": 5,
"updateGeneration": 11
}
time="2024-01-23T07:21:19Z" level=info msg="Deleting failed job to trigger retry fleet-local/loggin-final-1c010 due to: time="2024-01-23T07:21:16Z" level=fatal msg="no chart version found for rancher-logging-45.5.0"\n"
time="2024-01-23T07:22:20Z" level=info msg="Deleting failed job to trigger retry fleet-local/loggin-final-1c010 due to: time="2024-01-23T07:22:17Z" level=fatal msg="no chart version found for rancher-logging-45.5.0"\n"
I was able to see them in gitjob pod logs...
from fleet.
For debugging this is really annoying - especially because the failing pods (fleet container fails) are deleted really fast so that getting the logs is not easy.. basically as a workaround I use a bash for loop to get the logs of the fleet container as soon as the new pod is launched.
Yes, you can also try "stern", if you know how to match the pod, e.g. by label you can do stern -n cattle-fleet-system -l "app=fleet-job"
and it will tail any output from jobs like that.
from fleet.
Could Fleet and the Rancher UI be extended so that in the UI one can see that a specific git repo is constantly failing?
from fleet.
Could Fleet and the Rancher UI be extended so that in the UI one can see that a specific git repo is constantly failing?
How would you define "constantly failing"? Like a retry counter, which we reset on a successful deployment?
from fleet.
This is working as expected in fleet v0.10.0-rc.13
(Rancher 2.9-head
)
I've tested it with Ranched 2.7.9
and, although I can see all the job pods trying to get an invalid version for a helm chart, it still shows up as ACTIVE in Rancher.
I see this:
NAME READY STATUS RESTARTS AGE
supertest-512fe-p4gms 0/2 Error 0 31s
supertest-512fe-jcqnz 0/2 Error 0 23s
supertest-512fe-xxsmw 0/2 Error 0 5s
But Rancher is still showing this:
If we test the same scenario with Rancher 2.9-head
we can see:
NAME READY STATUS RESTARTS AGE
supertest29-0ea0d-htsmw 0/1 Completed 0 2m9s
supertest29-160cf-d62f4 0/1 Error 0 69s
supertest29-160cf-gmzdv 0/1 Error 0 63s
supertest29-160cf-k9zxd 0/1 Error 0 48s
And, after a few seconds we can see the error in Rancher: (and the error persists)
I'm closing this because its Milestone is 2.9.0
and it works fine for it.
from fleet.
Related Issues (20)
- [SURE-7930] Migrate rancher/build-tekton from Drone to Github Actions HOT 1
- Add tests for status updates
- Allow Force Update and initial clone for a gitrepo with `spec.disablePolling: true` set
- Race condition between gitops and gitrepo controller HOT 1
- Drift correction not working
- Improve CRD documentation (`kubectl explain`) HOT 1
- Feature Request: Enable namespaceAnnotations (and namespaceLabels) as targetCustomizations
- RPC error: ResourceExhausted: trying to send message larger than max 2097152 HOT 3
- cluster.desiredReady is wrong if bundledeployment was deleted HOT 1
- Bundle is not re-generated and Gitrepo status is not updated when deleting a bundle. HOT 1
- metrics: Add metrics to gitops controller
- Enable node selection for shards
- Grafana Dashboard for Metrics
- Fleet Repo doesn't show any error when there is an issue (in fleet 0.9) HOT 1
- Improve Content resources cleanup
- Add extraEnvs to allow setting env vars for the controller
- Error `no matches for kind \"GitRepo\" in version \"fleet.cattle.io/v1alpha1\"` in gitjob logs after start
- Error with stacktrace in gitjob pod after sending a webhook event with wrong credentials
- ‘Continuous Delivery Dashboard’ shows bundles in not ready state
- Force Update on GitRepo is creating multiple job workloads that fill-up the entire pod limit.
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from fleet.