/area test-and-release /kind dev Expected Behavior When Prow

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

It looks like prow has some of its own way of reporting <a href="https://github.com/ku

Yutong is working on prow's reporting feature. <a href="https://github.com/kubernetes/

<a class="user-mention notranslate" data-hovercard-type="user" data-hover

Need monitoring/alerting to check whether Knative prow jobs run properly about test-infra HOT 29 CLOSED

knative commented on August 21, 2024

Need monitoring/alerting to check whether Knative prow jobs run properly

from test-infra.

Comments (29)

steuhs commented on August 21, 2024

Do we also want a dashboard that shows whether the HEAD run of each job is successful?

from test-infra.

steuhs commented on August 21, 2024

Through what channel should we get the alert? Email, or maybe slack bot?

from test-infra.

jessiezcc commented on August 21, 2024

Kubernetes already has a dashboard for job status, isn't it. Slack or Git is good since we want to community visibility. Would be nice to auto create an issue and notify OWNERs.

from test-infra.

steuhs commented on August 21, 2024

@jessiezcc I chatted with Sen, there is no dashboard that does what I asked above

from test-infra.

steuhs commented on August 21, 2024

I am not sure if Github would be a good channel. The only place I can think of posting the status is the Issues section. Do we want to use the Issues section in this repo to record all the job failures in other repos?

from test-infra.

adrcunha commented on August 21, 2024

I suggest using Stackdriver, at least as an initial solution; this way we can have some monitoring up and running ASAP.

from test-infra.

steuhs commented on August 21, 2024

It looks like prow has some of its own way of reporting https://github.com/kubernetes/test-infra/tree/master/prow/report
I am trying to see if we can build on top of what they have

from test-infra.

steuhs commented on August 21, 2024

Yutong is working on prow's reporting feature. https://github.com/kubernetes/test-infra/tree/master/prow/report. I am trying to see if and to what extent we can use that feature

from test-infra.

steuhs commented on August 21, 2024

Looks like this pkg have a template and function to post prowjob issues on github https://github.com/kubernetes/test-infra/blob/master/prow/report/report.go

from test-infra.

cjwagner commented on August 21, 2024

@jessiezcc I chatted with Sen, there is no dashboard that does what I asked above

What exactly are you referring to? I think we have mechanisms to achieve everything listed on this issue except for reporting job failures to slack.

from test-infra.

steuhs commented on August 21, 2024

@cjwagner I think you are talking about the status context that is showing up at the end of each PR page. Isn't that limited to pre-submit checks? There is also post-submit and periodical jobs we want to monitor, I believe.

from test-infra.

cjwagner commented on August 21, 2024

That is just one of the mechanisms we have. We have configurable email alerting available through testgrid and we can display the status of the last run of a job with svg badges.

What exactly are you trying to report on?

from test-infra.

steuhs commented on August 21, 2024

@cjwagner Who is working on those reporting features you mentioned? I'd like to get the more detail on what has been implemented and what is planned.

from test-infra.

cjwagner commented on August 21, 2024

Those are Testgrid feature so @michelle192837 is the one who implemented them. Please refer to the documentation first though, it describes the features and how to use them: https://github.com/kubernetes/test-infra/tree/master/testgrid#email-alerts

from test-infra.

adrcunha commented on August 21, 2024

Testgrid e-mail alerting is already enabled by #261. We want a lower level job monitoring so we can act faster when something goes wrong.

from test-infra.

cjwagner commented on August 21, 2024

According to that PR body, you configured Testgrid to report only after 3 consecutive failures. If you reported on the first failure that would have the effect you want right? Or are you saying that Testgrid's update period itself is too slow for some use case that you have?

from test-infra.

adrcunha commented on August 21, 2024

Or are you saying that Testgrid's update period itself is too slow for some use case that you have?

That's correct. Example: suppose that we push a bad Prow config, for example, and the crons or pull jobs don't run. Currently we have no way of knowing that, unless someone stumbles upon it and reports (e.g., pull test jobs never finish for your PR).

from test-infra.

krzyzacy commented on August 21, 2024

Why it never finishes? Timeout on your prowjob should work?

If you change some presubmit jobs in your config, you probably always want to manually triggered on a PR right? For example, we have a little playground in k/k like kubernetes/kubernetes#46662

from test-infra.

adrcunha commented on August 21, 2024

Maybe that was just a bad example. But the idea, as Cold put it clearly, is to have some sort of monitoring in place so we are aware of issues with our Prow jobs (presubmits, postsubmits, crons, etc) way faster than the time it takes for Testgrid to update and us to check it out and realize that something is not right.

That's the motivation. I'll leave the details to Stephen, who's working on this issue.

from test-infra.

steuhs commented on August 21, 2024

@adrcunha I read this https://github.com/kubernetes/test-infra/tree/master/testgrid#email-alerts. It seems to me that we can use the TestGrid, with num_failures_to_alert set to 1. With that change, I don't see any use case where TestGrid would be too slow - we don't need to rely on periodical jobs, we can monitor on post-submit jobs to get the failure report as soon as it happens. Please correct me if you can give me a use case that we cannot use TestGrid's alerting mechanism. @michelle192837 Please provide your opinion as well.

from test-infra.

adrcunha commented on August 21, 2024

num_failures_to_alert will only report failed tests, not broken jobs (one that never reports test status, for example). Also, it has a delay of up to 2h due to Testgrid updates. This issue is about monitoring the jobs, not test failures.

from test-infra.

michelle192837 commented on August 21, 2024

Adriano is correct that you'll have to deal with the TestGrid update delay either way (though it's a lot less than 2H in worst case in external; more like 30 minutes delay with bad luck). That said, it does seem like broken jobs should timeout and report at some point, producing a failed result, so that the only delay you have to deal with is the update delay.

So I guess if the problem is 'I want to know when my Prow jobs are failing', you can get that (holding to TestGrid's update cycles) with TestGrid alerting. If it's a potential misconfiguration thing for Prow, that seems like that should be added to/handled by Prow presubmit tests? And if Prow jobs are staying up forever, seems like that should be fixed with a timeout.

ETA: That said, let me know if I'm missing something here. num_failures_to_alert = 1 on a dashboard might be a good first step either way.

from test-infra.

adrcunha commented on August 21, 2024

Timeouts are already in place, num_failures_to_alert is already setup (but won't work until we have our own Testgrid backend). On top of that, we want the quickest possible way to identify when Prow jobs are misbehaving; we don't want to wait 30 minutes, or 2h, or for a user to report issues on Slack. Scenarios include bad configs (secrets, ACLs), k8s pods failures, resource exhaustion, etc. Less frequent jobs (like the nightly releases or playground update) are more concerning, since we tend to realize they're broken too late in the game when we rely only on Testgrid (even if the report is automated).

from test-infra.

michelle192837 commented on August 21, 2024

Mm, fair enough.

from test-infra.

steuhs commented on August 21, 2024

@adrcunha I discussed with @cjwagner, bad configs such as invalid container address and wrong secrets will result in pending state and there is no way to tell if there is any real issue (see kubernetes/test-infra#9694) when it is in pending state. We can not alter that unless we change the design and coding of Kubernetes itself. To avoid long wait time we can shorten the timeout for pending state to 20 mins or so - we can tailor the number by looking into historical duration of pending state.
Cole also mentioned prow rarely fail before a job start - it only happens several times in a year, for different reasons.
For the particular case of checking whether the knative docker images are missing or non-accessible, there is an internal task assigned to @tcnghia
Considering all the three factors mentioned above, I think the most cost effective way is to shorten the timeout for pending state and monitor on prow job status change.

from test-infra.

adrcunha commented on August 21, 2024

I agree about shortening the timeout.
What's your proposal for monitoring the Prow job status change?
Please clarify and propose how the mentioned task about missing/non-accessible Knative images can be used for monitoring/alert.
Please rule out Stackdriver as a good monitoring solution before we reduce the solution to a simple timeout reduction, which still heavily relies on human checks.

from test-infra.

steuhs commented on August 21, 2024

@adrcunha
2. & 4. We don't necessarily need to rule out Stackdriver because the status change can be monitored through there. Indeed it seems to be a better alternative to monitor the status change there, comparing to using Crier. I am doing the investigation. One potential advantage of using Stackdriver would be the alerting system it provides
3. I mean Nghia is working on "Cloud Run GKE prober should fail when release images are missing" (b/120081643). So for those kinds of failures we will have a separate solution

from test-infra.

adrcunha commented on August 21, 2024

3 doesn't apply.
For 4 looks like I wasn't clear. I indeed meant "do NOT rule out Stackdriver unless it's proven that it doesn't help". I advocated using SD as an easy monitoring solution from day 1.

from test-infra.

srinivashegde86 commented on August 21, 2024

Should be handled by the knative-monitoring proposal

from test-infra.

Need monitoring/alerting to check whether Knative prow jobs run properly about test-infra HOT 29 CLOSED

Comments (29)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent

Jobs