GithubHelp home page GithubHelp logo

Comments (29)

steuhs avatar steuhs commented on August 21, 2024

Do we also want a dashboard that shows whether the HEAD run of each job is successful?

from test-infra.

steuhs avatar steuhs commented on August 21, 2024

Through what channel should we get the alert? Email, or maybe slack bot?

from test-infra.

jessiezcc avatar jessiezcc commented on August 21, 2024

Kubernetes already has a dashboard for job status, isn't it. Slack or Git is good since we want to community visibility. Would be nice to auto create an issue and notify OWNERs.

from test-infra.

steuhs avatar steuhs commented on August 21, 2024

@jessiezcc I chatted with Sen, there is no dashboard that does what I asked above

from test-infra.

steuhs avatar steuhs commented on August 21, 2024

I am not sure if Github would be a good channel. The only place I can think of posting the status is the Issues section. Do we want to use the Issues section in this repo to record all the job failures in other repos?

from test-infra.

adrcunha avatar adrcunha commented on August 21, 2024

I suggest using Stackdriver, at least as an initial solution; this way we can have some monitoring up and running ASAP.

from test-infra.

steuhs avatar steuhs commented on August 21, 2024

It looks like prow has some of its own way of reporting https://github.com/kubernetes/test-infra/tree/master/prow/report
I am trying to see if we can build on top of what they have

from test-infra.

steuhs avatar steuhs commented on August 21, 2024

Yutong is working on prow's reporting feature. https://github.com/kubernetes/test-infra/tree/master/prow/report. I am trying to see if and to what extent we can use that feature

from test-infra.

steuhs avatar steuhs commented on August 21, 2024

Looks like this pkg have a template and function to post prowjob issues on github https://github.com/kubernetes/test-infra/blob/master/prow/report/report.go

from test-infra.

cjwagner avatar cjwagner commented on August 21, 2024

@jessiezcc I chatted with Sen, there is no dashboard that does what I asked above

What exactly are you referring to? I think we have mechanisms to achieve everything listed on this issue except for reporting job failures to slack.

from test-infra.

steuhs avatar steuhs commented on August 21, 2024

@cjwagner I think you are talking about the status context that is showing up at the end of each PR page. Isn't that limited to pre-submit checks? There is also post-submit and periodical jobs we want to monitor, I believe.

from test-infra.

cjwagner avatar cjwagner commented on August 21, 2024

That is just one of the mechanisms we have. We have configurable email alerting available through testgrid and we can display the status of the last run of a job with svg badges.

What exactly are you trying to report on?

from test-infra.

steuhs avatar steuhs commented on August 21, 2024

@cjwagner Who is working on those reporting features you mentioned? I'd like to get the more detail on what has been implemented and what is planned.

from test-infra.

cjwagner avatar cjwagner commented on August 21, 2024

Those are Testgrid feature so @michelle192837 is the one who implemented them. Please refer to the documentation first though, it describes the features and how to use them: https://github.com/kubernetes/test-infra/tree/master/testgrid#email-alerts

from test-infra.

adrcunha avatar adrcunha commented on August 21, 2024

Testgrid e-mail alerting is already enabled by #261. We want a lower level job monitoring so we can act faster when something goes wrong.

from test-infra.

cjwagner avatar cjwagner commented on August 21, 2024

According to that PR body, you configured Testgrid to report only after 3 consecutive failures. If you reported on the first failure that would have the effect you want right? Or are you saying that Testgrid's update period itself is too slow for some use case that you have?

from test-infra.

adrcunha avatar adrcunha commented on August 21, 2024

Or are you saying that Testgrid's update period itself is too slow for some use case that you have?

That's correct. Example: suppose that we push a bad Prow config, for example, and the crons or pull jobs don't run. Currently we have no way of knowing that, unless someone stumbles upon it and reports (e.g., pull test jobs never finish for your PR).

from test-infra.

krzyzacy avatar krzyzacy commented on August 21, 2024

Why it never finishes? Timeout on your prowjob should work?

If you change some presubmit jobs in your config, you probably always want to manually triggered on a PR right? For example, we have a little playground in k/k like kubernetes/kubernetes#46662

from test-infra.

adrcunha avatar adrcunha commented on August 21, 2024

Maybe that was just a bad example. But the idea, as Cold put it clearly, is to have some sort of monitoring in place so we are aware of issues with our Prow jobs (presubmits, postsubmits, crons, etc) way faster than the time it takes for Testgrid to update and us to check it out and realize that something is not right.

That's the motivation. I'll leave the details to Stephen, who's working on this issue.

from test-infra.

steuhs avatar steuhs commented on August 21, 2024

@adrcunha I read this https://github.com/kubernetes/test-infra/tree/master/testgrid#email-alerts. It seems to me that we can use the TestGrid, with num_failures_to_alert set to 1. With that change, I don't see any use case where TestGrid would be too slow - we don't need to rely on periodical jobs, we can monitor on post-submit jobs to get the failure report as soon as it happens. Please correct me if you can give me a use case that we cannot use TestGrid's alerting mechanism. @michelle192837 Please provide your opinion as well.

from test-infra.

adrcunha avatar adrcunha commented on August 21, 2024

num_failures_to_alert will only report failed tests, not broken jobs (one that never reports test status, for example). Also, it has a delay of up to 2h due to Testgrid updates. This issue is about monitoring the jobs, not test failures.

from test-infra.

michelle192837 avatar michelle192837 commented on August 21, 2024

Adriano is correct that you'll have to deal with the TestGrid update delay either way (though it's a lot less than 2H in worst case in external; more like 30 minutes delay with bad luck). That said, it does seem like broken jobs should timeout and report at some point, producing a failed result, so that the only delay you have to deal with is the update delay.

So I guess if the problem is 'I want to know when my Prow jobs are failing', you can get that (holding to TestGrid's update cycles) with TestGrid alerting. If it's a potential misconfiguration thing for Prow, that seems like that should be added to/handled by Prow presubmit tests? And if Prow jobs are staying up forever, seems like that should be fixed with a timeout.

ETA: That said, let me know if I'm missing something here. num_failures_to_alert = 1 on a dashboard might be a good first step either way.

from test-infra.

adrcunha avatar adrcunha commented on August 21, 2024

Timeouts are already in place, num_failures_to_alert is already setup (but won't work until we have our own Testgrid backend). On top of that, we want the quickest possible way to identify when Prow jobs are misbehaving; we don't want to wait 30 minutes, or 2h, or for a user to report issues on Slack. Scenarios include bad configs (secrets, ACLs), k8s pods failures, resource exhaustion, etc. Less frequent jobs (like the nightly releases or playground update) are more concerning, since we tend to realize they're broken too late in the game when we rely only on Testgrid (even if the report is automated).

from test-infra.

michelle192837 avatar michelle192837 commented on August 21, 2024

Mm, fair enough.

from test-infra.

steuhs avatar steuhs commented on August 21, 2024

@adrcunha I discussed with @cjwagner, bad configs such as invalid container address and wrong secrets will result in pending state and there is no way to tell if there is any real issue (see kubernetes/test-infra#9694) when it is in pending state. We can not alter that unless we change the design and coding of Kubernetes itself. To avoid long wait time we can shorten the timeout for pending state to 20 mins or so - we can tailor the number by looking into historical duration of pending state.
Cole also mentioned prow rarely fail before a job start - it only happens several times in a year, for different reasons.
For the particular case of checking whether the knative docker images are missing or non-accessible, there is an internal task assigned to @tcnghia
Considering all the three factors mentioned above, I think the most cost effective way is to shorten the timeout for pending state and monitor on prow job status change.

from test-infra.

adrcunha avatar adrcunha commented on August 21, 2024
  1. I agree about shortening the timeout.
  2. What's your proposal for monitoring the Prow job status change?
  3. Please clarify and propose how the mentioned task about missing/non-accessible Knative images can be used for monitoring/alert.
  4. Please rule out Stackdriver as a good monitoring solution before we reduce the solution to a simple timeout reduction, which still heavily relies on human checks.

from test-infra.

steuhs avatar steuhs commented on August 21, 2024

@adrcunha
2. & 4. We don't necessarily need to rule out Stackdriver because the status change can be monitored through there. Indeed it seems to be a better alternative to monitor the status change there, comparing to using Crier. I am doing the investigation. One potential advantage of using Stackdriver would be the alerting system it provides
3. I mean Nghia is working on "Cloud Run GKE prober should fail when release images are missing" (b/120081643). So for those kinds of failures we will have a separate solution

from test-infra.

adrcunha avatar adrcunha commented on August 21, 2024

3 doesn't apply.
For 4 looks like I wasn't clear. I indeed meant "do NOT rule out Stackdriver unless it's proven that it doesn't help". I advocated using SD as an easy monitoring solution from day 1.

from test-infra.

srinivashegde86 avatar srinivashegde86 commented on August 21, 2024

Should be handled by the knative-monitoring proposal

from test-infra.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.