GithubHelp home page GithubHelp logo

Comments (15)

johnml1135 avatar johnml1135 commented on June 10, 2024

Eli - look at the auto-restarting. I'll get the alerting to work.

from machine.

Enkidu93 avatar Enkidu93 commented on June 10, 2024

I haven't found a way to auto-restart. One thing I could do is catch the exception to keep the Hangfire server from crashing in this particular situation (clearML being down apparently), but it wouldn't be a general solution. Thoughts? @johnml1135

from machine.

ddaspit avatar ddaspit commented on June 10, 2024

We should definitely handle the exception more gracefully. @johnml1135 Is there some way to restart the server in Kubernetes?

from machine.

johnml1135 avatar johnml1135 commented on June 10, 2024

@Enkidu93 - this may help for auto-restarting the service: https://docs.hangfire.io/en/latest/background-processing/dealing-with-exceptions.html

from machine.

johnml1135 avatar johnml1135 commented on June 10, 2024

This is implementing the standard guidance.

from machine.

Enkidu93 avatar Enkidu93 commented on June 10, 2024

@johnml1135 Doesn't that just allow jobs to be retried but isn't this a question of the Hangfire server crashing? That may be helpful (and I wonder if it would be good to configure it further, say, specifying the DelayInSeconds parameter because I imagine a common cause of this would be unavailable external services and it wouldn't be reasonable to retry so quickly), but I just wonder if it has really solved the issue. I know you've created a separate issue for handling the ClearML-related exception, but this seems like a more generic issue: What do we do if the Hangfire job server crashes? Maybe I'm misunderstanding.

from machine.

johnml1135 avatar johnml1135 commented on June 10, 2024

@ddaspit - I believe you are right - it is about retrying jobs, not restarting the server. Are there any dependancies between hangfire and the rest of the code? Could it be a completely separate executable running either on the same container or a different one?

from machine.

johnml1135 avatar johnml1135 commented on June 10, 2024

The fix actually crashed the engine - I need to test these things out first.... it is not needed. Let's rather try to do the original thing which is restart the server if it crashes.

from machine.

ddaspit avatar ddaspit commented on June 10, 2024

When the job server starts, it retrieves the access token for the ClearML API. If this throws, then it is a fatal error and the server crashes. We should restart the server if it crashes, but we can also better handle this particular exception. We could configure the HttpClient to retry using Polly.

from machine.

Enkidu93 avatar Enkidu93 commented on June 10, 2024

What has yet to be done here, @johnml1135 ?

from machine.

johnml1135 avatar johnml1135 commented on June 10, 2024

So, you implemented Polly in #66, but I don't believe that the hangfire server restarting has been resolved. Could you investigate further and see if there is a good resolution?

from machine.

Enkidu93 avatar Enkidu93 commented on June 10, 2024

@johnml1135 Is it as simple as setting the docker containers to restart on failure (which is very easy to do)? Or do you see a reason that won't work?

from machine.

johnml1135 avatar johnml1135 commented on June 10, 2024

from machine.

Enkidu93 avatar Enkidu93 commented on June 10, 2024

@johnml1135 The decision here was to ignore this until it pops up again, correct? Just wanted to document that if so.

from machine.

johnml1135 avatar johnml1135 commented on June 10, 2024

Yes.

from machine.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.