Comments (15)
Eli - look at the auto-restarting. I'll get the alerting to work.
from machine.
I haven't found a way to auto-restart. One thing I could do is catch the exception to keep the Hangfire server from crashing in this particular situation (clearML being down apparently), but it wouldn't be a general solution. Thoughts? @johnml1135
from machine.
We should definitely handle the exception more gracefully. @johnml1135 Is there some way to restart the server in Kubernetes?
from machine.
@Enkidu93 - this may help for auto-restarting the service: https://docs.hangfire.io/en/latest/background-processing/dealing-with-exceptions.html
from machine.
This is implementing the standard guidance.
from machine.
@johnml1135 Doesn't that just allow jobs to be retried but isn't this a question of the Hangfire server crashing? That may be helpful (and I wonder if it would be good to configure it further, say, specifying the DelayInSeconds parameter because I imagine a common cause of this would be unavailable external services and it wouldn't be reasonable to retry so quickly), but I just wonder if it has really solved the issue. I know you've created a separate issue for handling the ClearML-related exception, but this seems like a more generic issue: What do we do if the Hangfire job server crashes? Maybe I'm misunderstanding.
from machine.
@ddaspit - I believe you are right - it is about retrying jobs, not restarting the server. Are there any dependancies between hangfire and the rest of the code? Could it be a completely separate executable running either on the same container or a different one?
from machine.
The fix actually crashed the engine - I need to test these things out first.... it is not needed. Let's rather try to do the original thing which is restart the server if it crashes.
from machine.
When the job server starts, it retrieves the access token for the ClearML API. If this throws, then it is a fatal error and the server crashes. We should restart the server if it crashes, but we can also better handle this particular exception. We could configure the HttpClient to retry using Polly.
from machine.
What has yet to be done here, @johnml1135 ?
from machine.
So, you implemented Polly in #66, but I don't believe that the hangfire server restarting has been resolved. Could you investigate further and see if there is a good resolution?
from machine.
@johnml1135 Is it as simple as setting the docker containers to restart on failure (which is very easy to do)? Or do you see a reason that won't work?
from machine.
from machine.
@johnml1135 The decision here was to ignore this until it pops up again, correct? Just wanted to document that if so.
from machine.
Yes.
from machine.
Related Issues (20)
- Documentation on training a new translation e.g. German to English HOT 2
- Make queue depth information universal across instances
- CMAKE_CXX_COMPILER not set, after EnableLanguage HOT 10
- Update version number of server projects to match docker container version HOT 4
- XML upload for large files broken HOT 4
- SMT build error HOT 8
- Get Word Graph - catastrophic failure HOT 8
- Health check ClearML Health Check with status Unhealthy HOT 2
- Add health check for SMT engine disk storage HOT 7
- Crash on train-segment HOT 3
- Add tests to cover pretranslation and train-on logic in NmtPreprocessBuildJob
- NaN alignment score in FuzzyEditDistanceWordAlignmentMethod
- Investigate why JsonStringEnumConverter fails
- New NMT option - choose ClearML queue HOT 11
- Log each build as JSON in Loki HOT 1
- Add support for manipulating USFM
- Clean up inconsistent states in MongoDB HOT 7
- CancelBuildAsync return StatusCode.Aborted if already cancelled
- Move options to Record type
- More test coverage for non-scripture
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from machine.