Comments (7)
Need to work on this ASAP
from slurmr.
You would probably want the option to run them with a different seed when you restart them as well.
from slurmr.
It would be good to be able to rerun these failed jobs with the same random seed and arguments to be able to regenerate the problem or different seed and arguments for debugging.
from slurmr.
How about a function to return a list of argument sets corresponding to the failed jobs? That way they could be rerun as is or modified according to the user's need.
from slurmr.
Perhaps it should an option of sbatch
(the underlying function that submits the jobs). The function could gain a new argument, e.g. what
, which suggests what parts of the should be resubmitted. In such case, there could be a function that returns the sequence of failed jobs. This implies to somewhat reimagine the workflow. A few changings that I foresee:
-
Submit the job. The last should be kept in memory so that the user could grab it by typing something like
last_job()
. This is important b/c users can use thecollect = TRUE
option and still be able to access to the last job. -
Have a function called
read_job
or something like that that allows recovering any job. For this, when saving the auxiliary files we should be saving either a plain text file or a binary rds file with the job call itself so that users can recover a job set up by simply typing the path to the Slurm job folder. -
For debugging, one thing I find myself doing all the time is:
- Looking the job directory, and in particular, reading the log files generated by Slurm.
parallel::mclapply
has atryCatch
wrapper, so some times I cannot tell whether a job has failed or not, so what I usually do is load one of the datasets and look at the data directly. In such cases mclapply usually returns a warning.
I could try to tag the errors automatically and have a function, as mentioned earlier, to try to resubmit the job, but only a subset (in this case, those that failed). Need to check how to modify the
ARRAY
variable in the batch file, and what is the limit on that string for Slurm. That should be inMaxArraySize
in theslurm.conf
file.
from slurmr.
Maybe your #2 covers this? I just ran a bunch of jobs, some of which failed for unknown reasons, the 02-output-* files were not generated for the failed jobs. When I tried to rerun some of those jobs, individually, according to the job#, I was not able to recreate the error/failure, that is, the jobs completed successfully the second time around even though I used the same random seed. The problem is that now I cannot debug because I cannot regenerate the error.
from slurmr.
OK, so part of this was implemented in 1dccce5. Now jobs can be resubmitted very easily:
library(sluRm)
# A simple expr evaluation, WhoAmI() gives some info about the node
x <- Slurm_EvalQ(sluRm::WhoAmI(), njobs = 20, plan = "wait")
# Suppose jobs 1, 2, and 15-20 failed, then we can do the following
sbatch(x, array = "1,2,15-20")
# And, if status OK, then collect safely the entire array
ans <- Slurm_collect(x)
To keep it simple, users can check the status of a given submission with the state
function, or simply calling sacct
. state
will return an integer scaler telling whether the job has been submitted, is done, it has failed, or is still running, including a set of attributes that enumerate the state of each job in the array. So I think this is done :).
from slurmr.
Related Issues (20)
- [JOSS review] Seeding HOT 2
- [JOSS review] Comparison table HOT 2
- [JOSS review] Prefer file.path()? HOT 1
- [JOSS review] Add a more realistic example HOT 1
- Fix a few issues [re: JOSS submission] HOT 2
- PSOCK cluster backend HOT 1
- incorrect rslurm info in VS table HOT 5
- `opts_slurmR` not appearing in sbatch script HOT 3
- If error happens before command, Slurm_* hangs HOT 1
- When submitting a job with sbatch(..., array = ) Slurm_collect fails
- Matching STATE_CODES should be using regex not exact
- Passing along an environmental variable HOT 4
- βAn error has occurred when calling `silent_system2`:β HOT 19
- Suppress cleanup of output files from makeSlurmCluster
- mem-per-cpu as sbatch-opt HOT 1
- Error in save.image(name) when running R task in batch mode using slurm HOT 1
- slurmr command line tool deletes source file
- Error in loading shared libraries
- Hook error w/ Slurm_sapply
- Reverse HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
π Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. πππ
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google β€οΈ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from slurmr.