In some cases, a subset of the submitted jobs may fail because, e.g., time limit, memo

Maybe your <a class="issue-link js-issue-link" data-error-text="Failed to load title"

OK, so part of this was implemented in <a class="commit-link" data-hovercard-type="com

Smooth way to re-run of failed jobs about slurmr HOT 7 CLOSED

uscbiostats commented on May 28, 2024

Smooth way to re-run of failed jobs

from slurmr.

Comments (7)

gvegayon commented on May 28, 2024

Need to work on this ASAP

from slurmr.

pmarjora commented on May 28, 2024

You would probably want the option to run them with a different seed when you restart them as well.

from slurmr.

millstei commented on May 28, 2024

It would be good to be able to rerun these failed jobs with the same random seed and arguments to be able to regenerate the problem or different seed and arguments for debugging.

from slurmr.

millstei commented on May 28, 2024

How about a function to return a list of argument sets corresponding to the failed jobs? That way they could be rerun as is or modified according to the user's need.

from slurmr.

gvegayon commented on May 28, 2024

Perhaps it should an option of sbatch (the underlying function that submits the jobs). The function could gain a new argument, e.g. what, which suggests what parts of the should be resubmitted. In such case, there could be a function that returns the sequence of failed jobs. This implies to somewhat reimagine the workflow. A few changings that I foresee:

Submit the job. The last should be kept in memory so that the user could grab it by typing something like last_job(). This is important b/c users can use the collect = TRUE option and still be able to access to the last job.
Have a function called read_job or something like that that allows recovering any job. For this, when saving the auxiliary files we should be saving either a plain text file or a binary rds file with the job call itself so that users can recover a job set up by simply typing the path to the Slurm job folder.
For debugging, one thing I find myself doing all the time is:
- Looking the job directory, and in particular, reading the log files generated by Slurm.
- parallel::mclapply has a tryCatch wrapper, so some times I cannot tell whether a job has failed or not, so what I usually do is load one of the datasets and look at the data directly. In such cases mclapply usually returns a warning.
I could try to tag the errors automatically and have a function, as mentioned earlier, to try to resubmit the job, but only a subset (in this case, those that failed). Need to check how to modify the ARRAY variable in the batch file, and what is the limit on that string for Slurm. That should be in MaxArraySize in the slurm.conf file.

from slurmr.

millstei commented on May 28, 2024

Maybe your #2 covers this? I just ran a bunch of jobs, some of which failed for unknown reasons, the 02-output-* files were not generated for the failed jobs. When I tried to rerun some of those jobs, individually, according to the job#, I was not able to recreate the error/failure, that is, the jobs completed successfully the second time around even though I used the same random seed. The problem is that now I cannot debug because I cannot regenerate the error.

from slurmr.

gvegayon commented on May 28, 2024

OK, so part of this was implemented in 1dccce5. Now jobs can be resubmitted very easily:

library(sluRm)

# A simple expr evaluation, WhoAmI() gives some info about the node
x <- Slurm_EvalQ(sluRm::WhoAmI(), njobs = 20, plan = "wait")

# Suppose jobs 1, 2, and 15-20 failed, then we can do the following
sbatch(x, array = "1,2,15-20")

# And, if status OK, then collect safely the entire array
ans <- Slurm_collect(x)

To keep it simple, users can check the status of a given submission with the state function, or simply calling sacct. state will return an integer scaler telling whether the job has been submitted, is done, it has failed, or is still running, including a set of attributes that enumerate the state of each job in the array. So I think this is done :).

from slurmr.

Smooth way to re-run of failed jobs about slurmr HOT 7 CLOSED

Comments (7)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent

Jobs