GithubHelp home page GithubHelp logo

Comments (7)

gvegayon avatar gvegayon commented on May 28, 2024

Need to work on this ASAP

from slurmr.

pmarjora avatar pmarjora commented on May 28, 2024

You would probably want the option to run them with a different seed when you restart them as well.

from slurmr.

millstei avatar millstei commented on May 28, 2024

It would be good to be able to rerun these failed jobs with the same random seed and arguments to be able to regenerate the problem or different seed and arguments for debugging.

from slurmr.

millstei avatar millstei commented on May 28, 2024

How about a function to return a list of argument sets corresponding to the failed jobs? That way they could be rerun as is or modified according to the user's need.

from slurmr.

gvegayon avatar gvegayon commented on May 28, 2024

Perhaps it should an option of sbatch (the underlying function that submits the jobs). The function could gain a new argument, e.g. what, which suggests what parts of the should be resubmitted. In such case, there could be a function that returns the sequence of failed jobs. This implies to somewhat reimagine the workflow. A few changings that I foresee:

  1. Submit the job. The last should be kept in memory so that the user could grab it by typing something like last_job(). This is important b/c users can use the collect = TRUE option and still be able to access to the last job.

  2. Have a function called read_job or something like that that allows recovering any job. For this, when saving the auxiliary files we should be saving either a plain text file or a binary rds file with the job call itself so that users can recover a job set up by simply typing the path to the Slurm job folder.

  3. For debugging, one thing I find myself doing all the time is:

    • Looking the job directory, and in particular, reading the log files generated by Slurm.
    • parallel::mclapply has a tryCatch wrapper, so some times I cannot tell whether a job has failed or not, so what I usually do is load one of the datasets and look at the data directly. In such cases mclapply usually returns a warning.

    I could try to tag the errors automatically and have a function, as mentioned earlier, to try to resubmit the job, but only a subset (in this case, those that failed). Need to check how to modify the ARRAY variable in the batch file, and what is the limit on that string for Slurm. That should be in MaxArraySize in the slurm.conf file.

from slurmr.

millstei avatar millstei commented on May 28, 2024

Maybe your #2 covers this? I just ran a bunch of jobs, some of which failed for unknown reasons, the 02-output-* files were not generated for the failed jobs. When I tried to rerun some of those jobs, individually, according to the job#, I was not able to recreate the error/failure, that is, the jobs completed successfully the second time around even though I used the same random seed. The problem is that now I cannot debug because I cannot regenerate the error.

from slurmr.

gvegayon avatar gvegayon commented on May 28, 2024

OK, so part of this was implemented in 1dccce5. Now jobs can be resubmitted very easily:

library(sluRm)

# A simple expr evaluation, WhoAmI() gives some info about the node
x <- Slurm_EvalQ(sluRm::WhoAmI(), njobs = 20, plan = "wait")

# Suppose jobs 1, 2, and 15-20 failed, then we can do the following
sbatch(x, array = "1,2,15-20")

# And, if status OK, then collect safely the entire array
ans <- Slurm_collect(x)

To keep it simple, users can check the status of a given submission with the state function, or simply calling sacct. state will return an integer scaler telling whether the job has been submitted, is done, it has failed, or is still running, including a set of attributes that enumerate the state of each job in the array. So I think this is done :).

from slurmr.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    πŸ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. πŸ“ŠπŸ“ˆπŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❀️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.