GithubHelp home page GithubHelp logo

uscbiostats / slurmr Goto Github PK

View Code? Open in Web Editor NEW
53.0 6.0 11.0 2.03 MB

slurmR: A Lightweight Wrapper for Slurm

Home Page: https://uscbiostats.github.io/slurmR/

License: Other

R 94.31% TeX 2.07% Shell 1.64% Makefile 1.98%
hpc slurm rpackage rstats bioinformatics

slurmr's Introduction

DOI R CI release codecov CRAN status CRAN downloads status Integrative Methods of Analysis for Genetic Epidemiology

slurmR: A Lightweight Wrapper for Slurm

Slurm Workload Manager is a popular HPC cluster job scheduler found in many of the top 500 supercomputers. The slurmR R package provides an R wrapper to it that matches the parallel package’s syntax, this is, just like parallel provides the parLapply, clusterMap, parSapply, etc., slurmR provides Slurm_lapply, Slurm_Map, Slurm_sapply, etc.

While there are other alternatives such as future.batchtools, batchtools, clustermq, and rslurm, this R package has the following goals:

  1. It is dependency-free, which means that it works out-of-the-box

  2. Emphasizes been similar to the workflow in the R package parallel

  3. It provides a general framework for creating personalized own wrappers without using template files.

  4. Is specialized on Slurm, meaning more flexibility (no need to modify template files) and debugging tools (e.g., job resubmission).

  5. Provide a backend for the parallel package, providing an out-of-the-box method for creating Socket cluster objects for multi-node operations. (See the examples below on how to use it with other R packages)

Checkout the VS section section for comparing slurmR with other R packages. Wondering who is using Slurm? Check out the list at the end of this document.

Installation

From your HPC command line, you can install the development version from GitHub with:

$ git clone https://github.com/USCbiostats/slurmR.git
$ R CMD INSTALL slurmR/ 

The second line assumes you have R available in your system (usually loaded via module R or some other command). Or using the devtools from within R:

# install.packages("devtools")
devtools::install_github("USCbiostats/slurmR")

Citation

To cite slurmR in publications use:

  Vega Yon et al., (2019). slurmR: A lightweight wrapper for HPC with
  Slurm. Journal of Open Source Software, 4(39), 1493,
  https://doi.org/10.21105/joss.01493

And the actual R package:

  Vega Yon G, Marjoram P (2022). _slurmR: A Lightweight Wrapper for
  'Slurm'_. R package version 0.5-2,
  <https://github.com/USCbiostats/slurmR>.

To see these entries in BibTeX format, use 'print(<citation>,
bibtex=TRUE)', 'toBibtex(.)', or set
'options(citation.bibtex.max=999)'.

Running slurmR with Docker

For testing purposes, slurmR is available in Dockerhub. The rcmdcheck and interactive images are built on top of xenonmiddleware/slurm.

Once you download the files contained in the slurmR repository, you can go to the docker folder and use the Makefile included there to start a Unix session with slurmR and Slurm included.

To test slurmR using docker, check the README.md file located at https://github.com/USCbiostats/slurmR/tree/master/docker.

Examples

Example 1: Computing means (and looking under the hood)

library(slurmR)
#  Loading required package: parallel
#  slurmR default option for `tmp_path` (used to store auxiliar files) set to:
#    /home/george/Documents/development/slurmR
#  You can change this and checkout other slurmR options using: ?opts_slurmR, or you could just type "opts_slurmR" on the terminal.

# Suppose that we have 100 vectors of length 50 ~ Unif(0,1)
set.seed(881)
x <- replicate(100, runif(50), simplify = FALSE)

We can use the function Slurm_lapply to distribute computations

ans <- Slurm_lapply(x, mean, plan = "none")
#  Warning in normalizePath(file.path(tmp_path, job_name)):
#  path[1]="/home/george/Documents/development/slurmR/slurmr-job-113bd5bca5b18":
#  No such file or directory
#  Warning: [submit = FALSE] The job hasn't been submitted yet. Use sbatch() to submit the job, or you can submit it via command line using the following:
#  sbatch --job-name=slurmr-job-113bd5bca5b18 /home/george/Documents/development/slurmR/slurmr-job-113bd5bca5b18/01-bash.sh
Slurm_clean(ans) # Cleaning after you

Notice the plan = "none" option; this tells Slurm_lapply to only create the job object but do nothing with it, i.e., skip submission. To get more info, we can set the verbose mode on

opts_slurmR$verbose_on()
ans <- Slurm_lapply(x, mean, plan = "none")
#  Warning in normalizePath(file.path(tmp_path, job_name)):
#  path[1]="/home/george/Documents/development/slurmR/slurmr-job-113bd5bca5b18":
#  No such file or directory
#  --------------------------------------------------------------------------------
#  [VERBOSE MODE ON] The R script that will be used is located at: /home/george/Documents/development/slurmR/slurmr-job-113bd5bca5b18/00-rscript.r and has the following contents:
#  --------------------------------------------------------------------------------
#  .libPaths(c("/home/george/R/x86_64-pc-linux-gnu-library/4.2", "/usr/local/lib/R/site-library", "/usr/lib/R/site-library", "/usr/lib/R/library"))
#  message("[slurmR info] Loading variables and functions... ", appendLF = FALSE)
#  Slurm_env <- function (x = "SLURM_ARRAY_TASK_ID") 
#  {
#      y <- Sys.getenv(x)
#      if ((x == "SLURM_ARRAY_TASK_ID") && y == "") {
#          return(1)
#      }
#      y
#  }
#  ARRAY_ID  <- as.integer(Slurm_env("SLURM_ARRAY_TASK_ID"))
#  
#  # The -snames- function creates the write names for I/O of files as a 
#  # function of the ARRAY_ID
#  snames    <- function (type, array_id = NULL, tmp_path = NULL, job_name = NULL) 
#  {
#      if (length(array_id) && length(array_id) > 1) 
#          return(sapply(array_id, snames, type = type, tmp_path = tmp_path, 
#              job_name = job_name))
#      type <- switch(type, r = "00-rscript.r", sh = "01-bash.sh", 
#          out = "02-output-%A-%a.out", rds = if (missing(array_id)) "03-answer-%03i.rds" else sprintf("03-answer-%03i.rds", 
#              array_id), job = "job.rds", stop("Invalid type, the only valid types are `r`, `sh`, `out`, and `rds`.", 
#              call. = FALSE))
#      sprintf("%s/%s/%s", tmp_path, job_name, type)
#  }
#  TMP_PATH  <- "/home/george/Documents/development/slurmR"
#  JOB_NAME  <- "slurmr-job-113bd5bca5b18"
#  
#  # The -tcq- function is a wrapper of tryCatch that on error tries to recover
#  # the message and saves the outcome so that slurmR can return OK.
#  tcq <- function (...) 
#  {
#      ans <- tryCatch(..., error = function(e) e)
#      if (inherits(ans, "error")) {
#          ARRAY_ID. <- get("ARRAY_ID", envir = .GlobalEnv)
#          msg <- paste0("[slurmR info] An error has ocurred while evualting the expression:\n[slurmR info]   ", 
#              paste(deparse(match.call()[[2]]), collapse = "\n[slurmR info]   "), 
#              "\n[slurmR info] in ", "ARRAY_ID # ", ARRAY_ID., 
#              "\n[slurmR info] The error will be saved and quit R.\n")
#          message(msg, immediate. = TRUE, call. = FALSE)
#          ans <- list(res = ans, array_id = ARRAY_ID., job_name = get("JOB_NAME", 
#              envir = .GlobalEnv), slurmr_msg = structure(msg, 
#              class = "slurm_info"))
#          saveRDS(list(ans), snames("rds", tmp_path = get("TMP_PATH", 
#              envir = .GlobalEnv), job_name = get("JOB_NAME", envir = .GlobalEnv), 
#              array_id = ARRAY_ID.))
#          message("[slurmR info] job-status: failed.\n")
#          q(save = "no")
#      }
#      invisible(ans)
#  }
#  message("done loading variables and functions.")
#  tcq({
#    INDICES <- readRDS("/home/george/Documents/development/slurmR/slurmr-job-113bd5bca5b18/INDICES.rds")
#  })
#  tcq({
#    X <- readRDS(sprintf("/home/george/Documents/development/slurmR/slurmr-job-113bd5bca5b18/X_%04d.rds", ARRAY_ID))
#  })
#  tcq({
#    FUN <- readRDS("/home/george/Documents/development/slurmR/slurmr-job-113bd5bca5b18/FUN.rds")
#  })
#  tcq({
#    mc.cores <- readRDS("/home/george/Documents/development/slurmR/slurmr-job-113bd5bca5b18/mc.cores.rds")
#  })
#  tcq({
#    seeds <- readRDS("/home/george/Documents/development/slurmR/slurmr-job-113bd5bca5b18/seeds.rds")
#  })
#  set.seed(seeds[ARRAY_ID], kind = NULL, normal.kind = NULL)
#  tcq({
#    ans <- parallel::mclapply(
#      X                = X,
#      FUN              = FUN,
#      mc.cores         = mc.cores
#  )
#  })
#  saveRDS(ans, sprintf("/home/george/Documents/development/slurmR/slurmr-job-113bd5bca5b18/03-answer-%03i.rds", ARRAY_ID), compress = TRUE)
#  message("[slurmR info] job-status: OK.\n")
#  --------------------------------------------------------------------------------
#  The bash file that will be used is located at: /home/george/Documents/development/slurmR/slurmr-job-113bd5bca5b18/01-bash.sh and has the following contents:
#  --------------------------------------------------------------------------------
#  #!/bin/sh
#  #SBATCH --job-name=slurmr-job-113bd5bca5b18
#  #SBATCH --output=/home/george/Documents/development/slurmR/slurmr-job-113bd5bca5b18/02-output-%A-%a.out
#  #SBATCH --array=1-2
#  #SBATCH --job-name=slurmr-job-113bd5bca5b18
#  #SBATCH --cpus-per-task=1
#  #SBATCH --ntasks=1
#  /usr/lib/R/bin/Rscript  /home/george/Documents/development/slurmR/slurmr-job-113bd5bca5b18/00-rscript.r
#  --------------------------------------------------------------------------------
#  EOF
#  --------------------------------------------------------------------------------
#  Warning: [submit = FALSE] The job hasn't been submitted yet. Use sbatch() to submit the job, or you can submit it via command line using the following:
#  sbatch --job-name=slurmr-job-113bd5bca5b18 /home/george/Documents/development/slurmR/slurmr-job-113bd5bca5b18/01-bash.sh
Slurm_clean(ans) # Cleaning after you

Example 2: Job resubmission

The following example was extracted from the package’s manual.

# Submitting a simple job
job <- Slurm_EvalQ(slurmR::WhoAmI(), njobs = 20, plan = "submit")

# Checking the status of the job (we can simply print)
job
status(job) # or use the state function
sacct(job) # or get more info with the sactt wrapper.

# Suppose some of the jobs are taking too long to complete (say 1, 2, and 15 through 20)
# we can stop it and resubmit the job as follows:
scancel(job)

# Resubmitting only 
sbatch(job, array = "1,2,15-20") # A new jobid will be assigned

# Once its done, we can collect all the results at once
res <- Slurm_collect(job)

# And clean up if we don't need to use it again
Slurm_clean(res)

Take a look at the vignette here.

Example 3: Using slurmR and future/doParallel/boot/…

The function makeSlurmCluster creates a PSOCK cluster within a Slurm HPC network, meaning that users can go beyond a single node cluster object and take advantage of Slurm to create a multi-node cluster object. This feature allows using slurmR with other R packages that support working with SOCKcluster class objects. Here are some examples

With the future package

library(future)
library(slurmR)

cl <- makeSlurmCluster(50)

# It only takes using a cluster plan!
plan(cluster, cl)

...your fancy futuristic code...

# Slurm Clusters are stopped in the same way any cluster object is
stopCluster(cl)

With the doParallel package

library(doParallel)
library(slurmR)

cl <- makeSlurmCluster(50)

registerDoParallel(cl)
m <- matrix(rnorm(9), 3, 3)
foreach(i=1:nrow(m), .combine=rbind) 

stopCluster(cl)

Example 4: Using slurmR directly from the command line

The slurmR package has a couple of convenient functions designed for the user to save time. First, the function sourceSlurm() allows skipping the explicit creating of a bash script file to be used together with sbatch by putting all the required config files on the first lines of an R scripts, for example:

#!/bin/sh
#SBATCH --account=lc_ggv
#SBATCH --partition=scavenge
#SBATCH --time=01:00:00
#SBATCH --mem-per-cpu=4G
#SBATCH --job-name=Waiting
Sys.sleep(10)
message("done.")

Is an R script that on the first line coincides with that of a bash script for Slurm: #!/bin/bash. The following lines start with #SBATCH explicitly specifying options for sbatch, and the reminder lines are just R code.

The previous R script is included in the package (type system.file("example.R", package="slurmR")).

Imagine that that R script is named example.R, then you use the sourceSlurm function to submit it to Slurm as follows:

slurmR::sourceSlurm("example.R")

This will create the corresponding bash file required to be used with sbatch, and submit it to Slurm.

Another nice tool is the slurmr_cmd(). This function will create a simple bash-script that we can use as a command-line tool to submit this type of R-scripts. Moreover, this command will can add the command to your session’s alias as follows:

library(slurmR)
slurmr_cmd("~", add_alias = TRUE)

Once that’s done, you can submit R scripts with “Slurm-like headers” (as shown previously) as follows:

$ slurmr example.R

Example 5: Using the preamble

Since version 0.4-3, slurmR includes the option preamble. This provides a way for the user to specify commands/modules that need to be executed before running the Rscript. Here is an example using module load:

# Turning the verbose mode off
opts_slurmR$verbose_off()

# Setting the preamble can be done globally
opts_slurmR$set_preamble("module load gcc/6.0")

# Or on the fly
ans <- Slurm_lapply(1:10, mean, plan = "none", preamble = "module load pandoc")

# Printing out the bashfile
cat(readLines(ans$bashfile), sep = "\n")
#  #!/bin/sh
#  #SBATCH --job-name=slurmr-job-113bd5bca5b18
#  #SBATCH --output=/home/george/Documents/development/slurmR/slurmr-job-113bd5bca5b18/02-output-%A-%a.out
#  #SBATCH --array=1-2
#  #SBATCH --job-name=slurmr-job-113bd5bca5b18
#  #SBATCH --cpus-per-task=1
#  #SBATCH --ntasks=1
#  module load gcc/6.0
#  module load pandoc
#  /usr/lib/R/bin/Rscript  /home/george/Documents/development/slurmR/slurmr-job-113bd5bca5b18/00-rscript.r

Slurm_clean(ans) # Cleaning after you

VS

There are several ways to enhance R for HPC. Depending on what are your goals/restrictions/preferences, you can use any of the following from this manually curated list:

Package Rerun (1) *apply (2) makeCluster (3) Slurm options Dependencies Activity
slurmR yes yes yes on the fly status Activity
drake yes - - by template status Activity
rslurm - yes - on the fly status Activity
future.batchtools - yes yes by template status Activity
batchtools yes yes - by template status Activity
clustermq - - - by template status Activity
  1. After errors, a part or the entire job can be resubmitted.
  2. Functionality similar to the apply family in base R, e.g., lapply, sapply, mapply or similar.
  3. Creating a cluster object using either MPI or Socket connection.

The packages slurmR, rslurm work only on Slurm. The drake package is focused on workflows.

Contributing

We welcome contributions to slurmR. Whether it is reporting a bug, starting a discussion by asking a question, or proposing/requesting a new feature, please go by creating a new issue here so that we can talk about it.

Please note that this project is released with a Contributor Code of Conduct (see the CODE_OF_CONDUCT.md file included in this project). By participating in this project, you agree to abide by its terms.

Who uses Slurm

Here is a manually curated list of institutions using Slurm:

Institution Country Link
University of Utah’s CHPC US link
USC Center for Advance Research Computing US link
Princeton Research Computing US link
Harvard FAS US link
Harvard HMS research computing US link
UCSan Diego WM Keck Lab for Integrated Biology US link
Stanford Sherlock US link
Stanford SCG Informatics Cluster US link
UC Berkeley Open Computing Facility US link
University of Utah CHPC US link
The University of Kansas Center for Research Computing US link
University of Cambridge UK link
Indiana University US link
Caltech HPC Center US link
Institute for Advanced Study US link
UTSouthwestern Medical Center BioHPC US link
Vanderbilt University ACCRE US link
University of Virginia Research Computing US link
Center for Advanced Computing CA link
SciNet CA link
NLHPC CL link
Kultrun CL link
Matbio CL link
TIG MIT US link
MIT Supercloud US supercloud.mit.edu/
Oxford’s ARC UK link

Funding

With project is supported by the National Cancer Institute, Grant #1P01CA196596.

Computation for the work described in this paper was supported by the University of Southern California’s Center for High-Performance Computing (hpcc.usc.edu).

slurmr's People

Contributors

gvegayon avatar pmarjora avatar schuettl avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar

slurmr's Issues

Smooth way to re-run of failed jobs

In some cases, a subset of the submitted jobs may fail because, e.g., time limit, memory limit, or other reasons. In such cases, it would be nice if there were a way to resubmit jobs that failed.

This was also suggested by @millstei.

In particular, we would need to do the following:

  • Catch errors on slurm side, and return with a particular code on the 0#-finito%a.out file indicating that there was an error.
  • This could be done by creating a function to parse slurm output files.
  • Also, it would be best if we could have a function to read/write analysis of outputs

mem-per-cpu as sbatch-opt

I am trying to specify the RAM to be used per CPU, but the option name ("mem-per-cpu") is not a valid R variable name. So, the list statement (as in sbatch_opt(list(mem-per-cpu="16G")) results in an error. I specified the variable name with backticks - and no errors were generated, but I am not sure it has actually worked. Is that the proper way to solve this issue, or is there another way I should be doing it?

Slurm_lapply unable to parse sbatch output when starting jobs on federated slurm cluster when cluster name specified

The Slurm_lapply function fails, reporting job ids of NA, when run on a federated SLURM cluster. In this case the parallel slurm jobs were successfully started, but the parent process failed to parse the output from the sbatch command.
On a federated SLURM cluster, when the cluster name is specified in the sbatch_opts (and passed to the sbatch command), the output from sbatch looks like:
Submitted batch job 8653762 on cluster name_of_cluster
The regular expression used to parse this output and capture the job id on lines 142 and 224 of sbatch.R is:
".+ (?=[[:digit:]]+$)"
The "$" in that expression prevents the pattern from matching the sbatch output since there are characters following the job id. I suspect just removing the $ will solve the problem. I tried recoding that line as:
jobid <- as.integer(regmatches(ans,regexpr("[[:digit:]]+",ans)))
and that worked as well.

vignette 'sluRm' not found

vignette("sluRm")
#> Warning: vignette 'sluRm' not found

Somehow, the instruction to get started is not shown on the GitHub page and the vignette function could not find it. I am submitting this question on behalf of Joshua.

mclapply causes segfault

As reported before by @gmweaver, apparently mclapply is prone to segfault error depending on the version of BLAS used in R. A possible solution to this problem can be giving the user the option to chose either forking or a sock cluster, the later more expensive but safer.

“An error has occurred when calling `silent_system2`:”

https://stackoverflow.com/questions/65402764/slurmr-trying-to-run-an-example-job-an-error-has-occurred-when-calling-silent

I setup a slurm cluster and I can issue a srun -N4 hostname just fine.

I keep seeing "silent_system2" errors. I've installed slurmR using devtools::install_github("USCbiostats/slurmR")

I'm following the second example 3: https://github.com/USCbiostats/slurmR

here are my files

cat slurmR.R

library(doParallel)
library(slurmR)

cl <- makeSlurmCluster(4)

registerDoParallel(cl)
m <- matrix(rnorm(9), 3, 3)
foreach(i=1:nrow(m), .combine=rbind)

StopCluster(cl)
print(m)

cat rscript.slurm

#!/bin/bash
#SBATCH --output=slurmR.out

cd /mnt/nfsshare/tankpve0/
Rscript --vanilla slurmR.R

cat slurmR.out

Loading required package: foreach
Loading required package: iterators
Loading required package: parallel
slurmR default option for `tmp_path` (used to store auxiliar files) set to:
  /mnt/nfsshare/tankpve0
You can change this and checkout other slurmR options using: ?opts_slurmR, or you could just type "opts_slurmR" on the terminal.
Submitting job... jobid:18.
Slurm accounting storage is disabled
Error: An error has occurred when calling `silent_system2`:
Warning: An error was detected before returning the cluster object. If submitted, we will try to cancel the job and stop the cluster object.
Execution halted

Error in loading shared libraries

Hi,
Using doParallel and the foreach construct we get the following error:

Warning: Permanently added 't077,10.12.0.77' (ECDSA) to the list of known hosts.
/storage/apps/R/4.2.1/GNU/lib64/R/bin/exec/R: error while loading shared libraries: libgfortran.so.5: cannot open shared object file: No such file or directory

It is not clear to us if the problem lies in the HPC, in the R configuration on the machine or in the Slurm library.

Thanks in advance.

[JOSS review] Inconsistent authors

This is a part of the JOSS review outlined in openjournals/joss-reviews#1493.

The authors on the paper and in DESCRIPTION are different. Is there any reason for that?

Looking at the commit history, @pmarjora contributed mostly on improving documentation. This is fine, but can you please confirm your authorship and that you are willing to take responsibility for the paper and the software (insofar as this is usually the case for co-authors on publications)?

[JOSS review] [bug?] Submitting only one `lapply` job fails

This is a part of the JOSS review outlined in openjournals/joss-reviews#1493.

The following fails:

library(sluRm)
Slurm_lapply(1, function(x) x*2)
# Error in splits[[j]] : subscript out of bounds
# In addition: Warning message:
# `X` is not a list. The function will coerce it into one using `as.list`

Related: Internal parallelism via mc.cores defaults to 2, which wastes a core if there are not more calls than jobs.

`opts_slurmR` not appearing in sbatch script

Maybe I'm doing this wrong, but if so then I can't figure out how to do it right.

I'm using opts_slurmR$set_opts() to set options that I want to appear in the #SBATCH lines at the beginning of the job script. For example:

opts_slurmR$set_opts(partition="debug", `get-user-env` = "",  `mail-type` = "BEGIN, FAIL, END")

The options appear to have been set correctly, based on the output of opts_slurmR and opts_slurmR$get_opts_job(). However, they do not appear in myjob.sh. Any guidance would be appreciated.

[JOSS review] Chunking of jobs

This is a part of the JOSS review outlined in openjournals/joss-reviews#1493.

As far as I can tell, individual function calls are always submitted as array jobs.

What if you have millions of calls?
Does that mean sluRm will submit an array with millions of entries?

In other words, can sluRm chunk together multiple function calls in one jobs?

I assume it can not. This is fine, but should be documented.

Default partition, account, and cluster

slurmR should identify whenever the job is being run within a job. If that's the case, the following variables should have the following defaults:

  • partition = $SLURM_JOB_PARTITION
  • account = $SLURM_JOB_ACCOUNT
  • cluster = $SLURM_CLUSTER_NAME

This way, users can skip writing this information twice when possible.

[JOSS review] Seeding

This is a part of the JOSS review outlined in openjournals/joss-reviews#1493.

The seeds currently default to 1:njobs. I've followed a similar approach in batchtools where I determine a start seed randomly, and then increment the seed for each job. Even with a random initialization, I'm not sure if this approach is feasible (see mllg/batchtools#81). Defaulting to 1:njobs seems to be even worse, there is a reason RNGs are not initialized with a constant value.

[JOSS review] Community guidelines

This is a part of the JOSS review outlined in openjournals/joss-reviews#1493.

The JOSS check list states:

Community guidelines: Are there clear guidelines for third parties wishing to 1) Contribute to the software 2) Report issues or problems with the software 3) Seek support

However, the only contributing guidelines are a linked CoC.

Can you add a sentence on how to report issues/PRs/seek support? (e.g. should users open an issue, email the author, or send a carrier pigeon)

[JOSS review] Add a more realistic example

This is a part of the JOSS review outlined in openjournals/joss-reviews#1493.

It would be very important for the user to have a vignette with a more realistic example, especially on how to deal with failed jobs (retrieve logs, run job interactively to generate a traceback (is this possible?), re-submit subset of failed jobs).

Fix a few issues [re: JOSS submission]

I found a handful of typos in your package (not all may be valid). Please fix.

 WORD              FOUND IN
accesible         new_rscript.Rd:19
                  Slurm_EvalQ.Rd:21
                  Slurm_lapply.Rd:31
batchtools        README.md:403,455
circunscribed     getting-started.Rmd:76
Debuggin          opts_sluRm.Rd:23
debuging          README.md:30
                  README.Rmd:35
elipsis           getting-started.Rmd:38
estiates          getting-started.Rmd:93
EvalQ             the_plan.Rd:59
frac              getting-started.Rmd:72
mbox              getting-started.Rmd:72
nd                getting-started.Rmd:48
OverSubscribe     getting-started.Rmd:142
parallelisation   getting-started.Rmd:93
parallelization   getting-started.Rmd:48
parallelize       getting-started.Rmd:59
radious           getting-started.Rmd:76
sbatch            slurm_job.Rd:33,52
                  Slurm_lapply.Rd:100
                  the_plan.Rd:32,35
schedmd           Slurm_lapply.Rd:97
scontrol          getting-started.Rmd:139
specity           Slurm_EvalQ.Rd:16
                  Slurm_lapply.Rd:26
splited           Slurm_lapply.Rd:65
splitIndices      Slurm_lapply.Rd:62
submited          getting-started.Rmd:21
th                README.md:583
ThreadsPerCore    getting-started.Rmd:139,142
underlaying       sbatch.Rd:81
which's           getting-started.Rmd:76
wikipedia         getting-started.Rmd:76

Incompatible with `renv` projects

libPaths are not properly parsed into the rscript's library calls when using an renv project.

The function list_loaded_pkgs attempts to parse package directory information passed from sessionInfo(), which for renv projects using a cache is the base directory of the cache (which does not have a library tree structure), rather than the project specific renv libPath (which is a library tree with symlinks to the cached packages).

Is there a particular reason why the libPaths variable is not respected when loading packages from the parent environment?

I have forked the repo and made a change here to ensure that the library calls always respect the specified libPaths variable. Is there a situation where this is not desirable?

PSOCK cluster backend

I just figured out that creating a sluRm type cluster object is rather easy:

  1. From a node, we need to initialize a job with N workes (nodes)

  2. Once the job has started, we can list what are the nodes that were assigned by typing squeue(u=[userid])

  3. From there, the node name is given, which actually matches the address of it.

  4. With that, we can create PSOCK clusers easily by typing:

    cl <- parallel::makePSOCKcluster([list of node names])
    

    Meaning, we can use sluRm as a backend for all!

cc @pmarjora @gmweaver

Passing along an environmental variable

I am trying to fit a stan model using the cmdstanr package. Using cmdstan requires the declaration of the path where cmdstan sits. On my local computer, this does not need to be set once cmdstan has been installed. However, on the HPC, we need to tell the nodes where to find cmdstan by issuing a command like this (which is specific to my setup):

set_cmdstan_path(path = "/gpfs/share/apps/cmdstan/2.25.0")

In the code below, I have commented out the two places where I know it does work. However, I'd rather not have to embed this command in the code, but would rather have it in the Slurm_lapply call or in some system setting. Basically, I don't really want to have to worry about this when coding.

I can comment out the second command (just before the cmdstan_model call) if I specify the expression in my sbatch code:

Rscript --vanilla cmdstanr_test.R -e 'set_cmdstan_path(path = "/gpfs/share/apps/cmdstan/2.25.0")'

However, as you well know, this does not propagate to the nodes accessed as part of the job array. My attempt at using the rscript_opt argument did not do the trick. Any suggestions?

library(slurmR)
library(cmdstanr)
library(simstudy)
library(data.table)

s_estimate <- function(sigma, s_model, K) {
  
  fit <- s_model$sample(
    data = list(K=K, sigma = sigma),
    seed = 123,
    chains = 4,
    parallel_chains = 4,
    refresh = 500,
    iter_warmup = 250,
    iter_sampling = 500
  )
  
  return(fit)
  
}

s_extract <- function(fit) {
  
  x <- fit$draws()
  return(as.data.table(x))
  
}

iteration <- function(sigma, s_model, K) {
  
  # set_cmdstan_path(path = "/gpfs/share/apps/cmdstan/2.25.0")
  
  fit <- s_estimate(sigma, s_model, K)
  dd <- s_extract(fit)
  
  return(dd)
  
}

#---

# set_cmdstan_path(path = "/gpfs/share/apps/cmdstan/2.25.0")
mod <- cmdstan_model("/gpfs/data/troxellab/ksg/r/cmdstanr_test.stan")

job <- Slurm_lapply(
  1:10, 
  iteration, 
  s_model = mod,
  K = 4,
  njobs = 8, 
  mc.cores = 4,
  tmp_path = "/gpfs/data/troxellab/ksg/scratch",
  overwrite = TRUE,
  job_name = "i_cmd",
  sbatch_opt = list(time = "03:00:00", partition = "cpu_short"),
  export = c("s_estimate", "s_extract"),
  plan = "wait",
  rscript_opt = list('set_cmdstan_path(path = "/gpfs/share/apps/cmdstan/2.25.0")'))

job
res <- Slurm_collect(job)

[JOSS review] Comparison table

This is a part of the JOSS review outlined in openjournals/joss-reviews#1493.

The comparison table in the readme says that batchtools cannot re-run jobs, but batchtools is perfectly capable of it:

library(batchtools)
reg = makeRegistry(NA)
f = function(x) if (x == 3) stop(3) else x^2
btlapply(1:5, f, reg = reg)
# -> no result

# partial results are available in registry
reduceResultsList()

# re-submit failed jobs
# (job #3 will fail again of course, just for demonstration)
submitJobs(findErrors())

incorrect rslurm info in VS table

  1. rslurm package has always been active, and is currently active to present
  2. rslurm::slurm_apply() provides functionality similar to the apply family in base R, so table should read "yes"

Error with --exclusive option

Hello everyone,

When using sourceSlurm or slurmr via command line, and the R file contain the following sbatch directives

#!/bin/sh
#SBATCH --job-name=simRMD
#SBATCH --mem-per-cpu=10G
#SBATCH --nodes=1
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=1
#SBATCH --exclusive

it returns the following error:
sbatch: error: Invalid --exclusive specification
Return code (status): 255

looking at the generated .sh file I see that slurmr appends the following : #SBATCH --exclusive=--exclusive
is that behaviour expected ?

Best,

Mahmoud

Basic checks before submitting a job

In the case of the Slurm_lapply function, make sure of the following:

  • The first argument can be coerced into a list
  • The second argument is a function
  • The ... is coherent with the arguments received by the function

filesystem error: cannot rename: Directory not empty

My Slurm cluster runs a pair of prolog and epilog scripts to write performance data to the job submission folder. This seems to conflict with assumptions made by slurmR:

Success! nodenames collected (terminate called after throwing an instance of 'std::filesystem::__cxx11::filesystem_error',   what():  filesystem error: cannot rename: Directory not empty [sps-%JOBID%_-2] [sps-%JOBID%_-2.1],   what():  filesystem error: cannot rename: Directory not empty [sps-%JOBID%_-2] [sps-%JOBID%_-2.2], %NODENAME%). Creating the cluster object...
ssh: Could not resolve hostname terminate: Temporary failure in name resolution

Those folders named sps-* are automatically created by our scripts. Job ID and node name replaced with placeholders %JOBID% and %NODENAME%.

[JOSS review] Prefer file.path()?

This is a part of the JOSS review outlined in openjournals/joss-reviews#1493.

This is nitpicking, but I'd suggest to use file.path() instead of paste() or paste0(). According to the documentation, file.path() should be faster.
Feel free to just close this issue if you disagree or have other reasons to not use file.path().

Slurm_collect throws 'Error in x$njobs : $ operator is invalid for atomic vectors'

I am testing the slurmR package on my school HPC. Everything works great using the Slurm_lapply with plan="none" then sbatch call to launch the job array. However - I get the following strange error when using Slurm_collect.

Warning: The call to -sacct- failed. This is probably due to not having slurm accounting up and running. For more information, checkout this discussion: https://github.com/USCbiostats/slurmR/issues29
Error in x$njobs : $ operator is invalid for atomic vectors`

The code I run is

library(slurmR)
ans <- Slurm_lapply(1:10, sqrt, plan="none")
sbatch(ans)
result <- Slurm_collect(ans)

I understand my cluster does not have slurm accounting enabled - but it seems the error is unrelated to the warning? However when I enter debug mode, the x object has a njobs attribute and does not throw an error when I retrieve it directly.

Error in save.image(name) when running R task in batch mode using slurm

Hi, When I run the R codes on the Linux, it works well. But when I ran R in batch mode on Linux, I always got the same error. The following are my .sh file, .R file and .out file. APSIM.out was created automatically. It shows that "Error in save.image(name) :", but I already told R not to save workspace "--vanilla". Do you have any suggestions about it? Maybe it is unrelated with Apsimx packages, but with R. Thanks!

APSIM.sh

#!/bin/bash
#SBATCH --time=00:30:00
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=16
#SBATCH --job-name="myjob"
#SBATCH --partition=secondary
#SBATCH --output=myjob.o%j
#SBATCH --dependency=afterany:<JobID>

conda activate ly-envi

cd xx/xx/xx/apsim
srun R --vanilla CMD BATCH  APSIM.R APSIM.out

APSIM.R

library(soilDB)
library(sp)
library(sf)
library(spData)
library(apsimx)
library(raster)
library(lubridate)
estimated_sowing_list = list()


apsim_path = "/data/keeling/a/xx/apsim/apsim_2016.txt"
apsim_input <- read.delim(apsim_path,header = FALSE, sep = ",", dec = ",")
colnames(apsim_input) <- c('SD','B21','B25','Long','Lati')
head(apsim_input)

apsimx_options(exe.path = "/data/keeling/a/xxx/ApsimX/bin/Debug/netcoreapp3.1/Models")

for (i in 1:2) {
  
  options(digits=15)
  filed_latilongi = c(as.double(as.character(apsim_input$Long[i])) , as.double(as.character(apsim_input$Lati[i]) ) )
  
  # extd.dir <- system.file("extdata", package = "apsimx")
  tmp.dir ="/data/keeling/a/xxx/apsim/"

  dmet12 <- get_daymet_apsim_met(lonlat = filed_latilongi, years = c(2016))
  write_apsim_met(dmet12,wrt.dir = tmp.dir, filename = "test.met")

 	print(i)

}

APSIM.out

[Previously saved workspace restored]

> 
> library(soilDB)
> library(sp)
> library(sf)
Linking to GEOS 3.8.0, GDAL 3.0.2, PROJ 6.2.1; sf_use_s2() is TRUE
> library(spData)
To access larger datasets in this package, install the spDataLarge
package with: `install.packages('spDataLarge',
repos='https://nowosad.github.io/drat/', type='source')`
> library(apsimx)
APSIM(X) not found.
                        If APSIM(X) is installed in an alternative location,
                        set paths manually using 'apsimx_options' or 'apsim_options'.
                        You can still try as the package will look into the registry (under Windows)
> library(raster)
> library(lubridate)

Attaching package: 'lubridate'

The following objects are masked from 'package:raster':

    intersect, union

The following objects are masked from 'package:base':

    date, intersect, setdiff, union

> estimated_sowing_list = list()
> 
> 
> apsim_path = "/data/keeling/a/xxx/apsim/apsim_2016.txt"
> apsim_input <- read.delim(apsim_path,header = FALSE, sep = ",", dec = ",")
> colnames(apsim_input) <- c('SD','B21','B25','Long','Lati')
> head(apsim_input)
   SD B21 B25                 Long                 Lati
1 111 145 150  -89.48824446937473    39.29027168436681 
2 108 133 137   -89.5437797576881    39.21354130583063 
3 144 164 168   -89.5826152805141    39.08288777713276 
4 115 132 138  -89.83044110359363    39.12534519941483 
5 145 155 159  -89.51309913395067   39.370810476566184 
6 109 140 144  -89.37520541425403    39.49220820731174 
> 
> apsimx_options(exe.path = "/data/keeling/a/xxx/ApsimX/bin/Debug/netcoreapp3.1/Models")
> 
> for (i in 1:1) {
+   
+   options(digits=15)
+   filed_latilongi = c(as.double(as.character(apsim_input$Long[i])) , as.double(as.character(apsim_input$Lati[i]) ) )
+   
+   # extd.dir <- system.file("extdata", package = "apsimx")
+   tmp.dir ="/data/keeling/a/xxx/apsim/"
+ 
+   dmet12 <- get_daymet_apsim_met(lonlat = filed_latilongi, years = c(2016),silent=TRUE)
+   write_apsim_met(dmet12,wrt.dir = tmp.dir, filename = "test.met")
+   summary(dmet12)
+   ## Check for reasonable ranges 
+   check_apsim_met(dmet12)
+  	print(i)
+ 
+ }
[1] 1
Warning message:
In check_apsim_met(dmet12) :
  Last year in the met file is a leap year and it only has 365 days
> 
> proc.time()
   user  system elapsed 
  5.752   0.380  13.795 
Error in save.image(name) : 
  image could not be renamed and is left in .RDataTmp
Calls: sys.save.image -> save.image
In addition: Warning message:
In file.rename(outfile, file) :
  cannot rename file '.RDataTmp' to '.RData', reason 'No such file or directory'
Execution halted

When submitting a job with sbatch(..., array = ) Slurm_collect fails

Slurm_collect needs to be able to collect whatever is available. And also, I need to work on a better way to put everything together. Right now it seems that it is not submitting the job to the same folder. Slurm_collect is trying to get x$opts_job$tmp_path but this is not reflecting on the job object.

Reverse

Hello, I have to files. One file is .sh extensions containting slurm script and a job array of 1000. The another file is a r script that use SLURM_TASK_ARRAY_ID as a seed and it will be used in whole script.
Since i am a mac user and I dont have SLURM installed, and I want to only use the package slurmR.
I am a asking about the way i can get the
SLURM_TASK_ARRAY_ID using only SlurmR in rstudio. If it is possible, I can run my 1000 jobs.

Thank you

Suppress cleanup of output files from makeSlurmCluster

Is it possible to include a toggle in the makeSlurmCluster function that would allow for the retention of the SLURM output file when calling makeSlurmCluster? I would like to be able to track progress of a future call using a slurm socket cluster, but it appears that the clean() function is called to remove the output file when the makeSlurmCluster function is completed.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.