GithubHelp home page GithubHelp logo

uscbiostats / slurmr Goto Github PK

View Code? Open in Web Editor NEW
53.0 6.0 11.0 2.03 MB

slurmR: A Lightweight Wrapper for Slurm

Home Page: https://uscbiostats.github.io/slurmR/

License: Other

R 94.31% TeX 2.07% Shell 1.64% Makefile 1.98%
hpc slurm rpackage rstats bioinformatics

slurmr's Issues

[JOSS review] Prefer file.path()?

This is a part of the JOSS review outlined in openjournals/joss-reviews#1493.

This is nitpicking, but I'd suggest to use file.path() instead of paste() or paste0(). According to the documentation, file.path() should be faster.
Feel free to just close this issue if you disagree or have other reasons to not use file.path().

Fix a few issues [re: JOSS submission]

I found a handful of typos in your package (not all may be valid). Please fix.

 WORD              FOUND IN
accesible         new_rscript.Rd:19
                  Slurm_EvalQ.Rd:21
                  Slurm_lapply.Rd:31
batchtools        README.md:403,455
circunscribed     getting-started.Rmd:76
Debuggin          opts_sluRm.Rd:23
debuging          README.md:30
                  README.Rmd:35
elipsis           getting-started.Rmd:38
estiates          getting-started.Rmd:93
EvalQ             the_plan.Rd:59
frac              getting-started.Rmd:72
mbox              getting-started.Rmd:72
nd                getting-started.Rmd:48
OverSubscribe     getting-started.Rmd:142
parallelisation   getting-started.Rmd:93
parallelization   getting-started.Rmd:48
parallelize       getting-started.Rmd:59
radious           getting-started.Rmd:76
sbatch            slurm_job.Rd:33,52
                  Slurm_lapply.Rd:100
                  the_plan.Rd:32,35
schedmd           Slurm_lapply.Rd:97
scontrol          getting-started.Rmd:139
specity           Slurm_EvalQ.Rd:16
                  Slurm_lapply.Rd:26
splited           Slurm_lapply.Rd:65
splitIndices      Slurm_lapply.Rd:62
submited          getting-started.Rmd:21
th                README.md:583
ThreadsPerCore    getting-started.Rmd:139,142
underlaying       sbatch.Rd:81
which's           getting-started.Rmd:76
wikipedia         getting-started.Rmd:76

Slurm_lapply unable to parse sbatch output when starting jobs on federated slurm cluster when cluster name specified

The Slurm_lapply function fails, reporting job ids of NA, when run on a federated SLURM cluster. In this case the parallel slurm jobs were successfully started, but the parent process failed to parse the output from the sbatch command.
On a federated SLURM cluster, when the cluster name is specified in the sbatch_opts (and passed to the sbatch command), the output from sbatch looks like:
Submitted batch job 8653762 on cluster name_of_cluster
The regular expression used to parse this output and capture the job id on lines 142 and 224 of sbatch.R is:
".+ (?=[[:digit:]]+$)"
The "$" in that expression prevents the pattern from matching the sbatch output since there are characters following the job id. I suspect just removing the $ will solve the problem. I tried recoding that line as:
jobid <- as.integer(regmatches(ans,regexpr("[[:digit:]]+",ans)))
and that worked as well.

Reverse

Hello, I have to files. One file is .sh extensions containting slurm script and a job array of 1000. The another file is a r script that use SLURM_TASK_ARRAY_ID as a seed and it will be used in whole script.
Since i am a mac user and I dont have SLURM installed, and I want to only use the package slurmR.
I am a asking about the way i can get the
SLURM_TASK_ARRAY_ID using only SlurmR in rstudio. If it is possible, I can run my 1000 jobs.

Thank you

Error in loading shared libraries

Hi,
Using doParallel and the foreach construct we get the following error:

Warning: Permanently added 't077,10.12.0.77' (ECDSA) to the list of known hosts.
/storage/apps/R/4.2.1/GNU/lib64/R/bin/exec/R: error while loading shared libraries: libgfortran.so.5: cannot open shared object file: No such file or directory

It is not clear to us if the problem lies in the HPC, in the R configuration on the machine or in the Slurm library.

Thanks in advance.

[JOSS review] Community guidelines

This is a part of the JOSS review outlined in openjournals/joss-reviews#1493.

The JOSS check list states:

Community guidelines: Are there clear guidelines for third parties wishing to 1) Contribute to the software 2) Report issues or problems with the software 3) Seek support

However, the only contributing guidelines are a linked CoC.

Can you add a sentence on how to report issues/PRs/seek support? (e.g. should users open an issue, email the author, or send a carrier pigeon)

[JOSS review] Add a more realistic example

This is a part of the JOSS review outlined in openjournals/joss-reviews#1493.

It would be very important for the user to have a vignette with a more realistic example, especially on how to deal with failed jobs (retrieve logs, run job interactively to generate a traceback (is this possible?), re-submit subset of failed jobs).

Error with --exclusive option

Hello everyone,

When using sourceSlurm or slurmr via command line, and the R file contain the following sbatch directives

#!/bin/sh
#SBATCH --job-name=simRMD
#SBATCH --mem-per-cpu=10G
#SBATCH --nodes=1
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=1
#SBATCH --exclusive

it returns the following error:
sbatch: error: Invalid --exclusive specification
Return code (status): 255

looking at the generated .sh file I see that slurmr appends the following : #SBATCH --exclusive=--exclusive
is that behaviour expected ?

Best,

Mahmoud

Passing along an environmental variable

I am trying to fit a stan model using the cmdstanr package. Using cmdstan requires the declaration of the path where cmdstan sits. On my local computer, this does not need to be set once cmdstan has been installed. However, on the HPC, we need to tell the nodes where to find cmdstan by issuing a command like this (which is specific to my setup):

set_cmdstan_path(path = "/gpfs/share/apps/cmdstan/2.25.0")

In the code below, I have commented out the two places where I know it does work. However, I'd rather not have to embed this command in the code, but would rather have it in the Slurm_lapply call or in some system setting. Basically, I don't really want to have to worry about this when coding.

I can comment out the second command (just before the cmdstan_model call) if I specify the expression in my sbatch code:

Rscript --vanilla cmdstanr_test.R -e 'set_cmdstan_path(path = "/gpfs/share/apps/cmdstan/2.25.0")'

However, as you well know, this does not propagate to the nodes accessed as part of the job array. My attempt at using the rscript_opt argument did not do the trick. Any suggestions?

library(slurmR)
library(cmdstanr)
library(simstudy)
library(data.table)

s_estimate <- function(sigma, s_model, K) {
  
  fit <- s_model$sample(
    data = list(K=K, sigma = sigma),
    seed = 123,
    chains = 4,
    parallel_chains = 4,
    refresh = 500,
    iter_warmup = 250,
    iter_sampling = 500
  )
  
  return(fit)
  
}

s_extract <- function(fit) {
  
  x <- fit$draws()
  return(as.data.table(x))
  
}

iteration <- function(sigma, s_model, K) {
  
  # set_cmdstan_path(path = "/gpfs/share/apps/cmdstan/2.25.0")
  
  fit <- s_estimate(sigma, s_model, K)
  dd <- s_extract(fit)
  
  return(dd)
  
}

#---

# set_cmdstan_path(path = "/gpfs/share/apps/cmdstan/2.25.0")
mod <- cmdstan_model("/gpfs/data/troxellab/ksg/r/cmdstanr_test.stan")

job <- Slurm_lapply(
  1:10, 
  iteration, 
  s_model = mod,
  K = 4,
  njobs = 8, 
  mc.cores = 4,
  tmp_path = "/gpfs/data/troxellab/ksg/scratch",
  overwrite = TRUE,
  job_name = "i_cmd",
  sbatch_opt = list(time = "03:00:00", partition = "cpu_short"),
  export = c("s_estimate", "s_extract"),
  plan = "wait",
  rscript_opt = list('set_cmdstan_path(path = "/gpfs/share/apps/cmdstan/2.25.0")'))

job
res <- Slurm_collect(job)

Default partition, account, and cluster

slurmR should identify whenever the job is being run within a job. If that's the case, the following variables should have the following defaults:

  • partition = $SLURM_JOB_PARTITION
  • account = $SLURM_JOB_ACCOUNT
  • cluster = $SLURM_CLUSTER_NAME

This way, users can skip writing this information twice when possible.

Incompatible with `renv` projects

libPaths are not properly parsed into the rscript's library calls when using an renv project.

The function list_loaded_pkgs attempts to parse package directory information passed from sessionInfo(), which for renv projects using a cache is the base directory of the cache (which does not have a library tree structure), rather than the project specific renv libPath (which is a library tree with symlinks to the cached packages).

Is there a particular reason why the libPaths variable is not respected when loading packages from the parent environment?

I have forked the repo and made a change here to ensure that the library calls always respect the specified libPaths variable. Is there a situation where this is not desirable?

Suppress cleanup of output files from makeSlurmCluster

Is it possible to include a toggle in the makeSlurmCluster function that would allow for the retention of the SLURM output file when calling makeSlurmCluster? I would like to be able to track progress of a future call using a slurm socket cluster, but it appears that the clean() function is called to remove the output file when the makeSlurmCluster function is completed.

[JOSS review] Seeding

This is a part of the JOSS review outlined in openjournals/joss-reviews#1493.

The seeds currently default to 1:njobs. I've followed a similar approach in batchtools where I determine a start seed randomly, and then increment the seed for each job. Even with a random initialization, I'm not sure if this approach is feasible (see mllg/batchtools#81). Defaulting to 1:njobs seems to be even worse, there is a reason RNGs are not initialized with a constant value.

[JOSS review] [bug?] Submitting only one `lapply` job fails

This is a part of the JOSS review outlined in openjournals/joss-reviews#1493.

The following fails:

library(sluRm)
Slurm_lapply(1, function(x) x*2)
# Error in splits[[j]] : subscript out of bounds
# In addition: Warning message:
# `X` is not a list. The function will coerce it into one using `as.list`

Related: Internal parallelism via mc.cores defaults to 2, which wastes a core if there are not more calls than jobs.

`opts_slurmR` not appearing in sbatch script

Maybe I'm doing this wrong, but if so then I can't figure out how to do it right.

I'm using opts_slurmR$set_opts() to set options that I want to appear in the #SBATCH lines at the beginning of the job script. For example:

opts_slurmR$set_opts(partition="debug", `get-user-env` = "",  `mail-type` = "BEGIN, FAIL, END")

The options appear to have been set correctly, based on the output of opts_slurmR and opts_slurmR$get_opts_job(). However, they do not appear in myjob.sh. Any guidance would be appreciated.

[JOSS review] Chunking of jobs

This is a part of the JOSS review outlined in openjournals/joss-reviews#1493.

As far as I can tell, individual function calls are always submitted as array jobs.

What if you have millions of calls?
Does that mean sluRm will submit an array with millions of entries?

In other words, can sluRm chunk together multiple function calls in one jobs?

I assume it can not. This is fine, but should be documented.

incorrect rslurm info in VS table

  1. rslurm package has always been active, and is currently active to present
  2. rslurm::slurm_apply() provides functionality similar to the apply family in base R, so table should read "yes"

Basic checks before submitting a job

In the case of the Slurm_lapply function, make sure of the following:

  • The first argument can be coerced into a list
  • The second argument is a function
  • The ... is coherent with the arguments received by the function

Error in save.image(name) when running R task in batch mode using slurm

Hi, When I run the R codes on the Linux, it works well. But when I ran R in batch mode on Linux, I always got the same error. The following are my .sh file, .R file and .out file. APSIM.out was created automatically. It shows that "Error in save.image(name) :", but I already told R not to save workspace "--vanilla". Do you have any suggestions about it? Maybe it is unrelated with Apsimx packages, but with R. Thanks!

APSIM.sh

#!/bin/bash
#SBATCH --time=00:30:00
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=16
#SBATCH --job-name="myjob"
#SBATCH --partition=secondary
#SBATCH --output=myjob.o%j
#SBATCH --dependency=afterany:<JobID>

conda activate ly-envi

cd xx/xx/xx/apsim
srun R --vanilla CMD BATCH  APSIM.R APSIM.out

APSIM.R

library(soilDB)
library(sp)
library(sf)
library(spData)
library(apsimx)
library(raster)
library(lubridate)
estimated_sowing_list = list()


apsim_path = "/data/keeling/a/xx/apsim/apsim_2016.txt"
apsim_input <- read.delim(apsim_path,header = FALSE, sep = ",", dec = ",")
colnames(apsim_input) <- c('SD','B21','B25','Long','Lati')
head(apsim_input)

apsimx_options(exe.path = "/data/keeling/a/xxx/ApsimX/bin/Debug/netcoreapp3.1/Models")

for (i in 1:2) {
  
  options(digits=15)
  filed_latilongi = c(as.double(as.character(apsim_input$Long[i])) , as.double(as.character(apsim_input$Lati[i]) ) )
  
  # extd.dir <- system.file("extdata", package = "apsimx")
  tmp.dir ="/data/keeling/a/xxx/apsim/"

  dmet12 <- get_daymet_apsim_met(lonlat = filed_latilongi, years = c(2016))
  write_apsim_met(dmet12,wrt.dir = tmp.dir, filename = "test.met")

 	print(i)

}

APSIM.out

[Previously saved workspace restored]

> 
> library(soilDB)
> library(sp)
> library(sf)
Linking to GEOS 3.8.0, GDAL 3.0.2, PROJ 6.2.1; sf_use_s2() is TRUE
> library(spData)
To access larger datasets in this package, install the spDataLarge
package with: `install.packages('spDataLarge',
repos='https://nowosad.github.io/drat/', type='source')`
> library(apsimx)
APSIM(X) not found.
                        If APSIM(X) is installed in an alternative location,
                        set paths manually using 'apsimx_options' or 'apsim_options'.
                        You can still try as the package will look into the registry (under Windows)
> library(raster)
> library(lubridate)

Attaching package: 'lubridate'

The following objects are masked from 'package:raster':

    intersect, union

The following objects are masked from 'package:base':

    date, intersect, setdiff, union

> estimated_sowing_list = list()
> 
> 
> apsim_path = "/data/keeling/a/xxx/apsim/apsim_2016.txt"
> apsim_input <- read.delim(apsim_path,header = FALSE, sep = ",", dec = ",")
> colnames(apsim_input) <- c('SD','B21','B25','Long','Lati')
> head(apsim_input)
   SD B21 B25                 Long                 Lati
1 111 145 150  -89.48824446937473    39.29027168436681 
2 108 133 137   -89.5437797576881    39.21354130583063 
3 144 164 168   -89.5826152805141    39.08288777713276 
4 115 132 138  -89.83044110359363    39.12534519941483 
5 145 155 159  -89.51309913395067   39.370810476566184 
6 109 140 144  -89.37520541425403    39.49220820731174 
> 
> apsimx_options(exe.path = "/data/keeling/a/xxx/ApsimX/bin/Debug/netcoreapp3.1/Models")
> 
> for (i in 1:1) {
+   
+   options(digits=15)
+   filed_latilongi = c(as.double(as.character(apsim_input$Long[i])) , as.double(as.character(apsim_input$Lati[i]) ) )
+   
+   # extd.dir <- system.file("extdata", package = "apsimx")
+   tmp.dir ="/data/keeling/a/xxx/apsim/"
+ 
+   dmet12 <- get_daymet_apsim_met(lonlat = filed_latilongi, years = c(2016),silent=TRUE)
+   write_apsim_met(dmet12,wrt.dir = tmp.dir, filename = "test.met")
+   summary(dmet12)
+   ## Check for reasonable ranges 
+   check_apsim_met(dmet12)
+  	print(i)
+ 
+ }
[1] 1
Warning message:
In check_apsim_met(dmet12) :
  Last year in the met file is a leap year and it only has 365 days
> 
> proc.time()
   user  system elapsed 
  5.752   0.380  13.795 
Error in save.image(name) : 
  image could not be renamed and is left in .RDataTmp
Calls: sys.save.image -> save.image
In addition: Warning message:
In file.rename(outfile, file) :
  cannot rename file '.RDataTmp' to '.RData', reason 'No such file or directory'
Execution halted

mclapply causes segfault

As reported before by @gmweaver, apparently mclapply is prone to segfault error depending on the version of BLAS used in R. A possible solution to this problem can be giving the user the option to chose either forking or a sock cluster, the later more expensive but safer.

β€œAn error has occurred when calling `silent_system2`:”

https://stackoverflow.com/questions/65402764/slurmr-trying-to-run-an-example-job-an-error-has-occurred-when-calling-silent

I setup a slurm cluster and I can issue a srun -N4 hostname just fine.

I keep seeing "silent_system2" errors. I've installed slurmR using devtools::install_github("USCbiostats/slurmR")

I'm following the second example 3: https://github.com/USCbiostats/slurmR

here are my files

cat slurmR.R

library(doParallel)
library(slurmR)

cl <- makeSlurmCluster(4)

registerDoParallel(cl)
m <- matrix(rnorm(9), 3, 3)
foreach(i=1:nrow(m), .combine=rbind)

StopCluster(cl)
print(m)

cat rscript.slurm

#!/bin/bash
#SBATCH --output=slurmR.out

cd /mnt/nfsshare/tankpve0/
Rscript --vanilla slurmR.R

cat slurmR.out

Loading required package: foreach
Loading required package: iterators
Loading required package: parallel
slurmR default option for `tmp_path` (used to store auxiliar files) set to:
  /mnt/nfsshare/tankpve0
You can change this and checkout other slurmR options using: ?opts_slurmR, or you could just type "opts_slurmR" on the terminal.
Submitting job... jobid:18.
Slurm accounting storage is disabled
Error: An error has occurred when calling `silent_system2`:
Warning: An error was detected before returning the cluster object. If submitted, we will try to cancel the job and stop the cluster object.
Execution halted

vignette 'sluRm' not found

vignette("sluRm")
#> Warning: vignette 'sluRm' not found

Somehow, the instruction to get started is not shown on the GitHub page and the vignette function could not find it. I am submitting this question on behalf of Joshua.

Smooth way to re-run of failed jobs

In some cases, a subset of the submitted jobs may fail because, e.g., time limit, memory limit, or other reasons. In such cases, it would be nice if there were a way to resubmit jobs that failed.

This was also suggested by @millstei.

In particular, we would need to do the following:

  • Catch errors on slurm side, and return with a particular code on the 0#-finito%a.out file indicating that there was an error.
  • This could be done by creating a function to parse slurm output files.
  • Also, it would be best if we could have a function to read/write analysis of outputs

Slurm_collect throws 'Error in x$njobs : $ operator is invalid for atomic vectors'

I am testing the slurmR package on my school HPC. Everything works great using the Slurm_lapply with plan="none" then sbatch call to launch the job array. However - I get the following strange error when using Slurm_collect.

Warning: The call to -sacct- failed. This is probably due to not having slurm accounting up and running. For more information, checkout this discussion: https://github.com/USCbiostats/slurmR/issues29
Error in x$njobs : $ operator is invalid for atomic vectors`

The code I run is

library(slurmR)
ans <- Slurm_lapply(1:10, sqrt, plan="none")
sbatch(ans)
result <- Slurm_collect(ans)

I understand my cluster does not have slurm accounting enabled - but it seems the error is unrelated to the warning? However when I enter debug mode, the x object has a njobs attribute and does not throw an error when I retrieve it directly.

[JOSS review] Inconsistent authors

This is a part of the JOSS review outlined in openjournals/joss-reviews#1493.

The authors on the paper and in DESCRIPTION are different. Is there any reason for that?

Looking at the commit history, @pmarjora contributed mostly on improving documentation. This is fine, but can you please confirm your authorship and that you are willing to take responsibility for the paper and the software (insofar as this is usually the case for co-authors on publications)?

mem-per-cpu as sbatch-opt

I am trying to specify the RAM to be used per CPU, but the option name ("mem-per-cpu") is not a valid R variable name. So, the list statement (as in sbatch_opt(list(mem-per-cpu="16G")) results in an error. I specified the variable name with backticks - and no errors were generated, but I am not sure it has actually worked. Is that the proper way to solve this issue, or is there another way I should be doing it?

When submitting a job with sbatch(..., array = ) Slurm_collect fails

Slurm_collect needs to be able to collect whatever is available. And also, I need to work on a better way to put everything together. Right now it seems that it is not submitting the job to the same folder. Slurm_collect is trying to get x$opts_job$tmp_path but this is not reflecting on the job object.

PSOCK cluster backend

I just figured out that creating a sluRm type cluster object is rather easy:

  1. From a node, we need to initialize a job with N workes (nodes)

  2. Once the job has started, we can list what are the nodes that were assigned by typing squeue(u=[userid])

  3. From there, the node name is given, which actually matches the address of it.

  4. With that, we can create PSOCK clusers easily by typing:

    cl <- parallel::makePSOCKcluster([list of node names])
    

    Meaning, we can use sluRm as a backend for all!

cc @pmarjora @gmweaver

filesystem error: cannot rename: Directory not empty

My Slurm cluster runs a pair of prolog and epilog scripts to write performance data to the job submission folder. This seems to conflict with assumptions made by slurmR:

Success! nodenames collected (terminate called after throwing an instance of 'std::filesystem::__cxx11::filesystem_error',   what():  filesystem error: cannot rename: Directory not empty [sps-%JOBID%_-2] [sps-%JOBID%_-2.1],   what():  filesystem error: cannot rename: Directory not empty [sps-%JOBID%_-2] [sps-%JOBID%_-2.2], %NODENAME%). Creating the cluster object...
ssh: Could not resolve hostname terminate: Temporary failure in name resolution

Those folders named sps-* are automatically created by our scripts. Job ID and node name replaced with placeholders %JOBID% and %NODENAME%.

[JOSS review] Comparison table

This is a part of the JOSS review outlined in openjournals/joss-reviews#1493.

The comparison table in the readme says that batchtools cannot re-run jobs, but batchtools is perfectly capable of it:

library(batchtools)
reg = makeRegistry(NA)
f = function(x) if (x == 3) stop(3) else x^2
btlapply(1:5, f, reg = reg)
# -> no result

# partial results are available in registry
reduceResultsList()

# re-submit failed jobs
# (job #3 will fail again of course, just for demonstration)
submitJobs(findErrors())

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    πŸ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. πŸ“ŠπŸ“ˆπŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❀️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.