uscbiostats / slurmr Goto Github PK
View Code? Open in Web Editor NEWslurmR: A Lightweight Wrapper for Slurm
Home Page: https://uscbiostats.github.io/slurmR/
License: Other
slurmR: A Lightweight Wrapper for Slurm
Home Page: https://uscbiostats.github.io/slurmR/
License: Other
This is a part of the JOSS review outlined in openjournals/joss-reviews#1493.
This is nitpicking, but I'd suggest to use file.path()
instead of paste()
or paste0()
. According to the documentation, file.path()
should be faster.
Feel free to just close this issue if you disagree or have other reasons to not use file.path()
.
I found a handful of typos in your package (not all may be valid). Please fix.
WORD FOUND IN
accesible new_rscript.Rd:19
Slurm_EvalQ.Rd:21
Slurm_lapply.Rd:31
batchtools README.md:403,455
circunscribed getting-started.Rmd:76
Debuggin opts_sluRm.Rd:23
debuging README.md:30
README.Rmd:35
elipsis getting-started.Rmd:38
estiates getting-started.Rmd:93
EvalQ the_plan.Rd:59
frac getting-started.Rmd:72
mbox getting-started.Rmd:72
nd getting-started.Rmd:48
OverSubscribe getting-started.Rmd:142
parallelisation getting-started.Rmd:93
parallelization getting-started.Rmd:48
parallelize getting-started.Rmd:59
radious getting-started.Rmd:76
sbatch slurm_job.Rd:33,52
Slurm_lapply.Rd:100
the_plan.Rd:32,35
schedmd Slurm_lapply.Rd:97
scontrol getting-started.Rmd:139
specity Slurm_EvalQ.Rd:16
Slurm_lapply.Rd:26
splited Slurm_lapply.Rd:65
splitIndices Slurm_lapply.Rd:62
submited getting-started.Rmd:21
th README.md:583
ThreadsPerCore getting-started.Rmd:139,142
underlaying sbatch.Rd:81
which's getting-started.Rmd:76
wikipedia getting-started.Rmd:76
The Slurm_lapply function fails, reporting job ids of NA, when run on a federated SLURM cluster. In this case the parallel slurm jobs were successfully started, but the parent process failed to parse the output from the sbatch command.
On a federated SLURM cluster, when the cluster name is specified in the sbatch_opts (and passed to the sbatch command), the output from sbatch looks like:
Submitted batch job 8653762 on cluster name_of_cluster
The regular expression used to parse this output and capture the job id on lines 142 and 224 of sbatch.R is:
".+ (?=[[:digit:]]+$)"
The "$" in that expression prevents the pattern from matching the sbatch output since there are characters following the job id. I suspect just removing the $ will solve the problem. I tried recoding that line as:
jobid <- as.integer(regmatches(ans,regexpr("[[:digit:]]+",ans)))
and that worked as well.
If the error happens, say, when loading the package, the array job hangs. Need to add a tryCatch at that point (and intermediate, JIC).
Hello, I have to files. One file is .sh extensions containting slurm script and a job array of 1000. The another file is a r script that use SLURM_TASK_ARRAY_ID as a seed and it will be used in whole script.
Since i am a mac user and I dont have SLURM installed, and I want to only use the package slurmR.
I am a asking about the way i can get the
SLURM_TASK_ARRAY_ID using only SlurmR in rstudio. If it is possible, I can run my 1000 jobs.
Thank you
Hi,
Using doParallel
and the foreach
construct we get the following error:
Warning: Permanently added 't077,10.12.0.77' (ECDSA) to the list of known hosts.
/storage/apps/R/4.2.1/GNU/lib64/R/bin/exec/R: error while loading shared libraries: libgfortran.so.5: cannot open shared object file: No such file or directory
It is not clear to us if the problem lies in the HPC, in the R configuration on the machine or in the Slurm library.
Thanks in advance.
This is a part of the JOSS review outlined in openjournals/joss-reviews#1493.
Temporary files are created in the current working directory.
If these are not accessible from the network, jobs will fail.
This should be documented.
This is a part of the JOSS review outlined in openjournals/joss-reviews#1493.
The JOSS check list states:
Community guidelines: Are there clear guidelines for third parties wishing to 1) Contribute to the software 2) Report issues or problems with the software 3) Seek support
However, the only contributing guidelines are a linked CoC.
Can you add a sentence on how to report issues/PRs/seek support? (e.g. should users open an issue, email the author, or send a carrier pigeon)
This is a part of the JOSS review outlined in openjournals/joss-reviews#1493.
It would be very important for the user to have a vignette with a more realistic example, especially on how to deal with failed jobs (retrieve logs, run job interactively to generate a traceback (is this possible?), re-submit subset of failed jobs).
Hello everyone,
When using sourceSlurm or slurmr via command line, and the R file contain the following sbatch directives
#!/bin/sh
#SBATCH --job-name=simRMD
#SBATCH --mem-per-cpu=10G
#SBATCH --nodes=1
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=1
#SBATCH --exclusive
it returns the following error:
sbatch: error: Invalid --exclusive specification
Return code (status): 255
looking at the generated .sh file I see that slurmr appends the following : #SBATCH --exclusive=--exclusive
is that behaviour expected ?
Best,
Mahmoud
I am trying to fit a stan model using the cmdstanr
package. Using cmdstan requires the declaration of the path where cmdstan sits. On my local computer, this does not need to be set once cmdstan has been installed. However, on the HPC, we need to tell the nodes where to find cmdstan by issuing a command like this (which is specific to my setup):
set_cmdstan_path(path = "/gpfs/share/apps/cmdstan/2.25.0")
In the code below, I have commented out the two places where I know it does work. However, I'd rather not have to embed this command in the code, but would rather have it in the Slurm_lapply
call or in some system setting. Basically, I don't really want to have to worry about this when coding.
I can comment out the second command (just before the cmdstan_model
call) if I specify the expression in my sbatch
code:
Rscript --vanilla cmdstanr_test.R -e 'set_cmdstan_path(path = "/gpfs/share/apps/cmdstan/2.25.0")'
However, as you well know, this does not propagate to the nodes accessed as part of the job array. My attempt at using the rscript_opt
argument did not do the trick. Any suggestions?
library(slurmR)
library(cmdstanr)
library(simstudy)
library(data.table)
s_estimate <- function(sigma, s_model, K) {
fit <- s_model$sample(
data = list(K=K, sigma = sigma),
seed = 123,
chains = 4,
parallel_chains = 4,
refresh = 500,
iter_warmup = 250,
iter_sampling = 500
)
return(fit)
}
s_extract <- function(fit) {
x <- fit$draws()
return(as.data.table(x))
}
iteration <- function(sigma, s_model, K) {
# set_cmdstan_path(path = "/gpfs/share/apps/cmdstan/2.25.0")
fit <- s_estimate(sigma, s_model, K)
dd <- s_extract(fit)
return(dd)
}
#---
# set_cmdstan_path(path = "/gpfs/share/apps/cmdstan/2.25.0")
mod <- cmdstan_model("/gpfs/data/troxellab/ksg/r/cmdstanr_test.stan")
job <- Slurm_lapply(
1:10,
iteration,
s_model = mod,
K = 4,
njobs = 8,
mc.cores = 4,
tmp_path = "/gpfs/data/troxellab/ksg/scratch",
overwrite = TRUE,
job_name = "i_cmd",
sbatch_opt = list(time = "03:00:00", partition = "cpu_short"),
export = c("s_estimate", "s_extract"),
plan = "wait",
rscript_opt = list('set_cmdstan_path(path = "/gpfs/share/apps/cmdstan/2.25.0")'))
job
res <- Slurm_collect(job)
slurmR should identify whenever the job is being run within a job. If that's the case, the following variables should have the following defaults:
$SLURM_JOB_PARTITION
$SLURM_JOB_ACCOUNT
$SLURM_CLUSTER_NAME
This way, users can skip writing this information twice when possible.
libPaths
are not properly parsed into the rscript's library calls when using an renv
project.
The function list_loaded_pkgs
attempts to parse package directory information passed from sessionInfo()
, which for renv projects using a cache is the base directory of the cache (which does not have a library tree structure), rather than the project specific renv libPath (which is a library tree with symlinks to the cached packages).
Is there a particular reason why the libPaths
variable is not respected when loading packages from the parent environment?
I have forked the repo and made a change here to ensure that the library calls always respect the specified libPaths
variable. Is there a situation where this is not desirable?
Is it possible to include a toggle in the makeSlurmCluster function that would allow for the retention of the SLURM output file when calling makeSlurmCluster? I would like to be able to track progress of a future call using a slurm socket cluster, but it appears that the clean() function is called to remove the output file when the makeSlurmCluster function is completed.
This is a part of the JOSS review outlined in openjournals/joss-reviews#1493.
The seeds currently default to 1:njobs
. I've followed a similar approach in batchtools
where I determine a start seed randomly, and then increment the seed for each job. Even with a random initialization, I'm not sure if this approach is feasible (see mllg/batchtools#81). Defaulting to 1:njobs
seems to be even worse, there is a reason RNGs are not initialized with a constant value.
This is a part of the JOSS review outlined in openjournals/joss-reviews#1493.
You gave an updated view of the package's strengths in pre-review:
openjournals/joss-reviews#1428 (comment).
It would be nice it this was better reflected in the paper (i.e., just updated to include your points - e.g. resubmit on failure)
This is a part of the JOSS review outlined in openjournals/joss-reviews#1493.
The following fails:
library(sluRm)
Slurm_lapply(1, function(x) x*2)
# Error in splits[[j]] : subscript out of bounds
# In addition: Warning message:
# `X` is not a list. The function will coerce it into one using `as.list`
Related: Internal parallelism via mc.cores
defaults to 2
, which wastes a core if there are not more calls than jobs.
Maybe I'm doing this wrong, but if so then I can't figure out how to do it right.
I'm using opts_slurmR$set_opts()
to set options that I want to appear in the #SBATCH
lines at the beginning of the job script. For example:
opts_slurmR$set_opts(partition="debug", `get-user-env` = "", `mail-type` = "BEGIN, FAIL, END")
The options appear to have been set correctly, based on the output of opts_slurmR
and opts_slurmR$get_opts_job()
. However, they do not appear in myjob.sh
. Any guidance would be appreciated.
This is a part of the JOSS review outlined in openjournals/joss-reviews#1493.
The citation for batchtools
is wrong: this is for BatchJobs
, rather use this:
If the output cannot be coerced to a list of the same length, the default hook fails.
This is a part of the JOSS review outlined in openjournals/joss-reviews#1493.
As far as I can tell, individual function calls are always submitted as array jobs.
What if you have millions of calls?
Does that mean sluRm
will submit an array with millions of entries?
In other words, can sluRm
chunk together multiple function calls in one jobs?
I assume it can not. This is fine, but should be documented.
In the case of the Slurm_lapply
function, make sure of the following:
...
is coherent with the arguments received by the functionHi, When I run the R codes on the Linux, it works well. But when I ran R in batch mode on Linux, I always got the same error. The following are my .sh file, .R file and .out file. APSIM.out was created automatically. It shows that "Error in save.image(name) :", but I already told R not to save workspace "--vanilla". Do you have any suggestions about it? Maybe it is unrelated with Apsimx packages, but with R. Thanks!
APSIM.sh
#!/bin/bash
#SBATCH --time=00:30:00
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=16
#SBATCH --job-name="myjob"
#SBATCH --partition=secondary
#SBATCH --output=myjob.o%j
#SBATCH --dependency=afterany:<JobID>
conda activate ly-envi
cd xx/xx/xx/apsim
srun R --vanilla CMD BATCH APSIM.R APSIM.out
APSIM.R
library(soilDB)
library(sp)
library(sf)
library(spData)
library(apsimx)
library(raster)
library(lubridate)
estimated_sowing_list = list()
apsim_path = "/data/keeling/a/xx/apsim/apsim_2016.txt"
apsim_input <- read.delim(apsim_path,header = FALSE, sep = ",", dec = ",")
colnames(apsim_input) <- c('SD','B21','B25','Long','Lati')
head(apsim_input)
apsimx_options(exe.path = "/data/keeling/a/xxx/ApsimX/bin/Debug/netcoreapp3.1/Models")
for (i in 1:2) {
options(digits=15)
filed_latilongi = c(as.double(as.character(apsim_input$Long[i])) , as.double(as.character(apsim_input$Lati[i]) ) )
# extd.dir <- system.file("extdata", package = "apsimx")
tmp.dir ="/data/keeling/a/xxx/apsim/"
dmet12 <- get_daymet_apsim_met(lonlat = filed_latilongi, years = c(2016))
write_apsim_met(dmet12,wrt.dir = tmp.dir, filename = "test.met")
print(i)
}
APSIM.out
[Previously saved workspace restored]
>
> library(soilDB)
> library(sp)
> library(sf)
Linking to GEOS 3.8.0, GDAL 3.0.2, PROJ 6.2.1; sf_use_s2() is TRUE
> library(spData)
To access larger datasets in this package, install the spDataLarge
package with: `install.packages('spDataLarge',
repos='https://nowosad.github.io/drat/', type='source')`
> library(apsimx)
APSIM(X) not found.
If APSIM(X) is installed in an alternative location,
set paths manually using 'apsimx_options' or 'apsim_options'.
You can still try as the package will look into the registry (under Windows)
> library(raster)
> library(lubridate)
Attaching package: 'lubridate'
The following objects are masked from 'package:raster':
intersect, union
The following objects are masked from 'package:base':
date, intersect, setdiff, union
> estimated_sowing_list = list()
>
>
> apsim_path = "/data/keeling/a/xxx/apsim/apsim_2016.txt"
> apsim_input <- read.delim(apsim_path,header = FALSE, sep = ",", dec = ",")
> colnames(apsim_input) <- c('SD','B21','B25','Long','Lati')
> head(apsim_input)
SD B21 B25 Long Lati
1 111 145 150 -89.48824446937473 39.29027168436681
2 108 133 137 -89.5437797576881 39.21354130583063
3 144 164 168 -89.5826152805141 39.08288777713276
4 115 132 138 -89.83044110359363 39.12534519941483
5 145 155 159 -89.51309913395067 39.370810476566184
6 109 140 144 -89.37520541425403 39.49220820731174
>
> apsimx_options(exe.path = "/data/keeling/a/xxx/ApsimX/bin/Debug/netcoreapp3.1/Models")
>
> for (i in 1:1) {
+
+ options(digits=15)
+ filed_latilongi = c(as.double(as.character(apsim_input$Long[i])) , as.double(as.character(apsim_input$Lati[i]) ) )
+
+ # extd.dir <- system.file("extdata", package = "apsimx")
+ tmp.dir ="/data/keeling/a/xxx/apsim/"
+
+ dmet12 <- get_daymet_apsim_met(lonlat = filed_latilongi, years = c(2016),silent=TRUE)
+ write_apsim_met(dmet12,wrt.dir = tmp.dir, filename = "test.met")
+ summary(dmet12)
+ ## Check for reasonable ranges
+ check_apsim_met(dmet12)
+ print(i)
+
+ }
[1] 1
Warning message:
In check_apsim_met(dmet12) :
Last year in the met file is a leap year and it only has 365 days
>
> proc.time()
user system elapsed
5.752 0.380 13.795
Error in save.image(name) :
image could not be renamed and is left in .RDataTmp
Calls: sys.save.image -> save.image
In addition: Warning message:
In file.rename(outfile, file) :
cannot rename file '.RDataTmp' to '.RData', reason 'No such file or directory'
Execution halted
When calling the command line utility, slurmr
, the source file is replaced by a batch file.
As reported before by @gmweaver, apparently mclapply is prone to segfault error depending on the version of BLAS used in R. A possible solution to this problem can be giving the user the option to chose either forking or a sock cluster, the later more expensive but safer.
I setup a slurm cluster and I can issue a srun -N4 hostname just fine.
I keep seeing "silent_system2" errors. I've installed slurmR using devtools::install_github("USCbiostats/slurmR")
I'm following the second example 3: https://github.com/USCbiostats/slurmR
here are my files
cat slurmR.R
library(doParallel)
library(slurmR)
cl <- makeSlurmCluster(4)
registerDoParallel(cl)
m <- matrix(rnorm(9), 3, 3)
foreach(i=1:nrow(m), .combine=rbind)
StopCluster(cl)
print(m)
cat rscript.slurm
#!/bin/bash
#SBATCH --output=slurmR.out
cd /mnt/nfsshare/tankpve0/
Rscript --vanilla slurmR.R
cat slurmR.out
Loading required package: foreach
Loading required package: iterators
Loading required package: parallel
slurmR default option for `tmp_path` (used to store auxiliar files) set to:
/mnt/nfsshare/tankpve0
You can change this and checkout other slurmR options using: ?opts_slurmR, or you could just type "opts_slurmR" on the terminal.
Submitting job... jobid:18.
Slurm accounting storage is disabled
Error: An error has occurred when calling `silent_system2`:
Warning: An error was detected before returning the cluster object. If submitted, we will try to cancel the job and stop the cluster object.
Execution halted
vignette("sluRm")
#> Warning: vignette 'sluRm' not found
Somehow, the instruction to get started is not shown on the GitHub page and the vignette function could not find it. I am submitting this question on behalf of Joshua.
In some cases, a subset of the submitted jobs may fail because, e.g., time limit, memory limit, or other reasons. In such cases, it would be nice if there were a way to resubmit jobs that failed.
This was also suggested by @millstei.
In particular, we would need to do the following:
This is a part of the JOSS review outlined in openjournals/joss-reviews#1493.
Consider the following:
job1 <- Slurm_EvalQ(sluRm::WhoAmI(), njobs = 1, plan = "submit")
job2 <- Slurm_EvalQ(sluRm::WhoAmI(), njobs = 1, plan = "submit")
Slurm_clean(job1)
Slurm_collect(job2)
# Error: Nothing to retrieve. (see ?status).
I assume this is not intended behaviour.
The following code should be fixed here:
Line 259 in 17f7ad2
For something like
State <- lapply(STATE_CODES, function(jsc) {
m <- grepl(paste0(jsc, collapse = "|"), State)
JobID[which(m)]
})
I am testing the slurmR package on my school HPC. Everything works great using the Slurm_lapply
with plan="none"
then sbatch
call to launch the job array. However - I get the following strange error when using Slurm_collect
.
Warning: The call to -sacct- failed. This is probably due to not having slurm accounting up and running. For more information, checkout this discussion: https://github.com/USCbiostats/slurmR/issues29
Error in x$njobs : $ operator is invalid for atomic vectors`
The code I run is
library(slurmR)
ans <- Slurm_lapply(1:10, sqrt, plan="none")
sbatch(ans)
result <- Slurm_collect(ans)
I understand my cluster does not have slurm accounting enabled - but it seems the error is unrelated to the warning? However when I enter debug mode, the x
object has a njobs
attribute and does not throw an error when I retrieve it directly.
This is a part of the JOSS review outlined in openjournals/joss-reviews#1493.
--chdir
is not available for all slurm installations.
$ sbatch --version
slurm 16.05.9
$ sbatch --chdir=~
sbatch: unrecognized option '--chdir=~'
Try "sbatch --help" for more information
-D
seems to be a workaround.
This is a part of the JOSS review outlined in openjournals/joss-reviews#1493.
The authors on the paper and in DESCRIPTION are different. Is there any reason for that?
Looking at the commit history, @pmarjora contributed mostly on improving documentation. This is fine, but can you please confirm your authorship and that you are willing to take responsibility for the paper and the software (insofar as this is usually the case for co-authors on publications)?
This is a part of the JOSS review outlined in openjournals/joss-reviews#1493.
https://travis-ci.org/USCbiostats/sluRm/branches
Can you please fix your tests?
I am trying to specify the RAM to be used per CPU, but the option name ("mem-per-cpu") is not a valid R variable name. So, the list
statement (as in sbatch_opt(list(mem-per-cpu="16G"))
results in an error. I specified the variable name with backticks - and no errors were generated, but I am not sure it has actually worked. Is that the proper way to solve this issue, or is there another way I should be doing it?
Slurm_collect
needs to be able to collect whatever is available. And also, I need to work on a better way to put everything together. Right now it seems that it is not submitting the job to the same folder. Slurm_collect
is trying to get x$opts_job$tmp_path
but this is not reflecting on the job object.
I just figured out that creating a sluRm
type cluster object is rather easy:
From a node, we need to initialize a job with N
workes (nodes)
Once the job has started, we can list what are the nodes that were assigned by typing squeue(u=[userid])
From there, the node name is given, which actually matches the address of it.
With that, we can create PSOCK clusers easily by typing:
cl <- parallel::makePSOCKcluster([list of node names])
Meaning, we can use sluRm
as a backend for all!
My Slurm cluster runs a pair of prolog and epilog scripts to write performance data to the job submission folder. This seems to conflict with assumptions made by slurmR:
Success! nodenames collected (terminate called after throwing an instance of 'std::filesystem::__cxx11::filesystem_error', what(): filesystem error: cannot rename: Directory not empty [sps-%JOBID%_-2] [sps-%JOBID%_-2.1], what(): filesystem error: cannot rename: Directory not empty [sps-%JOBID%_-2] [sps-%JOBID%_-2.2], %NODENAME%). Creating the cluster object...
ssh: Could not resolve hostname terminate: Temporary failure in name resolution
Those folders named sps-*
are automatically created by our scripts. Job ID and node name replaced with placeholders %JOBID%
and %NODENAME%
.
This is a part of the JOSS review outlined in openjournals/joss-reviews#1493.
The comparison table in the readme says that batchtools
cannot re-run jobs, but batchtools is perfectly capable of it:
library(batchtools)
reg = makeRegistry(NA)
f = function(x) if (x == 3) stop(3) else x^2
btlapply(1:5, f, reg = reg)
# -> no result
# partial results are available in registry
reduceResultsList()
# re-submit failed jobs
# (job #3 will fail again of course, just for demonstration)
submitJobs(findErrors())
A declarative, efficient, and flexible JavaScript library for building user interfaces.
π Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. πππ
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google β€οΈ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.