giovtorres / docker-centos7-slurm Goto Github PK
View Code? Open in Web Editor NEWSlurm Docker Container on CentOS 7
License: MIT License
Slurm Docker Container on CentOS 7
License: MIT License
I'm using the slurm container for various tests and would like to monitor the status of jobs using the sacct command. I fire up the container:
docker run -it -h ernie giovtorres/docker-centos7-slurm:latest
and submit a simple job:
[root@ernie /]# sbatch --wrap "sleep 60"
Submitted batch job 2
[root@ernie /]# squeue -l
Fri Dec 8 09:41:47 2017
JOBID PARTITION NAME USER STATE TIME TIME_LIMI NODES NODELIST(REASON)
2 normal wrap root RUNNING 0:08 5-00:00:00 1 c1
scontrol works fine:
[root@ernie /]# scontrol show job 2
JobId=2 JobName=wrap
UserId=root(0) GroupId=root(0) MCS_label=N/A
Priority=4294901759 Nice=0 Account=(null) QOS=normal
JobState=RUNNING Reason=None Dependency=(null)
Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0
RunTime=00:00:12 TimeLimit=5-00:00:00 TimeMin=N/A
SubmitTime=2017-12-08T09:41:39 EligibleTime=2017-12-08T09:41:39
StartTime=2017-12-08T09:41:39 EndTime=2017-12-13T09:41:39 Deadline=N/A
PreemptTime=None SuspendTime=None SecsPreSuspend=0
LastSchedEval=2017-12-08T09:41:39
Partition=normal AllocNode:Sid=ernie:1
ReqNodeList=(null) ExcNodeList=(null)
NodeList=c1
BatchHost=localhost
NumNodes=1 NumCPUs=1 NumTasks=1 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
TRES=cpu=1,mem=500M,node=1,billing=1
Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=*
MinCPUsNode=1 MinMemoryCPU=500M MinTmpDiskNode=0
Features=(null) DelayBoot=00:00:00
Gres=(null) Reservation=(null)
OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null)
Command=(null)
WorkDir=/
StdErr=//slurm-2.out
StdIn=/dev/null
StdOut=//slurm-2.out
Power=
However, sacct fails since the table 'slurm_acct_db.linux_job_table' doesn't exist:
[root@ernie /]# sacct
JobID JobName Partition Account AllocCPUS State ExitCode
------------ ---------- ---------- ---------- ---------- ---------- --------
[root@ernie /]# cat /var/log/slurm/slurmdbd.log |tail
[2017-12-08T09:42:11.526] debug4: This could happen often and is expected.
mysql_query failed: 1146 Table 'slurm_acct_db.linux_job_table' doesn't exist
insert into "linux_job_table" (id_job, mod_time, id_array_job, id_array_task, pack_job_id, pack_job_offset, id_assoc, id_qos, id_user, id_group, nodelist, id_resv, timelimit, time_eligible, time_submit, time_start, job_name, track_steps, state, priority, cpus_req, nodes_alloc, mem_req, `partition`, node_inx, array_task_str, array_task_pending, tres_alloc, tres_req, work_dir) values (2, UNIX_TIMESTAMP(), 0, 4294967294, 0, 4294967294, 0, 1, 0, 0, 'c1', 0, 7200, 1512726099, 1512726099, 1512726099, 'wrap', 0, 1, 4294901759, 1, 1, 9223372036854776308, 'normal', '0', NULL, 0, '1=1,2=500,3=18446744073709551614,4=1,5=1', '1=1,2=500,4=1', '/') on duplicate key update job_db_inx=LAST_INSERT_ID(job_db_inx), id_assoc=0, id_user=0, id_group=0, nodelist='c1', id_resv=0, timelimit=7200, time_submit=1512726099, time_eligible=1512726099, time_start=1512726099, mod_time=UNIX_TIMESTAMP(), job_name='wrap', track_steps=0, id_qos=1, state=greatest(state, 1), priority=4294901759, cpus_req=1, nodes_alloc=1, mem_req=9223372036854776308, id_array_job=0, id_array_task=4294967294, pack_job_id=0, pack_job_offset=4294967294, `partition`='normal', node_inx='0', array_task_str=NULL, array_task_pending=0, tres_alloc='1=1,2=500,3=18446744073709551614,4=1,5=1', tres_req='1=1,2=500,4=1', work_dir='/'
[2017-12-08T09:42:11.526] error: We should have gotten a new id: Table 'slurm_acct_db.linux_job_table' doesn't exist
[2017-12-08T09:42:11.526] error: It looks like the storage has gone away trying to reconnect
[2017-12-08T09:42:11.526] debug4: This could happen often and is expected.
mysql_query failed: 1146 Table 'slurm_acct_db.linux_job_table' doesn't exist
insert into "linux_job_table" (id_job, mod_time, id_array_job, id_array_task, pack_job_id, pack_job_offset, id_assoc, id_qos, id_user, id_group, nodelist, id_resv, timelimit, time_eligible, time_submit, time_start, job_name, track_steps, state, priority, cpus_req, nodes_alloc, mem_req, `partition`, node_inx, array_task_str, array_task_pending, tres_alloc, tres_req, work_dir) values (2, UNIX_TIMESTAMP(), 0, 4294967294, 0, 4294967294, 0, 1, 0, 0, 'c1', 0, 7200, 1512726099, 1512726099, 1512726099, 'wrap', 0, 1, 4294901759, 1, 1, 9223372036854776308, 'normal', '0', NULL, 0, '1=1,2=500,3=18446744073709551614,4=1,5=1', '1=1,2=500,4=1', '/') on duplicate key update job_db_inx=LAST_INSERT_ID(job_db_inx), id_assoc=0, id_user=0, id_group=0, nodelist='c1', id_resv=0, timelimit=7200, time_submit=1512726099, time_eligible=1512726099, time_start=1512726099, mod_time=UNIX_TIMESTAMP(), job_name='wrap', track_steps=0, id_qos=1, state=greatest(state, 1), priority=4294901759, cpus_req=1, nodes_alloc=1, mem_req=9223372036854776308, id_array_job=0, id_array_task=4294967294, pack_job_id=0, pack_job_offset=4294967294, `partition`='normal', node_inx='0', array_task_str=NULL, array_task_pending=0, tres_alloc='1=1,2=500,3=18446744073709551614,4=1,5=1', tres_req='1=1,2=500,4=1', work_dir='/'
[2017-12-08T09:42:11.526] error: We should have gotten a new id: Table 'slurm_acct_db.linux_job_table' doesn't exist
[2017-12-08T09:42:11.526] DBD_JOB_START: cluster not registered
I cloned the repo and modified some settings in slurm.conf, to no avail. I have little experience setting up slurm so I'm unsure what changes need to be applied.
The issue has been reported before (e.g. http://thread.gmane.org/gmane.comp.distributed.slurm.devel/6333 and https://bugs.schedmd.com/show_bug.cgi?id=1943) and one proposed solution is adding the table with sacctmgr:
sacctmgr add cluster linux
sacctmgr add account none,test Cluster=linux \
Description="none" Organization="none"
sacctmgr add user da DefaultAccount=test
However, the first command hangs in the container.
Do you have any idea for a solution?
Cheers,
Per
Everything is OK and I try to place a job with srun command: srun -n 32 slurmctl
But the docker diaplay srun: job 1 queued and waiting for resources forever. What's the reason?
Thank you for your work. I tried to convert the Dockerfile in ubuntu format.
But I keep getting following error from mysql
:
$ /usr/bin/mysqld_safe
220422 10:39:58 mysqld_safe Logging to syslog.
220422 10:39:58 mysqld_safe Starting mariadbd daemon with databases from /var/lib/mysql
$ mysql
ERROR 2002 (HY000): Can't connect to local MySQL server through socket '/run/mysqld/mysqld.sock' (2)
Do you have any idea how could this be solved?
Hello Giovanni!
I find this docker container really useful and I am interested to use this for development with pyslurm as well. I am not sure how to connect the two though. I can not install pyslurm with "pip install pyslurm" inside this container. Do you have any suggestions about how to make it work with this (the Slurm headers and libraries) and pyslurm?
Have a nice day!
Jakob
I tried to run container with following docker-compose.yml
:
version: '3'
services:
slurm:
image: giovtorres/docker-centos7-slurm:latest
build: .
hostname: ernie
stdin_open: true
tty: true
volumes:
- ./volumes/lib:/var/lib/slurmd
- ./volumes/spool:/var/spool/slurmd
- ./volumes/log:/var/log/slurm
- ./volumes/db:/var/lib/mysql
but it fails on starting MariaDB:
$ docker-compose up
Creating docker-centos7-slurm_slurm_1 ... done
Attaching to docker-centos7-slurm_slurm_1
slurm_1 | - Initializing database
slurm_1 | - Database initialized
slurm_1 | - Updating MySQL directory permissions
slurm_1 | - Starting MariaDB to create Slurm account database
slurm_1 | 191017 08:41:46 mysqld_safe Logging to '/var/log/mariadb/mariadb.log'.
slurm_1 | 191017 08:41:46 mysqld_safe Starting mysqld daemon with databases from /var/lib/mysql
slurm_1 | - Starting MariaDB to create Slurm account database
slurm_1 | - Starting MariaDB to create Slurm account database
slurm_1 | - Starting MariaDB to create Slurm account database
slurm_1 | - Starting MariaDB to create Slurm account database
slurm_1 | - Starting MariaDB to create Slurm account database
slurm_1 | - Starting MariaDB to create Slurm account database
slurm_1 | - Starting MariaDB to create Slurm account database
slurm_1 | - Starting MariaDB to create Slurm account database
slurm_1 | - Starting MariaDB to create Slurm account database
slurm_1 | - Starting MariaDB to create Slurm account database
slurm_1 | - Starting MariaDB to create Slurm account database
slurm_1 | - Starting MariaDB to create Slurm account database
slurm_1 | - Starting MariaDB to create Slurm account database
slurm_1 | - Starting MariaDB to create Slurm account database
slurm_1 | - Starting MariaDB to create Slurm account database
slurm_1 | - Starting MariaDB to create Slurm account database
slurm_1 | - Starting MariaDB to create Slurm account database
slurm_1 | - Starting MariaDB to create Slurm account database
slurm_1 | - Starting MariaDB to create Slurm account database
slurm_1 | - Starting MariaDB to create Slurm account database
slurm_1 | - Starting MariaDB to create Slurm account database
slurm_1 | - Starting MariaDB to create Slurm account database
slurm_1 | - Starting MariaDB to create Slurm account database
slurm_1 | - Starting MariaDB to create Slurm account database
slurm_1 | - Starting MariaDB to create Slurm account database
slurm_1 | - Starting MariaDB to create Slurm account database
slurm_1 | - Starting MariaDB to create Slurm account database
slurm_1 | - Starting MariaDB to create Slurm account database
slurm_1 | - Starting MariaDB to create Slurm account database
slurm_1 | - Starting MariaDB to create Slurm account database
slurm_1 | MariaDB did not start
docker-centos7-slurm_slurm_1 exited with code 1
Did I miss something?
ATM tagged images are 2 years old although recent PRs were merged. Would be nice to get them updated. I see two possible ways to setup auto-updating (to avoid manual pains)
gitovtorres
to become organization etc. Might be painfulWDYT @giovtorres ?
Using dockerhub readme I couldn't get the cluster to start.
README (Works great, thanks!):
docker run -it -h slurmctl --cap-add sys_admin giovtorres/docker-centos7-slurm:latest
Dockerhub (Failure output below):
docker run -it -h ernie giovtorres/docker-centos7-slurm:latest
This fails, I suppose the hostname must be slurmctl.
- Initializing database
- Database initialized
- Starting MariaDB to create Slurm account database
231031 13:57:20 mysqld_safe Logging to '/var/log/mariadb/mariadb.log'.
231031 13:57:20 mysqld_safe Starting mysqld daemon with databases from /var/lib/mysql
- Starting MariaDB to create Slurm account database
- Creating Slurm acct database
- Slurm acct database created. Stopping MariaDB
- Starting supervisord process manager
- Starting munged
munged: started
- munged is in the RUNNING state.
- Starting mysqld
mysqld: started
- mysqld is in the RUNNING state.
- Starting slurmdbd
slurmdbd: started
- slurmdbd is in the RUNNING state.
- Starting slurmctld
slurmctld: ERROR (spawn error)
- slurmctld is in the BACKOFF state.
- slurmctld is in the STARTING state.
- slurmctld is in the BACKOFF state.
- slurmctld is in the BACKOFF state.
- slurmctld is in the STARTING state.
- slurmctld is in the FATAL state.
- slurmctld is in the FATAL state.
- slurmctld is in the FATAL state.
- slurmctld is in the FATAL state.
- slurmctld is in the FATAL state.
- slurmctld is in the FATAL state.
- Starting slurmd
slurmd: ERROR (spawn error)
- slurmd is in the BACKOFF state.
- slurmd is in the STARTING state.
- slurmd is in the BACKOFF state.
- slurmd is in the BACKOFF state.
- slurmd is in the STARTING state.
- slurmd is in the FATAL state.
- slurmd is in the FATAL state.
- slurmd is in the FATAL state.
- slurmd is in the FATAL state.
- slurmd is in the FATAL state.
- slurmd is in the FATAL state.
- Port 6817 is not listening
- Port 6817 is not listening
- Port 6817 is not listening
- Port 6817 is not listening
- Port 6817 is not listening
- Port 6817 is not listening
- Port 6817 is not listening
- Port 6817 is not listening
- Port 6817 is not listening
- Port 6817 is not listening
- Port 6817 is not listening
- Port 6818 is not listening
- Port 6818 is not listening
- Port 6818 is not listening
- Port 6818 is not listening
- Port 6818 is not listening
- Port 6818 is not listening
- Port 6818 is not listening
- Port 6818 is not listening
- Port 6818 is not listening
- Port 6818 is not listening
- Port 6818 is not listening
- Port 6819 is listening
- Waiting for the cluster to become available
sinfo: error: get_addr_info: getaddrinfo() failed: Name or service not known
sinfo: error: slurm_set_addr: Unable to resolve "slurmctl"
sinfo: error: Unable to establish control machine address
slurm_load_partitions: No such file or directory
sinfo: error: get_addr_info: getaddrinfo() failed: Name or service not known
sinfo: error: slurm_set_addr: Unable to resolve "slurmctl"
sinfo: error: Unable to establish control machine address
slurm_load_partitions: No such file or directory
sinfo: error: get_addr_info: getaddrinfo() failed: Name or service not known
sinfo: error: slurm_set_addr: Unable to resolve "slurmctl"
sinfo: error: Unable to establish control machine address
slurm_load_partitions: No such file or directory
sinfo: error: get_addr_info: getaddrinfo() failed: Name or service not known
sinfo: error: slurm_set_addr: Unable to resolve "slurmctl"
sinfo: error: Unable to establish control machine address
slurm_load_partitions: No such file or directory
sinfo: error: get_addr_info: getaddrinfo() failed: Name or service not known
sinfo: error: slurm_set_addr: Unable to resolve "slurmctl"
sinfo: error: Unable to establish control machine address
slurm_load_partitions: No such file or directory
sinfo: error: get_addr_info: getaddrinfo() failed: Name or service not known
sinfo: error: slurm_set_addr: Unable to resolve "slurmctl"
sinfo: error: Unable to establish control machine address
slurm_load_partitions: No such file or directory
sinfo: error: get_addr_info: getaddrinfo() failed: Name or service not known
sinfo: error: slurm_set_addr: Unable to resolve "slurmctl"
sinfo: error: Unable to establish control machine address
slurm_load_partitions: No such file or directory
sinfo: error: get_addr_info: getaddrinfo() failed: Name or service not known
sinfo: error: slurm_set_addr: Unable to resolve "slurmctl"
sinfo: error: Unable to establish control machine address
slurm_load_partitions: No such file or directory
sinfo: error: get_addr_info: getaddrinfo() failed: Name or service not known
sinfo: error: slurm_set_addr: Unable to resolve "slurmctl"
sinfo: error: Unable to establish control machine address
slurm_load_partitions: No such file or directory
sinfo: error: get_addr_info: getaddrinfo() failed: Name or service not known
sinfo: error: slurm_set_addr: Unable to resolve "slurmctl"
sinfo: error: Unable to establish control machine address
slurm_load_partitions: No such file or directory
sinfo: error: get_addr_info: getaddrinfo() failed: Name or service not known
sinfo: error: slurm_set_addr: Unable to resolve "slurmctl"
sinfo: error: Unable to establish control machine address
slurm_load_partitions: No such file or directory
Slurm partitions failed to start successfully.
Hi,
Using the command
docker run -it -h ernie giovtorres/docker-centos7-slurm:17.02.9
the program 'slurmctld' is not running:
[root@ernie /]# supervisorctl status
munged RUNNING pid 263, uptime 0:00:17
mysqld RUNNING pid 491, uptime 0:00:13
slurmctld EXITED Nov 13 12:50 PM
slurmd RUNNING pid 262, uptime 0:00:17
slurmdbd RUNNING pid 266, uptime 0:00:17
I am using docker 17.05.0-ce @ ubuntu 16.04.
Best regards,
Bernd
In the current Dockerfile, pkgconfig is currently listed as pkconfig :)
I have submitted 2 jobs but they don't run at the same time:
[root@slurmctl /]# sbatch -n1 --wrap="sleep 10"
Submitted batch job 1
[root@slurmctl /]# sbatch -n1 --wrap="sleep 10"
Submitted batch job 2
[root@slurmctl /]# squeue
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
2 normal wrap root PD 0:00 1 (Resources)
1 normal wrap root R 0:09 1 c1
Job-2
should wait till Job-1
completed. Is there any way to let them run at the same time? Since there is 4 free node I though we could run 4 processes at the same time.
Hi,
this probably relates back to #38 which addressed issue #37. The partitions are up, but if you submit the example job to the debug partition, it runs indefinitely, e.g:
docker exec dockercentos7slurm_slurm_1 sbatch --wrap="sleep 10" --partition debug
I added the following to tests/test_slurm.py:test_job_can_run:
time.sleep(2)
res = host.run(f'sacct -o State --parsable --noheader -j {jobid}')
assert "COMPLETE" in res.stdout
to verify that jobs complete; see github action output on my fork. Unfortunately I have no immediate solution but I thought I'd let you know. I make use of the debug partition in a CI test so I will see if I can find a fix.
Cheers,
Per
I just added Lmod support and an example hello
modulefile, that adds in hello-world
script that does the obvious.
While this isn't exactly Slurm, I guess many Slurm setups use this, so would you be interested in taking it upstream?
It is findable here: https://github.com/AaltoSciComp/docker-centos7-slurm/ (on master currently, no direct link to the commit since I am likely to rebase)
I think it might cause a problem that the /bin/bash
process is pid 1 inside the container. When the container is requested to stop it takes 10 seconds to shut it down because bash does not propagate signaling as it should, ie. docker kills pid 1 inside the container.
Running supervisord
directly as process 1 does not seem to mitigate the problem. The container still need a proper init. See citation below.
Supervisor is a client/server system that allows its users to monitor and control a number of processes on UNIX-like operating systems.
It shares some of the same goals of programs like launchd, daemontools, and runit. Unlike some of these programs, it is not meant to be run as a substitute for init as “process id 1”. Instead it is meant to be used to control processes related to a project or a customer, and is meant to start like any other program at boot time.
Suggestion:
Use docker-compose 2.4 format and init: true
, or, include tini explicitly in the Dockerfile prepended to the ENTRYPOINT.
Hi,
Thank you for this container, it's very useful. I would like to use it composing with my python application image (with lot of tools dependancies) which need slurm and mysql.
So how can do that? I probably need to modify the docker-compose.yml adding my appli as a service?
Any advices will be welcomed, best
Véronique
Hi,
I'm using docker-centos7-slurm to test a workflow manager. It has been a while since updating, but when trying out the most recent version, I notice that only one node (c1
) is up in the container. I am currently testing this in my fork (see pr #1). Briefly, I parametrized test_job_can_run to pass partition to the --partition
option. The normal
partition works as expected, but debug
fails.
If one enters the latest image with
docker run -it -h slurmctl giovtorres/docker-centos7-slurm:latest
running sinfo
yields
NODELIST NODES PARTITION STATE CPUS S:C:T MEMORY TMP_DISK WEIGHT AVAIL_FE REASON
c1 1 normal* idle 1 1:1:1 1000 0 1 (null) none
c2 1 normal* unknown* 1 1:1:1 1000 0 1 (null) none
c3 1 debug unknown* 1 1:1:1 1000 0 1 (null) none
c4 1 debug unknown* 1 1:1:1 1000 0 1 (null) none
See github action resuts, where I added some print statements to see what was going on (nevermind that the test actually passed; I was simply looking at the erroneous slurm output file). I consistently get the feedback that the required nodes are not available; it would seem node c1
is the only node available to sbatch.
Are you able to reproduce this?
Cheers,
Per
need rest API for slurmrestd, Could you please add it, thanks
I am wondering if this setup could be used to simulate a slurm configuration using the task/affinity
or task/cgroups
mode for TaskPlugin
. How do you feel @giovtorres ?
I would love to test some issues we see regarding the interplay of MPI and slurm. I am a bit new to docker, so I wonder how I would install (say) openmpi from the centos repos on the nodes which are spawned from the Dockerfile of this repo?
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.