giovtorres / docker-centos7-slurm Goto Github PK

Slurm Docker Container on CentOS 7

License: MIT License

Shell 12.99% Dockerfile 34.19% Python 52.82%

docker-centos7-slurm's Issues

sacct not producing output due to missing mysql tables

I'm using the slurm container for various tests and would like to monitor the status of jobs using the sacct command. I fire up the container:

docker run -it -h ernie giovtorres/docker-centos7-slurm:latest
and submit a simple job:

[root@ernie /]# sbatch --wrap "sleep 60"

Submitted batch job 2

[root@ernie /]# squeue -l               
Fri Dec  8 09:41:47 2017
             JOBID PARTITION     NAME     USER    STATE       TIME TIME_LIMI  NODES NODELIST(REASON)
                 2    normal     wrap     root  RUNNING       0:08 5-00:00:00      1 c1

scontrol works fine:

[root@ernie /]# scontrol show job 2
JobId=2 JobName=wrap
   UserId=root(0) GroupId=root(0) MCS_label=N/A
   Priority=4294901759 Nice=0 Account=(null) QOS=normal
   JobState=RUNNING Reason=None Dependency=(null)
   Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0
   RunTime=00:00:12 TimeLimit=5-00:00:00 TimeMin=N/A
   SubmitTime=2017-12-08T09:41:39 EligibleTime=2017-12-08T09:41:39
   StartTime=2017-12-08T09:41:39 EndTime=2017-12-13T09:41:39 Deadline=N/A
   PreemptTime=None SuspendTime=None SecsPreSuspend=0
   LastSchedEval=2017-12-08T09:41:39
   Partition=normal AllocNode:Sid=ernie:1
   ReqNodeList=(null) ExcNodeList=(null)
   NodeList=c1
   BatchHost=localhost
   NumNodes=1 NumCPUs=1 NumTasks=1 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
   TRES=cpu=1,mem=500M,node=1,billing=1
   Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=*
   MinCPUsNode=1 MinMemoryCPU=500M MinTmpDiskNode=0
   Features=(null) DelayBoot=00:00:00
   Gres=(null) Reservation=(null)
   OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null)
   Command=(null)
   WorkDir=/
   StdErr=//slurm-2.out
   StdIn=/dev/null
   StdOut=//slurm-2.out
   Power=

However, sacct fails since the table 'slurm_acct_db.linux_job_table' doesn't exist:

[root@ernie /]# sacct         
       JobID    JobName  Partition    Account  AllocCPUS      State ExitCode 
------------ ---------- ---------- ---------- ---------- ---------- -------- 
[root@ernie /]# cat /var/log/slurm/slurmdbd.log |tail
[2017-12-08T09:42:11.526] debug4: This could happen often and is expected.
mysql_query failed: 1146 Table 'slurm_acct_db.linux_job_table' doesn't exist
insert into "linux_job_table" (id_job, mod_time, id_array_job, id_array_task, pack_job_id, pack_job_offset, id_assoc, id_qos, id_user, id_group, nodelist, id_resv, timelimit, time_eligible, time_submit, time_start, job_name, track_steps, state, priority, cpus_req, nodes_alloc, mem_req, `partition`, node_inx, array_task_str, array_task_pending, tres_alloc, tres_req, work_dir) values (2, UNIX_TIMESTAMP(), 0, 4294967294, 0, 4294967294, 0, 1, 0, 0, 'c1', 0, 7200, 1512726099, 1512726099, 1512726099, 'wrap', 0, 1, 4294901759, 1, 1, 9223372036854776308, 'normal', '0', NULL, 0, '1=1,2=500,3=18446744073709551614,4=1,5=1', '1=1,2=500,4=1', '/') on duplicate key update job_db_inx=LAST_INSERT_ID(job_db_inx), id_assoc=0, id_user=0, id_group=0, nodelist='c1', id_resv=0, timelimit=7200, time_submit=1512726099, time_eligible=1512726099, time_start=1512726099, mod_time=UNIX_TIMESTAMP(), job_name='wrap', track_steps=0, id_qos=1, state=greatest(state, 1), priority=4294901759, cpus_req=1, nodes_alloc=1, mem_req=9223372036854776308, id_array_job=0, id_array_task=4294967294, pack_job_id=0, pack_job_offset=4294967294, `partition`='normal', node_inx='0', array_task_str=NULL, array_task_pending=0, tres_alloc='1=1,2=500,3=18446744073709551614,4=1,5=1', tres_req='1=1,2=500,4=1', work_dir='/'
[2017-12-08T09:42:11.526] error: We should have gotten a new id: Table 'slurm_acct_db.linux_job_table' doesn't exist
[2017-12-08T09:42:11.526] error: It looks like the storage has gone away trying to reconnect
[2017-12-08T09:42:11.526] debug4: This could happen often and is expected.
mysql_query failed: 1146 Table 'slurm_acct_db.linux_job_table' doesn't exist
insert into "linux_job_table" (id_job, mod_time, id_array_job, id_array_task, pack_job_id, pack_job_offset, id_assoc, id_qos, id_user, id_group, nodelist, id_resv, timelimit, time_eligible, time_submit, time_start, job_name, track_steps, state, priority, cpus_req, nodes_alloc, mem_req, `partition`, node_inx, array_task_str, array_task_pending, tres_alloc, tres_req, work_dir) values (2, UNIX_TIMESTAMP(), 0, 4294967294, 0, 4294967294, 0, 1, 0, 0, 'c1', 0, 7200, 1512726099, 1512726099, 1512726099, 'wrap', 0, 1, 4294901759, 1, 1, 9223372036854776308, 'normal', '0', NULL, 0, '1=1,2=500,3=18446744073709551614,4=1,5=1', '1=1,2=500,4=1', '/') on duplicate key update job_db_inx=LAST_INSERT_ID(job_db_inx), id_assoc=0, id_user=0, id_group=0, nodelist='c1', id_resv=0, timelimit=7200, time_submit=1512726099, time_eligible=1512726099, time_start=1512726099, mod_time=UNIX_TIMESTAMP(), job_name='wrap', track_steps=0, id_qos=1, state=greatest(state, 1), priority=4294901759, cpus_req=1, nodes_alloc=1, mem_req=9223372036854776308, id_array_job=0, id_array_task=4294967294, pack_job_id=0, pack_job_offset=4294967294, `partition`='normal', node_inx='0', array_task_str=NULL, array_task_pending=0, tres_alloc='1=1,2=500,3=18446744073709551614,4=1,5=1', tres_req='1=1,2=500,4=1', work_dir='/'
[2017-12-08T09:42:11.526] error: We should have gotten a new id: Table 'slurm_acct_db.linux_job_table' doesn't exist
[2017-12-08T09:42:11.526] DBD_JOB_START: cluster not registered

I cloned the repo and modified some settings in slurm.conf, to no avail. I have little experience setting up slurm so I'm unsure what changes need to be applied.

The issue has been reported before (e.g. http://thread.gmane.org/gmane.comp.distributed.slurm.devel/6333 and https://bugs.schedmd.com/show_bug.cgi?id=1943) and one proposed solution is adding the table with sacctmgr:

sacctmgr add cluster linux

sacctmgr add account none,test Cluster=linux \
  Description="none" Organization="none"

sacctmgr add user da DefaultAccount=test

However, the first command hangs in the container.

Do you have any idea for a solution?

Cheers,

Per

srun: job 1 queued and waiting for resources

Everything is OK and I try to place a job with srun command: srun -n 32 slurmctl
But the docker diaplay srun: job 1 queued and waiting for resources forever. What's the reason?

Can't connect to local MySQL server through socket '/run/mysqld/mysqld.sock'

Thank you for your work. I tried to convert the Dockerfile in ubuntu format.

But I keep getting following error from mysql:

$ /usr/bin/mysqld_safe
220422 10:39:58 mysqld_safe Logging to syslog.
220422 10:39:58 mysqld_safe Starting mariadbd daemon with databases from /var/lib/mysql
$ mysql
ERROR 2002 (HY000): Can't connect to local MySQL server through socket '/run/mysqld/mysqld.sock' (2)

Do you have any idea how could this be solved?

Connecting to pyslurm

Hello Giovanni!

I find this docker container really useful and I am interested to use this for development with pyslurm as well. I am not sure how to connect the two though. I can not install pyslurm with "pip install pyslurm" inside this container. Do you have any suggestions about how to make it work with this (the Slurm headers and libraries) and pyslurm?

Have a nice day!
Jakob

MariaDB did not start

I tried to run container with following docker-compose.yml:

version: '3'

services:
  slurm:
    image: giovtorres/docker-centos7-slurm:latest
    build: .
    hostname: ernie
    stdin_open: true
    tty: true
    volumes:
      - ./volumes/lib:/var/lib/slurmd
      - ./volumes/spool:/var/spool/slurmd
      - ./volumes/log:/var/log/slurm
      - ./volumes/db:/var/lib/mysql

but it fails on starting MariaDB:

$ docker-compose up
Creating docker-centos7-slurm_slurm_1 ... done
Attaching to docker-centos7-slurm_slurm_1
slurm_1  | - Initializing database
slurm_1  | - Database initialized
slurm_1  | - Updating MySQL directory permissions
slurm_1  | - Starting MariaDB to create Slurm account database
slurm_1  | 191017 08:41:46 mysqld_safe Logging to '/var/log/mariadb/mariadb.log'.
slurm_1  | 191017 08:41:46 mysqld_safe Starting mysqld daemon with databases from /var/lib/mysql
slurm_1  | - Starting MariaDB to create Slurm account database
slurm_1  | - Starting MariaDB to create Slurm account database
slurm_1  | - Starting MariaDB to create Slurm account database
slurm_1  | - Starting MariaDB to create Slurm account database
slurm_1  | - Starting MariaDB to create Slurm account database
slurm_1  | - Starting MariaDB to create Slurm account database
slurm_1  | - Starting MariaDB to create Slurm account database
slurm_1  | - Starting MariaDB to create Slurm account database
slurm_1  | - Starting MariaDB to create Slurm account database
slurm_1  | - Starting MariaDB to create Slurm account database
slurm_1  | - Starting MariaDB to create Slurm account database
slurm_1  | - Starting MariaDB to create Slurm account database
slurm_1  | - Starting MariaDB to create Slurm account database
slurm_1  | - Starting MariaDB to create Slurm account database
slurm_1  | - Starting MariaDB to create Slurm account database
slurm_1  | - Starting MariaDB to create Slurm account database
slurm_1  | - Starting MariaDB to create Slurm account database
slurm_1  | - Starting MariaDB to create Slurm account database
slurm_1  | - Starting MariaDB to create Slurm account database
slurm_1  | - Starting MariaDB to create Slurm account database
slurm_1  | - Starting MariaDB to create Slurm account database
slurm_1  | - Starting MariaDB to create Slurm account database
slurm_1  | - Starting MariaDB to create Slurm account database
slurm_1  | - Starting MariaDB to create Slurm account database
slurm_1  | - Starting MariaDB to create Slurm account database
slurm_1  | - Starting MariaDB to create Slurm account database
slurm_1  | - Starting MariaDB to create Slurm account database
slurm_1  | - Starting MariaDB to create Slurm account database
slurm_1  | - Starting MariaDB to create Slurm account database
slurm_1  | - Starting MariaDB to create Slurm account database
slurm_1  | MariaDB did not start
docker-centos7-slurm_slurm_1 exited with code 1

Did I miss something?

(auto)update docker hub images?

ATM tagged images are 2 years old although recent PRs were merged. Would be nice to get them updated. I see two possible ways to setup auto-updating (to avoid manual pains)

setup autobuilding by docker hub itself but then repo should get under "Sponsored OSS" but it would require gitovtorres to become organization etc. Might be painful
setup github CI to push builds, e.g. like we have for https://hub.docker.com/repository/docker/nipy/heudiconv/tags?page=1&ordering=last_updated built by this github action https://github.com/nipy/heudiconv/blob/master/.github/workflows/docker.yml -- relatively straightforward and as flexible as you wish and could be version controlled directly here

WDYT @giovtorres ?

Update dockerhub to match README

Using dockerhub readme I couldn't get the cluster to start.

README (Works great, thanks!):
docker run -it -h slurmctl --cap-add sys_admin giovtorres/docker-centos7-slurm:latest

Dockerhub (Failure output below):
docker run -it -h ernie giovtorres/docker-centos7-slurm:latest

This fails, I suppose the hostname must be slurmctl.

- Initializing database
- Database initialized
- Starting MariaDB to create Slurm account database
231031 13:57:20 mysqld_safe Logging to '/var/log/mariadb/mariadb.log'.
231031 13:57:20 mysqld_safe Starting mysqld daemon with databases from /var/lib/mysql
- Starting MariaDB to create Slurm account database
- Creating Slurm acct database
- Slurm acct database created. Stopping MariaDB
- Starting supervisord process manager
- Starting munged
munged: started
- munged is in the RUNNING state.
- Starting mysqld
mysqld: started
- mysqld is in the RUNNING state.
- Starting slurmdbd
slurmdbd: started
- slurmdbd is in the RUNNING state.
- Starting slurmctld
slurmctld: ERROR (spawn error)
- slurmctld is in the BACKOFF state.
- slurmctld is in the STARTING state.
- slurmctld is in the BACKOFF state.
- slurmctld is in the BACKOFF state.
- slurmctld is in the STARTING state.
- slurmctld is in the FATAL state.
- slurmctld is in the FATAL state.
- slurmctld is in the FATAL state.
- slurmctld is in the FATAL state.
- slurmctld is in the FATAL state.
- slurmctld is in the FATAL state.
- Starting slurmd
slurmd: ERROR (spawn error)
- slurmd is in the BACKOFF state.
- slurmd is in the STARTING state.
- slurmd is in the BACKOFF state.
- slurmd is in the BACKOFF state.
- slurmd is in the STARTING state.
- slurmd is in the FATAL state.
- slurmd is in the FATAL state.
- slurmd is in the FATAL state.
- slurmd is in the FATAL state.
- slurmd is in the FATAL state.
- slurmd is in the FATAL state.
- Port 6817 is not listening
- Port 6817 is not listening
- Port 6817 is not listening
- Port 6817 is not listening
- Port 6817 is not listening
- Port 6817 is not listening
- Port 6817 is not listening
- Port 6817 is not listening
- Port 6817 is not listening
- Port 6817 is not listening
- Port 6817 is not listening
- Port 6818 is not listening
- Port 6818 is not listening
- Port 6818 is not listening
- Port 6818 is not listening
- Port 6818 is not listening
- Port 6818 is not listening
- Port 6818 is not listening
- Port 6818 is not listening
- Port 6818 is not listening
- Port 6818 is not listening
- Port 6818 is not listening
- Port 6819 is listening
- Waiting for the cluster to become available
sinfo: error: get_addr_info: getaddrinfo() failed: Name or service not known
sinfo: error: slurm_set_addr: Unable to resolve "slurmctl"
sinfo: error: Unable to establish control machine address
slurm_load_partitions: No such file or directory
sinfo: error: get_addr_info: getaddrinfo() failed: Name or service not known
sinfo: error: slurm_set_addr: Unable to resolve "slurmctl"
sinfo: error: Unable to establish control machine address
slurm_load_partitions: No such file or directory
sinfo: error: get_addr_info: getaddrinfo() failed: Name or service not known
sinfo: error: slurm_set_addr: Unable to resolve "slurmctl"
sinfo: error: Unable to establish control machine address
slurm_load_partitions: No such file or directory
sinfo: error: get_addr_info: getaddrinfo() failed: Name or service not known
sinfo: error: slurm_set_addr: Unable to resolve "slurmctl"
sinfo: error: Unable to establish control machine address
slurm_load_partitions: No such file or directory
sinfo: error: get_addr_info: getaddrinfo() failed: Name or service not known
sinfo: error: slurm_set_addr: Unable to resolve "slurmctl"
sinfo: error: Unable to establish control machine address
slurm_load_partitions: No such file or directory
sinfo: error: get_addr_info: getaddrinfo() failed: Name or service not known
sinfo: error: slurm_set_addr: Unable to resolve "slurmctl"
sinfo: error: Unable to establish control machine address
slurm_load_partitions: No such file or directory
sinfo: error: get_addr_info: getaddrinfo() failed: Name or service not known
sinfo: error: slurm_set_addr: Unable to resolve "slurmctl"
sinfo: error: Unable to establish control machine address
slurm_load_partitions: No such file or directory
sinfo: error: get_addr_info: getaddrinfo() failed: Name or service not known
sinfo: error: slurm_set_addr: Unable to resolve "slurmctl"
sinfo: error: Unable to establish control machine address
slurm_load_partitions: No such file or directory
sinfo: error: get_addr_info: getaddrinfo() failed: Name or service not known
sinfo: error: slurm_set_addr: Unable to resolve "slurmctl"
sinfo: error: Unable to establish control machine address
slurm_load_partitions: No such file or directory
sinfo: error: get_addr_info: getaddrinfo() failed: Name or service not known
sinfo: error: slurm_set_addr: Unable to resolve "slurmctl"
sinfo: error: Unable to establish control machine address
slurm_load_partitions: No such file or directory
sinfo: error: get_addr_info: getaddrinfo() failed: Name or service not known
sinfo: error: slurm_set_addr: Unable to resolve "slurmctl"
sinfo: error: Unable to establish control machine address
slurm_load_partitions: No such file or directory

Slurm partitions failed to start successfully.

slurmctld not running

Hi,

Using the command

docker run -it -h ernie giovtorres/docker-centos7-slurm:17.02.9

the program 'slurmctld' is not running:

[root@ernie /]# supervisorctl status
munged                           RUNNING   pid 263, uptime 0:00:17
mysqld                           RUNNING   pid 491, uptime 0:00:13
slurmctld                        EXITED    Nov 13 12:50 PM
slurmd                           RUNNING   pid 262, uptime 0:00:17
slurmdbd                         RUNNING   pid 266, uptime 0:00:17

I am using docker 17.05.0-ce @ ubuntu 16.04.

Best regards,
Bernd

Minor typo with yum install of pkgconfig

In the current Dockerfile, pkgconfig is currently listed as pkconfig :)

How can I run multiple jobs at the same time?

I have submitted 2 jobs but they don't run at the same time:

[root@slurmctl /]# sbatch -n1 --wrap="sleep 10"
Submitted batch job 1
[root@slurmctl /]# sbatch -n1 --wrap="sleep 10"
Submitted batch job 2
[root@slurmctl /]# squeue
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
                 2    normal     wrap     root PD       0:00      1 (Resources)
                 1    normal     wrap     root  R       0:09      1 c1

Job-2 should wait till Job-1 completed. Is there any way to let them run at the same time? Since there is 4 free node I though we could run 4 processes at the same time.

jobs in debug partition run indefinitely

Hi,

this probably relates back to #38 which addressed issue #37. The partitions are up, but if you submit the example job to the debug partition, it runs indefinitely, e.g:

docker exec dockercentos7slurm_slurm_1 sbatch --wrap="sleep 10" --partition debug

I added the following to tests/test_slurm.py:test_job_can_run:

time.sleep(2)
res = host.run(f'sacct -o State --parsable --noheader -j {jobid}')
assert "COMPLETE" in res.stdout

to verify that jobs complete; see github action output on my fork. Unfortunately I have no immediate solution but I thought I'd let you know. I make use of the debug partition in a CI test so I will see if I can find a fix.

Cheers,

Per

Lmod / module support

I just added Lmod support and an example hello modulefile, that adds in hello-world script that does the obvious.

While this isn't exactly Slurm, I guess many Slurm setups use this, so would you be interested in taking it upstream?

It is findable here: https://github.com/AaltoSciComp/docker-centos7-slurm/ (on master currently, no direct link to the commit since I am likely to rebase)

signal handling

I think it might cause a problem that the /bin/bash process is pid 1 inside the container. When the container is requested to stop it takes 10 seconds to shut it down because bash does not propagate signaling as it should, ie. docker kills pid 1 inside the container.

Running supervisord directly as process 1 does not seem to mitigate the problem. The container still need a proper init. See citation below.

Supervisor is a client/server system that allows its users to monitor and control a number of processes on UNIX-like operating systems.

It shares some of the same goals of programs like launchd, daemontools, and runit. Unlike some of these programs, it is not meant to be run as a substitute for init as “process id 1”. Instead it is meant to be used to control processes related to a project or a customer, and is meant to start like any other program at boot time.

Suggestion:

Use docker-compose 2.4 format and init: true, or, include tini explicitly in the Dockerfile prepended to the ENTRYPOINT.

Hi,

Hi,
Thank you for this container, it's very useful. I would like to use it composing with my python application image (with lot of tools dependancies) which need slurm and mysql.
So how can do that? I probably need to modify the docker-compose.yml adding my appli as a service?
Any advices will be welcomed, best
Véronique

Node states are unknown

Hi,

I'm using docker-centos7-slurm to test a workflow manager. It has been a while since updating, but when trying out the most recent version, I notice that only one node (c1) is up in the container. I am currently testing this in my fork (see pr #1). Briefly, I parametrized test_job_can_run to pass partition to the --partition option. The normal partition works as expected, but debug fails.

If one enters the latest image with

docker run -it -h slurmctl giovtorres/docker-centos7-slurm:latest

running sinfo yields

NODELIST   NODES PARTITION       STATE CPUS    S:C:T MEMORY TMP_DISK WEIGHT AVAIL_FE REASON              
c1             1   normal*        idle 1       1:1:1   1000        0      1   (null) none                
c2             1   normal*    unknown* 1       1:1:1   1000        0      1   (null) none                
c3             1     debug    unknown* 1       1:1:1   1000        0      1   (null) none                
c4             1     debug    unknown* 1       1:1:1   1000        0      1   (null) none

See github action resuts, where I added some print statements to see what was going on (nevermind that the test actually passed; I was simply looking at the erroneous slurm output file). I consistently get the feedback that the required nodes are not available; it would seem node c1 is the only node available to sbatch.

Are you able to reproduce this?

Cheers,

Per

giovtorres / docker-centos7-slurm Goto Github PK

docker-centos7-slurm's Issues

Recommend Projects

Recommend Topics

Recommend Org

Jobs