galaxyproject / ansible-slurm Goto Github PK

Ansible role for installing and managing the Slurm Workload Manager

Jinja 100.00%

ansible-slurm's Introduction

Slurm

Install and configure a Slurm cluster on RHEL/CentOS or Debian/Ubuntu servers

Role Variables

All variables are optional. If nothing is set, the role will install the Slurm client programs, munge, and create a slurm.conf with a single localhost node and debug partition. See the defaults and example playbooks for examples.

For the various roles a slurm node can play, you can either set group names, or add values to a list, slurm_roles.

group slurmservers or slurm_roles: ['controller']
group slurmexechosts or slurm_roles: ['exec']
group slurmdbdservers or slurm_roles: ['dbd']

General config options for slurm.conf go in slurm_config, a hash. Keys are Slurm config option names.

Partitions and nodes go in slurm_partitions and slurm_nodes, lists of hashes. The only required key in the hash is name, which becomes the PartitionName or NodeName for that line. All other keys/values are placed on to the line of that partition or node.

Options for the additional configuration files acct_gather.conf, cgroup.conf and gres.conf may be specified in the slurm_acct_gather_config, slurm_cgroup_config (both of them hashes) and slurm_gres_config (list of hashes) respectively.

Set slurm_upgrade to true to upgrade the installed Slurm packages.

You can use slurm_user (a hash) and slurm_create_user (a bool) to pre-create a Slurm user so that uids match.

Note that this role requires root access, so enable become either globally in your playbook / on the commandline or just for the role like shown below.

Dependencies

None.

Example Playbooks

Minimal setup, all services on one node:

- name: Slurm all in One
  hosts: all
  vars:
    slurm_roles: ['controller', 'exec', 'dbd']
  roles:
    - role: galaxyproject.slurm
      become: True

More extensive example:

- name: Slurm execution hosts
  hosts: all
  roles:
    - role: galaxyproject.slurm
      become: True
  vars:
    slurm_cgroup_config:
      CgroupMountpoint: "/sys/fs/cgroup"
      CgroupAutomount: yes
      ConstrainCores: yes
      TaskAffinity: no
      ConstrainRAMSpace: yes
      ConstrainSwapSpace: no
      ConstrainDevices: no
      AllowedRamSpace: 100
      AllowedSwapSpace: 0
      MaxRAMPercent: 100
      MaxSwapPercent: 100
      MinRAMSpace: 30
    slurm_config:
      AccountingStorageType: "accounting_storage/none"
      ClusterName: cluster
      GresTypes: gpu
      JobAcctGatherType: "jobacct_gather/none"
      MpiDefault: none
      ProctrackType: "proctrack/cgroup"
      ReturnToService: 1
      SchedulerType: "sched/backfill"
      SelectType: "select/cons_res"
      SelectTypeParameters: "CR_Core"
      SlurmctldHost: "slurmctl"
      SlurmctldLogFile: "/var/log/slurm/slurmctld.log"
      SlurmctldPidFile: "/var/run/slurmctld.pid"
      SlurmdLogFile: "/var/log/slurm/slurmd.log"
      SlurmdPidFile: "/var/run/slurmd.pid"
      SlurmdSpoolDir: "/var/spool/slurmd"
      StateSaveLocation: "/var/spool/slurmctld"
      SwitchType: "switch/none"
      TaskPlugin: "task/affinity,task/cgroup"
      TaskPluginParam: Sched
    slurm_create_user: yes
    slurm_gres_config:
      - File: /dev/nvidia[0-3]
        Name: gpu
        NodeName: gpu[01-10]
        Type: tesla
    slurm_munge_key: "../../../munge.key"
    slurm_nodes:
      - name: "gpu[01-10]"
        CoresPerSocket: 18
        Gres: "gpu:tesla:4"
        Sockets: 2
        ThreadsPerCore: 2
    slurm_partitions:
      - name: gpu
        Default: YES
        MaxTime: UNLIMITED
        Nodes: "gpu[01-10]"
    slurm_roles: ['exec']
    slurm_user:
      comment: "Slurm Workload Manager"
      gid: 888
      group: slurm
      home: "/var/lib/slurm"
      name: slurm
      shell: "/usr/sbin/nologin"
      uid: 888

License

MIT

Author Information

View contributors on GitHub

ansible-slurm's People

Contributors

Stargazers

Watchers

ansible-slurm's Issues

`sudo` privileges required to run slurm commands

Pardon my ignorance, I’m a noob at slurm. After installing slurm using this role, non-root users need sudo privileges to run ‘srun’, ‘salloc’ and other slurm commands.

Couldn’t see any example online where the need for privilege escalation was required to run slurm commands, am I missing something?

Add debian Buster support

Please see my pull request #12

Maybe, my change on systemd service can also be use on other OS but as i'm not sure, i have add this when condition

  when: ansible_distribution == 'Debian'

Handler typo in tasks/slurmdbd_cluster.yml

When running the playbook we received an error:

TASK [galaxyproject.slurm : Create the slurmdbd cluster] ***********************
ERROR! The requested handler 'reload slurmdbd' was not found in either the main handlers list nor in the listening handlers list
FATAL: command execution failed

After looking at the tasks/slurmdbd_cluster.yml file the issue is on line 19. It is trying to notify the reload slurmdbd handler but it should be Reload slurmdbd since the handler in handlers/main.yml on line 7 uses a capital R in the name.

Service can't start due to wrong path on Ubuntu 20.04 LTS

On Ubuntu 20.04 LTS the slurm-service doesnt start with the error:

systemd[1]: slurmctld.service: Can't open PID file /run/slurmctld.pid (yet?) after start: Operation not permitted

I found something similar at Stackoverflow. It seems that the PID-Path between the slurm.conf and the service-definition is different, which leads to the crash. By manually changing the path in slurm.conf it starts without problems, but the playbook has a hardcoded paths for the pid.

/etc/slurm-llnl/slurm.conf

SlurmdPidFile=/var/run/slurm-llnl/slurmd.pid

/lib/systemd/system/slurmctld.service

[Service]
PIDFile=/run/slurmctld.pid

Edit: The same problem might apply to the slurmd and slurmdbd services, didnt test that yet, but paths are the same.

[suggestion] Add setup MariaDB if defined

I konw this is not the simplest role to execute properly but it would be immensely useful!

generating slurm.conf values

I'm a bit of an ansible noob here. . .but,

When generating the slurm.conf file

Instead of hard coded values:
slurm_nodes:

name: "{{ headnode }}"
CoresPerSocket: "6"
CPUs: "12"
Gres: "gpu:p620:1"
NodeAddr: "{{ headnode }}"
RealMemory: "31846"
Sockets: "1"
ThreadsPerCore: "2"
Feature: "gpu,intel,ht"
State: "UNKNOWN"

Is it possible to get the values from ansible_facts . . .something along the lines of

slurm_nodes:

name: "{{ headnode }}"
CoresPerSocket: "{{ ansible_facts['ansible_processor_cores'] }}"
CPUs: "{{ ansible_facts['ansible_processor_vcpu'] }}"
Gres: "gpu:p620:1"
NodeAddr: "{{ headnode }}"
RealMemory: {{ ansible_facts['ansible_memory_mb.real.total'] }}
Sockets: "1"
ThreadsPerCore: "{{ ansible_facts['ansible_processor_threads_per_core'] }}"
Feature: "gpu,intel,ht"
State: "UNKNOWN"

passwords for databases are visible to anyone in slurm.conf

This is mainly a concern if the controller is available to others than admins, but it would not hurt to change to owner ship like so:

# tasks/common.yml
# ...

- name: Install slurm.conf
  ansible.builtin.template:
    src: "slurm.conf.j2"
    dest: "{{ slurm_config_dir }}/slurm.conf"
    owner: root
    group: slurm
    mode: "0640"
  notify:
    - Restart slurmd
    - Restart slurmctld

this way members of the group "slurm" are allowed to view the file but only root can edit. if scontrol or one of the other commands actually edits the file, the owner can be changed to slurm or change the rights to 0644 so members of the group can edit as well.

Slurm location changed in debian bullseye

The location of slurm in debian bullseye is no longer /etc/slurm-llnl but /etc/slurm

Munge service is not restart when munge key is change

Slurmdbd does not recognize the parameter SlurmctldPidFile

Hello there,

My Slurmdbd does not recognize the parameter SlurmctldPidFile.
When i comment this out, it works:
# SlurmctldPidFile: "{{ __slurm_run_dir ~ '/slurmdbd.pid' if __slurm_debian else omit }}"

I also could not find the used parameter in https://slurm.schedmd.com/slurmdbd.conf.html#OPT_PidFile.

I could however find the parameter ´PidFile´ but that did throw the same error.

Aug 02 18:37:06 slurmhost slurmdbd[15616]: error: _parse_next_key: Parsing error at unrecognized key: SlurmctldPidFile
Aug 02 18:37:06 slurmhost slurmdbd[15616]: fatal: Could not open/read/parse slurmdbd.conf file /etc/slurm/slurmdbd.conf

I would prefer to set the Location of my slurmdbd.pid file like i do with SlurmctldPidFile and SlurmdPidFile. Can this be accomplished?

I am running slurm-wlm 21.08.5.

/etc/slurm/slurmdbd.conf

  root@slurmhost:/var/log/slurm# cat /etc/slurm/slurmdbd.conf
    ##
    ## This file is maintained by Ansible - ALL MODIFICATIONS WILL BE REVERTED
    ##
    
    ArchiveJobs=yes
    ArchiveSteps=yes
    AuthType=auth/munge
    DbdHost=slurmhost
    DbdPort=6819
    DebugLevel=4
    LogFile=/var/log/slurm/slurmdbd.log
    SlurmUser=slurm
    StorageHost=localhost
    StorageLoc=slurmdb
    StoragePass=CBB...........2B8
    StoragePort=3306
    StorageType=accounting_storage/mysql
    StorageUser=slurm

/etc/slurm/slurm.conf

  root@slurmhost:/var/log/slurm# cat /etc/slurm/slurm.conf
  ##
  ## This file is maintained by Ansible - ALL MODIFICATIONS WILL BE REVERTED
  ##
  
  
  # Configuration options
  AccountingStorageEnforce=limits
  AccountingStorageHost=slurmhost
  AccountingStorageType=accounting_storage/slurmdbd
  AuthType=auth/munge
  ClusterName=ei-hpc-cluster
  CryptoType=crypto/munge
  GresTypes=gpu
  InactiveLimit=0
  JobAcctGatherType=jobacct_gather/linux
  JobCompType=jobcomp/none
  KillWait=30
  MinJobAge=300
  MpiDefault=none
  PriorityDecayHalfLife=7-0
  PriorityType=priority/multifactor
  PriorityWeightAge=1000
  PriorityWeightFairshare=100000
  PriorityWeightPartition=10000
  ProctrackType=proctrack/pgid
  ReturnToService=2
  SchedulerParameters=nohold_on_prolog_fail
  SchedulerType=sched/backfill
  SelectType=select/cons_tres
  SelectTypeParameters=CR_Core
  SlurmctldDebug=5
  SlurmctldHost=slurmhost
  SlurmctldLogFile=/var/log/slurm/slurmctld.log
  SlurmctldPidFile=/run/slurm/slurmctld.pid
  SlurmctldPort=6817
  SlurmctldTimeout=300
  SlurmdDebug=3
  SlurmdLogFile=/var/log/slurm/slurmd.log
  SlurmdPidFile=/run/slurm/slurmd.pid
  SlurmdPort=6818
  SlurmdSpoolDir=/var/spool/slurm/d
  SlurmdTimeout=300
  SlurmUser=slurm
  StateSaveLocation=/var/spool/slurm/ctld
  SwitchType=switch/none
  Waittime=0
  
  # Nodes
  NodeName=ei-srv-018 Boards=1 CoresPerSocket=16 CPUs=32 RealMemory=240000 SocketsPerBoard=2 State=UNKNOWN ThreadsPerCore=2
  
  # Partitions
  PartitionName=normal Default=YES MaxTime=60 Nodes=ALL PriorityJobFactor=10000 State=UP
  PartitionName=day AllowAccounts=professor,mitarbeiter,student MaxTime=1440 Nodes=ALL PriorityJobFactor=6000 State=UP
  PartitionName=long AllowAccounts=professor,mitarbeiter,student MaxTime=10080 Nodes=ALL PriorityJobFactor=1000 State=UP
  PartitionName=priority AllowAccounts=admin MaxTime=UNLIMITED Nodes=ALL PriorityJobFactor=5000 State=UP

Debian11 issue

Sadly debian change path for slurm on new debian version

On old debian (debian 9 & 10 ) (with slurm 18.08.5.2-1 ) :

dpkg -L slurmd |egrep '/etc/slur|/var/log/'
/etc/slurm-llnl
/var/log/slurm-llnl

On new debian 11 (with slurm 20.11.7+really20.11.4-2) :

dpkg -L slurmd |egrep '/etc/slur|/var/log/'
/etc/slurm
/var/log/slurm

Problems with slurm rpms

Not sure if this is the best place to report this, but let's try.
I was trying to use the slurm RPMs that are found here:
https://depot.galaxyproject.org/yum/package/slurm/18.08/7/x86_64/
I hit 2 problems:

the repo definitions (https://depot.galaxyproject.org/yum/package/slurm/18.08/) seem broken.
To get it to work (on Centos7) I needed to change the baseurl property to https://depot.galaxyproject.org/yum/package/slurm/18.08/$releasever/$basearch/
The 2 drmaa packages seem to have been built against a different version of slurm. When doing an yum install you get this error:

Error: Package: slurm-drmaa-1.1.0-1.el7.x86_64 (slurm-18.08)
Requires: libslurm.so.31()(64bit)
Error: Package: slurm-drmaa-1.1.0-1.el7.x86_64 (slurm-18.08)
Requires: libslurmdb.so.31()(64bit)
The version of slurm that is installed is libslurmdb.so.33.

I'm not an yum/rpm expert so I might be missing something subtle here, but it looks to me like the these rpms were not built against same version of slurm.

To get round this I downloaded the two drmaa rpms and installed them manually with something like:
rpm -iv --nodeps slurm-drmaa-1.1.0-1.el7.x86_64.rpm

That seems to have installed, but I haven't got to the stage of checking that it works correctly.

Port-Ranges are converted into strings

When i try to set a port-range, it gets converted into a string, which leads to an error in the slurm-deamon while loading the conf.

slurm_nodes:
  - name: dockerworker[1-10]
    NodeAddr: 192.168.0.1
    Port: [6818-6828]
    State: UNKNOWN

slurm.conf

...
NodeName=dockerworker[1-10] NodeAddr=192.168.0.1 Port=[u'6818-6828'] State=UNKNOWN
...

Error in slurmctrld

ERROR: [../../../src/common/hostlist.c:1735] Invalid range: `u'6818-6828'': Invalid argument

SlurmDBD config does not work

The config template, generic.conf.j2 expects to be run in a loop over a list of dicts, but the task that templates it does not do so.

Ability to specify Slurm version

When running this role, we had issues with another repo on the target which had version 21.08.8 of Slurm (compared to 20.11.8 from the Galaxy repos), so running the playbook installed the later version. However the only version of MUNGE we had was 0.5.14 (again from the Galaxy repo), which was not compatible with the later Slurm version that had been installed. Our solution was to disable the other repo in a pre-task before running the slurm role.

Would it be possible to add a variable which specifies the version of Slurm to install?

Issue running playbook

When running the playbook I get the error
"No package matching 'slurm-slurmdbd' found available, installed or updated"
for Task : Install Slurm DB packages

{
"_ansible_parsed": true,
"_ansible_no_log": false,
"changed": false,
"results": [
"No package matching 'slurm-slurmdbd' found available, installed or updated"
],
"rc": 126,
"invocation": {
"module_args": {
"autoremove": false,
"disable_plugin": [],
"install_repoquery": true,
"update_cache": false,
"disable_excludes": null,
"exclude": [],
"update_only": false,
"installroot": "/",
"allow_downgrade": false,
"name": [
"munge",
"slurm-slurmdbd"
],
"download_only": false,
"bugfix": false,
"list": null,
"disable_gpg_check": false,
"conf_file": null,
"use_backend": "auto",
"validate_certs": true,
"state": "present",
"disablerepo": [],
"releasever": null,
"enablerepo": [],
"skip_broken": false,
"security": false,
"enable_plugin": []
}
},
"msg": "No package matching 'slurm-slurmdbd' found available, installed or updated"

Cgroup mode

Hello,

I try to configure slurm on cgroup mode to be able to manage Memory limit on the server running galaxy.
When I want to setup with cgroup v2 using this option:
CgroupPlugin=cgroup/v2

I get a message as this option is not valid. Is slurm version install with ansible compatible with cgroup2?

new release

Hi,

the last release is rather old. Particularly with the merges fixing issue #6 a new release would be merited, don't you think?

Best regards
Christian Meesters

basic playbook fails at `Create slurm user`

Hi there!

I have to create a high-performance computing server, and I'm trying to use this repo for the slurm step. It fails at the user creation step. It looks like a basic permissions error where some unix commands should be ran as root.

The error I get is:

TASK [galaxyproject.slurm : Include user creation tasks] **************************************************************************************
included: /home/heroico/.ansible/roles/galaxyproject.slurm/tasks/user.yml for [MY HOST]

TASK [galaxyproject.slurm : Create slurm group] ***********************************************************************************************
skipping: [MY HOST]

TASK [galaxyproject.slurm : Create slurm user] ************************************************************************************************
fatal: [MY HOST]: FAILED! => {"changed": false, "msg": "useradd: Permission denied.\nuseradd: cannot lock /etc/passwd; try again later.\n", "name": "slurm", "rc": 1}

PLAY RECAP ************************************************************************************************************************************
[MY HOST]              : ok=2    changed=0    unreachable=0    failed=1    skipped=1    rescued=0    ignored=0

My minimal reproducible playbook is like this:

---
- name: test
  hosts: [MY_HOST]

  tasks:
    - include_role:
        name: 'galaxyproject.slurm'
      vars:
        slurm_roles: ['controller', 'exec', 'dbd']

I'm not sure what to do here. I can use a become directive at the playbook level to have this work, but I wanted to check with you first if there is a better way to achieve this. (Bear in mind: I'm not a sysadmin/devops person, I'm merely a coder recently forced to maintain infrastructure). Thanks in advance!

Need help creating a minimal non-trivial playbook

Hi everyone, I'm struggling with creating a playbook that would install both control and execution nodes.
Whatever I do, I end up with multiple nodes where each of them is a singleton cluster with single node in it, and there's no interconnectivity between them.

Minimal playbook:

- name: install SLURM cluster
  hosts: vm0
  roles:
    - role: galaxyproject.slurm
      become: True
  vars:
    slurm_roles: ['exec', 'dbd', 'controller']
    slurm_munge_key: munge.key

- name: SLURM execution hosts
  roles:
    - role: galaxyproject.slurm
      become: True
  hosts: vm1, vm2
  vars:
    slurm_munge_key: munge.key
    slurm_roles: ['exec']
    slurm_nodes:
      - name: "vm[1-2]"
        CoresPerSocket: 1
    slurm_partitions:
      - name: compute
        Default: YES
        MaxTime: UNLIMITED
        Nodes: "vm[1-2]"

and the output would be:

~/github/slurm_local
❯ ansible -i local.yml all -a 'sinfo'
vm1 | CHANGED | rc=0 >>
PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST
debug*       up   infinite      1   idle localhost
vm2 | CHANGED | rc=0 >>
PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST
debug*       up   infinite      1   idle localhost
vm0 | CHANGED | rc=0 >>
PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST
debug*       up   infinite      1   idle localhost

which is not what I intended.

Could anyone help drafting a correct playbook for such a case?

munge.service on the master node fails to start

-- Unit munge.service has begun starting up.
Jun 15 20:34:10 controller systemd[1]: munge.service: control process exited, code=exited status=1
Jun 15 20:34:10 controller munged[5006]: munged: Error: Failed to check keyfile "/etc/munge/munge.key": No such file or directory
Jun 15 20:34:10 controller systemd[1]: Failed to start MUNGE authentication service.
-- Subject: Unit munge.service has failed

MIT License

Would you mind including the MIT license contents in your project in a file? Just want to remove any ambiguity.

Thanks

Service restart order

when need to restart several service, follow this order :

slurmdbd
slurmctld
slurmd

sbatch not working if SlurmdSpoolDir is undefined

If I do not define SlurmdSpoolDir I get the following error when running any type of jobs

slurmstepd-node-1: error: execve(): /var/lib/slurm/slurmctld/job00218/slurm_script: Permission denied

But if I set SlurmdSpoolDir to just the default SlurmdSpoolDir: "/var/spool/slurmd" i can successfully dispatch jobs. This seems like a slurm issue, but for the sake of usability I suggest simply adding the default value to the slurm.conf if undefined.

[feature] Handle sacctmgr account and user association

I found myself needing to manage some users, accounts and partitions after enabling JobComp and made a local role to handle associations - is this within the scope of this module? If so, I will gladly make a PR that incorporates it.

You are not running a supported accounting_storage plugin

I'm trying out this Ansible role on a ubuntu 22.04 VM. When running the bare default example, things run smoothly until the following task:

TASK [galaxyproject.slurm : Create the slurmdbd cluster] *************************************************************************************************************************************
fatal: [129.70.51.119]: FAILED! => {"changed": true, "cmd": ["sacctmgr", "-i", "-n", "add", "cluster", "cluster"], "delta": "0:00:00.006080", "end": "2023-09-09 20:14:06.899666", "msg": "non-zero return code", "rc": 1, "start": "2023-09-09 20:14:06.893586", "stderr": "You are not running a supported accounting_storage plugin\nOnly 'accounting_storage/slurmdbd' is supported.", "stderr_lines": ["You are not running a supported accounting_storage plugin", "Only 'accounting_storage/slurmdbd' is supported."], "stdout": "", "stdout_lines": []}

I'm not really sure where to start digging into this, as I was hoping the default would "just work" 😉.