stackhpc / ansible-role-openhpc Goto Github PK

Ansible role for OpenHPC

License: Apache License 2.0

Python 46.97% Jinja 52.61% Dockerfile 0.42%

ansible-role-openhpc's Introduction

stackhpc.openhpc

This Ansible role installs packages and performs configuration to provide an OpenHPC v2.x Slurm cluster.

As a role it must be used from a playbook, for which a simple example is given below. This approach means it is totally modular with no assumptions about available networks or any cluster features except for some hostname conventions. Any desired cluster fileystem or other required functionality may be freely integrated using additional Ansible roles or other approaches.

The minimal image for nodes is a RockyLinux 8 GenericCloud image.

Role Variables

openhpc_extra_repos: Optional list. Extra Yum repository definitions to configure, following the format of the Ansible yum_repository module. Respected keys for each list element:

name: Required
description: Optional
file: Required
baseurl: Optional
metalink: Optional
mirrorlist: Optional
gpgcheck: Optional
gpgkey: Optional

openhpc_slurm_service_enabled: boolean, whether to enable the appropriate slurm service (slurmd/slurmctld).

openhpc_slurm_service_started: Optional boolean. Whether to start slurm services. If set to false, all services will be stopped. Defaults to openhpc_slurm_service_enabled.

openhpc_slurm_control_host: Required string. Ansible inventory hostname (and short hostname) of the controller e.g. "{{ groups['cluster_control'] | first }}".

openhpc_slurm_control_host_address: Optional string. IP address or name to use for the openhpc_slurm_control_host, e.g. to use a different interface than is resolved from openhpc_slurm_control_host.

openhpc_packages: additional OpenHPC packages to install.

openhpc_enable:

control: whether to enable control host
database: whether to enable slurmdbd
batch: whether to enable compute nodes
runtime: whether to enable OpenHPC runtime

openhpc_slurmdbd_host: Optional. Where to deploy slurmdbd if are using this role to deploy slurmdbd, otherwise where an existing slurmdbd is running. This should be the name of a host in your inventory. Set this to none to prevent the role from managing slurmdbd. Defaults to openhpc_slurm_control_host.

openhpc_slurm_configless: Optional, default false. If true then slurm's "configless" mode is used.

openhpc_munge_key: Optional. Define a munge key to use. If not provided then one is generated but the openhpc_slurm_control_host must be in the play.

openhpc_login_only_nodes: Optional. If using "configless" mode specify the name of an ansible group containing nodes which are login-only nodes (i.e. not also control nodes), if required. These nodes will run slurmd to contact the control node for config.

openhpc_module_system_install: Optional, default true. Whether or not to install an environment module system. If true, lmod will be installed. If false, You can either supply your own module system or go without one.

slurm.conf

openhpc_slurm_partitions: Optional. List of one or more slurm partitions, default []. Each partition may contain the following values:

groups: If there are multiple node groups that make up the partition, a list of group objects can be defined here. Otherwise, groups can be omitted and the following attributes can be defined in the partition object:
- name: The name of the nodes within this group.
- cluster_name: Optional. An override for the top-level definition openhpc_cluster_name.
- extra_nodes: Optional. A list of additional node definitions, e.g. for nodes in this group/partition not controlled by this role. Each item should be a dict, with keys/values as per the "NODE CONFIGURATION" docs for slurm.conf. Note the key NodeName must be first.
- ram_mb: Optional. The physical RAM available in each node of this group (slurm.conf parameter RealMemory) in MiB. This is set using ansible facts if not defined, equivalent to free --mebi total * openhpc_ram_multiplier.
- ram_multiplier: Optional. An override for the top-level definition openhpc_ram_multiplier. Has no effect if ram_mb is set.
- gres: Optional. List of dicts defining generic resources. Each dict must define:
  - conf: A string with the resource specification but requiring the format <name>:<type>:<number>, e.g. gpu:A100:2. Note the type is an arbitrary string.
  - file: A string with the File (path to device(s)) for this resource, e.g. /dev/nvidia[0-1] for the above example.
  Note GresTypes must be set in openhpc_config if this is used.
default: Optional. A boolean flag for whether this partion is the default. Valid settings are YES and NO.
maxtime: Optional. A partition-specific time limit following the format of slurm.conf parameter MaxTime. The default value is given by openhpc_job_maxtime. The value should be quoted to avoid Ansible conversions.
partition_params: Optional. Mapping of additional parameters and values for partition configuration.

For each group (if used) or partition any nodes in an ansible inventory group <cluster_name>_<group_name> will be added to the group/partition. Note that:

Nodes may have arbitrary hostnames but these should be lowercase to avoid a mismatch between inventory and actual hostname.
Nodes in a group are assumed to be homogenous in terms of processor and memory.
An inventory group may be empty or missing, but if it is not then the play must contain at least one node from it (used to set processor information).
Nodes may not appear in more than one group.

openhpc_job_maxtime: Maximum job time limit, default '60-0' (60 days). See slurm.conf parameter MaxTime for format. The default is 60 days. The value should be quoted to avoid Ansible conversions.

openhpc_cluster_name: name of the cluster.

openhpc_config: Optional. Mapping of additional parameters and values for slurm.conf. Note these will override any included in templates/slurm.conf.j2.

openhpc_ram_multiplier: Optional, default 0.95. Multiplier used in the calculation: total_memory * openhpc_ram_multiplier when setting RealMemory for the partition in slurm.conf. Can be overriden on a per partition basis using openhpc_slurm_partitions.ram_multiplier. Has no effect if openhpc_slurm_partitions.ram_mb is set.

openhpc_state_save_location: Optional. Absolute path for Slurm controller state (slurm.conf parameter StateSaveLocation)

Accounting

By default, no accounting storage is configured. OpenHPC v1.x and un-updated OpenHPC v2.0 clusters support file-based accounting storage which can be selected by setting the role variable openhpc_slurm_accounting_storage_type to accounting_storage/filetxt¹. Accounting for OpenHPC v2.1 and updated OpenHPC v2.0 clusters requires the Slurm database daemon, slurmdbd (although job completion may be a limited alternative, see below. To enable accounting:

Configure a mariadb or mysql server as described in the slurm accounting documentation on one of the nodes in your inventory and set openhpc_enable.database to true for this node.
Set openhpc_slurm_accounting_storage_type to accounting_storage/slurmdbd.
Configure the variables for slurmdbd.conf below.

The role will take care of configuring the following variables for you:

openhpc_slurm_accounting_storage_host: Where the accounting storage service is running i.e where slurmdbd running.

openhpc_slurm_accounting_storage_port: Which port to use to connect to the accounting storage.

openhpc_slurm_accounting_storage_user: Username for authenticating with the accounting storage.

openhpc_slurm_accounting_storage_pass: Mungekey or database password to use for authenticating.

For more advanced customisation or to configure another storage type, you might want to modify these values manually.

Job accounting

This is largely redundant if you are using the accounting plugin above, but will give you basic accounting data such as start and end times. By default no job accounting is configured.

openhpc_slurm_job_comp_type: Logging mechanism for job accounting. Can be one of jobcomp/filetxt, jobcomp/none, jobcomp/elasticsearch.

openhpc_slurm_job_acct_gather_type: Mechanism for collecting job accounting data. Can be one of jobacct_gather/linux, jobacct_gather/cgroup and jobacct_gather/none.

openhpc_slurm_job_acct_gather_frequency: Sampling period for job accounting (seconds).

openhpc_slurm_job_comp_loc: Location to store the job accounting records. Depends on value of openhpc_slurm_job_comp_type, e.g for jobcomp/filetxt represents a path on disk.

slurmdbd.conf

The following options affect slurmdbd.conf. Please see the slurm documentation for more details. You will need to configure these variables if you have set openhpc_enable.database to true.

openhpc_slurmdbd_port: Port for slurmdb to listen on, defaults to 6819.

openhpc_slurmdbd_mysql_host: Hostname or IP Where mariadb is running, defaults to openhpc_slurm_control_host.

openhpc_slurmdbd_mysql_database: Database to use for accounting, defaults to slurm_acct_db.

openhpc_slurmdbd_mysql_password: Password for authenticating with the database. You must set this variable.

openhpc_slurmdbd_mysql_username: Username for authenticating with the database, defaults to slurm.

Example Inventory

And an Ansible inventory as this:

[openhpc_login]
openhpc-login-0 ansible_host=10.60.253.40 ansible_user=centos

[openhpc_compute]
openhpc-compute-0 ansible_host=10.60.253.31 ansible_user=centos
openhpc-compute-1 ansible_host=10.60.253.32 ansible_user=centos

[cluster_login:children]
openhpc_login

[cluster_control:children]
openhpc_login

[cluster_batch:children]
openhpc_compute

Example Playbooks

To deploy, create a playbook which looks like this:

---
- hosts:
  - cluster_login
  - cluster_control
  - cluster_batch
  become: yes
  roles:
    - role: openhpc
      openhpc_enable:
        control: "{{ inventory_hostname in groups['cluster_control'] }}"
        batch: "{{ inventory_hostname in groups['cluster_batch'] }}"
        runtime: true
      openhpc_slurm_service_enabled: true
      openhpc_slurm_control_host: "{{ groups['cluster_control'] | first }}"
      openhpc_slurm_partitions:
        - name: "compute"
      openhpc_cluster_name: openhpc
      openhpc_packages: []
...

1 Slurm 20.11 removed accounting_storage/filetxt as an option. This version of Slurm was introduced in OpenHPC v2.1 but the OpenHPC repos are common to all OpenHPC v2.x releases. ↩

ansible-role-openhpc's People

Contributors

Stargazers

Watchers

Forkers

derekjhunt plasticroad yalesom diegobulhoes swipswaps wahello kbendl panda1100 samir-sb linuxem yhal-nesi mkarpiarz keldush nine-sarayut

ansible-role-openhpc's Issues

Update github CI OS when possible

#110 changed the CI OS ubuntu-20.04 to ubuntu-18.04 to deal with an incompatibility between the (CentOS) systemd in the container and the underlying kernel. When we're past CentOS 8.3 consider revisiting this.

Molecule tests fails on centos 8.3

Due to changes in PowerTools repo name

Consider adding "optimised" config for (subset) of installed compiler+MPI chains

Could pick/document/install "default" compiler + MPI combination, and then add slurm config (+ modules?) so that process management (pmi, pmi2, pmix, etc - e.g. SLURM_MPI_TYPE), launcher (e.g. srun etc) and fabric work properly "out of the box".

Difficulty will be any hardware-specific items required e.g. for fabrics. Could just document which hardware this is expected to be "optimised" for I guess.

Update package lists

Things to consider adding:

ohpc v2 includes an openmpi built against ucx, but whether intel mpi needs a different ucx isn't quite clear.
IMPI needs mellanox ucx v1.4+, Will checked package names and ucx from system and mellanox repos appears to be the same. So maybe install that at least on v1 systems?
For intel mpi should install slurm-libpmi-ohpc which allows 'I_MPI_PMI_LIBRARY=/lib64/libpmi.so`. See slurm mpi page.
Install the mpi performance metapackage by default so we get a testable cluster (i.e. compilers, mpi, imb).

Should also consider adding default MPI etc in slurm conf when we've done this. Or maybe even modifying modules??

See:

https://software.intel.com/content/www/us/en/develop/articles/improve-performance-and-stability-with-intel-mpi-library-on-infiniband.html

openhpc-login-0: Unable to start service slurmd

Hey,

i've started a deployment as described in the README. But after deploying (set hostname, and add Repos before) i got an error while starting the slurm daemon on the openhpc-login-0 node.
I got the same error while have other hostnames of openhpc-compute-*, but solved this by changing the name announced in the README file.

fatal: [openhpc-login-0]: FAILED! => {"changed": false, "msg": "Unable to start service slurmd: Job for slurmd.service failed because the control process exited with error code. See "systemctl status slurmd.service" and "journalctl -xe" for details.\n"}

When starting i got this message:
slurmd[21031]: fatal: Unable to determine this slurmd's NodeName

And after some research i saw this hostname of the login note is not wrote down in the slurm.conf Template file. (Only compute nodes).

I thing this a configuration problem in the template file or?

Thomas

Login-only node in configless mode fails to get config

Create separate control and login nodes in configless mode.
Login to login node:

[root@testohpc-login-0 /]# sinfo
sinfo: error: resolve_ctls_from_dns_srv: res_nsearch error: No error
sinfo: error: fetch_config: DNS SRV lookup failed
sinfo: error: _establish_config_source: failed to fetch config
sinfo: fatal: Could not establish a configuration source

Note sinfo on control node works ok.

Enable PowerTools repo on ohpc_v2/centos8

As reported by @oneswig :

It seems that for OHPC 2 on CentOS 8 you also need the CentOS PowerTools repo, which comes installed but disabled.

Role should add/document requirement for OpenHPC repo

Currently this is normally done manually, e.g.:

  vars:
    openhpc_repo_url: "https://github.com/openhpc/ohpc/releases/download/v1.3.GA/ohpc-release-1.3-1.el7.x86_64.rpm"
  tasks:
    - name: Install OpenHPC repository
      yum:
        name: "{{ openhpc_repo_url }}"

Connnected to #6?

Slurm node config can be empty

Should check that the NodeName=DEFAULT section of templates/slurm.conf.j2 is output at least once per slurm partition. From template logic looks like there are various circumstances where there can be no hosts in the group for this and therefore the cpu info is not set.

Symptom is sinfo -N --long showing 1:1:1 in the S:C:T (sockets:cores per socket: threads per core) column.

slurm.conf incorrect if only one compute node

templating produces this if a group only has 1x compute node:

NodeName=ohpc-compute-[0-0]

openhpc_slurm_service_enabled badly named

Slurm and munge services are always enabled; the openhpc_slurm_service_enabled role var actually controls whether they are started.

Support less-noisy reconfigure

With an image-based deploy the current workflow for adding a node looks like:

Boot a new compute node. It will attempt to join the cluster, slurmctld will say it doesn't have a nodename entry, and slurmd will die.
Run the role on the ENTIRE cluster, so that:
- new slurm.conf generated including the new node
- slurmctld and ALL slurmd restarted (inc. the new, failed one) in the correct order

Item 2 is really noisy as all the compute nodes run all the ansible. It would be good if really we could just run the appropriate steps for these cases.

I think the cases covered are:

Adding nodes with an appropriate image
Deleting nodes

We probably could do something just using the configure tag, but this needs testing/documenting.

Enable topology-aware scheduling?

Slurm can do this with a topology.conf file, and this can be auto-generated for an IB switch.

See https://slurm.schedmd.com/topology.html.

Add RebootProgram to slurm.conf

Currently the rebuild role modifies slurm.conf directly to add RebootProgram= - this should get incorporated into the role.

https://github.com/stackhpc/ansible_collection_slurm_openstack_tools/blame/a511486185494c6409b747c077c250a6456f3136/roles/rebuild/tasks/main.yml#L43

Support emulating larger cluster

Would be useful for testing to handle one of the methods of emulating a larger cluster described here:
https://slurm.schedmd.com/faq.html#multi_slurmd

slurm.conf templating can't handle empty partitions

Example of faulty result:

NodeName=DEFAULT State=UNKNOWN \
    RealMemory=122185 \
    Sockets=2 \
    CoresPerSocket=10 \
    ThreadsPerCore=2
NodeName=kbendl-compute-0
NodeName=kbendl-compute-1
PartitionName=baremetal \
    Default=YES \
    MaxTime=86400 \
    State=UP \
    Nodes=\
kbendl-compute-0,\
kbendl-compute-1NodeName=-nonesuch
PartitionName=small \
    Default=YES \
    MaxTime=86400 \
    State=UP \
    Nodes=\
-nonesuch

At a glance I think there's two issues here - empty partitions should probably be ignored (that would be a chance in behaviour, but would be more useful I think), and clearly there's a missing \n here too.

Proctracktype warning

Warning on slurmctld startup:

WARNING: We will use a much slower algorithm with proctrack/pgid, use Proctracktype=proctrack/linuxproc or some other proctrack when using jobacct_gather/linux

Adding nodes to running cluster fails in configless mode.

In this mode, slurmd's have to contact slurmctl for the config, which means they need to be defined in the running config.

Currently:

The changed slurm.conf on disk triggers a handler to reload slurmctld - this task is pending the end of the play
The "Configure Slurm service" task then runs which will ensure slurm[d/ctld] is enabled and running:
i. On the control node it's a no-op
ii. On the new compute node it tries to start slurmd, which fails with fatal: Unable to determine this slurmd's NodeName as its only defined in the on-disk config, not the running one.
The end of the play is reached, the handler task then runs to reload slurmctld. This also fails b/c the running and file configs have different numbers of nodes. For some reason this doesn't show as a failure in ansible.

openmpi3 job distribution appears odd

Not clear this is really specific to this role. However using openmpi3 the jobs distribution under slurm looks odd. The examples below use srun but similar behaviour seen using sbatch and mpirun too.
Packages: "@ohpc-slurm-client", "@ohpc-slurm-server", "slurm-slurmctld-ohpc", "slurm-example-configs-ohpc", "gnu7-compilers-ohpc" and "openmpi3-gnu7-ohpc".

Using gnu7 and openmpi3 modules:

$ srun --mpi=list -N 4 -n 4 helloworld
 srun: MPI types are... srun: openmpi srun: none srun: pmi2 srun: pmix_v2 srun: pmix

$ srun -N 4 -n 4 helloworld
# fails with : "... OMPI not build with SLURM's PMI support and therefore cannot execute ...""

Ok maybe not entirely surprising but would be good if we could build one with PMI support. Note and using --mpi=openmpi or --mpi=pmi2 gives the same.

These work as expected:

$ srun --mpi=pmix_v2 -N 4 -n 4 helloworld$ srun --mpi=pmix -N 4 -n 4 helloworld
$ srun --mpi=pmix -N 4 -n 8 helloworld

but then using more jobs:

$ srun --mpi=pmix -N 4 -n 16 helloworld

all 10 processes end up on 1 node. Using -m block or -m cyclic doesn't change this.

This works:

$ srun --mpi=pmix -N 4 -n 16 --ntasks-per-node=4 helloworld

with processes distributed as if -m block is used, which I expect to to be the default behaviour.

Can get correct behaviour using:

$ srun --mpi=pmix -N 4 -n 16 --ntasks-per-node=4 -m cyclic helloworld
# correctly shows cyclic behaviour
$ srun --mpi=pmix -N 4 -n 16 --ntasks-per-node=4 helloworld
# OK: processes distributed as if "-m block" used

slurm daemons started too early

In runtime.yml the slurm services are started before munge and the OHPC packages are installed. Starting the daemons potentially will run any jobs stored in state files, so this is the wrong way round. It works ok on a fresh install as there won't be jobs stored, but trips up e.g. if autoscaling uses this playbook to add nodes.

More broadly there may also be other requirements not part of this role (e.g. distributed filesystems) which should be satisfied before starting/restarting the services.

Memory not set for compute nodes

Processor info is set in slurm.conf but memory info isn't and requires passing ram_mb. Could use ansible to fill this out automatically.

Using 100% of real memory as default for ram_mb is too high

Leads to nodes getting drained on LowRealMemory. Advice from schemd here is to under-specify.

Document tags in README

Currently install, configure

filetxt accounting storage not supported in Slurm 20.11

Currently role default (openhpc_slurm_accounting_storage_type) for AccountingStorageType is accounting_storage/filetxt.

openhpc v2.1 (which is in the same repos) provides slurm 20.11 which only supports /none and /slurmdbd.

Parameters need to have consistent prefix

To avoid namespace collisions, all parameters should have an openhpc_ prefix. For example, cluster_name should be openhpc_cluster_name.

ohpc-release install fails signature check

Under some circumstances installing the ohpc-release repo for openhpc v2 fails with

Failed to validate GPG signature for ohpc-release-2-1.el8.x86_64

Not clear why: worked on alaska with centos8 cloud image, failed using docker centos:8, then failed on alaska using a new deployment host with centos8 cloud image. Tried going back to same ansible version as working run, still failed.

Set PAM to only allow ssh from users with running jobs

As per S3.8.4.4 of the openhpc v2 install guide, should configure PAM so that only users with a running job can login to a compute node.

sacct_cluster.py needs to be more careful about the location of Python

I found that the task Ensure the cluster exists in the accounting database was hanging.

After I changed the top of the script sacct_cluster.py to be:
#!/usr/bin/python3
the task completed.

Configless mode failed to start slurmd

On 00d83d7, I saw slurmd startup after instance creation fail with:

[centos@testohpc-compute-2 ~]$ systemctl status slurmd
● slurmd.service - Slurm node daemon
   Loaded: loaded (/usr/lib/systemd/system/slurmd.service; disabled; vendor preset: disabled)
   Active: failed (Result: timeout) since Tue 2020-11-24 09:56:20 UTC; 24s ago
  Process: 26914 ExecStart=/usr/sbin/slurmd $SLURMD_OPTIONS (code=killed, signal=TERM)

Nov 24 09:54:50 testohpc-compute-2.novalocal systemd[1]: Starting Slurm node daemon...
Nov 24 09:56:20 testohpc-compute-2.novalocal systemd[1]: slurmd.service: Start operation timed out. Terminating.
Nov 24 09:56:20 testohpc-compute-2.novalocal systemd[1]: slurmd.service: Failed with result 'timeout'.
Nov 24 09:56:20 testohpc-compute-2.novalocal systemd[1]: Failed to start Slurm node daemon.
[centos@testohpc-compute-2 ~]$ journalctl -xe
Nov 24 09:54:52 testohpc-compute-2.novalocal systemd[1]: Started man-db-cache-update.service.
-- Subject: Unit man-db-cache-update.service has finished start-up
-- Defined-By: systemd
-- Support: https://access.redhat.com/support
-- 
-- Unit man-db-cache-update.service has finished starting up.
-- 
-- The start-up result is done.
Nov 24 09:54:59 testohpc-compute-2.novalocal slurmd[27015]: error: _fetch_child: failed to fetch remote configs
Nov 24 09:56:16 testohpc-compute-2.novalocal systemd[4426]: Starting Mark boot as successful...
-- Subject: Unit UNIT has begun start-up
-- Defined-By: systemd
-- Support: https://access.redhat.com/support
-- 
-- Unit UNIT has begun starting up.
Nov 24 09:56:16 testohpc-compute-2.novalocal systemd[4426]: Started Mark boot as successful.
-- Subject: Unit UNIT has finished start-up
-- Defined-By: systemd
-- Support: https://access.redhat.com/support
-- 
-- Unit UNIT has finished starting up.
-- 
-- The start-up result is done.
Nov 24 09:56:20 testohpc-compute-2.novalocal systemd[1]: slurmd.service: Start operation timed out. Terminating.
Nov 24 09:56:20 testohpc-compute-2.novalocal systemd[1]: slurmd.service: Failed with result 'timeout'.
Nov 24 09:56:20 testohpc-compute-2.novalocal systemd[1]: Failed to start Slurm node daemon.

However issuing sudo systemctl start slurmd immediately after this succeeded. Some issue with name resolution or something after node startup?

Need tests for configless operation

slurmctld start fails

As of 5aca744, slurmctld start fails.

There are various error messages in /var/log/messages; ones about MailProg are a red-herring. Culprit is:

error: open /var/log/slurm_jobacct.log: Permission denied
...
fatal: failed to initialize accounting_storage plugin

There is a file missing, running this fixes it:

[centos@slurm-control-0 ~]$ sudo chown slurm /var/log/slurm_jobacct.log
[centos@slurm-control-0 ~]$ system slurmctld start

but can't be done before the openhpc role runs as user slurm won't exist.

Note that ansible shows the task as [changed] and using -vvvv shows that systemd returns a successful start to ansible; however service slurmctld status does show it as failed.

An alternative fix is to remove the line:

AccountingStorageType=accounting_storage/filetxt

in slurm.conf.j2:103. The same parameter=value is also given but commented out under the Accounting section at `:91' so it's not clear if this is actually meant to be there at all.

Remove FastSchedule option from slurm.conf

slurm.conf.j2 currently has FastSchedule=0.

FastSchedule parameter was depreciated in 19.05.3+ and prevents daemon start with openhpcV2 slurm version (TBD) as noted by @oneswig.

Docs here). Note that:

default value is =1,
conf template currently has PreemptMode=suspend,gang so processor count matching with =0 will actually be active.

Brief discussion here:https://slurm.schedmd.com/SLUG19/Slurm_20.02_and_Beyond.pdf

SlurmdParameters=config_overrides is equivalent to FastSchedule=2 but there's no replacement for other values.

Suggest we just delete with no replacement. Intending to add functionality to allow parameter changes/additions later, which could be used to add in SlurmdParameters=config_overrides if appropriate during development.

Role variables doc improvements

Some minor issues with the Role Variable docs:

default missing "this partition is the default".
Would be good to link variables to their slurm.conf parameters:
- ram_mb (RealMemory)
- default (Default)
- maxtime (MaxTime)
The README example also shows some additional keys flavor, image and user are not used by the role, which is a bit confusing.
The link between group/partition names and ansible is shown in the example, but the documentation could be more explicit and it's not stated that it actually links to instance names as well. For each group/partition definition:
- there must be an ansible group named <cluster_name>_<name> which contains the relevant compute hosts: also note that this means ansible's group name restrictions apply to both bits
- nodes in this group must have hostnames which are this <cluster_name>_<name>-<num> (where num starts at 0) (NB if #20 added then the suffix is only required for groups/partitions with >1 node)

Handler name misleading

The handler Restart SLURM service actually only RELOADS the appropriate service. The behaviour is appropriate as a restart is only required for some config file changes, and will break the scheduler loop which might be undesirable. So the handler name should be changed to avoid confusion.

lmod only installed on compute

compute.yml includes task - name: Install OpenHPC LMOD. However given that openhpc_packages get installed on all nodes I don't think this makes sense and lmod should also be installed on the login node.

Ability to install given versions of OpenHPC

The role should (optionally?) install an RPM for the OpenHPC repo file for a named version.

For example, it would be good to supply an OpenHPC release version as parameter and have the effect of installing a repo like this:

    - name: Ensure the OpenHPC package repo rpm is present
      yum:
        name: "https://github.com/openhpc/ohpc/releases/download/v1.3.GA/ohpc-release-1.3-1.el7.x86_64.rpm"
        state: present

The deep link into the OpenHPC distribution package archives would be added (and maintained) by this role. The user would name a version (eg, as openhpc_version=v1.3) and the role would maintain a dictionary of known versions and their repo URLs.

Slurm partition configuration makes assumptions about hostnames

The partition configuration slurm_partitions is really using a format based on that used by the stackhpc.cluster-infra role, and makes assumptions about the format of the nodes' hostnames that may not be generally applicable. Here is how the partition config is currently generated:

{% for part in slurm_partitions %}
{% for group in part.get('groups', [part]) %}
NodeName={{group.cluster_name|default(cluster_name)}}-{{group.name}}-[0-{{group.num_nodes-1}}] \
{% if 'ram_mb' in group %}
    RealMemory={{group.ram_mb}} \
{% endif %}
{% set group_name = group.cluster_name|default(cluster_name) ~ '_' ~ group.name %}
{# If using --limit, the first host in each group may not have facts available. Find one that does. #}
{% set group_hosts = groups[group_name] | intersect(play_hosts) %}
{% if group_hosts | length > 0 %}
{% set first_host_hv = hostvars[group_hosts | first] %}
    Sockets={{first_host_hv['ansible_processor_count']}} \
    CoresPerSocket={{first_host_hv['ansible_processor_cores']}} \
    ThreadsPerCore={{first_host_hv['ansible_processor_threads_per_core']}} \
{% endif %}
    State=UNKNOWN
{% endfor %}
{% endfor %}
{% for part in slurm_partitions %}
PartitionName={{part.name}} Nodes={% for group in part.get('groups', [part]) %}{{group.cluster_name|default(cluster_name)}}-{{group.name}}-[0-{{group.num_nodes-1}}]{% if not loop.last %},{% endif %}{% endfor %} Default=YES MaxTime=24:00:00 State=UP
{% endfor %}

Node names are assumed to be in the following format:

{{group.cluster_name|default(cluster_name)}}-{{group.name}}-[0-{{group.num_nodes-1}}]

Really, it should be possible to take the nodes' hostnames from the inventory. The difficult part will be in determining the common prefix of the nodes' names.

Can't define a cluster with zero compute nodes

This would be useful for e.g. packer build pipelines if you want to stand up the login/control node first. But the templating fix for #51 means that an error gets thrown if there are no hosts in a group/partition, to avoid nodes getting defined with no cpu info. I guess the correct approach is to skip writing the Node/Partition info for that group entirely, possibly warning that that has been done?

CLI usage of molecule doesn't auto-skip incompatible tests

For github CI we define a list of tests vs images to skip as some things don't work/aren't supported by the role on centos 7= openhpc v1 (e.g. configless operation). However when running molecule manually from the cli with test --all those tests get run.
Would be nice if that could use the same skip list.

Enable using srun with intel MPI

This needs pmi, but current package lists install pmix. Need to install slurm-libpmi-ohpc-18.08.8.

See this openhpc thread for links to other env vars etc required.

Provide control over slurm.conf

At the moment the user has no way to change parameters in this. This meant the enhanced-slurm work had to fork the role.

Consider providing facilities to either override template values, and/or take a user-provided template.

Support writing slurm.conf to cluster fileysystem

for cases where we have a shared filesystem (usually?) would be more standard to write slurm.conf to that.

drain nodes example in README has un-supported parameters

Currently:

openhpc_slurm_partitions:
          - name: "compute"
            flavor: "compute-A"
            image: "CentOS7.5-OpenHPC"
            num_nodes: 6
            user: "centos"

flavor, image, num_nodes, user are not supported parameters. num_nodes occurs further down the example too.

Provide error messages on failure to start slurm daemons

If this fails, then journalctl or systemctl status might well have useful info, e.g. if you specify two partitions which share nodes (which is legal to slurm, but isn't handled by our current templating) then:

slurmctld appears to start from the ansible but actually fails
slurmd shows Unable to start service slurmd: Job for slurmd.service failed because a timeout was exceeded. See "systemctl status slurmd.service" and "journalctl -xe" for details.

but actually the control node shows:

$ sudo journalctl -xe
Feb 24 09:21:38 nrel-control.novalocal slurmctld[26180]: error: Duplicated NodeHostName nrel-hpc-0 in the config file
Feb 24 09:21:38 nrel-control.novalocal slurmctld[26180]: error: Duplicated NodeHostName nrel-hpc-1 in the config file
Feb 24 09:21:38 nrel-control.novalocal slurmctld[26180]: error: Duplicated NodeHostName nrel-hpc-2 in the config file
Feb 24 09:21:38 nrel-control.novalocal slurmctld[26180]: error: Duplicated NodeHostName nrel-hpc-3 in the config file
Feb 24 09:21:38 nrel-control.novalocal slurmctld[26180]: error: Duplicated NodeHostName nrel-express-0 in the config file
Feb 24 09:21:38 nrel-control.novalocal slurmctld[26180]: error: Duplicated NodeHostName nrel-express-1 in the config file
Feb 24 09:21:38 nrel-control.novalocal slurmctld[26180]: fatal: Duplicated NodeHostName nrel-hpc-0 in config file

and

$ sudo systemctl status slurmctld
● slurmctld.service - Slurm controller daemon
   Loaded: loaded (/usr/lib/systemd/system/slurmctld.service; enabled; vendor preset: disabled)
   Active: failed (Result: exit-code) since Wed 2021-02-24 09:21:38 UTC; 5min ago
  Process: 26178 ExecStart=/usr/sbin/slurmctld $SLURMCTLD_OPTIONS (code=exited, status=0/SUCCESS)
 Main PID: 26180 (code=exited, status=1/FAILURE)

Feb 24 09:21:38 nrel-control.novalocal slurmctld[26180]: layouts: no layout to initialize
Feb 24 09:21:38 nrel-control.novalocal slurmctld[26180]: error: Duplicated NodeHostName nrel-hpc-0 in the config file
Feb 24 09:21:38 nrel-control.novalocal slurmctld[26180]: error: Duplicated NodeHostName nrel-hpc-1 in the config file
Feb 24 09:21:38 nrel-control.novalocal slurmctld[26180]: error: Duplicated NodeHostName nrel-hpc-2 in the config file
Feb 24 09:21:38 nrel-control.novalocal slurmctld[26180]: error: Duplicated NodeHostName nrel-hpc-3 in the config file
Feb 24 09:21:38 nrel-control.novalocal slurmctld[26180]: error: Duplicated NodeHostName nrel-express-0 in the config file
Feb 24 09:21:38 nrel-control.novalocal slurmctld[26180]: error: Duplicated NodeHostName nrel-express-1 in the config file
Feb 24 09:21:38 nrel-control.novalocal slurmctld[26180]: fatal: Duplicated NodeHostName nrel-hpc-0 in config file
Feb 24 09:21:38 nrel-control.novalocal systemd[1]: slurmctld.service: Main process exited, code=exited, status=1/FAILURE
Feb 24 09:21:38 nrel-control.novalocal systemd[1]: slurmctld.service: Failed with result 'exit-code'.

Add slurmdb

Consider adding slurmdb rather than using the text plugin for sacct (which means e.g. time ranges don't work properly).

Could have a slurm_db group which e.g. you add the control node to to enable this.

Still want to be able to not have it.

filetxt storage locations not created if not default

In openhpc v2.0 / slurm 20.02.5 the filetxt storage type can be used for either/both accounting storage and job completion storage. Locations for these are set by AccountingStorageLoc and JobCompLoc respectively. In openhpc v2.1 /slurm v20.11.3 filetxt storage can only be used for job completion storage. Both versions support DefaultStorageLoc which can fill in for the other *StorageLoc parameters.

As of v0.7.0 the situation with this role is as follows:

DefaultStorageLoc is not specified in the slurm.conf template.
The default accounting storage type is filetxt
AccountingStorageLoc is not specified in the slurm.conf template
Accounting logs to /var/log/slurm_jobacct.log. This appears to be the internal Slurm default for DefaultStorageLoc (source). This file is not part of any package, and is not created by the role. So Slurm must create it itself.
The default job completion type is none.
There is a role default for JobCompLoc which is /var/log/slurm_jobacct.log. This appears to be inconsistent with Slurm's internal default of /var/log/slurm_jobcomp.log (source) although it is the internal default for DefaultStorageLoc.

If accounting storage type is set to none and job completion is set to filetxt, slurmctld dies on startup with a permissions error for /var/log/slurm_jobacct.log. Changing the JobCompLoc to be /var/log/slurm_jobcomp.log doesn't help, and the file must be "manually" created before slurmctld starts.

If AccountingStorageLoc is manually added to the slurm.conf template, with a non-default location, then slurmctld similarly dies similarly on startup too. So it appears that there is some special-case creation for /var/log/slurm_jobacct.log, but only if it is used for accounting storage, not as job completion storage.

Refs:

v20.02.5 slurm.conf: https://slurm.schedmd.com/archive/slurm-20.02.5/slurm.conf.html
v slurm.conf: https://slurm.schedmd.com/slurm.conf.html

Enable task affinity in slurm.conf

Currently there is no task launch plugin configured, which means srun's --cpu-bind option does not work.

See guidance under TaskPlugin on the slurm.conf manpage:

NOTE: It is recommended to stack task/affinity,task/cgroup together when configuring TaskPlugin, and setting TaskAffinity=no and ConstrainCores=yes in cgroup.conf. This setup uses the task/affinity plugin for setting the affinity of the tasks (which is better and different than task/cgroup) and uses the task/cgroup plugin to fence tasks into the specified resources, thus combining the best of both pieces.