agnostiqhq / covalent-slurm-plugin Goto Github PK

View Code? Open in Web Editor NEW

27.0 12.0 6.0 375 KB

Executor plugin interfacing Covalent with Slurm

Home Page: https://covalent.xyz

License: Apache License 2.0

Python 96.38% Dockerfile 3.32% Shell 0.31%

python workflow etl hpc parallelization pipelines python3 covalent quantum-computing machinelearning

covalent-slurm-plugin's People

Contributors

Stargazers

Watchers

Forkers

svandenhaute wingcode jackbaker1001 acastanedam ccsun21 rajarshitiwari

covalent-slurm-plugin's Issues

SLURM job crashes if Conda is not installed

Environment

Covalent version: 0.220.0
Covalent-Slurm plugin version: 0.12.1
Python version: Python 3.9.16
Operating system: Ubuntu

What is happening?

Note: this one is motivated by some "getting started" snags by a colleague of mine, @rwexler.

covalent-slurm-plugin/covalent_slurm_plugin/slurm.py

Lines 337 to 346 in bcf2049

 # sets up conda environment 

 slurm_conda = f""" 

 source $HOME/.bashrc 

 conda activate {conda_env_str} 

 retval=$? 

 if [ $retval -ne 0 ] ; then 

  >&2 echo "Conda environment {conda_env_str} is not present on the compute node. "\ 

  "Please create the environment and try again." 

  exit 99 

 fi

Because of this line, there is currently no way to use the plugin if you aren't using conda. There are many situations in which the user may not want to use an Anaconda environment. For instance, maybe the user is loading an HPC-hosted Python module, which they've installed packages with via pip. They would have to load the module in their prerun_commands but otherwise shouldn't be required to use conda.

How can we reproduce the issue?

Submit a job without conda in the PATH.

What should happen?

Covalent should support running with Python in the PATH but not necessarily relying on Conda.

Any suggestions?

Only add the conda-related lines if conda_env is not False.

PyPI Action Fix

Environment

Covalent version: N/A
Covalent-Slurm plugin version: 0.0.2
Python version: 3.8.12
Operating system: Gentoo

What is happening?

Need to apply the following patch in the next merge:

diff --git a/.github/workflows/pypi.yml b/.github/workflows/pypi.yml
index de94eb5..676b614 100644
--- a/.github/workflows/pypi.yml
+++ b/.github/workflows/pypi.yml
@@ -46,8 +46,8 @@ jobs:
           VERSION="$(cat ./VERSION)"
           cd dist
           tar xzf covalent-slurm-plugin-${VERSION}.tar.gz
-          test -e covalent-slurm-plugin-${VERSION}/slurm.py
-          rm -rf cova-${VERSION}/
+          test -e covalent-slurm-plugin-${VERSION}/covalent_slurm_plugin/slurm.py
+          rm -rf covalent-slurm-plugin-${VERSION}/
       - name: Upload Distribution
         env:
           TWINE_USERNAME: ${{ secrets.PYPI_USERNAME }}

How can we reproduce the issue?

N/A

What should happen?

PyPI upload should work properly

Any suggestions?

N/A

Backported Slurm plugin doesn't detect failures within an Electron

Environment

Covalent version: 0.33.0
Covalent-Slurm plugin version: 0.3.0
Python version: 3.8.13
Operating system: Fedora 35

What is happening?

When an exception is raised by the Slurm plugin signaling a task failure, that exception is not propagated to the dispatcher's _run_task. Thus covalent always marks the electron as completed even if no results are returned.

How can we reproduce the issue?

The following workflow

@ct.electron
def task():
        assert False
        return 1
        
@ct.lattice
def workflow():
        return task()

should fail, but instead "succeeds".

What should happen?

The electron should fail. Instead it "succeeds" because we weren't propagating exceptions raised within executor.execute() to the dispatcher.

Any suggestions?

No response

`prerun_commands` don't show up in the Slurm jobscript file

Environment

Covalent version: 0.234.0rc0
Covalent-Slurm plugin version: 0.18.0
Python version: 3.10.13
Operating system: Ubuntu

What is happening?

Specifying prerun_commands in the SlurmExecutor does not result in the commands appear in the Slurm job script.

How can we reproduce the issue?

Run a simple toy example with the prerun_commands keyword argument.

I used the example below:

import covalent as ct

executor = ct.executor.SlurmExecutor(
    username="rosen",
    address="perlmutter-p1.nersc.gov",
    ssh_key_file="/home/rosen/.ssh/nersc",
    cert_file="/home/rosen/.ssh/nersc-cert.pub",
    conda_env="covalent",
    options={
        "nodes": f"{n_nodes}",
        "qos": "debug",
        "constraint": "cpu",
        "account": "matgen",
        "job-name": "quacc",
        "time": "00:30:00",
    },
    remote_workdir="/pscratch/sd/r/rosen/quacc/",
    create_unique_workdir=True,
    use_srun=False,
    prerun_commands=[
        "module load vasp/6.4.1-cpu",
        f"export QUACC_VASP_PARALLEL_CMD='{vasp_parallel_cmd}'",
    ],
)

What should happen?

The prerun_commands should appear at the bottom of the job script, but they do not. The following was present for me:

#!/bin/bash
#SBATCH --nodes=1
#SBATCH --qos=debug
#SBATCH --constraint=cpu
#SBATCH --account=matgen
#SBATCH --job-name=quacc
#SBATCH --time=00:30:00
#SBATCH --parsable
#SBATCH --output=/pscratch/sd/r/rosen/quacc/94d1d3a5-8c42-4af3-b6f1-b6ee8126bea0/node_0/stdout-94d1d3a5-8c42-4af3-b6f1-b6ee8126bea0-0.log
#SBATCH --error=/pscratch/sd/r/rosen/quacc/94d1d3a5-8c42-4af3-b6f1-b6ee8126bea0/node_0/stderr-94d1d3a5-8c42-4af3-b6f1-b6ee8126bea0-0.log
#SBATCH --chdir=/pscratch/sd/r/rosen/quacc/94d1d3a5-8c42-4af3-b6f1-b6ee8126bea0/node_0

source $HOME/.bashrc

            conda activate covalent
            retval=$?
            if [ $retval -ne 0 ] ; then
                >&2 echo "Conda environment covalent is not present on the compute node. "                "Please create the environment and try again."
                exit 99
            fi

remote_py_version=$(python -c "print('.'.join(map(str, __import__('sys').version_info[:2])))")
if [[ "3.10" != $remote_py_version ]] ; then
  >&2 echo "Python version mismatch. Please install Python 3.10 in the compute environment."
  exit 199
fi

python /pscratch/sd/r/rosen/quacc/script-94d1d3a5-8c42-4af3-b6f1-b6ee8126bea0-0.py

wait

Note how there are no prerun commands here.

I have not yet tried the postrun_commands.

Any suggestions?

No response

Make Slurm executor code compatible with covalent-mono

Make sure Slurm plugin is compatible and tested with workflows

Timeout error when using the plugin

Environment

Covalent version: 0.177.0
Covalent-Slurm plugin version: 0.6.0
Python version: 3.8.13
Operating system: MacOS 12.3.1

What is happening?

When dispatching with the slurm executor, one of the ssh commands seem to be halted, and after 10 seconds of no response from the server, a timeout occurs.

How can we reproduce the issue?

from covalent.executor import SlurmExecutor
import covalent as ct

executor = SlurmExecutor(
    ssh_key_file="~/.ssh/<key_name>",
    username="<user_name>",
    poll_freq="3",
    cache_dir="<covalent_local_cache_dir>",
    address="<address>",
    remote_workdir="<any_workdir_on_remote>",
    options={"partition": "debug", "cpus-per-task": 2, "nodelist": ["<hostname>"]},
)


@ct.electron(executor=executor)
def join_words(a, b):
    return ", ".join([a, b])

@ct.electron(executor=executor)
def excitement(a):
    return f"{a}!"

@ct.lattice
def simple_workflow(a, b):
    phrase = join_words(a, b)
    return excitement(phrase)

dispatch_id = ct.dispatch(simple_workflow)("Hello", "World")

What should happen?

A dispatch should be sent and run successfully. The output of above should be Hello, World!.

Any suggestions?

No response

Allow for the creation of unique subfolders in the current working directory to avoid file overwriting

From @arosen93 :

What should we add?

There needs to be a way in which files can be written out to the current working directory without them potentially overwriting each other between calculations. The solution is to have unique subfolders for each Electron, as described below.

Motivation

In many quantum chemistry codes, files are often written out to the filesystem at runtime. Typically, these files are hard-coded to be written to the current working directory and are both read from and written to throughout the course of the calculation. Therefore, if one launches a quantum chemistry code from within an Electron, these files will be overwriting each other if multiple calculations are going on simultaneously. It is also impossible to preserve the provenance of where these files originated from.

On a personal note, I would love to get my computational chemistry and materials science colleagues on board with Covalent, but this is currently a major dealbreaker in terms of adoption (hopefully not for long though!).

In the toy example below, the files are both written out to the same working directory, which will cause file loss and result in unexpected errors in more complex examples where the file I/O cannot be explicitly controlled by the user. For instance, with the SLURM plugin, everything is written out to the same remote_workdir folder (see #1619 for where this currently is for local executors; there is a bug).

import covalent as ct
import os

@ct.electron
def job(val1, val2):
    with open("job.txt", "w") as w:
        w.write(str(val1 + val2))
    return "Done!"

@ct.lattice
def workflow(val1, val2, val3, val4):
    job1 = job(val1, val2)
    job2 = job(val3, val4)
    return "Done!"

dispatch_id = ct.dispatch(workflow)(1, 2, 3, 4)
result = ct.get_result(dispatch_id)
print(result)

Suggestion

My recommendation is that there should be a keyword argument somewhere of the type create_folders: bool. If set to True, then wherever the default working directory is (which may depend on the executor), Covalent would automatically create subfolders of the form dispatch_id/node_number in the current working directory. Each calculation node would cd into its respective dispatch_id/node_number directory to avoid file overwriting concerns by any external program writing to the current working directory during runtime.

Note that this is distinct from the results_dir, which is where the .pkl and .log files go and may not be on the same filesystem. For instance, in the SLURM executor, the current working directory is set by the remote_workdir. In the proposed feature addition, every new calculation would have remote_workdir/dispatch_id/node_number as its unique current working directory if create_folders = True. The same could be done for the local executors except that it would be a local workdir as the base directory.

In addition to preventing file overwriting, this has the added benefit of ensuring that the files written out at runtime can be linked back to their corresponding Electrons for reproducibility purposes.

Note

This issue should be addressed after #1619 is closed. This issue was originally discussed in #1592, which I split into two separate issues.

Describe alternatives you've considered.

Always use absolute paths for file I/O

For full context, see [here](AgnostiqHQ/covalent#1592 (comment)). This route is not sufficient because there are many use cases where the user cannot control where the files are written out to at runtime. That is usually specified by the external code that is being run and is often the current working directory. Also, if Covalent ever adds a feature where the executor is dynamically selected, the user may not know in advance which filesystem the calculation will be run on.

File transfer mechanisms and call backs/befores

For full context, see [here](AgnostiqHQ/covalent#1592 (comment)). Unfortunately, this route isn't sufficient for the problem. We want to make sure that if there are multiple ongoing calculations that write files out to the current working directory during runtime that they do not overwrite one another. Many computational chemistry codes are hard-coded to rely on being able to write and read input and output files in the current working directory throughout the calculation itself. Transferring files before or after the calculation finishes is a separate issue that should be left to the user.

Using a decorator to change the working directory

For full context, see [here](AgnostiqHQ/covalent#1592 (comment)). The relevant code snippet from @santoshkumarradha is below:

import covalent as ct
from pathlib import Path

import os
from functools import wraps

def change_dir_and_execute(directory):
    def decorator(func):
        @wraps(func)
        def wrapper(*args, **kwargs):
            current_dir = os.getcwd()
            try:
                os.chdir(directory)
                result = func(*args, **kwargs)
            finally:
                os.chdir(current_dir)
            return result
        return wrapper
    return decorator

path=Path(".").absolute()

@ct.electron
@change_dir_and_execute(path)
def job(val1, val2,file):
    with open(file, "w") as w:
        w.write(str(val1 + val2))
    return Path(".").absolute()

@ct.lattice
def workflow(file, val1, val2, val3, val4):
    job1 = job(val1, val2,file)
    return job1

file="example.txt"
dispatch_id = ct.dispatch(workflow)(file, 1, 2, 3, 4)
result = ct.get_result(dispatch_id, wait=True)
print(result)

While this may be possible, it is not concise or clean. One of the major benefits of Covalent as a whole is that there is minimal friction to go from writing a function to running a complex workflow. Given the many foreseeable use cases where significant I/O is written out to the current working directory at runtime (without this being something that can be changed), always needing this verbose approach is less than ideal.

Tracking AgnostiqHQ/covalent#1628

Add an option, `use_srun: bool`, that can run the Python function without `srun`

What should we add?

Currently, the Python file is always run with srun (with potentially some srun options). An example script might look like the following:

#!/bin/bash
#SBATCH --partition=debug
#SBATCH --account=matgen
#SBATCH --constraint=cpu
#SBATCH --job-name=covalent
#SBATCH --time=00:10:00
#SBATCH --parsable
#SBATCH --output=/global/cfs/cdirs/matgen/arosen/covalent10/stdout-7d947ff3-c38a-452c-a5b5-b03555510d5b-0.log
#SBATCH --error=/global/cfs/cdirs/matgen/arosen/covalent10/stderr-7d947ff3-c38a-452c-a5b5-b03555510d5b-0.log


source $HOME/.bashrc
conda activate covalent
retval=$?
if [ $retval -ne 0 ] ; then
  >&2 echo "Conda environment covalent is not present on the compute node. "  "Please create the environment and try again."
  exit 99
fi


remote_py_version=$(python -c "print('.'.join(map(str, __import__('sys').version_info[:2])))")
if [[ "3.9" != $remote_py_version ]] ; then
  >&2 echo "Python version mismatch. Please install Python 3.9 in the compute environment."
  exit 199
fi

srun \
python /global/cfs/cdirs/matgen/arosen/covalent10/script-7d947ff3-c38a-452c-a5b5-b03555510d5b-0.py

wait

However, as originally noted by my colleague @rwexler and confirmed by me, this can be problematic depending on the use case. For instance, in computational materials science, a common usage pattern is to use SLURM to submit a Python job. That Python job then does several things, including making calls to and running external codes. This is how the Atomic Simulation Environment (ASE) works, which is a tool highlighted in one of the tutorials. You can see an example of how that's done in practice here, if you're curious.

The problem with this approach as a general rule is that if the pickled Python function at some point launches an external, parallel code with srun or mpirun, you now have nested parallelism and you can get very unintended behavior when running on multiple cores/nodes. We want to provide the user a way in which they can have Covalent simply do python <NameOfJob.py> without srun while still launching the workflow as a SLURM job so that the proper resources are reserved. I propose introducing a new kwarg named use_srun: bool, which has a default value of True but can be set to False.

I understand that this usage pattern might be difficult to visualize, so I can definitely explain further. It's a subtle but incredibly important usage pattern though!

Describe alternatives you've considered.

There are no possible alternatives other than cloning the plugin and modifying the line or not using entire software ecosystems like ASE altogether with Covalent.

Create a SLURM executor

@scottwn commented on Wed Nov 24 2021

In GitLab by @nolanagnostiq on Aug 9, 2021, 13:35

Overview

Create a SLURM executor that connects to a generic SLURM cluster (but specifically beehive.agnostiq.ai) to run a job

Technical Details

???

Expected Results

SLURM cluster spins up an instance or uses an existing compute resource for use on an electron.

@scottwn commented on Wed Nov 24 2021

In GitLab by @wjcunningham7 on Aug 16, 2021, 22:19

added to epic &58

@scottwn commented on Wed Nov 24 2021

In GitLab by @nolanagnostiq on Aug 24, 2021, 10:15

assigned to @wjcunningham7

@scottwn commented on Wed Nov 24 2021

In GitLab by @wjcunningham7 on Aug 31, 2021, 23:18

created merge request !26 to address this issue

@scottwn commented on Wed Nov 24 2021

In GitLab by @wjcunningham7 on Aug 31, 2021, 23:18

mentioned in merge request !26

@scottwn commented on Wed Nov 24 2021

In GitLab by @wjcunningham7 on Aug 31, 2021, 23:31

design notes:

A class SlurmExecutor will inherit from BaseExecutor and expose a member function execute
The flow of operations is a) serialize b) invoke Slurm REST API c) deserialize at the start of the batch script d) invoke the python function using srun
the electron's slurm_params dict will contain parameters that will be passed through to slurm
the expected return is a slurm job ID

implementation notes:

First need to enable slurmrestd on beehive-hive0 and then test the REST API with a dummy job submission
Next write serialization/deserialization code
Then write a function which wraps the electron's body in a slurm batch script
Next test end-to-end, all on hive0.beehive (login node)
finally consider auth methods in the case that the end-user machine is not in the same subnet.

@scottwn commented on Wed Nov 24 2021

In GitLab by @nolanagnostiq on Sep 1, 2021, 12:38

Overall, this looks good. Couple of comments:

The return of srun will be the slurm job id, but we'll need to get the response object back once the execution completes as well.
Some rudimentary error handling/returning across the remote executor is needed. We should also start thinking/planning (even if we don't build it now) for details like stack trace forwarding, distinguishing between error states/details, and maybe even logging stdout.
Both the slurm_params field and the cluster-side script will probably see reuse across other executors. We don't need to optimize prematurely, but another thing to think about while implementing.

@scottwn commented on Wed Nov 24 2021

In GitLab by @wjcunningham7 on Sep 1, 2021, 22:00

Example of how to submit a job to Slurm using sbatch over SSH:

ssh beehive 'bash -l -c "cat - | sbatch"' <<'EOF'
#!/bin/bash
#SBATCH -J testjob
#SBATCH -o /federation/testjob.log
#SBATCH -w beehive-debug-st-t2medium-1
#SBATCH --parsable

# This is the body of the script where executables are invoked
hostname

EOF

@scottwn commented on Wed Nov 24 2021

In GitLab by @wjcunningham7 on Sep 1, 2021, 22:01

Anything in slurm_params should be translated into an #SBATCH statement

@scottwn commented on Wed Nov 24 2021

In GitLab by @wjcunningham7 on Sep 1, 2021, 23:20

@nolanagnostiq can you advise on how to access the schedule assigned to the TransportableObject? In a minimal example I add a decorator @electron(num_cpu=2) to one of my electrons but I can't figure out how to retrieve this information in order to format a slurm batch script.

@scottwn commented on Wed Nov 24 2021

In GitLab by @wjcunningham7 on Sep 13, 2021, 07:35

as we discussed this issue will be pushed to backlog while we work on a local slurm executor, #72

@scottwn commented on Wed Nov 24 2021

In GitLab by @wjcunningham7 on Sep 23, 2021, 13:04

unassigned @wjcunningham7

Unclear error reported in the UI when the results pkl is not found on the Covalent side

Environment

Covalent version: 0.234.0rc0
Covalent-Slurm plugin version: 98e835f
Python version: 3.10.13
Operating system: Linux

What is happening?

When a Covalent-based error happens where the results pkl can't be found, the UI is very unclear about what the problem is. Ideally, I would like to see more details so I can know where to debug things.

How can we reproduce the issue?

import covalent as ct

executor = ct.executor.SlurmExecutor(
    username="rosen",
    address="perlmutter-p1.nersc.gov",
    ssh_key_file="/home/rosen/.ssh/nersc",
    cert_file="/home/rosen/.ssh/nersc-cert.pub",
    conda_env="covalent",
    options={
        "nodes": 1,
        "qos": "debug",
        "constraint": "cpu",
        "account": "matgen",
        "job-name": "test",
        "time": "00:10:00",
    },
    remote_workdir="/pscratch/sd/r/rosen/test",
    create_unique_workdir=True,
    cleanup=False,
)


@ct.lattice(executor=executor)
@ct.electron
def workflow():
    import os

    os.chdir("../")
    return os.getcwd()


ct.dispatch(workflow)()

What should happen?

The UI should give me more information about the issue. The log was not very helpful either: covalent_ui.log.txt

Any suggestions?

Yes, two things should be done.

The type of the exception should be reported. In this case, it seems a FileNotFoundError was raised, but this is never shown in the UI. It just says error. If I had known it was a FileNotFoundError, I would have known what the issue was more quickly.
The traceback should be provided somewhere. Currently, it's nowhere to be found.

Docs updates: Clarify remote dependencies (`cloudpickle`, `covalent`), Python version matching, and `parsable` flag

What should we add?

cloudpickle needs to be an installed package in the remote machine's Python environment, but this isn't specified anywhere in the docs.
Covalent is also required on the remote machine, which should be specified in the docs.
The SLURM script will not submit if the Python version on the remote machine does not match that used to submit the calculation to the server, as noted here. This should be mentioned in the installation/usage instructions.
The "options": {"parsable": ""} parameter is needed according to Will. This should be added to the docs. I think the SLURM directive is necessary for the job status to be properly logged.

These changes should be reflect in both the README and the main docs. Or maybe it makes sense to keep the README here minimal and link to the corresponding executor docs to not duplicate work? Up to you obviously.

Update PR template

What should we add?

The PR template has "I have updated the documentation, VERSION, and CHANGELOG accordingly" as a checkbox. However, the VERSION does not need to be updated since it will be done so automatically.

It also says:
⚠️ Increment the version number in the /VERSION file. If this is a bugfix, increment the patch version (the rightmost number), for example 0.18.2 becomes 0.18.3. If it's a new feature, increment the minor version (the middle number), for example 0.18.2 becomes 0.19.0.
⚠️ Add a note to /CHANGELOG.md with your version number and the date, summarizing the changes.

Both of these are not currently accurate since the user is not supposed to increment the version number manually or set the date.

The same should be done in the main covalent repo too I think...

v0.18.0 appears to be broken: no `sbatch` of jobs

Environment

Covalent version: 0.234.0rc0
Covalent-Slurm plugin version: 0.18.0
Python version: 3.10+
Operating system: Ubuntu

What is happening?

I can get the covalent-slurm-plugin to work fine with 0.16.0 but not with 0.18.0. I imagine the refactoring effort introduced a bug. Running any minimal example with the covalent-slurm-plugin yields an error akin to

scp: /global/homes/r/rosen/quacc/7f078879-ee9e-45a4-960b-98839dfdb1b8/node_0/stdout-7f078879-ee9e-45a4-960b-98839dfdb1b8-0.log: No such file or directory

with the following log

 Exception in ASGI application

Traceback (most recent call last):

  File "/home/rosen/software/miniconda/envs/covalent/lib/python3.10/site-packages/uvicorn/protocols/http/httptools_impl.py", line 404, in run_asgi

    result = await app(  # type: ignore[func-returns-value]

  File "/home/rosen/software/miniconda/envs/covalent/lib/python3.10/site-packages/uvicorn/middleware/proxy_headers.py", line 78, in __call__

    return await self.app(scope, receive, send)

  File "/home/rosen/software/miniconda/envs/covalent/lib/python3.10/site-packages/fastapi/applications.py", line 1054, in __call__

    await super().__call__(scope, receive, send)

  File "/home/rosen/software/miniconda/envs/covalent/lib/python3.10/site-packages/starlette/applications.py", line 123, in __call__

    await self.middleware_stack(scope, receive, send)

  File "/home/rosen/software/miniconda/envs/covalent/lib/python3.10/site-packages/starlette/middleware/errors.py", line 186, in __call__

    raise exc

  File "/home/rosen/software/miniconda/envs/covalent/lib/python3.10/site-packages/starlette/middleware/errors.py", line 164, in __call__

    await self.app(scope, receive, _send)

  File "/home/rosen/software/miniconda/envs/covalent/lib/python3.10/site-packages/starlette/middleware/cors.py", line 83, in __call__

    await self.app(scope, receive, send)

  File "/home/rosen/software/miniconda/envs/covalent/lib/python3.10/site-packages/starlette/middleware/exceptions.py", line 62, in __call__

    await wrap_app_handling_exceptions(self.app, conn)(scope, receive, send)

  File "/home/rosen/software/miniconda/envs/covalent/lib/python3.10/site-packages/starlette/_exception_handler.py", line 64, in wrapped_app

    raise exc

  File "/home/rosen/software/miniconda/envs/covalent/lib/python3.10/site-packages/starlette/_exception_handler.py", line 53, in wrapped_app

    await app(scope, receive, sender)

  File "/home/rosen/software/miniconda/envs/covalent/lib/python3.10/site-packages/starlette/routing.py", line 758, in __call__

    await self.middleware_stack(scope, receive, send)

  File "/home/rosen/software/miniconda/envs/covalent/lib/python3.10/site-packages/starlette/routing.py", line 778, in app

    await route.handle(scope, receive, send)

  File "/home/rosen/software/miniconda/envs/covalent/lib/python3.10/site-packages/starlette/routing.py", line 299, in handle

    await self.app(scope, receive, send)

  File "/home/rosen/software/miniconda/envs/covalent/lib/python3.10/site-packages/starlette/routing.py", line 79, in app

    await wrap_app_handling_exceptions(app, request)(scope, receive, send)

  File "/home/rosen/software/miniconda/envs/covalent/lib/python3.10/site-packages/starlette/_exception_handler.py", line 64, in wrapped_app

    raise exc

  File "/home/rosen/software/miniconda/envs/covalent/lib/python3.10/site-packages/starlette/_exception_handler.py", line 53, in wrapped_app

    await app(scope, receive, sender)

  File "/home/rosen/software/miniconda/envs/covalent/lib/python3.10/site-packages/starlette/routing.py", line 74, in app

    response = await func(request)

  File "/home/rosen/software/miniconda/envs/covalent/lib/python3.10/site-packages/fastapi/routing.py", line 278, in app

    raw_response = await run_endpoint_function(

  File "/home/rosen/software/miniconda/envs/covalent/lib/python3.10/site-packages/fastapi/routing.py", line 193, in run_endpoint_function

    return await run_in_threadpool(dependant.call, **values)

  File "/home/rosen/software/miniconda/envs/covalent/lib/python3.10/site-packages/starlette/concurrency.py", line 42, in run_in_threadpool

    return await anyio.to_thread.run_sync(func, *args)

  File "/home/rosen/software/miniconda/envs/covalent/lib/python3.10/site-packages/anyio/to_thread.py", line 56, in run_sync

    return await get_async_backend().run_sync_in_worker_thread(

  File "/home/rosen/software/miniconda/envs/covalent/lib/python3.10/site-packages/anyio/_backends/_asyncio.py", line 2144, in run_sync_in_worker_thread

    return await future

  File "/home/rosen/software/miniconda/envs/covalent/lib/python3.10/site-packages/anyio/_backends/_asyncio.py", line 851, in run

    result = context.run(func, *args)

  File "/home/rosen/software/miniconda/envs/covalent/lib/python3.10/site-packages/covalent_ui/api/v1/routes/end_points/electron_routes.py", line 216, in get_electron_file

    response, python_object = handler.read_from_serialized(result["results_filename"])

TypeError: cannot unpack non-iterable NoneType object

How can we reproduce the issue?

Use covalent-slurm-plugin==0.18.0 and submit any minimal example like that in the README.

This was my setup for what it's worth:

n_nodes = 1
n_cores_per_node = 1

executor = ct.executor.SlurmExecutor(
    username="rosen",
    address="perlmutter-p1.nersc.gov",
    ssh_key_file="/home/rosen/.ssh/nersc",
    cert_file="/home/rosen/.ssh/nersc-cert.pub",
    conda_env="quacc",
    options={
        "nodes": f"{n_nodes}",
        "qos": "debug",
        "constraint": "cpu",
        "account": "matgen",
        "job-name": "quacc",
        "time": "00:10:00",
    },
    remote_workdir="/pscratch/sd/r/rosen/quacc",
    create_unique_workdir=True,
    use_srun=False,
    cleanup=False,
)

What should happen?

The job should be submitted to the queue. In reality, it never is submitted.

Any suggestions?

No response

Remote machine shouldn't need to have covalent installed

Environment

Covalent version: 0.77.0
Covalent-Slurm plugin version: 0.2.1
Python version: 3.8.13
Operating system: Ubuntu 21.10

What is happening?

The function to be executed is sent to the remote machine as a Covalent TransportableObject object. Which needs to be deserialized on the remote machine in order to run, which requires a Covalent installation on the remote machine. If the deserialized function is sent instead, a "vanilla" python installation would be able to execute the function.

How can we reproduce the issue?

Run a simple workflow where the remote machine does not have Covalent installed. It will not succeed.

What should happen?

The function execution should happen regardless of if Covalent is installed on the remote machine.

Any suggestions?

Early in the execute method, the function should be deserialized before sending to the remote machine:
function = function.get_deserialized()

Accurate time logging for queued jobs

What should we add?

Once the job is submitted to the SLURM queue, the job is marked as "running" and the runtime starts ticking on the UI. However, this does not reflect the actual state of the job since it might be queuing. It might be difficult to modify the "running" state on the UI since there's no "queued" state (that I'm aware of), but an alternative might be to keep it as "running" but have the runtime be 0 s until it's detect that it changed state on the cluster, at which point the time can start.

Describe alternatives you've considered.

No response

Add support for dropped connections

What should we add?

If the server connection is halted, in-progress workflows remain "running" indefinitely. A nice feature would be add some sort of support for dropped connections or server restarts.

From Will:

stopping or restarting the server does drop connections and cause in-progress workflows to be dropped. they'll just appear as "running" forever. the recommended course as of the latest release would be to redispatch using previous results. if the connection drops, same deal. at least within the slurm/ssh plugins it may be easy to add reconnect logic with some retries.

Tagging @utf who had the question/suggestion in the first place.

Describe alternatives you've considered.

You can redispatch if needed.

Error when using SlurmExecutor on RHEL8 compute nodes

Environment

Covalent version: 0.177.0
Covalent-Slurm plugin version: 0.7.0
Python version: 3.10
Operating system: PopOS 22.04 LTS

What is happening?

asyncssh seems to have trouble sending the command to create a directory on the compute node. I don't know exactly what's going on, but based on this article I'd conclude that some HPCs do not like a login shell due to a legacy command mesg n in /etc/profile.

How can we reproduce the issue?

import covalent as ct
import numpy as np


@ct.electron(executor='local')
def sum_(n):
    return np.sum(np.arange(n))

@ct.electron(executor='local')
def product_(n):
    return np.prod(np.arange(n)[1:])


def get_sum_product(n):
    return sum_(n) + product_(n)


if __name__ == '__main__':
    workflow = ct.lattice(get_sum_product, executor='slurm')
    dispatch_id = ct.dispatch(workflow)(10)

What should happen?

[2022-11-15 13:22:54,295] [ERROR] execution.py: Line 364 in _run_task: Exception occurred when running task 4: mesg: ttyname failed: Inappropriate ioctl for device
[2022-11-15 13:22:54,297] [ERROR] execution.py: Line 372 in _run_task: Run task exception
Traceback (most recent call last):
  File "/home/sandervandenhaute/envs/covalent_env/pyenv/lib/python3.10/site-packages/covalent_dispatcher/_core/execution.py", line 345, in _run_task
    output, stdout, stderr = await execute_callable()
  File "/home/sandervandenhaute/envs/covalent_env/pyenv/lib/python3.10/site-packages/covalent/executor/base.py", line 572, in execute
    result = await self.run(function, args, kwargs, task_metadata)
  File "/home/sandervandenhaute/envs/covalent_env/pyenv/lib/python3.10/site-packages/covalent_slurm_plugin/slurm.py", line 399, in run
    raise RuntimeError(client_err)
RuntimeError: mesg: ttyname failed: Inappropriate ioctl for device

Any suggestions?

Adding request_pty='force' to the conn.run() call seems to fix the issue, although the message is still displayed in the log. Replacing mesg n with tty -s && mesg n as suggested elsewhere is only possible with root access, which will not always be the case.

SlurmExecutor fails everytime by raising a FileNotFoundError when looking for the `result-<dispatch_id>.pkl` file

To reproduce this issue, run any tutorial with the SlurmExecutor. This issue has started to appear after the qa-fixes PR was merged. Seems like a regression

Update electron statuses for `SlurmExecutor`

Once separation of workflow and electron statuses is done, the electron level statuses need to be updated to accommodate executor dependent statuses. In this case the following status definitions will be updated:

REGISTERING - Connection established to the remote machine and files are being transferred, slurm script also submitted
PENDING_BACKEND - Task is in the PENDING state in the slurm database
STARTING - Task is in the CONFIGURING state in the slurm database
RUNNING - Task is in the RUNNING state in the slurm database
COMPLETING - Task is in the COMPLETING state in the slurm database, result files are being retrieved, temporary files are being deleted

The classes for these statuses will need to be created similar to status classes defined in covalent/_shared_files/statuses.py. Then, in order to save the status, one can do:

...

status_store = task_metadata["status_store"]
status_store.save(CustomStatus())

...

This will propagate the status updation to the DB.

Acceptance Criteria:

Above mentioned statuses need to be updated inside the local executor
Tests need to be added to verify if those definitions are as expected

Pickle file paths are not handled appropriately when a `chdir` call is made

Environment

Covalent version: 0.234.0rc0
Covalent-Slurm plugin version: 98e835f
Python version: 3.10.13
Operating system: Linux

What is happening?

When a @ct.electron changes the working directory, Covalent crashes because it can no longer find the results pkl file. See #94 for the somewhat cryptic error message.

How can we reproduce the issue?

import covalent as ct

executor = ct.executor.SlurmExecutor(
    username="rosen",
    address="perlmutter-p1.nersc.gov",
    ssh_key_file="/home/rosen/.ssh/nersc",
    cert_file="/home/rosen/.ssh/nersc-cert.pub",
    conda_env="covalent",
    options={
        "nodes": 1,
        "qos": "debug",
        "constraint": "cpu",
        "account": "matgen",
        "job-name": "test",
        "time": "00:10:00",
    },
    remote_workdir="/pscratch/sd/r/rosen/test",
    create_unique_workdir=True,
    cleanup=False,
)


@ct.lattice(executor=executor)
@ct.electron
def workflow():
    import os

    os.chdir("../")
    return os.getcwd()


ct.dispatch(workflow)()

What should happen?

No crash.

Any suggestions?

Always, always, always have all files generated/parsed by Covalent be internally represented as absolute file paths, never relative file paths. You can basically copy what I've done in the covalent-hpc-plugin here and elsewhere in terms of file path handling.

Allow certain exit codes at user's discretion

What should we add?

For a number of reasons, some code you are running with srun may exit with an exit code other than 0 (0 indicates "success", non-zero indicates certain categories of failure) although from the standpoint of the user, nothing actually went wrong, and usable output was produced. This can happen, for example, in quantum espresso (pw.x) if self-consistency isn't met. In some select situations, we may not care about this and wish to carry on anyways.

Presently, however, this non-zero exit code is caught by the slurm executor from STDERR and the calculation crashes.

I suggest we implement "allowed exit codes" other than just 0. This is intended for advanced users and we should consider throwing a "are you sure you know what you're doing?" warning if this input variable is set as it can lead to unexpected beahviour.

In bash, to ignore a certain exit code (say exit code 3), we can use:

bash -c '(srun pw.x -npool %d -ndiag 1 -input PREFIX.pwi 2>stderr.log 1>PREFIX.pwo || exitcode=$?) && [[ $exitcode -eq 3 ]] && exitcode=0

if [[ $exitcode -eq 3 ]]; then
  >&2 cat stderr.log
fi

exit $exitcode'

where the above example is on Perlmutter running the GPU version of quantum expresso (pw.x).

We can implement something along these lines for srun options.

One thing that isn't clear to me is whether or not this is a slurm executor addition or a core covalent addition.

Describe alternatives you've considered.

You can set use_srun=False and manually define your srun behaviour as the above script.

Support more robust path handling with `remote_workdir`

Environment

Covalent version: 0.234.0rc0
Covalent-Slurm plugin version: 98e835f
Python version: 3.10+
Operating system: Ubuntu

What is happening?

Related to #95, when one specifies the remote_workdir with a path like ~/my/dir, it raises an error upon trying to parse the pkl file. The UI shows [Errno 1] : PosixPath('~/test/ef5861af-7b92-4dbc-a52c-eb83bb137496/node_0/result-ef5861af-7b92-4dbc-a52c-eb83bb137496-0.pkl').

How can we reproduce the issue?

import covalent as ct

executor = ct.executor.SlurmExecutor(
    username="rosen",
    address="perlmutter-p1.nersc.gov",
    ssh_key_file="/home/rosen/.ssh/nersc",
    cert_file="/home/rosen/.ssh/nersc-cert.pub",
    conda_env="covalent",
    options={
        "nodes": 1,
        "qos": "debug",
        "constraint": "cpu",
        "account": "matgen",
        "job-name": "test",
        "time": "00:10:00",
    },
    remote_workdir="~/test",
    create_unique_workdir=True,
    cleanup=False,
)


@ct.lattice(executor=executor)
@ct.electron
def workflow():

    return "<3"


ct.dispatch(workflow)()

What should happen?

Filepaths like ~/my/dir should work

Any suggestions?

Wrap the file path in Path().expanduser().resolve() client-side before the electron is launched. As a stretch goal, I would even suggest trying to support something like $SCRATCH/my/dir by wrapping the path call with os.path.expandvars() on the client side too.

You can basically copy what I have done in the covalent-hpc-plugin in terms of file path handling, such as here.

`self._remote_func_filename` is not defined when a SLURM job hits the walltime

Environment

Covalent version: 0.209.1
Covalent-Slurm plugin version: Custom branch off of develop here so that I could log in. The code in question should not be impacted by this branch.
Python version: 3.9
Operating system: Linux

What is happening?

I tried submitting a SLURM job and got the following traceback.

Traceback (most recent call last):
File "/home/arosen/anaconda3/envs/covalent/lib/python3.9/site-packages/covalent/executor/base.py", line 452, in execute
result = await self.run(function, args, kwargs, task_metadata)
File "/home/arosen/anaconda3/envs/covalent/lib/python3.9/site-packages/covalent_slurm_plugin/slurm.py", line 474, in run
await self._poll_slurm(slurm_job_id, conn)
File "/home/arosen/anaconda3/envs/covalent/lib/python3.9/site-packages/covalent_slurm_plugin/slurm.py", line 333, in _poll_slurm
raise RuntimeError("Job failed with status:\n", status)
RuntimeError: ('Job failed with status:\n', '')

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "/home/arosen/anaconda3/envs/covalent/lib/python3.9/site-packages/covalent_dispatcher/_core/runner.py", line 293, in _run_task
output, stdout, stderr, exception_raised = await executor._execute(
File "/home/arosen/anaconda3/envs/covalent/lib/python3.9/site-packages/covalent/executor/base.py", line 421, in _execute
return await self.execute(
File "/home/arosen/anaconda3/envs/covalent/lib/python3.9/site-packages/covalent/executor/base.py", line 459, in execute
await self.teardown(task_metadata=task_metadata)
File "/home/arosen/anaconda3/envs/covalent/lib/python3.9/site-packages/covalent_slurm_plugin/slurm.py", line 505, in teardown
remote_func_filename=self._remote_func_filename,
AttributeError: 'SlurmExecutor' object has no attribute '_remote_func_filename'

My guess (?) is that self._remote_func_filename is not defined since the RuntimeError was raised.

How can we reproduce the issue?

import covalent as ct
import time

executor = ct.executor.SlurmExecutor(<redacted>)

@ct.lattice
@ct.electron(executor=executor)
def add(val1,val2):
    time.sleep(10000) # make sure the walltime is less than this
    return val1+val2

dispatch_id = ct.dispatch(add)(1,2)
result = ct.get_result(dispatch_id,wait=True)
print(result)

What should happen?

The covalent task should abort gracefully.

Any suggestions?

I think this error happens anytime the job dies unexpectedly (e.g. hits the walltime or otherwise). It doesn't seem to "terminate gracefully."

Addendum

It seems that adding the parsable: "" option fixes the lack of a returned status but otherwise the same issue arises.

Slurm electrons fails when called within a Dask sublattice which itself is called in a Dask lattice.

Environment

Covalent version: 0.220
Covalent-Slurm plugin version: 0.16.0
Python version: 3.8.16
Operating system: MacOS Ventura 13.3.1

What is happening?

When running a slurm electron within a base (Dask) sublattice and dispatching the sublattice within a base (Dask) lattice, the dispatch will run on the remote cluster, finish the job, then fail when retieving the job. The traceback reported in the GUI is:

Traceback (most recent call last):
File "/Users/jbaker/miniconda3/envs/covalent_slurm/lib/python3.8/site-packages/covalent_dispatcher/_core/runner.py", line 251, in _run_task
output, stdout, stderr, status = await executor._execute(
File "/Users/jbaker/miniconda3/envs/covalent_slurm/lib/python3.8/site-packages/covalent/executor/base.py", line 628, in _execute
return await self.execute(
File "/Users/jbaker/miniconda3/envs/covalent_slurm/lib/python3.8/site-packages/covalent/executor/base.py", line 657, in execute
result = await self.run(function, args, kwargs, task_metadata)
File "/Users/jbaker/code/covalent/covalent-slurm-plugin/covalent_slurm_plugin/slurm.py", line 695, in run
result, stdout, stderr, exception = await self._query_result(
File "/Users/jbaker/code/covalent/covalent-slurm-plugin/covalent_slurm_plugin/slurm.py", line 577, in _query_result
async with aiofiles.open(stderr_file, "r") as f:
File "/Users/jbaker/miniconda3/envs/covalent_slurm/lib/python3.8/site-packages/aiofiles/base.py", line 78, in __aenter__
self._obj = await self._coro
File "/Users/jbaker/miniconda3/envs/covalent_slurm/lib/python3.8/site-packages/aiofiles/threadpool/__init__.py", line 80, in _open
f = yield from loop.run_in_executor(executor, cb)
File "/Users/jbaker/miniconda3/envs/covalent_slurm/lib/python3.8/concurrent/futures/thread.py", line 57, in run
result = self.fn(*self.args, **self.kwargs)
FileNotFoundError: [Errno 2] No such file or directory: '/Users/jbaker/.local/share/covalent/data/ee3b1f1b-b21b-4bbd-bc95-6bbc012c3091/stdout-ee3b1f1b-b21b-4bbd-bc95-6bbc012c3091-0.log'

How can we reproduce the issue?

I am using the sshproxy extra req and have prepared my covalent config file as suggested in the root README.md.

Here's a simple workflow to reproduce the above:

import covalent as ct
import numpy as np

executor = ct.executor.SlurmExecutor(
       remote_workdir="<wdir>",
       options={
           "qos": "regular",
           "t": "00:05:00",
           "nodes": 1,
           "C": "gpu",
           "A": "<acc code>",
           "J": "bug_test",
           "ntasks-per-node": 4,
           "gpus-per-task": 1,
           "gpu-bind": "map_gpu:0,1,2,3"
       },
       prerun_commands=[
           "export COVALENT_CONFIG_DIR="<somewhere in scratch>",
           "export COVALENT_CACHE_DIR="<somewhere in scratch>",
           "export SLURM_CPU_BIND=\"cores\"",
           "export OMP_PROC_BIND=spread",
           "export OMP_PLACES=threads",
           "export OMP_NUM_THREADS=1",
       ],
       username="<username>",
       ssh_key_file="<key>",
       cert_file="<cert>",
       address="perlmutter-p1.nersc.gov",
       conda_env="<conda env>",
       use_srun=False
)

@ct.electron
def get_rand_sum_length(lo, hi):
    np.random.seed(1984)
    return np.random.randint(lo, hi)

# Slurm electron
@ct.electron(executor=executor)
def get_rand_num_slurm(lo, hi):
    np.random.seed(1984)
    return np.random.randint(lo, hi)  

@ct.electron
@ct.lattice
def add_n_random_nums(n, lo, hi):
    np.random.seed(1984)
    sum = 0
    for i in range(n):
        sum += get_rand_num_slurm(lo, hi)
    return sum

@ct.lattice
def random_num_workflow(lo, hi):
    n = get_rand_sum_length(lo, hi)
    sum = add_n_random_nums(n, lo, hi) # sublattice
    return sum

id = ct.dispatch(random_num_workflow)(1, 3)
ct_result = ct.get_result(dispatch_id=id, wait=True)
sum = ct_result.result
print(sum)

What should happen?

The code should run to completion, throwing now error in the GUI and print an integer.

Any suggestions?

It seems to me that the interaction between the Dask and Slurm executors is not quite right. Either way, the file Covalent is looking for exists on the remote directory in <wdir>/stdout-ee3b1f1b-b21b-4bbd-bc95-6bbc012c3091-0.log but does not exist in the local directory /Users/jbaker/.local/share/covalent/data/ee3b1f1b-b21b-4bbd-bc95-6bbc012c3091/stdout-ee3b1f1b-b21b-4bbd-bc95-6bbc012c3091-0.log. Indeed, in /Users/jbaker/.local/share/covalent/data/ee3b1f1b-b21b-4bbd-bc95-6bbc012c3091/ stdout files are contained within the /node/ subdirs.

Adding docker based functional tests to the pipeline

Currently we cannot always test whether the Slurm executor is working as expected functionally due to its need for an actual slurm managed cluster. We can fix that by using a docker based alternative such as https://hub.docker.com/r/turuncu/slurm and run that as part of the functional test suite. This will allow us to do more robust testing and we'll have a reliable way to reproduce if there are any issues.

Slurm sublattice fails with "username is a required parameter in the Slurm plugin."

Environment

Covalent version: 0.220.0
Covalent-Slurm plugin version: 0.16.0
Python version: 3.8.16
Operating system: MacOS Ventura 13.3.1

What is happening?

Calling a Slurm sublattice within a base (Dask) lattice fails with the traceback:

Traceback (most recent call last):
File "/Users/jbaker/miniconda3/envs/covalent_slurm/lib/python3.8/site-packages/covalent_dispatcher/_core/runner.py", line 251, in _run_task
output, stdout, stderr, status = await executor._execute(
File "/Users/jbaker/miniconda3/envs/covalent_slurm/lib/python3.8/site-packages/covalent/executor/base.py", line 628, in _execute
return await self.execute(
File "/Users/jbaker/miniconda3/envs/covalent_slurm/lib/python3.8/site-packages/covalent/executor/base.py", line 657, in execute
result = await self.run(function, args, kwargs, task_metadata)
File "/Users/jbaker/code/covalent/covalent-slurm-plugin/covalent_slurm_plugin/slurm.py", line 623, in run
conn = await self._client_connect()
File "/Users/jbaker/code/covalent/covalent-slurm-plugin/covalent_slurm_plugin/slurm.py", line 213, in _client_connect
raise ValueError("username is a required parameter in the Slurm plugin.")
ValueError: username is a required parameter in the Slurm plugin.

Despite the username being defined in the supplied ct.executor.SlurmExecutor object.

How can we reproduce the issue?

The below code (very similar to 69) reproduces the issue:

import covalent as ct
import numpy as np

executor = ct.executor.SlurmExecutor(
       remote_workdir="<wdir>",
       options={
           "qos": "regular",
           "t": "00:05:00",
           "nodes": 1,
           "C": "gpu",
           "A": "<acc code>",
           "J": "bug_test",
           "ntasks-per-node": 4,
           "gpus-per-task": 1,
           "gpu-bind": "map_gpu:0,1,2,3"
       },
       prerun_commands=[
           "export COVALENT_CONFIG_DIR="<somewhere in scratch>",
           "export COVALENT_CACHE_DIR="<somewhere in scratch>",
           "export SLURM_CPU_BIND=\"cores\"",
           "export OMP_PROC_BIND=spread",
           "export OMP_PLACES=threads",
           "export OMP_NUM_THREADS=1",
       ],
       username="<username>",
       ssh_key_file="<key>",
       cert_file="<cert>",
       address="perlmutter-p1.nersc.gov",
       conda_env="<conda env>",
       use_srun=False
)

@ct.electron
def get_rand_sum_length(lo, hi):
    np.random.seed(1984)
    return np.random.randint(lo, hi)

@ct.electron
def get_rand_num_slurm(lo, hi):
    np.random.seed(1984)
    return np.random.randint(lo, hi)  

# Slurm sublattice
@ct.electron
@ct.lattice(executor=executor, workflow_executor=executor)
def add_n_random_nums(n, lo, hi):
    np.random.seed(1984)
    sum = 0
    for i in range(n):
        sum += get_rand_num_slurm(lo, hi)
    return sum

@ct.lattice
def random_num_workflow(lo, hi):
    n = get_rand_sum_length(lo, hi)
    sum = add_n_random_nums(n, lo, hi) # sublattice
    return sum

id = ct.dispatch(random_num_workflow)(1, 3)
ct_result = ct.get_result(dispatch_id=id, wait=True)
sum = ct_result.result
print(sum)

What should happen?

Should run to completion with no errors in the GUI and print an integer.

Any suggestions?

My gut says that Covalent isn't reading the username attribute from ct.executor.SlurmExecutor and is instead looking at my covalent.conf where I have not defined username.

Add support for a certificate file to be passed to `asyncssh.connect()`

What should we add?

Currently, client_keys in asyncssh.connect() is set to client_keys=[self.ssh_key_file], and self.ssh_key_file is coded such that it can only be a single path, as shown below.

covalent-slurm-plugin/covalent_slurm_plugin/slurm.py

Lines 129 to 135 in f48c03d

 if os.path.exists(self.ssh_key_file): 

 conn = await asyncssh.connect( 

 self.address, 

 username=self.username, 

 client_keys=[self.ssh_key_file], 

 known_hosts=None, 

 )

However, some supercomputers require a list[tuple(str, str)] to be passed that has the SSH key and a certificate file, respectively. For an example, here is the necessary approach to asyncssh.connect to a U.S. DOE cluster at NERSC. Two filepaths in a tuple are needed, but it's not possible for this to be set right now. There is also no way in the current code to set client_keys=None explicitly, which ignores authentication, but that's less important since I imagine most people are (or should) be using SSH keys.

In general, I think some additional flexibility in what can be passed to asyncssh.connect() would help ensure that the plugin is useful for a wide range of HPC machines.

As a side-note, I don't think it's possible for _client_connect() to ever actually return False for ssh_success based on the code below, although that's not really a major concern.

covalent-slurm-plugin/covalent_slurm_plugin/slurm.py

Lines 116 to 143 in f48c03d

 async def _client_connect(self) -> Tuple[bool, asyncssh.SSHClientConnection]: 

 """ 

  Helper function for connecting to the remote host through asyncssh module. 

  Args: 

  None 

  Returns: 

  True if connection to the remote host was successful, False otherwise. 

  """ 

 ssh_success = False 

 conn = None 

 if os.path.exists(self.ssh_key_file): 

 conn = await asyncssh.connect( 

 self.address, 

 username=self.username, 

 client_keys=[self.ssh_key_file], 

 known_hosts=None, 

 ) 

 ssh_success = True 

 else: 

 message = f"No SSH key file found at {self.ssh_key_file}. Cannot connect to host." 

 raise RuntimeError(message) 

 return ssh_success, conn

Describe alternatives you've considered.

There is no alternative. See PR in #47.

Support for login without SSH key

What should we add?

I think it would be nice to provide support for just username and password credentials without needing an SSH key. I'd personally never use it, but I'm thinking that it might help for folks who are less programming-inclined and may not be familiar with SSH keys. The downside is that their username and password are then stored as plaintext in the local config, so this shouldn't be done until a system is in place to better handle credential storage.

Describe alternatives you've considered.

No response

How to set the port for the remote address of the Slurm login node?

What should we add?

How do I set the port for a remote address such as login.cluster.org ?
I have attempted to set the address to login.cluster.org:65001 and login.cluster.org 65001, but neither worked.

Add tests

Environment

Covalent version: 0.89.2
Covalent-Slurm plugin version: 0.2.3
Python version: 3.8.13
Operating system: Ubuntu 21.10

What is happening?

The Slurm executor plugin needs unit tests and a functional test.

How can we reproduce the issue?

Notice that the repo doesn't have any tests.

What should happen?

There should be a suite of unit tests and a functional test for the executor. See the tests in the custom executor template for guidance (https://github.com/AgnostiqHQ/covalent-executor-template)

Any suggestions?

No response

Support commands beyond the SLURM directive

What should we add?

Currently, at least based on the README, it is only possible to set the SLURM directives (i.e. executors.slurm.options) and not any lines after the directives. However, it is often necessary for the user to include various lines, such as the loading of certain modules and setting of certain environment variables. If this is possible to do right now, it's not clear based on the README. If it's not possible to do, it would be a worthwhile addition (perhaps with a new kwarg).

Describe alternatives you've considered.

No response

Update to sshproxy instructions in README.md

What should we add?

Just a couple of small edits to the Readme requried.

The command pip install covalent-slurm-plugin[sshproxy] doesn't work in zsh (because of the square brackets). I would reccomend changing this to pip install "covalent-slurm-plugin[sshproxy]" such that it works in bash and zsh.
The instructions indicate that the command line oathtool can be used to verify secrets with the command oathtool <secret> but NERSC uses TOTP and alphanumeric strings. The correct syntax is oathtool --totp --base32 <secret>. These options aren't needed if using the Python wrapper but we are talking about the command line here.
We should have an example in the README or elsewhere which works out-of-the-box for Perlmutter (cori is decommissioned soon). The existing example will not "just work". We need to specify an account to charge and paths to keys/certs. For example, the below executor will work with conda and sshproxy:

executor = ct.executor.SlurmExecutor(
       remote_workdir="/global/homes/j/jsbaker/covalent_test", # scratch may be better practice
       options={
           "qos": "regular",
           "t": "00:10:00",
           "nodes": 1,
           "C": "gpu",
           "A": "<billing number>",
           "J": "ExampleCovalent",
           "ntasks-per-node": 4,
           "gpus-per-task": 1,
           "gpu-bind": "map_gpu:0,1,2,3"
       },
       prerun_commands=[
           "module load espresso/7.0-libxc-5.2.2-gpu",
       ],
       username="jsbaker",
       ssh_key_file="<nersc.pub path>",
       cert_file="<nersc.pub-cert.pub path>",
       address="perlmutter-p1.nersc.gov"
       conda_env="example_covalent"
 )

Opinions @wjcunningham7 ?

SSH connection problems with latest asyncssh version (2.15.0)

Environment

Covalent version: 0.232.0.post1
Covalent-Slurm plugin version: 0.18.0
Python version: 3.8
Operating system: Ubuntu 22.04.4 LTS

Installed with conda

What is happening?

Recently (2 days ago as of time of writing this), asyncssh updated to v. 2.15.0. Something in this update seems to have broken the Covalent SLURM plugin. In particular, attempts at submitting jobs error out at around line 524 of slurm.py, right after using scp to copy the pickle files over to the remote server. Pickle files can be found on the remote server, but no other files after this point manage to be copied, nor are any SLURM jobs started. Errors are rather cryptic and seem to change, from "SSH connection closed" to NoneType errors from a failed asyncssh conn object. Error stack trace confirms the location of the error and the source being the asyncssh library. After setting log level to debug in covalent's config option, and checking error.log for the failed SLURM executor node, this error trace appears:

Traceback (most recent call last):
  File "/home/myusername/miniconda3/envs/covalent_env/lib/python3.8/site-packages/covalent_dispatcher/_core/runner.py", line 182, in _run_task
    output, stdout, stderr, status = await executor._execute(
  File "/home/myusername/miniconda3/envs/covalent_env/lib/python3.8/site-packages/covalent/executor/base.py", line 695, in _execute
    return await self.execute(
  File "/home/myusername/miniconda3/envs/covalent_env/lib/python3.8/site-packages/covalent/executor/base.py", line 724, in execute
    result = await self.run(function, args, kwargs, task_metadata)
  File "/home/myusername/miniconda3/envs/covalent_env/lib/python3.8/site-packages/covalent_slurm_plugin/slurm.py", line 592, in run
    remote_paths = await self._copy_files(
  File "/home/myusername/miniconda3/envs/covalent_env/lib/python3.8/site-packages/covalent_slurm_plugin/slurm.py", line 537, in _copy_files
    await asyncssh.scp(temp_g.name, (conn, remote_py_script_filename))
  File "/home/myusername/miniconda3/envs/covalent_env/lib/python3.8/site-packages/asyncssh/scp.py", line 1041, in scp
    reader, writer = await _start_remote(
  File "/home/myusername/miniconda3/envs/covalent_env/lib/python3.8/site-packages/asyncssh/scp.py", line 190, in _start_remote
    writer, reader, _ = await conn.open_session(command, encoding=None)
  File "/home/myusername/miniconda3/envs/covalent_env/lib/python3.8/site-packages/asyncssh/connection.py", line 4198, in open_session
    chan, session = await self.create_session(
  File "/home/myusername/miniconda3/envs/covalent_env/lib/python3.8/site-packages/asyncssh/connection.py", line 4173, in create_session
    session = await chan.create(session_factory, command, subsystem,
  File "/home/myusername/miniconda3/envs/covalent_env/lib/python3.8/site-packages/asyncssh/channel.py", line 1207, in create
    result = await self._make_request(b'exec', String(command))
  File "/home/myusername/miniconda3/envs/covalent_env/lib/python3.8/site-packages/asyncssh/channel.py", line 740, in _make_request
    return await waiter
  File "/home/myusername/miniconda3/envs/covalent_env/lib/python3.8/site-packages/asyncssh/connection.py", line 1329, in data_received
    while self._inpbuf and self._recv_handler():
  File "/home/myusername/miniconda3/envs/covalent_env/lib/python3.8/site-packages/asyncssh/connection.py", line 1594, in _recv_packet
    processed = handler.process_packet(pkttype, seq, packet)
  File "/home/myusername/miniconda3/envs/covalent_env/lib/python3.8/site-packages/asyncssh/packet.py", line 237, in process_packet
    self._packet_handlers[pkttype](self, pkttype, pktid, packet)
  File "/home/myusername/miniconda3/envs/covalent_env/lib/python3.8/site-packages/asyncssh/channel.py", line 656, in _process_request
    self._service_next_request()
  File "/home/myusername/miniconda3/envs/covalent_env/lib/python3.8/site-packages/asyncssh/channel.py", line 416, in _service_next_request
    result = cast(Optional[bool], handler(packet))
  File "/home/myusername/miniconda3/envs/covalent_env/lib/python3.8/site-packages/asyncssh/channel.py", line 1246, in _process_exit_status_request
    self._session.exit_status_received(status)
AttributeError: 'NoneType' object has no attribute 'exit_status_received'

For a temporary fix: Revert to asyncssh v. 2.14.0 (restarting the covalent server and such, as needed)

For a more permanent fix: Some updates are needed in the plug in's code to be compatible with the latest version of asyncssh.

How can we reproduce the issue?

Install latest versions of covalent and the covalent-slurm-plugin
Check that asyncssh is version 2.15.0
Attempt to run any simple, minimal covalent job through the SLURM plug in

What should happen?

Job should run correctly. Instead, it will error out with an SSH connection closed or mentions of "NoneType has no attribute 'exit_status_received'"

Any suggestions?

For a temporary fix: Revert to asyncssh v. 2.14.0 (restarting the covalent server and such, as needed)

For a more permanent fix: Some updates are needed in the plug in's code to be compatible with the latest version of asyncssh.

Setting the executor in a `@ct.lattice` decorator does not use the right configuration parameters

Environment

Covalent version: 0.226.0-rc.0 or v0.224.0-rc.0
Covalent-Slurm plugin version: 0.12.1 or main
Python version: 3.10.12
Operating system: Ubuntu (WSL)

What is happening?

Passing in a ct.executor.SlurmExecutor definition into a @ct.lattice decorator's executor kwarg does not work properly. I am getting the dreaded "username is a required parameter in the Slurm plugin" error message even though the username is clearly shown in the UI. Unlike in #70, this is not a sublattice but rather a very simple workflow. It's clear that the covalent.conf file is being read whenever the executor is passed to the Lattice object. Passing it to the Electron object works as expected.

How can we reproduce the issue?

Let's take the example from the docs:

executor = ct.executor.SlurmExecutor(
    username="myname",
    address="test",
    ssh_key_file="/path/to/my/file",
    remote_workdir="/scratch/user/experiment1",
    options={
        "qos": "regular",
        "time": "01:30:00",
        "nodes": 1,
        "constraint": "gpu",
    },
    prerun_commands=[
        "module load package/1.2.3",
        "srun --ntasks-per-node 1 dcgmi profile --pause",
    ],
    srun_options={"n": 4, "c": 8, "cpu-bind": "cores", "G": 4, "gpu-bind": "single:1"},
    srun_append="nsys profile --stats=true -t cuda --gpu-metrics-device=all",
    postrun_commands=[
        "srun --ntasks-per-node 1 dcgmi profile --resume",
    ],
)


@ct.electron # (executor=executor) works fine here!
def my_custom_task(x, y):
    return x + y

@ct.lattice(executor=executor)
def workflow(x, y):
    return my_custom_task(x, y)


dispatch_id = ct.dispatch(workflow)(1, 2)

The UI will show "username is a required parameter in the Slurm plugin" even though the username is provided. In this case, I am starting from a default Covalent configuration file, which has no username by default.

Note: Using @ct.electron(executor=executor) instead of @ct.lattice(executor=executor) works fine.

What should happen?

The username should be detected.

Any suggestions?

Allow for SLURM submission locally

What should we add?

The SLURM executor relies on SSHing into the machine and submitting the SLURM jobs. This is excellent, but it would be ideal if the user also had the option to use the SLURM executor locally if they are dispatching from the HPC machine itself. One could imagine this might be particularly relevant, for instance, if there are intense security measures that restrict SSH access in a way that the plugin can't work with (e.g. requiring a physical Yubikey).

To be clear, this would be a large feature addition, so I am not suggesting the maintainers or myself work on it ASAP, but I wanted to log this prior discussion here anyway in case someone feels ambitious. In the meantime, it should be possible to use the Dask executor to write SLURM jobs locally (perhaps worth a tutorial if it's ultimately too ambitious to include here).

Describe alternatives you've considered.

No response

Can not acquire file lock on Slurm cluster

covalent is a dependency when wrapper_fn function is unpickled and executed.

However, when covalent is initialized for the first time, it will try to create a new config file, which means acquiring a filelock inside ConfigManager.update_config().

Trying to acquire the filelock leads to the following error

Traceback (most recent call last):
  File "/global/homes/a/ara/slurm-tests/script-310be8a1-383d-4586-9dc1-821c8120e93f-0.py", line 5, in <module>
    function, args, kwargs = pickle.load(f)
  File "/global/homes/a/ara/miniconda3/envs/slurm-test/lib/python3.9/site-packages/covalent/__init__.py", line 22, in <module>
    from . import executor, leptons  # nopycln: import
  File "/global/homes/a/ara/miniconda3/envs/slurm-test/lib/python3.9/site-packages/covalent/executor/__init__.py", line 32, in <module>
    from .._shared_files import logger
  File "/global/homes/a/ara/miniconda3/envs/slurm-test/lib/python3.9/site-packages/covalent/_shared_files/logger.py", line 24, in <module>
    from .config import get_config
  File "/global/homes/a/ara/miniconda3/envs/slurm-test/lib/python3.9/site-packages/covalent/_shared_files/config.py", line 199, in <module>
    _config_manager = ConfigManager()
  File "/global/homes/a/ara/miniconda3/envs/slurm-test/lib/python3.9/site-packages/covalent/_shared_files/config.py", line 52, in __init__
    self.update_config()
  File "/global/homes/a/ara/miniconda3/envs/slurm-test/lib/python3.9/site-packages/covalent/_shared_files/config.py", line 109, in update_config
    with filelock.FileLock(f"{self.config_file}.lock", timeout=1):
  File "/global/homes/a/ara/miniconda3/envs/slurm-test/lib/python3.9/site-packages/filelock/_api.py", line 297, in __enter__
    self.acquire()
  File "/global/homes/a/ara/miniconda3/envs/slurm-test/lib/python3.9/site-packages/filelock/_api.py", line 264, in acquire
    raise Timeout(lock_filename)  # noqa: TRY301
filelock._error.Timeout: The file lock '/global/homes/a/ara/.config/covalent/covalent.conf.lock' could not be acquired.

Notes

Solution may be a change to core covalent, not this plugin.
Manually disabling the filelock acquisition code seems to resolve the problem (workflow completes successfully)

Make it possible for users to pass optional kwargs to `asyncssh.connect()`

What should we add?

asyncssh.connect() takes many optional kwargs, most of which are not accessible to the user of covalent-slurm-plugin. While this is fine for most use cases, one can easily imagine some peculiar setups where additional kwargs are needed to establish the connection. In fact, I had such a scenario that was reported in ronf/asyncssh#582. It would probably be worthwhile to make this accessible to the user, but I'm also weary of adding yet another parameter since at some point we don't want to overload the user with options.

Describe alternatives you've considered.

If a connection needs additional parameters to for authorization to be established on asyncssh.connect(), the only option is to make a custom fork of the plugin. This isn't necessarily as terrible an idea as it sounds seeing as it's so light-weight and is meant to be customizable.

Error while parsing the job id returned by slurm from the subprocess stdout capture

Error traceback when attempting to submit a electron via the slurm executor plugin

  File "/home/vbala/work/covalent/covalent_dispatcher/_core/execution.py", line 237, in _run_task
    output, stdout, stderr = executor.execute(
  File "/home/vbala/.local/share/virtualenvs/covalent-AV-F4ayX/lib/python3.8/site-packages/covalent_slurm_plugin/slurm.py", line 204, in execute
ValueError: invalid literal for int() with base 10: 'Submitted batch job 67110791'```

	# sets up conda environment
	slurm_conda = f"""
	source $HOME/.bashrc
	conda activate {conda_env_str}
	retval=$?
	if [ $retval -ne 0 ] ; then
	>&2 echo "Conda environment {conda_env_str} is not present on the compute node. "\
	"Please create the environment and try again."
	exit 99
	fi

	if os.path.exists(self.ssh_key_file):
	conn = await asyncssh.connect(
	self.address,
	username=self.username,
	client_keys=[self.ssh_key_file],
	known_hosts=None,
	)

	async def _client_connect(self) -> Tuple[bool, asyncssh.SSHClientConnection]:
	"""
	Helper function for connecting to the remote host through asyncssh module.

	Args:
	None

	Returns:
	True if connection to the remote host was successful, False otherwise.
	"""

	ssh_success = False
	conn = None
	if os.path.exists(self.ssh_key_file):
	conn = await asyncssh.connect(
	self.address,
	username=self.username,
	client_keys=[self.ssh_key_file],
	known_hosts=None,
	)

	ssh_success = True

	else:
	message = f"No SSH key file found at {self.ssh_key_file}. Cannot connect to host."
	raise RuntimeError(message)

	return ssh_success, conn