agnostiqhq / covalent-slurm-plugin Goto Github PK
View Code? Open in Web Editor NEWExecutor plugin interfacing Covalent with Slurm
Home Page: https://covalent.xyz
License: Apache License 2.0
Executor plugin interfacing Covalent with Slurm
Home Page: https://covalent.xyz
License: Apache License 2.0
Note: this one is motivated by some "getting started" snags by a colleague of mine, @rwexler.
covalent-slurm-plugin/covalent_slurm_plugin/slurm.py
Lines 337 to 346 in bcf2049
Because of this line, there is currently no way to use the plugin if you aren't using conda. There are many situations in which the user may not want to use an Anaconda environment. For instance, maybe the user is loading an HPC-hosted Python module, which they've installed packages with via pip. They would have to load the module in their prerun_commands
but otherwise shouldn't be required to use conda.
Submit a job without conda in the PATH.
Covalent should support running with Python in the PATH but not necessarily relying on Conda.
Only add the conda-related lines if conda_env
is not False.
Need to apply the following patch in the next merge:
diff --git a/.github/workflows/pypi.yml b/.github/workflows/pypi.yml
index de94eb5..676b614 100644
--- a/.github/workflows/pypi.yml
+++ b/.github/workflows/pypi.yml
@@ -46,8 +46,8 @@ jobs:
VERSION="$(cat ./VERSION)"
cd dist
tar xzf covalent-slurm-plugin-${VERSION}.tar.gz
- test -e covalent-slurm-plugin-${VERSION}/slurm.py
- rm -rf cova-${VERSION}/
+ test -e covalent-slurm-plugin-${VERSION}/covalent_slurm_plugin/slurm.py
+ rm -rf covalent-slurm-plugin-${VERSION}/
- name: Upload Distribution
env:
TWINE_USERNAME: ${{ secrets.PYPI_USERNAME }}
N/A
PyPI upload should work properly
N/A
When an exception is raised by the Slurm plugin signaling a task failure, that exception is not propagated to the dispatcher's _run_task
. Thus covalent always marks the electron as completed even if no results are returned.
The following workflow
@ct.electron
def task():
assert False
return 1
@ct.lattice
def workflow():
return task()
should fail, but instead "succeeds".
The electron should fail. Instead it "succeeds" because we weren't propagating exceptions raised within executor.execute()
to the dispatcher.
No response
Specifying prerun_commands
in the SlurmExecutor
does not result in the commands appear in the Slurm job script.
Run a simple toy example with the prerun_commands
keyword argument.
I used the example below:
import covalent as ct
executor = ct.executor.SlurmExecutor(
username="rosen",
address="perlmutter-p1.nersc.gov",
ssh_key_file="/home/rosen/.ssh/nersc",
cert_file="/home/rosen/.ssh/nersc-cert.pub",
conda_env="covalent",
options={
"nodes": f"{n_nodes}",
"qos": "debug",
"constraint": "cpu",
"account": "matgen",
"job-name": "quacc",
"time": "00:30:00",
},
remote_workdir="/pscratch/sd/r/rosen/quacc/",
create_unique_workdir=True,
use_srun=False,
prerun_commands=[
"module load vasp/6.4.1-cpu",
f"export QUACC_VASP_PARALLEL_CMD='{vasp_parallel_cmd}'",
],
)
The prerun_commands
should appear at the bottom of the job script, but they do not. The following was present for me:
#!/bin/bash
#SBATCH --nodes=1
#SBATCH --qos=debug
#SBATCH --constraint=cpu
#SBATCH --account=matgen
#SBATCH --job-name=quacc
#SBATCH --time=00:30:00
#SBATCH --parsable
#SBATCH --output=/pscratch/sd/r/rosen/quacc/94d1d3a5-8c42-4af3-b6f1-b6ee8126bea0/node_0/stdout-94d1d3a5-8c42-4af3-b6f1-b6ee8126bea0-0.log
#SBATCH --error=/pscratch/sd/r/rosen/quacc/94d1d3a5-8c42-4af3-b6f1-b6ee8126bea0/node_0/stderr-94d1d3a5-8c42-4af3-b6f1-b6ee8126bea0-0.log
#SBATCH --chdir=/pscratch/sd/r/rosen/quacc/94d1d3a5-8c42-4af3-b6f1-b6ee8126bea0/node_0
source $HOME/.bashrc
conda activate covalent
retval=$?
if [ $retval -ne 0 ] ; then
>&2 echo "Conda environment covalent is not present on the compute node. " "Please create the environment and try again."
exit 99
fi
remote_py_version=$(python -c "print('.'.join(map(str, __import__('sys').version_info[:2])))")
if [[ "3.10" != $remote_py_version ]] ; then
>&2 echo "Python version mismatch. Please install Python 3.10 in the compute environment."
exit 199
fi
python /pscratch/sd/r/rosen/quacc/script-94d1d3a5-8c42-4af3-b6f1-b6ee8126bea0-0.py
wait
Note how there are no prerun commands here.
I have not yet tried the postrun_commands
.
No response
When dispatching with the slurm executor, one of the ssh commands seem to be halted, and after 10 seconds of no response from the server, a timeout occurs.
from covalent.executor import SlurmExecutor
import covalent as ct
executor = SlurmExecutor(
ssh_key_file="~/.ssh/<key_name>",
username="<user_name>",
poll_freq="3",
cache_dir="<covalent_local_cache_dir>",
address="<address>",
remote_workdir="<any_workdir_on_remote>",
options={"partition": "debug", "cpus-per-task": 2, "nodelist": ["<hostname>"]},
)
@ct.electron(executor=executor)
def join_words(a, b):
return ", ".join([a, b])
@ct.electron(executor=executor)
def excitement(a):
return f"{a}!"
@ct.lattice
def simple_workflow(a, b):
phrase = join_words(a, b)
return excitement(phrase)
dispatch_id = ct.dispatch(simple_workflow)("Hello", "World")
A dispatch should be sent and run successfully. The output of above should be Hello, World!
.
No response
From @arosen93 :
There needs to be a way in which files can be written out to the current working directory without them potentially overwriting each other between calculations. The solution is to have unique subfolders for each Electron, as described below.
In many quantum chemistry codes, files are often written out to the filesystem at runtime. Typically, these files are hard-coded to be written to the current working directory and are both read from and written to throughout the course of the calculation. Therefore, if one launches a quantum chemistry code from within an Electron, these files will be overwriting each other if multiple calculations are going on simultaneously. It is also impossible to preserve the provenance of where these files originated from.
On a personal note, I would love to get my computational chemistry and materials science colleagues on board with Covalent, but this is currently a major dealbreaker in terms of adoption (hopefully not for long though!).
In the toy example below, the files are both written out to the same working directory, which will cause file loss and result in unexpected errors in more complex examples where the file I/O cannot be explicitly controlled by the user. For instance, with the SLURM plugin, everything is written out to the same remote_workdir
folder (see #1619 for where this currently is for local executors; there is a bug).
import covalent as ct
import os
@ct.electron
def job(val1, val2):
with open("job.txt", "w") as w:
w.write(str(val1 + val2))
return "Done!"
@ct.lattice
def workflow(val1, val2, val3, val4):
job1 = job(val1, val2)
job2 = job(val3, val4)
return "Done!"
dispatch_id = ct.dispatch(workflow)(1, 2, 3, 4)
result = ct.get_result(dispatch_id)
print(result)
My recommendation is that there should be a keyword argument somewhere of the type create_folders: bool
. If set to True
, then wherever the default working directory is (which may depend on the executor), Covalent would automatically create subfolders of the form dispatch_id/node_number
in the current working directory. Each calculation node would cd
into its respective dispatch_id/node_number
directory to avoid file overwriting concerns by any external program writing to the current working directory during runtime.
Note that this is distinct from the results_dir
, which is where the .pkl
and .log
files go and may not be on the same filesystem. For instance, in the SLURM executor, the current working directory is set by the remote_workdir
. In the proposed feature addition, every new calculation would have remote_workdir/dispatch_id/node_number
as its unique current working directory if create_folders = True
. The same could be done for the local executors except that it would be a local workdir
as the base directory.
In addition to preventing file overwriting, this has the added benefit of ensuring that the files written out at runtime can be linked back to their corresponding Electrons for reproducibility purposes.
This issue should be addressed after #1619 is closed. This issue was originally discussed in #1592, which I split into two separate issues.
For full context, see [here](AgnostiqHQ/covalent#1592 (comment)). This route is not sufficient because there are many use cases where the user cannot control where the files are written out to at runtime. That is usually specified by the external code that is being run and is often the current working directory. Also, if Covalent ever adds a feature where the executor is dynamically selected, the user may not know in advance which filesystem the calculation will be run on.
For full context, see [here](AgnostiqHQ/covalent#1592 (comment)). Unfortunately, this route isn't sufficient for the problem. We want to make sure that if there are multiple ongoing calculations that write files out to the current working directory during runtime that they do not overwrite one another. Many computational chemistry codes are hard-coded to rely on being able to write and read input and output files in the current working directory throughout the calculation itself. Transferring files before or after the calculation finishes is a separate issue that should be left to the user.
For full context, see [here](AgnostiqHQ/covalent#1592 (comment)). The relevant code snippet from @santoshkumarradha is below:
import covalent as ct
from pathlib import Path
import os
from functools import wraps
def change_dir_and_execute(directory):
def decorator(func):
@wraps(func)
def wrapper(*args, **kwargs):
current_dir = os.getcwd()
try:
os.chdir(directory)
result = func(*args, **kwargs)
finally:
os.chdir(current_dir)
return result
return wrapper
return decorator
path=Path(".").absolute()
@ct.electron
@change_dir_and_execute(path)
def job(val1, val2,file):
with open(file, "w") as w:
w.write(str(val1 + val2))
return Path(".").absolute()
@ct.lattice
def workflow(file, val1, val2, val3, val4):
job1 = job(val1, val2,file)
return job1
file="example.txt"
dispatch_id = ct.dispatch(workflow)(file, 1, 2, 3, 4)
result = ct.get_result(dispatch_id, wait=True)
print(result)
While this may be possible, it is not concise or clean. One of the major benefits of Covalent as a whole is that there is minimal friction to go from writing a function to running a complex workflow. Given the many foreseeable use cases where significant I/O is written out to the current working directory at runtime (without this being something that can be changed), always needing this verbose approach is less than ideal.
Tracking AgnostiqHQ/covalent#1628
Currently, the Python file is always run with srun
(with potentially some srun options). An example script might look like the following:
#!/bin/bash
#SBATCH --partition=debug
#SBATCH --account=matgen
#SBATCH --constraint=cpu
#SBATCH --job-name=covalent
#SBATCH --time=00:10:00
#SBATCH --parsable
#SBATCH --output=/global/cfs/cdirs/matgen/arosen/covalent10/stdout-7d947ff3-c38a-452c-a5b5-b03555510d5b-0.log
#SBATCH --error=/global/cfs/cdirs/matgen/arosen/covalent10/stderr-7d947ff3-c38a-452c-a5b5-b03555510d5b-0.log
source $HOME/.bashrc
conda activate covalent
retval=$?
if [ $retval -ne 0 ] ; then
>&2 echo "Conda environment covalent is not present on the compute node. " "Please create the environment and try again."
exit 99
fi
remote_py_version=$(python -c "print('.'.join(map(str, __import__('sys').version_info[:2])))")
if [[ "3.9" != $remote_py_version ]] ; then
>&2 echo "Python version mismatch. Please install Python 3.9 in the compute environment."
exit 199
fi
srun \
python /global/cfs/cdirs/matgen/arosen/covalent10/script-7d947ff3-c38a-452c-a5b5-b03555510d5b-0.py
wait
However, as originally noted by my colleague @rwexler and confirmed by me, this can be problematic depending on the use case. For instance, in computational materials science, a common usage pattern is to use SLURM to submit a Python job. That Python job then does several things, including making calls to and running external codes. This is how the Atomic Simulation Environment (ASE) works, which is a tool highlighted in one of the tutorials. You can see an example of how that's done in practice here, if you're curious.
The problem with this approach as a general rule is that if the pickled Python function at some point launches an external, parallel code with srun
or mpirun
, you now have nested parallelism and you can get very unintended behavior when running on multiple cores/nodes. We want to provide the user a way in which they can have Covalent simply do python <NameOfJob.py>
without srun
while still launching the workflow as a SLURM job so that the proper resources are reserved. I propose introducing a new kwarg named use_srun: bool
, which has a default value of True
but can be set to False
.
I understand that this usage pattern might be difficult to visualize, so I can definitely explain further. It's a subtle but incredibly important usage pattern though!
There are no possible alternatives other than cloning the plugin and modifying the line or not using entire software ecosystems like ASE altogether with Covalent.
@scottwn commented on Wed Nov 24 2021
In GitLab by @nolanagnostiq on Aug 9, 2021, 13:35
Create a SLURM executor that connects to a generic SLURM cluster (but specifically beehive.agnostiq.ai) to run a job
???
SLURM cluster spins up an instance or uses an existing compute resource for use on an electron.
@scottwn commented on Wed Nov 24 2021
In GitLab by @wjcunningham7 on Aug 16, 2021, 22:19
added to epic &58
@scottwn commented on Wed Nov 24 2021
In GitLab by @nolanagnostiq on Aug 24, 2021, 10:15
assigned to @wjcunningham7
@scottwn commented on Wed Nov 24 2021
In GitLab by @wjcunningham7 on Aug 31, 2021, 23:18
created merge request !26 to address this issue
@scottwn commented on Wed Nov 24 2021
In GitLab by @wjcunningham7 on Aug 31, 2021, 23:18
mentioned in merge request !26
@scottwn commented on Wed Nov 24 2021
In GitLab by @wjcunningham7 on Aug 31, 2021, 23:31
design notes:
SlurmExecutor
will inherit from BaseExecutor
and expose a member function execute
srun
slurm_params
dict will contain parameters that will be passed through to slurmimplementation notes:
slurmrestd
on beehive-hive0 and then test the REST API with a dummy job submission@scottwn commented on Wed Nov 24 2021
In GitLab by @nolanagnostiq on Sep 1, 2021, 12:38
Overall, this looks good. Couple of comments:
slurm_params
field and the cluster-side script will probably see reuse across other executors. We don't need to optimize prematurely, but another thing to think about while implementing.@scottwn commented on Wed Nov 24 2021
In GitLab by @wjcunningham7 on Sep 1, 2021, 22:00
Example of how to submit a job to Slurm using sbatch over SSH:
ssh beehive 'bash -l -c "cat - | sbatch"' <<'EOF'
#!/bin/bash
#SBATCH -J testjob
#SBATCH -o /federation/testjob.log
#SBATCH -w beehive-debug-st-t2medium-1
#SBATCH --parsable
# This is the body of the script where executables are invoked
hostname
EOF
@scottwn commented on Wed Nov 24 2021
In GitLab by @wjcunningham7 on Sep 1, 2021, 22:01
Anything in slurm_params
should be translated into an #SBATCH
statement
@scottwn commented on Wed Nov 24 2021
In GitLab by @wjcunningham7 on Sep 1, 2021, 23:20
@nolanagnostiq can you advise on how to access the schedule assigned to the TransportableObject? In a minimal example I add a decorator @electron(num_cpu=2)
to one of my electrons but I can't figure out how to retrieve this information in order to format a slurm batch script.
@scottwn commented on Wed Nov 24 2021
In GitLab by @wjcunningham7 on Sep 13, 2021, 07:35
as we discussed this issue will be pushed to backlog while we work on a local slurm executor, #72
@scottwn commented on Wed Nov 24 2021
In GitLab by @wjcunningham7 on Sep 23, 2021, 13:04
unassigned @wjcunningham7
When a Covalent-based error happens where the results pkl can't be found, the UI is very unclear about what the problem is. Ideally, I would like to see more details so I can know where to debug things.
import covalent as ct
executor = ct.executor.SlurmExecutor(
username="rosen",
address="perlmutter-p1.nersc.gov",
ssh_key_file="/home/rosen/.ssh/nersc",
cert_file="/home/rosen/.ssh/nersc-cert.pub",
conda_env="covalent",
options={
"nodes": 1,
"qos": "debug",
"constraint": "cpu",
"account": "matgen",
"job-name": "test",
"time": "00:10:00",
},
remote_workdir="/pscratch/sd/r/rosen/test",
create_unique_workdir=True,
cleanup=False,
)
@ct.lattice(executor=executor)
@ct.electron
def workflow():
import os
os.chdir("../")
return os.getcwd()
ct.dispatch(workflow)()
The UI should give me more information about the issue. The log was not very helpful either: covalent_ui.log.txt
Yes, two things should be done.
FileNotFoundError
was raised, but this is never shown in the UI. It just says error. If I had known it was a FileNotFoundError
, I would have known what the issue was more quickly.cloudpickle
needs to be an installed package in the remote machine's Python environment, but this isn't specified anywhere in the docs.
Covalent is also required on the remote machine, which should be specified in the docs.
The SLURM script will not submit if the Python version on the remote machine does not match that used to submit the calculation to the server, as noted here. This should be mentioned in the installation/usage instructions.
The "options": {"parsable": ""}
parameter is needed according to Will. This should be added to the docs. I think the SLURM directive is necessary for the job status to be properly logged.
These changes should be reflect in both the README and the main docs. Or maybe it makes sense to keep the README here minimal and link to the corresponding executor docs to not duplicate work? Up to you obviously.
The PR template has "I have updated the documentation, VERSION, and CHANGELOG accordingly" as a checkbox. However, the VERSION does not need to be updated since it will be done so automatically.
It also says:
Both of these are not currently accurate since the user is not supposed to increment the version number manually or set the date.
The same should be done in the main covalent repo too I think...
I can get the covalent-slurm-plugin
to work fine with 0.16.0 but not with 0.18.0. I imagine the refactoring effort introduced a bug. Running any minimal example with the covalent-slurm-plugin
yields an error akin to
scp: /global/homes/r/rosen/quacc/7f078879-ee9e-45a4-960b-98839dfdb1b8/node_0/stdout-7f078879-ee9e-45a4-960b-98839dfdb1b8-0.log: No such file or directory
with the following log
Exception in ASGI application
Traceback (most recent call last):
File "/home/rosen/software/miniconda/envs/covalent/lib/python3.10/site-packages/uvicorn/protocols/http/httptools_impl.py", line 404, in run_asgi
result = await app( # type: ignore[func-returns-value]
File "/home/rosen/software/miniconda/envs/covalent/lib/python3.10/site-packages/uvicorn/middleware/proxy_headers.py", line 78, in __call__
return await self.app(scope, receive, send)
File "/home/rosen/software/miniconda/envs/covalent/lib/python3.10/site-packages/fastapi/applications.py", line 1054, in __call__
await super().__call__(scope, receive, send)
File "/home/rosen/software/miniconda/envs/covalent/lib/python3.10/site-packages/starlette/applications.py", line 123, in __call__
await self.middleware_stack(scope, receive, send)
File "/home/rosen/software/miniconda/envs/covalent/lib/python3.10/site-packages/starlette/middleware/errors.py", line 186, in __call__
raise exc
File "/home/rosen/software/miniconda/envs/covalent/lib/python3.10/site-packages/starlette/middleware/errors.py", line 164, in __call__
await self.app(scope, receive, _send)
File "/home/rosen/software/miniconda/envs/covalent/lib/python3.10/site-packages/starlette/middleware/cors.py", line 83, in __call__
await self.app(scope, receive, send)
File "/home/rosen/software/miniconda/envs/covalent/lib/python3.10/site-packages/starlette/middleware/exceptions.py", line 62, in __call__
await wrap_app_handling_exceptions(self.app, conn)(scope, receive, send)
File "/home/rosen/software/miniconda/envs/covalent/lib/python3.10/site-packages/starlette/_exception_handler.py", line 64, in wrapped_app
raise exc
File "/home/rosen/software/miniconda/envs/covalent/lib/python3.10/site-packages/starlette/_exception_handler.py", line 53, in wrapped_app
await app(scope, receive, sender)
File "/home/rosen/software/miniconda/envs/covalent/lib/python3.10/site-packages/starlette/routing.py", line 758, in __call__
await self.middleware_stack(scope, receive, send)
File "/home/rosen/software/miniconda/envs/covalent/lib/python3.10/site-packages/starlette/routing.py", line 778, in app
await route.handle(scope, receive, send)
File "/home/rosen/software/miniconda/envs/covalent/lib/python3.10/site-packages/starlette/routing.py", line 299, in handle
await self.app(scope, receive, send)
File "/home/rosen/software/miniconda/envs/covalent/lib/python3.10/site-packages/starlette/routing.py", line 79, in app
await wrap_app_handling_exceptions(app, request)(scope, receive, send)
File "/home/rosen/software/miniconda/envs/covalent/lib/python3.10/site-packages/starlette/_exception_handler.py", line 64, in wrapped_app
raise exc
File "/home/rosen/software/miniconda/envs/covalent/lib/python3.10/site-packages/starlette/_exception_handler.py", line 53, in wrapped_app
await app(scope, receive, sender)
File "/home/rosen/software/miniconda/envs/covalent/lib/python3.10/site-packages/starlette/routing.py", line 74, in app
response = await func(request)
File "/home/rosen/software/miniconda/envs/covalent/lib/python3.10/site-packages/fastapi/routing.py", line 278, in app
raw_response = await run_endpoint_function(
File "/home/rosen/software/miniconda/envs/covalent/lib/python3.10/site-packages/fastapi/routing.py", line 193, in run_endpoint_function
return await run_in_threadpool(dependant.call, **values)
File "/home/rosen/software/miniconda/envs/covalent/lib/python3.10/site-packages/starlette/concurrency.py", line 42, in run_in_threadpool
return await anyio.to_thread.run_sync(func, *args)
File "/home/rosen/software/miniconda/envs/covalent/lib/python3.10/site-packages/anyio/to_thread.py", line 56, in run_sync
return await get_async_backend().run_sync_in_worker_thread(
File "/home/rosen/software/miniconda/envs/covalent/lib/python3.10/site-packages/anyio/_backends/_asyncio.py", line 2144, in run_sync_in_worker_thread
return await future
File "/home/rosen/software/miniconda/envs/covalent/lib/python3.10/site-packages/anyio/_backends/_asyncio.py", line 851, in run
result = context.run(func, *args)
File "/home/rosen/software/miniconda/envs/covalent/lib/python3.10/site-packages/covalent_ui/api/v1/routes/end_points/electron_routes.py", line 216, in get_electron_file
response, python_object = handler.read_from_serialized(result["results_filename"])
TypeError: cannot unpack non-iterable NoneType object
Use covalent-slurm-plugin==0.18.0
and submit any minimal example like that in the README.
This was my setup for what it's worth:
n_nodes = 1
n_cores_per_node = 1
executor = ct.executor.SlurmExecutor(
username="rosen",
address="perlmutter-p1.nersc.gov",
ssh_key_file="/home/rosen/.ssh/nersc",
cert_file="/home/rosen/.ssh/nersc-cert.pub",
conda_env="quacc",
options={
"nodes": f"{n_nodes}",
"qos": "debug",
"constraint": "cpu",
"account": "matgen",
"job-name": "quacc",
"time": "00:10:00",
},
remote_workdir="/pscratch/sd/r/rosen/quacc",
create_unique_workdir=True,
use_srun=False,
cleanup=False,
)
The job should be submitted to the queue. In reality, it never is submitted.
No response
The function to be executed is sent to the remote machine as a Covalent TransportableObject
object. Which needs to be deserialized on the remote machine in order to run, which requires a Covalent installation on the remote machine. If the deserialized function is sent instead, a "vanilla" python installation would be able to execute the function.
Run a simple workflow where the remote machine does not have Covalent installed. It will not succeed.
The function execution should happen regardless of if Covalent is installed on the remote machine.
Early in the execute
method, the function should be deserialized before sending to the remote machine:
function = function.get_deserialized()
Once the job is submitted to the SLURM queue, the job is marked as "running" and the runtime starts ticking on the UI. However, this does not reflect the actual state of the job since it might be queuing. It might be difficult to modify the "running" state on the UI since there's no "queued" state (that I'm aware of), but an alternative might be to keep it as "running" but have the runtime be 0 s until it's detect that it changed state on the cluster, at which point the time can start.
No response
If the server connection is halted, in-progress workflows remain "running" indefinitely. A nice feature would be add some sort of support for dropped connections or server restarts.
From Will:
stopping or restarting the server does drop connections and cause in-progress workflows to be dropped. they'll just appear as "running" forever. the recommended course as of the latest release would be to redispatch using previous results. if the connection drops, same deal. at least within the slurm/ssh plugins it may be easy to add reconnect logic with some retries.
Tagging @utf who had the question/suggestion in the first place.
You can redispatch if needed.
asyncssh
seems to have trouble sending the command to create a directory on the compute node. I don't know exactly what's going on, but based on this article I'd conclude that some HPCs do not like a login shell due to a legacy command mesg n
in /etc/profile.
import covalent as ct
import numpy as np
@ct.electron(executor='local')
def sum_(n):
return np.sum(np.arange(n))
@ct.electron(executor='local')
def product_(n):
return np.prod(np.arange(n)[1:])
def get_sum_product(n):
return sum_(n) + product_(n)
if __name__ == '__main__':
workflow = ct.lattice(get_sum_product, executor='slurm')
dispatch_id = ct.dispatch(workflow)(10)
[2022-11-15 13:22:54,295] [ERROR] execution.py: Line 364 in _run_task: Exception occurred when running task 4: mesg: ttyname failed: Inappropriate ioctl for device
[2022-11-15 13:22:54,297] [ERROR] execution.py: Line 372 in _run_task: Run task exception
Traceback (most recent call last):
File "/home/sandervandenhaute/envs/covalent_env/pyenv/lib/python3.10/site-packages/covalent_dispatcher/_core/execution.py", line 345, in _run_task
output, stdout, stderr = await execute_callable()
File "/home/sandervandenhaute/envs/covalent_env/pyenv/lib/python3.10/site-packages/covalent/executor/base.py", line 572, in execute
result = await self.run(function, args, kwargs, task_metadata)
File "/home/sandervandenhaute/envs/covalent_env/pyenv/lib/python3.10/site-packages/covalent_slurm_plugin/slurm.py", line 399, in run
raise RuntimeError(client_err)
RuntimeError: mesg: ttyname failed: Inappropriate ioctl for device
Adding request_pty='force'
to the conn.run()
call seems to fix the issue, although the message is still displayed in the log. Replacing mesg n
with tty -s && mesg n
as suggested elsewhere is only possible with root access, which will not always be the case.
To reproduce this issue, run any tutorial with the SlurmExecutor
. This issue has started to appear after the qa-fixes
PR was merged. Seems like a regression
Once separation of workflow and electron statuses is done, the electron level statuses need to be updated to accommodate executor dependent statuses. In this case the following status definitions will be updated:
REGISTERING
- Connection established to the remote machine and files are being transferred, slurm script also submitted
PENDING_BACKEND
- Task is in the PENDING state in the slurm database
STARTING
- Task is in the CONFIGURING state in the slurm database
RUNNING
- Task is in the RUNNING state in the slurm database
COMPLETING
- Task is in the COMPLETING state in the slurm database, result files are being retrieved, temporary files are being deleted
The classes for these statuses will need to be created similar to status classes defined in covalent/_shared_files/statuses.py
. Then, in order to save the status, one can do:
...
status_store = task_metadata["status_store"]
status_store.save(CustomStatus())
...
This will propagate the status updation to the DB.
Acceptance Criteria:
When a @ct.electron
changes the working directory, Covalent crashes because it can no longer find the results pkl file. See #94 for the somewhat cryptic error message.
import covalent as ct
executor = ct.executor.SlurmExecutor(
username="rosen",
address="perlmutter-p1.nersc.gov",
ssh_key_file="/home/rosen/.ssh/nersc",
cert_file="/home/rosen/.ssh/nersc-cert.pub",
conda_env="covalent",
options={
"nodes": 1,
"qos": "debug",
"constraint": "cpu",
"account": "matgen",
"job-name": "test",
"time": "00:10:00",
},
remote_workdir="/pscratch/sd/r/rosen/test",
create_unique_workdir=True,
cleanup=False,
)
@ct.lattice(executor=executor)
@ct.electron
def workflow():
import os
os.chdir("../")
return os.getcwd()
ct.dispatch(workflow)()
No crash.
Always, always, always have all files generated/parsed by Covalent be internally represented as absolute file paths, never relative file paths. You can basically copy what I've done in the covalent-hpc-plugin
here and elsewhere in terms of file path handling.
For a number of reasons, some code you are running with srun
may exit with an exit code other than 0 (0 indicates "success", non-zero indicates certain categories of failure) although from the standpoint of the user, nothing actually went wrong, and usable output was produced. This can happen, for example, in quantum espresso (pw.x) if self-consistency isn't met. In some select situations, we may not care about this and wish to carry on anyways.
Presently, however, this non-zero exit code is caught by the slurm executor from STDERR and the calculation crashes.
I suggest we implement "allowed exit codes" other than just 0. This is intended for advanced users and we should consider throwing a "are you sure you know what you're doing?" warning if this input variable is set as it can lead to unexpected beahviour.
In bash, to ignore a certain exit code (say exit code 3), we can use:
bash -c '(srun pw.x -npool %d -ndiag 1 -input PREFIX.pwi 2>stderr.log 1>PREFIX.pwo || exitcode=$?) && [[ $exitcode -eq 3 ]] && exitcode=0
if [[ $exitcode -eq 3 ]]; then
>&2 cat stderr.log
fi
exit $exitcode'
where the above example is on Perlmutter running the GPU version of quantum expresso (pw.x).
We can implement something along these lines for srun options.
One thing that isn't clear to me is whether or not this is a slurm executor addition or a core covalent addition.
You can set use_srun=False
and manually define your srun
behaviour as the above script.
Related to #95, when one specifies the remote_workdir
with a path like ~/my/dir
, it raises an error upon trying to parse the pkl file. The UI shows [Errno 1] : PosixPath('~/test/ef5861af-7b92-4dbc-a52c-eb83bb137496/node_0/result-ef5861af-7b92-4dbc-a52c-eb83bb137496-0.pkl').
import covalent as ct
executor = ct.executor.SlurmExecutor(
username="rosen",
address="perlmutter-p1.nersc.gov",
ssh_key_file="/home/rosen/.ssh/nersc",
cert_file="/home/rosen/.ssh/nersc-cert.pub",
conda_env="covalent",
options={
"nodes": 1,
"qos": "debug",
"constraint": "cpu",
"account": "matgen",
"job-name": "test",
"time": "00:10:00",
},
remote_workdir="~/test",
create_unique_workdir=True,
cleanup=False,
)
@ct.lattice(executor=executor)
@ct.electron
def workflow():
return "<3"
ct.dispatch(workflow)()
Filepaths like ~/my/dir
should work
Wrap the file path in Path().expanduser().resolve()
client-side before the electron is launched. As a stretch goal, I would even suggest trying to support something like $SCRATCH/my/dir
by wrapping the path call with os.path.expandvars()
on the client side too.
You can basically copy what I have done in the covalent-hpc-plugin
in terms of file path handling, such as here.
develop
here so that I could log in. The code in question should not be impacted by this branch.I tried submitting a SLURM job and got the following traceback.
Traceback (most recent call last):
File "/home/arosen/anaconda3/envs/covalent/lib/python3.9/site-packages/covalent/executor/base.py", line 452, in execute
result = await self.run(function, args, kwargs, task_metadata)
File "/home/arosen/anaconda3/envs/covalent/lib/python3.9/site-packages/covalent_slurm_plugin/slurm.py", line 474, in run
await self._poll_slurm(slurm_job_id, conn)
File "/home/arosen/anaconda3/envs/covalent/lib/python3.9/site-packages/covalent_slurm_plugin/slurm.py", line 333, in _poll_slurm
raise RuntimeError("Job failed with status:\n", status)
RuntimeError: ('Job failed with status:\n', '')
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/home/arosen/anaconda3/envs/covalent/lib/python3.9/site-packages/covalent_dispatcher/_core/runner.py", line 293, in _run_task
output, stdout, stderr, exception_raised = await executor._execute(
File "/home/arosen/anaconda3/envs/covalent/lib/python3.9/site-packages/covalent/executor/base.py", line 421, in _execute
return await self.execute(
File "/home/arosen/anaconda3/envs/covalent/lib/python3.9/site-packages/covalent/executor/base.py", line 459, in execute
await self.teardown(task_metadata=task_metadata)
File "/home/arosen/anaconda3/envs/covalent/lib/python3.9/site-packages/covalent_slurm_plugin/slurm.py", line 505, in teardown
remote_func_filename=self._remote_func_filename,
AttributeError: 'SlurmExecutor' object has no attribute '_remote_func_filename'
My guess (?) is that self._remote_func_filename
is not defined since the RuntimeError
was raised.
import covalent as ct
import time
executor = ct.executor.SlurmExecutor(<redacted>)
@ct.lattice
@ct.electron(executor=executor)
def add(val1,val2):
time.sleep(10000) # make sure the walltime is less than this
return val1+val2
dispatch_id = ct.dispatch(add)(1,2)
result = ct.get_result(dispatch_id,wait=True)
print(result)
The covalent task should abort gracefully.
I think this error happens anytime the job dies unexpectedly (e.g. hits the walltime or otherwise). It doesn't seem to "terminate gracefully."
It seems that adding the parsable: ""
option fixes the lack of a returned status but otherwise the same issue arises.
When running a slurm electron within a base (Dask) sublattice and dispatching the sublattice within a base (Dask) lattice, the dispatch will run on the remote cluster, finish the job, then fail when retieving the job. The traceback reported in the GUI is:
Traceback (most recent call last):
File "/Users/jbaker/miniconda3/envs/covalent_slurm/lib/python3.8/site-packages/covalent_dispatcher/_core/runner.py", line 251, in _run_task
output, stdout, stderr, status = await executor._execute(
File "/Users/jbaker/miniconda3/envs/covalent_slurm/lib/python3.8/site-packages/covalent/executor/base.py", line 628, in _execute
return await self.execute(
File "/Users/jbaker/miniconda3/envs/covalent_slurm/lib/python3.8/site-packages/covalent/executor/base.py", line 657, in execute
result = await self.run(function, args, kwargs, task_metadata)
File "/Users/jbaker/code/covalent/covalent-slurm-plugin/covalent_slurm_plugin/slurm.py", line 695, in run
result, stdout, stderr, exception = await self._query_result(
File "/Users/jbaker/code/covalent/covalent-slurm-plugin/covalent_slurm_plugin/slurm.py", line 577, in _query_result
async with aiofiles.open(stderr_file, "r") as f:
File "/Users/jbaker/miniconda3/envs/covalent_slurm/lib/python3.8/site-packages/aiofiles/base.py", line 78, in __aenter__
self._obj = await self._coro
File "/Users/jbaker/miniconda3/envs/covalent_slurm/lib/python3.8/site-packages/aiofiles/threadpool/__init__.py", line 80, in _open
f = yield from loop.run_in_executor(executor, cb)
File "/Users/jbaker/miniconda3/envs/covalent_slurm/lib/python3.8/concurrent/futures/thread.py", line 57, in run
result = self.fn(*self.args, **self.kwargs)
FileNotFoundError: [Errno 2] No such file or directory: '/Users/jbaker/.local/share/covalent/data/ee3b1f1b-b21b-4bbd-bc95-6bbc012c3091/stdout-ee3b1f1b-b21b-4bbd-bc95-6bbc012c3091-0.log'
I am using the sshproxy
extra req and have prepared my covalent config file as suggested in the root README.md.
Here's a simple workflow to reproduce the above:
import covalent as ct
import numpy as np
executor = ct.executor.SlurmExecutor(
remote_workdir="<wdir>",
options={
"qos": "regular",
"t": "00:05:00",
"nodes": 1,
"C": "gpu",
"A": "<acc code>",
"J": "bug_test",
"ntasks-per-node": 4,
"gpus-per-task": 1,
"gpu-bind": "map_gpu:0,1,2,3"
},
prerun_commands=[
"export COVALENT_CONFIG_DIR="<somewhere in scratch>",
"export COVALENT_CACHE_DIR="<somewhere in scratch>",
"export SLURM_CPU_BIND=\"cores\"",
"export OMP_PROC_BIND=spread",
"export OMP_PLACES=threads",
"export OMP_NUM_THREADS=1",
],
username="<username>",
ssh_key_file="<key>",
cert_file="<cert>",
address="perlmutter-p1.nersc.gov",
conda_env="<conda env>",
use_srun=False
)
@ct.electron
def get_rand_sum_length(lo, hi):
np.random.seed(1984)
return np.random.randint(lo, hi)
# Slurm electron
@ct.electron(executor=executor)
def get_rand_num_slurm(lo, hi):
np.random.seed(1984)
return np.random.randint(lo, hi)
@ct.electron
@ct.lattice
def add_n_random_nums(n, lo, hi):
np.random.seed(1984)
sum = 0
for i in range(n):
sum += get_rand_num_slurm(lo, hi)
return sum
@ct.lattice
def random_num_workflow(lo, hi):
n = get_rand_sum_length(lo, hi)
sum = add_n_random_nums(n, lo, hi) # sublattice
return sum
id = ct.dispatch(random_num_workflow)(1, 3)
ct_result = ct.get_result(dispatch_id=id, wait=True)
sum = ct_result.result
print(sum)
The code should run to completion, throwing now error in the GUI and print an integer.
It seems to me that the interaction between the Dask and Slurm executors is not quite right. Either way, the file Covalent is looking for exists on the remote directory in <wdir>/stdout-ee3b1f1b-b21b-4bbd-bc95-6bbc012c3091-0.log
but does not exist in the local directory /Users/jbaker/.local/share/covalent/data/ee3b1f1b-b21b-4bbd-bc95-6bbc012c3091/stdout-ee3b1f1b-b21b-4bbd-bc95-6bbc012c3091-0.log
. Indeed, in /Users/jbaker/.local/share/covalent/data/ee3b1f1b-b21b-4bbd-bc95-6bbc012c3091/ stdout files are contained within the /node/
subdirs.
Currently we cannot always test whether the Slurm executor is working as expected functionally due to its need for an actual slurm managed cluster. We can fix that by using a docker based alternative such as https://hub.docker.com/r/turuncu/slurm and run that as part of the functional test suite. This will allow us to do more robust testing and we'll have a reliable way to reproduce if there are any issues.
Calling a Slurm sublattice within a base (Dask) lattice fails with the traceback:
Traceback (most recent call last):
File "/Users/jbaker/miniconda3/envs/covalent_slurm/lib/python3.8/site-packages/covalent_dispatcher/_core/runner.py", line 251, in _run_task
output, stdout, stderr, status = await executor._execute(
File "/Users/jbaker/miniconda3/envs/covalent_slurm/lib/python3.8/site-packages/covalent/executor/base.py", line 628, in _execute
return await self.execute(
File "/Users/jbaker/miniconda3/envs/covalent_slurm/lib/python3.8/site-packages/covalent/executor/base.py", line 657, in execute
result = await self.run(function, args, kwargs, task_metadata)
File "/Users/jbaker/code/covalent/covalent-slurm-plugin/covalent_slurm_plugin/slurm.py", line 623, in run
conn = await self._client_connect()
File "/Users/jbaker/code/covalent/covalent-slurm-plugin/covalent_slurm_plugin/slurm.py", line 213, in _client_connect
raise ValueError("username is a required parameter in the Slurm plugin.")
ValueError: username is a required parameter in the Slurm plugin.
Despite the username being defined in the supplied ct.executor.SlurmExecutor
object.
The below code (very similar to 69) reproduces the issue:
import covalent as ct
import numpy as np
executor = ct.executor.SlurmExecutor(
remote_workdir="<wdir>",
options={
"qos": "regular",
"t": "00:05:00",
"nodes": 1,
"C": "gpu",
"A": "<acc code>",
"J": "bug_test",
"ntasks-per-node": 4,
"gpus-per-task": 1,
"gpu-bind": "map_gpu:0,1,2,3"
},
prerun_commands=[
"export COVALENT_CONFIG_DIR="<somewhere in scratch>",
"export COVALENT_CACHE_DIR="<somewhere in scratch>",
"export SLURM_CPU_BIND=\"cores\"",
"export OMP_PROC_BIND=spread",
"export OMP_PLACES=threads",
"export OMP_NUM_THREADS=1",
],
username="<username>",
ssh_key_file="<key>",
cert_file="<cert>",
address="perlmutter-p1.nersc.gov",
conda_env="<conda env>",
use_srun=False
)
@ct.electron
def get_rand_sum_length(lo, hi):
np.random.seed(1984)
return np.random.randint(lo, hi)
@ct.electron
def get_rand_num_slurm(lo, hi):
np.random.seed(1984)
return np.random.randint(lo, hi)
# Slurm sublattice
@ct.electron
@ct.lattice(executor=executor, workflow_executor=executor)
def add_n_random_nums(n, lo, hi):
np.random.seed(1984)
sum = 0
for i in range(n):
sum += get_rand_num_slurm(lo, hi)
return sum
@ct.lattice
def random_num_workflow(lo, hi):
n = get_rand_sum_length(lo, hi)
sum = add_n_random_nums(n, lo, hi) # sublattice
return sum
id = ct.dispatch(random_num_workflow)(1, 3)
ct_result = ct.get_result(dispatch_id=id, wait=True)
sum = ct_result.result
print(sum)
Should run to completion with no errors in the GUI and print an integer.
My gut says that Covalent isn't reading the username
attribute from ct.executor.SlurmExecutor
and is instead looking at my covalent.conf
where I have not defined username
.
Currently, client_keys
in asyncssh.connect()
is set to client_keys=[self.ssh_key_file]
, and self.ssh_key_file
is coded such that it can only be a single path, as shown below.
covalent-slurm-plugin/covalent_slurm_plugin/slurm.py
Lines 129 to 135 in f48c03d
However, some supercomputers require a list[tuple(str, str)]
to be passed that has the SSH key and a certificate file, respectively. For an example, here is the necessary approach to asyncssh.connect
to a U.S. DOE cluster at NERSC. Two filepaths in a tuple are needed, but it's not possible for this to be set right now. There is also no way in the current code to set client_keys=None
explicitly, which ignores authentication, but that's less important since I imagine most people are (or should) be using SSH keys.
In general, I think some additional flexibility in what can be passed to asyncssh.connect()
would help ensure that the plugin is useful for a wide range of HPC machines.
As a side-note, I don't think it's possible for _client_connect()
to ever actually return False
for ssh_success
based on the code below, although that's not really a major concern.
covalent-slurm-plugin/covalent_slurm_plugin/slurm.py
Lines 116 to 143 in f48c03d
There is no alternative. See PR in #47.
I think it would be nice to provide support for just username and password credentials without needing an SSH key. I'd personally never use it, but I'm thinking that it might help for folks who are less programming-inclined and may not be familiar with SSH keys. The downside is that their username and password are then stored as plaintext in the local config, so this shouldn't be done until a system is in place to better handle credential storage.
No response
How do I set the port for a remote address such as login.cluster.org
?
I have attempted to set the address
to login.cluster.org:65001
and login.cluster.org 65001
, but neither worked.
The Slurm executor plugin needs unit tests and a functional test.
Notice that the repo doesn't have any tests.
There should be a suite of unit tests and a functional test for the executor. See the tests in the custom executor template for guidance (https://github.com/AgnostiqHQ/covalent-executor-template)
No response
Currently, at least based on the README, it is only possible to set the SLURM directives (i.e. executors.slurm.options
) and not any lines after the directives. However, it is often necessary for the user to include various lines, such as the loading of certain modules and setting of certain environment variables. If this is possible to do right now, it's not clear based on the README. If it's not possible to do, it would be a worthwhile addition (perhaps with a new kwarg).
No response
Just a couple of small edits to the Readme requried.
The command pip install covalent-slurm-plugin[sshproxy]
doesn't work in zsh
(because of the square brackets). I would reccomend changing this to pip install "covalent-slurm-plugin[sshproxy]"
such that it works in bash
and zsh
.
The instructions indicate that the command line oathtool
can be used to verify secrets with the command oathtool <secret>
but NERSC uses TOTP and alphanumeric strings. The correct syntax is oathtool --totp --base32 <secret>
. These options aren't needed if using the Python wrapper but we are talking about the command line here.
We should have an example in the README or elsewhere which works out-of-the-box for Perlmutter (cori is decommissioned soon). The existing example will not "just work". We need to specify an account to charge and paths to keys/certs. For example, the below executor will work with conda and sshproxy:
executor = ct.executor.SlurmExecutor(
remote_workdir="/global/homes/j/jsbaker/covalent_test", # scratch may be better practice
options={
"qos": "regular",
"t": "00:10:00",
"nodes": 1,
"C": "gpu",
"A": "<billing number>",
"J": "ExampleCovalent",
"ntasks-per-node": 4,
"gpus-per-task": 1,
"gpu-bind": "map_gpu:0,1,2,3"
},
prerun_commands=[
"module load espresso/7.0-libxc-5.2.2-gpu",
],
username="jsbaker",
ssh_key_file="<nersc.pub path>",
cert_file="<nersc.pub-cert.pub path>",
address="perlmutter-p1.nersc.gov"
conda_env="example_covalent"
)
Opinions @wjcunningham7 ?
Installed with conda
Recently (2 days ago as of time of writing this), asyncssh updated to v. 2.15.0. Something in this update seems to have broken the Covalent SLURM plugin. In particular, attempts at submitting jobs error out at around line 524 of slurm.py, right after using scp to copy the pickle files over to the remote server. Pickle files can be found on the remote server, but no other files after this point manage to be copied, nor are any SLURM jobs started. Errors are rather cryptic and seem to change, from "SSH connection closed" to NoneType errors from a failed asyncssh conn object. Error stack trace confirms the location of the error and the source being the asyncssh library. After setting log level to debug in covalent's config option, and checking error.log for the failed SLURM executor node, this error trace appears:
Traceback (most recent call last):
File "/home/myusername/miniconda3/envs/covalent_env/lib/python3.8/site-packages/covalent_dispatcher/_core/runner.py", line 182, in _run_task
output, stdout, stderr, status = await executor._execute(
File "/home/myusername/miniconda3/envs/covalent_env/lib/python3.8/site-packages/covalent/executor/base.py", line 695, in _execute
return await self.execute(
File "/home/myusername/miniconda3/envs/covalent_env/lib/python3.8/site-packages/covalent/executor/base.py", line 724, in execute
result = await self.run(function, args, kwargs, task_metadata)
File "/home/myusername/miniconda3/envs/covalent_env/lib/python3.8/site-packages/covalent_slurm_plugin/slurm.py", line 592, in run
remote_paths = await self._copy_files(
File "/home/myusername/miniconda3/envs/covalent_env/lib/python3.8/site-packages/covalent_slurm_plugin/slurm.py", line 537, in _copy_files
await asyncssh.scp(temp_g.name, (conn, remote_py_script_filename))
File "/home/myusername/miniconda3/envs/covalent_env/lib/python3.8/site-packages/asyncssh/scp.py", line 1041, in scp
reader, writer = await _start_remote(
File "/home/myusername/miniconda3/envs/covalent_env/lib/python3.8/site-packages/asyncssh/scp.py", line 190, in _start_remote
writer, reader, _ = await conn.open_session(command, encoding=None)
File "/home/myusername/miniconda3/envs/covalent_env/lib/python3.8/site-packages/asyncssh/connection.py", line 4198, in open_session
chan, session = await self.create_session(
File "/home/myusername/miniconda3/envs/covalent_env/lib/python3.8/site-packages/asyncssh/connection.py", line 4173, in create_session
session = await chan.create(session_factory, command, subsystem,
File "/home/myusername/miniconda3/envs/covalent_env/lib/python3.8/site-packages/asyncssh/channel.py", line 1207, in create
result = await self._make_request(b'exec', String(command))
File "/home/myusername/miniconda3/envs/covalent_env/lib/python3.8/site-packages/asyncssh/channel.py", line 740, in _make_request
return await waiter
File "/home/myusername/miniconda3/envs/covalent_env/lib/python3.8/site-packages/asyncssh/connection.py", line 1329, in data_received
while self._inpbuf and self._recv_handler():
File "/home/myusername/miniconda3/envs/covalent_env/lib/python3.8/site-packages/asyncssh/connection.py", line 1594, in _recv_packet
processed = handler.process_packet(pkttype, seq, packet)
File "/home/myusername/miniconda3/envs/covalent_env/lib/python3.8/site-packages/asyncssh/packet.py", line 237, in process_packet
self._packet_handlers[pkttype](self, pkttype, pktid, packet)
File "/home/myusername/miniconda3/envs/covalent_env/lib/python3.8/site-packages/asyncssh/channel.py", line 656, in _process_request
self._service_next_request()
File "/home/myusername/miniconda3/envs/covalent_env/lib/python3.8/site-packages/asyncssh/channel.py", line 416, in _service_next_request
result = cast(Optional[bool], handler(packet))
File "/home/myusername/miniconda3/envs/covalent_env/lib/python3.8/site-packages/asyncssh/channel.py", line 1246, in _process_exit_status_request
self._session.exit_status_received(status)
AttributeError: 'NoneType' object has no attribute 'exit_status_received'
For a temporary fix: Revert to asyncssh v. 2.14.0 (restarting the covalent server and such, as needed)
For a more permanent fix: Some updates are needed in the plug in's code to be compatible with the latest version of asyncssh.
Job should run correctly. Instead, it will error out with an SSH connection closed or mentions of "NoneType has no attribute 'exit_status_received'"
For a temporary fix: Revert to asyncssh v. 2.14.0 (restarting the covalent server and such, as needed)
For a more permanent fix: Some updates are needed in the plug in's code to be compatible with the latest version of asyncssh.
main
Passing in a ct.executor.SlurmExecutor
definition into a @ct.lattice
decorator's executor
kwarg does not work properly. I am getting the dreaded "username is a required parameter in the Slurm plugin" error message even though the username is clearly shown in the UI. Unlike in #70, this is not a sublattice but rather a very simple workflow. It's clear that the covalent.conf
file is being read whenever the executor is passed to the Lattice
object. Passing it to the Electron
object works as expected.
Let's take the example from the docs:
executor = ct.executor.SlurmExecutor(
username="myname",
address="test",
ssh_key_file="/path/to/my/file",
remote_workdir="/scratch/user/experiment1",
options={
"qos": "regular",
"time": "01:30:00",
"nodes": 1,
"constraint": "gpu",
},
prerun_commands=[
"module load package/1.2.3",
"srun --ntasks-per-node 1 dcgmi profile --pause",
],
srun_options={"n": 4, "c": 8, "cpu-bind": "cores", "G": 4, "gpu-bind": "single:1"},
srun_append="nsys profile --stats=true -t cuda --gpu-metrics-device=all",
postrun_commands=[
"srun --ntasks-per-node 1 dcgmi profile --resume",
],
)
@ct.electron # (executor=executor) works fine here!
def my_custom_task(x, y):
return x + y
@ct.lattice(executor=executor)
def workflow(x, y):
return my_custom_task(x, y)
dispatch_id = ct.dispatch(workflow)(1, 2)
The UI will show "username is a required parameter in the Slurm plugin" even though the username is provided. In this case, I am starting from a default Covalent configuration file, which has no username by default.
Note: Using @ct.electron(executor=executor)
instead of @ct.lattice(executor=executor)
works fine.
The username should be detected.
The SLURM executor relies on SSHing into the machine and submitting the SLURM jobs. This is excellent, but it would be ideal if the user also had the option to use the SLURM executor locally if they are dispatching from the HPC machine itself. One could imagine this might be particularly relevant, for instance, if there are intense security measures that restrict SSH access in a way that the plugin can't work with (e.g. requiring a physical Yubikey).
To be clear, this would be a large feature addition, so I am not suggesting the maintainers or myself work on it ASAP, but I wanted to log this prior discussion here anyway in case someone feels ambitious. In the meantime, it should be possible to use the Dask executor to write SLURM jobs locally (perhaps worth a tutorial if it's ultimately too ambitious to include here).
No response
covalent
is a dependency when wrapper_fn
function is unpickled and executed.
However, when covalent is initialized for the first time, it will try to create a new config file, which means acquiring a filelock inside ConfigManager.update_config()
.
Trying to acquire the filelock leads to the following error
Traceback (most recent call last):
File "/global/homes/a/ara/slurm-tests/script-310be8a1-383d-4586-9dc1-821c8120e93f-0.py", line 5, in <module>
function, args, kwargs = pickle.load(f)
File "/global/homes/a/ara/miniconda3/envs/slurm-test/lib/python3.9/site-packages/covalent/__init__.py", line 22, in <module>
from . import executor, leptons # nopycln: import
File "/global/homes/a/ara/miniconda3/envs/slurm-test/lib/python3.9/site-packages/covalent/executor/__init__.py", line 32, in <module>
from .._shared_files import logger
File "/global/homes/a/ara/miniconda3/envs/slurm-test/lib/python3.9/site-packages/covalent/_shared_files/logger.py", line 24, in <module>
from .config import get_config
File "/global/homes/a/ara/miniconda3/envs/slurm-test/lib/python3.9/site-packages/covalent/_shared_files/config.py", line 199, in <module>
_config_manager = ConfigManager()
File "/global/homes/a/ara/miniconda3/envs/slurm-test/lib/python3.9/site-packages/covalent/_shared_files/config.py", line 52, in __init__
self.update_config()
File "/global/homes/a/ara/miniconda3/envs/slurm-test/lib/python3.9/site-packages/covalent/_shared_files/config.py", line 109, in update_config
with filelock.FileLock(f"{self.config_file}.lock", timeout=1):
File "/global/homes/a/ara/miniconda3/envs/slurm-test/lib/python3.9/site-packages/filelock/_api.py", line 297, in __enter__
self.acquire()
File "/global/homes/a/ara/miniconda3/envs/slurm-test/lib/python3.9/site-packages/filelock/_api.py", line 264, in acquire
raise Timeout(lock_filename) # noqa: TRY301
filelock._error.Timeout: The file lock '/global/homes/a/ara/.config/covalent/covalent.conf.lock' could not be acquired.
asyncssh.connect()
takes many optional kwargs, most of which are not accessible to the user of covalent-slurm-plugin
. While this is fine for most use cases, one can easily imagine some peculiar setups where additional kwargs are needed to establish the connection. In fact, I had such a scenario that was reported in ronf/asyncssh#582. It would probably be worthwhile to make this accessible to the user, but I'm also weary of adding yet another parameter since at some point we don't want to overload the user with options.
If a connection needs additional parameters to for authorization to be established on asyncssh.connect()
, the only option is to make a custom fork of the plugin. This isn't necessarily as terrible an idea as it sounds seeing as it's so light-weight and is meant to be customizable.
Error traceback when attempting to submit a electron via the slurm executor plugin
File "/home/vbala/work/covalent/covalent_dispatcher/_core/execution.py", line 237, in _run_task
output, stdout, stderr = executor.execute(
File "/home/vbala/.local/share/virtualenvs/covalent-AV-F4ayX/lib/python3.8/site-packages/covalent_slurm_plugin/slurm.py", line 204, in execute
ValueError: invalid literal for int() with base 10: 'Submitted batch job 67110791'```
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.