exaworks / psij-python Goto Github PK

License: MIT License

Makefile 0.52% Python 80.94% Shell 8.27% HTML 5.92% CSS 1.91% Ruby 0.34% JavaScript 2.09%

psij-python's Introduction

Portable Submission Interface for Jobs

PSI/J is an abstraction layer over cluster schedulers to write scheduler agnostic HPC applications. PSI/J automatically translates abstract job specifications into concrete scripts and commands to send to the scheduler. PSI/J is tested on a wide variety of clusters. For further information about PSI/J or the SDK visit the ExaWorks project page and read the PSI/J documentation.

Introduction

psij-python's People

Contributors

Stargazers

Watchers

Forkers

stevwonder jameshcorbett wilke j-woz jlnav kinow yarikoptic cyndy-llnl ramonara209 exaworks adi611 frobnitzem hjjvandam

psij-python's Issues

CI: add testing for Radical connector

Blocked by: ExaWorks/SDK#47

Add coverage to tests

We don’t quite have anything that tells us how much of the code we test (such as, in the above subtask, how many of the scheduler options we exercise). We should, in both github tests and the userland ones, with proper reporting to the server.

Inconsistent variable names in launchers

There are a few instances of internal variables that are meant to refer to the same thing having slightly different identifiers, such as ending with an underscore when being defined and referenced without the underscore.

Find and use tool to automatically check docstrings

It may be useful to have a tool that ensures that:

all parameters of a function/method are documented
changes to parameters are matched by updates to doctstrings
docstring links are valid

Implement pre/post-launch

Currently, launchers transform a spec into a list of arguments that are meant to be executed by executors, either directly, or placed into a submit script.

But this does not leave proper room for pre-launch and post-launch scripts, since they need to be "sourced". This implies that:

They must be "sourced" in a script
That script must be a script that also contains the launch command.

This is not currently straightforward with the local executor and the single launcher since no script is generated by either.

Things should change such that:

All launchers use a script.
Pre and post launch scripts are sourced by the launcher scripts.

Mac support

Local connector tests are failing on Mac. @hategan is investigating debuggin on AWS

Provide a command line tool to run jobs

Swift/T, and other non-python or shell based projects, in order to use PSI/J, would need a command-line interface to running jobs. This issue is here to discuss the requirements, come up with a design, and track the implementation.

an initial task would be to attempt to gather requirements from teams that are potential users of such a feature (@j-woz mentioned one such team).
an initial rough design is based on the idea that one needs to communicate a JobSpec (and properties) and executor + config from one process to the other; this would be done through some serialization mechanism, which could take the form of:
- command line arguments (e.g., --spec.attributes.duration=...)
- json
we talked about a sync vs async interface and concluded that the biggest benefit/cost ratio would be achieved with a sync implementation, since more complex projects would use higher level tooling; for reference:
- sync here means run-job ... which exits when the job completes
- async implies a set of tools, such as submit-job, cancel-job, job-status; there is considerably more complexity in implementing these, at least in layer 0 (basically these would roughly be a local version of layer 1)

Discuss delegated testing strategies

This is a very general discussion. Its intent is to possibly hash out a solution for wider testing environment than what would normally be possible.

Our problem is as follows:

PSI/J is meant to be used to talk to a wide variety of cluster scheduler setups. Many of these are unique in various ways, despite using the same core software. In order to increase the likelihood that PSI/J functions correctly on such a wide variety of systems, proper testing is necessary on each potential target machine. However, we, as developers, have access to a rather limited subset of all potential setups. This difference between the machines we should test on and the machines we can test on forms a gap that we should address.

So we must delegate the testing to either a system that has access to many machines or to users that have such access. We'd like to explore the possibility of doing the latter.

In principle, potential users, should they decide to use PSI/J, would be motivated to ensure that releases are compatible with their environment. If we provide a sufficiently simple mechanism for users to run tests and contribute the results back to us, we may be able to bridge the gap above in the sense that a system would emerge that would effectively act as if automated tests were run on a much wider variety of systems than what we have access to. The important part to note is that we do not need access to all said systems; we only need the results from tests that are run by users who do.

Replace gcc in the launcher file generation

Currently launcher files are generating using the (g)c(c) pre-processor. This is heavy-handed since the only actual operation done during this generation is the concatenation of two files. This should be changed to a simple python-based solution that gets called automatically by setup.py.

Flux executor broken

I am not sure if this is a problem with my local setup, or if the state of the executor deteriorated for some other reason, but I observe this behavior on main:

$ ./tests/run_job.py local
Job submitted
Job done: JobStatus[COMPLETED, time=1629979965.2759883, exit_code=0]

$ ./tests/run_job.py flux
Job submitted
Job done: JobStatus[FAILED, time=1629979974.6525922, meta={'type': 'alloc', 'severity': 0, 'userid': -1, 'note': 'alloc denied due to type="unsatisfiable"', 'exit_code': None, 'message': 'alloc denied due to type="unsatisfiable"'}]

I did not yet dig into it, just recording the ticket for now.

PS.: the ticket referred to running the test w/i a flux instance - but that mode is not supported in main yet, there a flux instance is started internally - I edited to reflect that.

MPI launcher doesn't properly deal with output

Currently, the launcher redirects all output from mpirun to the spec stdout/stderr. This may or may not do the right thing when mpirun succeeds, but it does the wrong thing when it fails, since it ends up redirecting the error from mpirun to the job output files.

What should happen is that only the job stdout/stderr should be redirected to the files specified by job.spec and any errors from mpirun should be part of the launcher output. This can be done by passing "--output-filename" to mpirun. This produces pairs of output/error files, one such pair for each rank. These files should be concatenated by the launcher after mpirun returns.

Code Style

As we discussed today on the call, we should try and converge on a coding style.

There are a few popular options that I found:

black
pycodestyle (formerly pep8)
yapf

My personal preference is for black. We use it on Flux, Spack, and a few other LLNL projects. It is actually the formatter/style recommended internally at Livermore Computing. Everyone has a different qualm or nitpick with the style that it enforces, but most people I talk to are fine with it overall. The major benefit with it is that it has no configuration, so you never spend any time fighting with people over how things should be; they are just what they are. Another minor benefit is that black makes code review easier my optimizing for smaller diffs. Relevant excerpt from their README:

Black is the uncompromising Python code formatter. By using it, you agree to cede control over minutiae of hand-formatting. In return, Black gives you speed, determinism, and freedom from pycodestyle nagging about formatting. You will save time and mental energy for more important matters. Blackened code looks the same regardless of the project you're reading. Formatting becomes transparent after a while and you can focus on the content instead. Black makes code review faster by producing the smallest diffs possible.

Also relevant, Ian Lee is an LC employee that used to be a major contributor to pycodestyle (# 2 in LOC), but AFAIK he now recommends people use black.

All that said, I'd be happy to use the other formatting tools, but I have no interest in configuring them or iterating on the format style. So if we go with pycodestyle or yapf, I will just use whatever config is decided upon by others.

``_pytest._io`` ImportError

I ran into an exception while running the tests:

(37B) [corbett8@lassen709:psi-j-python]$ make tests -- --upload-results
PYTHONPATH=/usr/WS2/corbett8/ExaWorks/psi-j-python/src: python -m pytest -v --upload-results
ImportError while loading conftest '/usr/WS2/corbett8/ExaWorks/psi-j-python/tests/conftest.py'.
tests/conftest.py:20: in <module>
    from _pytest._io import TerminalWriter
E   ImportError: cannot import name 'TerminalWriter' from '_pytest._io' (/usr/tce/packages/python/python-3.7.2/lib/python3.7/site-packages/_pytest/_io/__init__.py)
make: *** [Makefile:14: tests] Error 4

The problem went away when I upgraded pytest from 4.3 to 6.2. So the easiest solution would be to require pytest >= X or similar in our requirements file. But I was wondering if there are other options available to us that wouldn't require _pytest imports. We could replace the line with from py.io import TerminalWriter but py is in maintenance mode. Any thoughts?

Write 0.1.0 developer docs

Templating engine woes

Our templating engine, pystache, is causing us some trouble. It breaks our readthedocs builds and other things (more here). The way it fails is configuration-dependent but it manages to fail in interesting ways no matter what.

@hategan mentioned a few options we have:

There is a promising fork of pystache whose maintainer is attempting to take over the pypi package (see pypi/support#1422). We could point to that github repo until the issue gets sorted.

There is also a newer implementation of mustache, https://github.com/noahmorrison/chevron, but it seems less used/tested. However, if the situation above does not resolve in a timely manner, this may end up being the better solution.

I wonder though if we might consider using jinja instead?

Complete 0.1.0 API guide

Complete documentation for 0.1.0

This is a parent issue for a number of documentation subtasks:

Add plugin contributor guide

We need to add a guide on how to write and publish executors and/or launchers.

'external' imports should not stop from loading `psi.j` module and doc build

CI: Local Connector Tests Failures

In the CI for #27, the CI is failing due to the following errors:

============================= test session starts ==============================
platform linux -- Python 3.6.13, pytest-6.2.4, py-1.10.0, pluggy-0.13.1
rootdir: /home/runner/work/psi-j-python/psi-j-python
collected 8 items

tests/test_doc_examples.py ..                                            [ 25%]
tests/test_local_jobs.py .F.F..                                          [100%]

=================================== FAILURES ===================================
___________________________ test_simple_job_redirect ___________________________

    def test_simple_job_redirect() -> None:
        with TemporaryDirectory() as td:
            outp = Path(td, 'stdout.txt')
            job = Job(JobSpec(executable='/bin/echo', arguments=['-n', '_x_'], stdout_path=outp))
            exec = JobExecutor.get_instance('local')
            exec.submit(job)
            job.wait()
            f = outp.open("r")
            contents = f.read()
>           assert contents == '_x_'
E           AssertionError: assert '' == '_x_'
E             - _x_

Not sure if #28 or any other open PRs fix this issue.

Expand tests to exercise complete range of scheduler options

We expect that between all potential projects using PSI/J, all scheduler options will be exercised. Tests need to be written for all supported options.

docs: RTD not generating the complete autodocs

Remove NativeId class

The NativeId class is an ABC that's meant to be subclassed such that executors return it as on opaque object from list() and can be used in attach(). However, it can't be fully opaque since the purpose of attach() is to allow different processes to submit/monitor a job. The object cannot be communicated directly between processes without some serialization, which should not be implementation specific.

Long story short, a string may be a better class to represent native ids.

Type Annotations

Since we're talking about code style and docstring format, I was wondering if it would be worth discussing type annotations. It sounds like everyone else is intending to annotate everything, which is fine by me, although I wonder if they might be optional until a given project (assuming that the policies put in place in this repo would carry over to Python code in other ExaWorks repos) finds that they would really be beneficial.

RCT integration: data staging

For integrating PSI-J with RCT, we would need the ability to stage data into the resource before a job is submitted, and to stage data out after the job completed.

Creating executors/launchers for CLI-based schedulers: Slurm, PBS, Torque, etc

On the ☎️ call today, we discuss how best to proceed with including executors for CLI-based interfaces provided by most schedulers. The three main interfaces of these schedulers appears to be (with example CLIs from Slurm):

The batch script and the associated batch submission command (e.g., sbatch)
The job status command (e.g., squeue)
The parallel launcher (e.g., srun)

The first two interfaces being encapsulated as an executor and the latter being encapsulated as a launcher.

Random various notes/observations from the discussion today:

Previous efforts at this sort of thing suffered from huge performance cost to polling of the job status (e.g., qstat/squeue), if we implement it generically, we can do it “right”/performantly/etc once and then it will apply to all CLI-based adaptors
One lesson learned is that submit scripts should be based on templates rather than buried/generated deep inside the python code. This makes it easier for external collaborators to contribute and maintain executors. Templates cannot handle every corner case, but their ease-of-development/maintenance is probably worth it.
We need some way to handle multiple versions of the same backend for an executor. For example, the output from squeue changes between Slurm versions, and trying to handle every version discrepancy in a single file/class results in a rats nest of ifs.
It might be beneficial to develop two executors in parallel so that we can see what is common between the two and abstract as much as possible to higher levels/classes.

JobExecutor: add static method for listing available executors

On the ☎️ call today, there seemed to be agreement on including a static method in JobExecutor that lists the available/installed executors.

We could name it list_instances(). To avoid confusion with the list method, maybe we call it something like get_available_instances() or get_installed_instances()? Any thoughts on naming?

Create a website

Move _set_status out of Job

There is a re-usable _set_status method in Job which sets the status and notifies relevant callbacks. This method is not called from Job and is meant to be called from executors. This effectively makes it a public method and reliance on it prevents another possible Job class that correctly implements the spec from working in its place (not that we're encouraging that).

Anyway, _set_status appears to belong more in the JobExecutor abc than in Job, so it should be move there.

Change module name to `psij` to avoid clashing with `psi`

This includes both the imports in the code and the eventual PyPI package.

RCT integration: remote submission

TO integrate PSI-J with RCT, PSI-J would need the ability to submit jobs remotely, usually over an ssh connection.

Write 0.1.0 user guide

Handling executors with dependencies

Currently, if you try to run on a system without flux or radical installed, you will get a runtime error, even if you are just attempting to use the local executor. The radical one is a relatively easy fix since you can pip install -r requirements-connector-radical.txt, but no such PyPI package or requirements file exists for Flux. Both Flux and radical (I think) are limited to Linux only, so they preclude local mac development. As we expand the set of executors that we support, I imagine the list of overall requirements will continue to grow.

In #43 , I initially attempted to solve this issue by A) removing the explicit imports of the flux and radical executors in the `` and B) wrapping with try/except the implicit import of those executors by the plugin discovery mechanism. For part B, if the flux or radical import failed, it was silently ignored.

@hategan rightfully pointed out that:

I think removing the requirement for Flux to be installed and importable is fine. But I worry about the error being silently ignored. Both legitimate errors in importing flux classes as well as flux not being installed are presented to the user as the absence of the flux executor. This could potentially send users on very wasteful wild goose chases.

and then suggested:

either moving executors with heavy dependencies to their own, separately installable packages, as I think we might have discussed as a possibility in the past. Alternatively, save the import exception and throw it when the user attempts to instantiate the flux executor. I think the latter can be made as a generic solution in the mechanism used to discover/instantiate executors.

log consistency across executors

from @SteVwonder on #26:

do we want to have some logging consistency across executors for some of the common things like state changes and job submission? If so, we can spin out a separate issue on how to handle this and what logs to do this for.

This is that separate issue.

Docstring Format

As discussed on the call today, we agreed on using Sphinx to generate our documentation from docstrings, but we need to decide/agree on a docstring format. The three most popular formats are RestructuredText, Google-style, and Numpy-style.

The latter two are supported in Sphinx via the Napoleon extension (which actually now ships as a part of Sphinx proper). The Napoleon documentation has a great overview of the situation: https://sphinxcontrib-napoleon.readthedocs.io/en/latest/index.html. The tl;dr (too long didn't read) is: Napoleon pre-processes the google/numpy docstrings into the RestructuredText format before handing things over to sphinx for parsing and doc generation. This means that all three formats can live together in the same codebase and even within the same docstring. I'm not sure if google/numpy covers literally every feature of sphinx's docstrings, but if not, you can always fallback to ReST syntax for that particular feature/line.

IMO, google and numpy style are way more legible (to me, the RestructuredText docstrings border on line noise sometimes with all the colons, poor grouping of type with description and poor separation of args, returns, exceptions, descriptions). They are also much easier to remember and write than ReST. IMO, the google format is easier to write than the numpy format since it uses tabs rather than ------; so my vote is for the google style. Ultimately, I would also be totally fine with the numpy format too. Out of ease-of-use and legibility concerns, I vote against the ReST format/syntax.

Flux executor: re-add support for `attach`

73bc66b temporarily dropped support for attach in the Flux executor due to errors detected by mypy and the need for the FluxExecutor class in Flux (not to be confused with jpsi's Flux executor) to support attaching to normal Flux jobs submitted outside of the FluxExecutor interface. Support was added with flux-framework/flux-core#3790 but no tag has occurred since then. Once a tagged release occurs, maybe then would be a good time to revisit this.

Remove static access to executors through psij.executors.init.py

The issue here is executors with dependencies when we would like to guard against import failures due to missing dependencies, such as is the case for the FluxJobExecutor.

We could wrap the import into a try block. However, I don't see how this could be made to work with static type checking. The fundamental problem is that mypy imports modules and checks against the resulting symbols. If the import protection mechanism replaces the FluxJobExecutor attribute in psij.executors.__init__.py with a mock object, mypy cannot effectively check against the specifics of FluxJobExecutor.

We could do something like 6e04d21, but we run into the same problem within the Flux executor, although with a different manifestation: we can't make mypy happy without hiding the type of the guarded variables.

Even if the executors are not accessible as psij.executors.<name>, they remain accessible statically through their canonical names, such as psij.executors.flux.FluxJobExecutor, so the consequences are minor.

This is not a problem with the dynamic executor loading scheme since objects returned by the executor factory have the base JobExecutor type.

@andre-merzky, @jameshcorbett: thoughts?

Rethink `cores_per_process` and `processes_per_node`

Some of the fine-grained placement options in the spec don't nicely (if at all) translate into corresponding options for batch schedulers (LSF and Slurm are known to have some difficulties there).

This might need to go back to the spec, but we do need to figure out what is reasonable in that area and how to better specify such options.

`setup.py` doesn't build scripts when using a development install

When I ran pip install -e . to get a development install of PSI-J, I found that the scripts in src/psij/launchers/scripts weren't built like they should be, and so all of my Jobs failed. The scripts aren't built because the build command is not executed (develop is instead, I think) and so our CustomBuildCommand defined in setup.py is never run.

As a fix, we could add a CustomDevelopCommand and refer to it in the cmdclass argument to setup(). But my preferred fix would be to remove the cmdclass stuff entirely and to instead break out the script-creation logic into some free functions that run before setup(). I prefer that approach because I find the setuptools/distutils stuff a little opaque.

Missing dev requirements

pytest, flake8, mypy are missing from the requirements-dev.txt. I think they should be added.

Is there a particular set of versions we should pin them to? Or just always go with the latest? If we tie their passing to the CI, we've had issues in Flux where CI will start failing on a random PR due to a version bump by one of the static analysis tools.

CI: add testing for Flux connector

Blocked by: ExaWorks/SDK#47 and #14

Use ECP CI

Use CI testing platforms provided by ECP or ECP participating labs.

Move SAGA connector from jpsi

mypy checks fail on case-insensitive filesystem

The case insensitivity means that an unintentional alias is created for the Job class since psi/j/Job.py and psi/j/job.py seemingly both exist.

(.venv-vscode-focal)  fluxuser@1f7fe27de5f4:/workspaces/psi-j-python$ make typecheck
mypy --config-file=.mypy --strict src tests
src/psi/j/Job.py:117: error: Argument 1 to "cancel" of "JobExecutor" has incompatible type "psi.j.Job.Job"; expected "psi.j.job.Job"
(.venv-vscode-focal)  fluxuser@1f7fe27de5f4:/workspaces/psi-j-python$ mypy --version
mypy 0.910

The impact here is that mypy checks will fail on any mac that has a case insensitive filesystem and windows machines. I also suspect that runtime can be impacted by this (e.g., import psi.j.Job - should that import the "Job.py" file or the Job class exposed by psi.j).

Default time limit does not make it into batch script

When I create a JobSpec with attributes=None, the defaults (time limit = 10 minutes, node count = 1) don't make it into the generated batch script. I end up with something like this:

#!/bin/bash



#BSUB -env all



#BSUB -e /dev/null
#BSUB -o /dev/null

exec &>> "/g/g12/corbett8/.psij/work/lsf/$LSB_JOBID.out"

/bin/bash /usr/WS2/corbett8/37B/lib/python3.7/site-packages/psij/launchers/scripts/single_launch.sh a7c0891a-0818-4e0a-b99d-055f38dea296 '' '' '' /dev/null /dev/null /dev/null /bin/false 

echo "$?" > "/g/g12/corbett8/.psij/work/lsf/$LSB_JOBID.ec"

Installation with pip

If I do:

git clone [email protected]:ExaWorks/psi-j-python.git
virtualenv -p python3 ./.virtualenv
source .virtualenv/bin/activate
pip install -r requirements.txt
pip install -e .

And then try to run one of the example files in test, I get:

(126) herbein1 ~/Repositories/exaworks/psi-j-python/tests [.virtualenv] (main ?)
❯ python ./test_doc_examples.py                                                                                                                                                              17:08:01 ()
Traceback (most recent call last):
  File "/Users/herbein1/Repositories/exaworks/psi-j-python/tests/./test_doc_examples.py", line 1, in <module>
    import psi.j
ModuleNotFoundError: No module named 'psi'

Looking at the virtualenv, it seems the package was installed as jpsi-python:

❯ find ./.virtualenv -name "*psi*"                                                                                                                                                           17:08:37 ()
./.virtualenv/lib/python3.9/site-packages/jpsi-python.egg-link

I don't totally know my way around the packaging of python, so I'll defer to others, but I think the issue here is in setup.py which has the name of the package as jpsi-python. Not sure if we want to tweak the setup.py to have a different package name or tweak the imports.

Expand range of sites used for testing

The tentative goal is that each of us bootstrap tests on a reasonable set of machines we have access to.

Executors missing `VERSION`

❯ python3 -c 'from psi.j.job_executor import JobExecutor; exec = JobExecutor.get_instance("local"); print(exec.version)'                                                                                   19:35:22 ()
Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "/Users/herbein1/Repositories/exaworks/psi-j-python/.virtualenv/lib/python3.9/site-packages/psi/j/job_executor.py", line 69, in version
    return cast(Version, getattr(self.__class__, '__VERSION__'))
AttributeError: type object 'LocalJobExecutor' has no attribute '__VERSION__'

Flux executor should work with existing Flux instance

LSF and Cobalt executors need launchers

In order to launch applications across multiple nodes with various resource requirements.