GithubHelp home page GithubHelp logo

llnl / ats Goto Github PK

View Code? Open in Web Editor NEW
6.0 11.0 5.0 6.85 MB

ATS - Automated Testing System - is an open-source, Python-based tool for automating the running of tests of an application across a broad range of high performance computers.

License: BSD 3-Clause "New" or "Revised" License

Python 83.79% Roff 5.32% C 0.53% C++ 10.27% Makefile 0.09%

ats's Introduction

Documentation Status

ATS

Description

ATS is an Automated Test System. It is used to implement regression testing across a variety of HPC platforms.

Getting Started

ATS usage and expectations vary among its user base. This also applies to how ATS is installed. Below are a few variations that users may find helpful.

For more information, please check our documentation.

Sample install, modify for your project or personal usage.

An install really means a Python executable with ATS modules discoverable in its python path. Useful for multiple different projects in a shared environment.

Example installation:

# Load a python 3.8 module, or otherwise put python 3.8 in your path
module load python/3.8.2

# Create a fresh Python 3.8 (or higher) executable to be shared.
python3 -m virtualenv --system-site-packages --python=python3.8 /location/of/your/new/install

# Clone ATS
git clone [email protected]:LLNL/ATS.git <CLONE_PATH>

# pip install cloned ATS into fresh shared Python 3.8 (or higher) executable.
/location/of/your/new/install/bin/python -m pip install <CLONE_PATH>/

Getting Involved

Contact the ATS project lead [email protected]

Contributing

Refer to file Contributing

Release

ATS is licensed under the BSD 3-Clause license, (BSD-3-Clause or https://opensource.org/licenses/BSD-3-Clause).

Refer to LICENSE

LLNL-CODE-820679

ats's People

Contributors

davidbloss avatar dawson6 avatar jwhite242 avatar kennyweiss avatar liu15 avatar mdavis36 avatar mishazakharchanka avatar tomstitt avatar white238 avatar wihobbs avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

ats's Issues

Change in salloc/srun submittal behavior with new slurm

This issue has been resolved, but want to document it:

Reported by Ben Liu:

I am taking a closer look at the rzalastor results and I don’t think it is doing the right thing.

I ran with two nodes and when I look on the first node, the only processes I see are sruns. When I look on the second node, all the ale3d processes are there.

Followup by Ben Liu:

We have some Gyllenhaal magic that we were using in our salloc command:

salloc -N 2 –exclusive srun -n 1 -N 1 --mem-per-cpu=0 –pty --preserve-env --mpi=none --mpibind=off aleats -N 1

This was supposed to give the same behavior as if you launched the command from the prompt within the salloc, but apparently this works with the old slurm but not with the new.

When I remove this incantation, it looks like I’m getting the expected behavior. Yay!

Thanks,

Comments by Shawn Dawson:

OK, great, glad you figured that out.

Yes, in general, we do not run 'ats' itself within an srun. We run it directly on the allocated node. ATS then submits sruns.

Now i'm curious, I'm going to try that myself and see if I see the same behavior.

OK, reproduced the same behavio.

I allocated 2 nodes and submitted this line:

srun -n 1 -N 1 --mem-per-cpu=0 -pty --preserve-env --mpi=none --mpibind=off -N 1 atsmercury --exec=./mercury --level=10 --verbose

And it put all the tests on the 2nd node. The only thing on the first was the ats.

I then allocated 3 nodes, and similarly -- node 1 was reserved for just the ats script, but nodes 2 and 3 were running the tests.

So that method of running, with slurm on alastor, generats that behavior. the -N1 -n1 s interpreted by slurm to reserve the first node for the submitted job, but the other nodes are available for tests.

Undesirable bhavior for testeing, but good to know.

When ATS creates the srun line, I have a lot of options I use to give slurm the info it needs to share and load level the running jobs. You can see them if you run with the --verbose flag.

Adjust Flux commands

Flux mini being deprecated so our command line strings need to be adjusted to avoid this problem/warning in the future.

Multiple dependencies?

Is it possible to have a test that executes only when multiple predecessors have completed (and passed)?

One solution is:
t1 = test(...)
t2 = testif(t1, ...)
t3 = testif(t2, ...)

But this serializes t1 and t2, which may be able to run concurrently.

try/except sleep unlink out/err files in tests.py

Liu, Ben
Another feature request. On Windows systems, the deletion of the out/err files can sometimes fail because the file has not been finalized yet. Previously we added a try/except on the unlink to get around this (this is at the end of tests.py).

Would it be possible to sleep(1) and try again, then only give the log message if that also failed?

Resolve circular dependency in "executables.py"

Problem

In "ats/src/ats/executables.py" the import statement

from configuration import machine

is called from within the Executable class' init function.

This is a circular dependency workaround since
"configuration.py" imports "machines.py"
"machines.py" imports "configuration.py" (the contents of machines.py are not imported yet)
"configuration.py" imports "executables.py"
and "executables.py" imports "configuration.machine" (more accurately: from configuration import machine).

For now the code works since Executable (in executables.py) does not try to reference configuration.machine until after it has been imported elsewhere.

Goal

Resolve dependencies so "from configuration import machine" can exist outside of the Executable.init() call.

Update lsf module to avoid using -N with -g option

If a test case has nn=2 or some other nn option specified, and uses the ATS 'lrun' option, ats gives lrun a line like so:

lrun -N2 -n4 -g1 -c1 --pack

for example. But lrun states that -N2 and -g and the --pack options conflict, it does generate a line anyway.

In practice the generated line causes gpu issues and the code crashes.

Update the lrun module such that if running lrun with the -pack option to ignore the -nn option. Perhaps print a notice taht we are ignoring this.

str object not callable in print line

On a Toss system, I am receiving this issue with an updated print line.

It is happening here:
for v in brothers:
print("testlist[%d].set(%s, 'Previously ran.') # %s"
(v.serialNumber - 1, v.status.name, v.name), file=fc)

Here is the python error:

Traceback (most recent call last):
File "/usr/WS2/dawson/git-mercury-ats-new/atsmercury_back_end", line 483, in
result = manager.core()
File "/usr/apps/ats/7.0.4/lib/python2.7/site-packages/ats/management.py", line 918, in core
self.continuationFile(interactiveTests)
File "/usr/apps/ats/7.0.4/lib/python2.7/site-packages/ats/management.py", line 985, in continuationFile
(v.serialNumber - 1, v.status.name, v.name), file=fc)
TypeError: 'str' object is not callable

MACHINE_DIR / MACHINE_OVERRIDE_DIR not being honored

Ben Liu reports in their Darwin module that they use MACHINE_DIR / MACHINE_OVERRIDE_DIR.

This is not being honored when setting up the machine module for submitting jobs.
[2:20 PM] Liu, Ben

We ran into a problem with the machine directory search. We have a darwin configuration that we rolled ourselves, but it the specification in standard.py is found first and does not work!

We specify MACHINE_OVERRIDE_DIR, but it is placed after MACHINE_DIR.

That seems like an incorrect behavior.
(also, it appears that MACHINE_DIR can not longer be set as an environment variable (at least it is not respected). Is that the intended behavior.

This is an issue that has existed for a while.  We have a hand-rolled darwin config that we put in the MACHINE_OVERRIDE_DIR.  But we specify MACHINE_DIR to the installation atsMachines directory (which differs from the site-config installation directory in that standard.py is present in the site-config installation and not in the other).

The change in behavior was that the environment variable MACHINE_DIR is no longer respected.

I am ok with that if we can make the following behavioral change: search MACHINE_OVERRIDE_DIR before MACHINE_DIR.

Tox installer does not work for me

I could not get the new tox installer added by @davidbloss to work. This could be user error.

Steps:

// clone repo and cd into repo
$ cd ats
$ python2 -m pip install --user .
$ cd HelloGPU
$ mpicxx hello_gpu.cc
$ PATH=$PATH:~/.local/bin srun -N1 /g/g20/white238/.local/bin/atslite1
Traceback (most recent call last):
  File "/g/g20/white238/.local/bin/atslite1", line 6, in <module>
    from atslite.bin._atslite1 import main
ImportError: No module named atslite.bin._atslite1
srun: error: rzgenie6: task 0: Exited with exit code 1

--smpi_off causes python codec errors

Running ats with --smpi_off throws the error below during a run, running the same ats test suite without --smpi_off completes successfully but makes cuda tests fail (as expected because we need it off for those tests on blueos machines).

  File "/usr/WS2/sphapp/.jacamar-ci/spheral-builds/1175168/build_gitlab/install/.venv/bin/ats", line 10, in <module>
    sys.exit(main())
  File "/usr/WS2/sphapp/.jacamar-ci/spheral-builds/1175168/build_gitlab/install/.venv/lib/python3.9/site-packages/ats/__main__.py", l
ine 9, in main
    result = ats.manager.main()
  File "/usr/WS2/sphapp/.jacamar-ci/spheral-builds/1175168/build_gitlab/install/.venv/lib/python3.9/site-packages/ats/management.py",
 line 686, in main
    core_result = self.core()
  File "/usr/WS2/sphapp/.jacamar-ci/spheral-builds/1175168/build_gitlab/install/.venv/lib/python3.9/site-packages/ats/management.py",
 line 890, in core
    self.run(interactiveTests)
  File "/usr/WS2/sphapp/.jacamar-ci/spheral-builds/1175168/build_gitlab/install/.venv/lib/python3.9/site-packages/ats/management.py",
 line 1016, in run
    unfinished = machine.scheduler.step()
  File "/usr/WS2/sphapp/.jacamar-ci/spheral-builds/1175168/build_gitlab/install/.venv/lib/python3.9/site-packages/ats/schedulers.py",
 line 69, in step
    machine.checkRunning()
  File "/usr/WS2/sphapp/.jacamar-ci/spheral-builds/1175168/build_gitlab/install/.venv/lib/python3.9/site-packages/ats/machines.py", l
ine 93, in checkRunning
    done = self.getStatus(test)
  File "/usr/WS2/sphapp/.jacamar-ci/spheral-builds/1175168/build_gitlab/install/.venv/lib/python3.9/site-packages/ats/machines.py", l
ine 162, in getStatus
    lines = f.readlines()
  File "/usr/WS2/wciuser/Spheral/spheral-spack-tpls/spack/opt/spack/__spack_path_placeholder__/__spack_path_placeholder__/__spack_pat
h_p/linux-rhel7-ppc64le/gcc-8.3.1/python-3.9.10-vm762d2wzoh2cvjmq2descxfxo7craxt/lib/python3.9/codecs.py", line 322, in decode
    (result, consumed) = self._buffer_decode(data, self.errors, final)

lrun, old_default, and mpi_bind not happy

Ben liu reported issue (and provided fix) for these types of run

atslite1 --lrun --level=10 --old_defaults
atslite1 --lrun --level=10 --old_defaults --mpibind --mpibind_executable="/usr/tcetmp/bin/mpibind"

They python used to build the string to submit was failing, and ATS dies with python syntax error.
Solution is to treat the str_mpibind correctly.

MR is up, and tested with above commands which did replicate Ben's issue.

atss.log is written by default and gets large

We used the floor version of ATS on LC and this leaves an atss.log file with every run (note the second 's'). I think this file is to debug scheduling issues. Can we get an option to only write out this file if we need to diagnose scheduling issues?
Reason is that with the longer runs on LC systems this file can get really large and takes up valuable space when moving results to the filesystem archive and to the html space for them to be searched. We have to add a removal for this file in automation scripts or we'll hit quotas quicker than expected.

Ensure ATS can run CPU and GPU jobs concurrently on the same node with Flux

Verify (and fix if not verified) that we can

  1. Run CPU only tests and CPU+GPU tests on the same node concurrently using flux.

This may mean creating (or tweaking) an existing test setup such that two codes are specified in the ATS test files.

That is

Code A) built for the CPU only. Does not need access to the GPU at all (in particularly for memory access)
Code B) Built for the CPU+GPU. That is it will require access to either hipMalloc or hipMallocManaged memory at run time.

Verify we can saturate the nodes (for throughput) with a combination of the above codes.

Project install option broken?

Not sure what the root of the issue is here given i can import the module in question manually in the interpreter, but using the project install option is currently not very happy. I used the python 3.8.2 module on the LLNL toss3 machines so you can hopefully reproduce it? Anyway, cloning, then attempting to install into an alternate directory gives this enormous stack pointing to something that's definitely built in to the python install. Following the steps on the readme leads to:

$ git clone [email protected]:LLNL/ATS.git ats3_source

$ python3 -m pip install ats3_source/ --target=ats3/

Processing ./ats3_source
  Installing build dependencies ... done
  Getting requirements to build wheel ... done
ERROR: Exception:
Traceback (most recent call last):
  File "/usr/tce/packages/python/python-3.8.2/lib/python3.8/site-packages/pip/_internal/cli/base_command.py", line 153, in _main
    status = self.run(options, args)
  File "/usr/tce/packages/python/python-3.8.2/lib/python3.8/site-packages/pip/_internal/commands/install.py", line 381, in run
    resolver.resolve(requirement_set)
  File "/usr/tce/packages/python/python-3.8.2/lib/python3.8/site-packages/pip/_internal/legacy_resolve.py", line 201, in resolve
    self._resolve_one(requirement_set, req)
  File "/usr/tce/packages/python/python-3.8.2/lib/python3.8/site-packages/pip/_internal/legacy_resolve.py", line 365, in _resolve_one
    abstract_dist = self._get_abstract_dist_for(req_to_install)
  File "/usr/tce/packages/python/python-3.8.2/lib/python3.8/site-packages/pip/_internal/legacy_resolve.py", line 312, in _get_abstract_dist_for
    abstract_dist = self.preparer.prepare_linked_requirement(
  File "/usr/tce/packages/python/python-3.8.2/lib/python3.8/site-packages/pip/_internal/operations/prepare.py", line 223, in prepare_linked_requirement
    abstract_dist = _get_prepared_distribution(
  File "/usr/tce/packages/python/python-3.8.2/lib/python3.8/site-packages/pip/_internal/operations/prepare.py", line 49, in _get_prepared_distribution
    abstract_dist.prepare_distribution_metadata(finder, build_isolation)
  File "/usr/tce/packages/python/python-3.8.2/lib/python3.8/site-packages/pip/_internal/distributions/source/legacy.py", line 37, in prepare_distribution_metadata
    self._setup_isolation(finder)
  File "/usr/tce/packages/python/python-3.8.2/lib/python3.8/site-packages/pip/_internal/distributions/source/legacy.py", line 90, in _setup_isolation
    reqs = backend.get_requires_for_build_wheel()
  File "/usr/tce/packages/python/python-3.8.2/lib/python3.8/site-packages/pip/_vendor/pep517/wrappers.py", line 151, in get_requires_for_build_wheel
    return self._call_hook('get_requires_for_build_wheel', {
  File "/usr/tce/packages/python/python-3.8.2/lib/python3.8/site-packages/pip/_vendor/pep517/wrappers.py", line 255, in _call_hook
    raise BackendUnavailable(data.get('traceback', ''))
pip._vendor.pep517.wrappers.BackendUnavailable: Traceback (most recent call last):
  File "/usr/tce/packages/python/python-3.8.2/lib/python3.8/subprocess.py", line 64, in <module>
    import msvcrt
ModuleNotFoundError: No module named 'msvcrt'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/tce/packages/python/python-3.8.2/lib/python3.8/site-packages/pip/_vendor/pep517/_in_process.py", line 63, in _build_backend
    obj = import_module(mod_path)
  File "/usr/tce/packages/python/python-3.8.2/lib/python3.8/importlib/__init__.py", line 127, in import_module
    return _bootstrap._gcd_import(name[level:], package, level)
  File "<frozen importlib._bootstrap>", line 1014, in _gcd_import
  File "<frozen importlib._bootstrap>", line 991, in _find_and_load
  File "<frozen importlib._bootstrap>", line 961, in _find_and_load_unlocked
  File "<frozen importlib._bootstrap>", line 219, in _call_with_frames_removed
  File "<frozen importlib._bootstrap>", line 1014, in _gcd_import
  File "<frozen importlib._bootstrap>", line 991, in _find_and_load
  File "<frozen importlib._bootstrap>", line 975, in _find_and_load_unlocked
  File "<frozen importlib._bootstrap>", line 671, in _load_unlocked
  File "<frozen importlib._bootstrap_external>", line 783, in exec_module
  File "<frozen importlib._bootstrap>", line 219, in _call_with_frames_removed
  File "/tmp/white242/pip-build-env-uf448wdc/overlay/lib/python3.8/site-packages/setuptools/__init__.py", line 8, in <module>
    import _distutils_hack.override  # noqa: F401
  File "/tmp/white242/pip-build-env-uf448wdc/overlay/lib/python3.8/site-packages/_distutils_hack/override.py", line 1, in <module>
    __import__('_distutils_hack').do_override()
  File "/tmp/white242/pip-build-env-uf448wdc/overlay/lib/python3.8/site-packages/_distutils_hack/__init__.py", line 77, in do_override
    ensure_local_distutils()
  File "/tmp/white242/pip-build-env-uf448wdc/overlay/lib/python3.8/site-packages/_distutils_hack/__init__.py", line 63, in ensure_local_distutils
    core = importlib.import_module('distutils.core')
  File "/usr/tce/packages/python/python-3.8.2/lib/python3.8/importlib/__init__.py", line 127, in import_module
    return _bootstrap._gcd_import(name[level:], package, level)
  File "/tmp/white242/pip-build-env-uf448wdc/overlay/lib/python3.8/site-packages/setuptools/_distutils/core.py", line 17, in <module>
    from distutils.dist import Distribution
  File "/tmp/white242/pip-build-env-uf448wdc/overlay/lib/python3.8/site-packages/setuptools/_distutils/dist.py", line 19, in <module>
    from distutils.util import check_environ, strtobool, rfc822_escape
  File "/tmp/white242/pip-build-env-uf448wdc/overlay/lib/python3.8/site-packages/setuptools/_distutils/util.py", line 11, in <module>
    import subprocess
  File "/usr/tce/packages/python/python-3.8.2/lib/python3.8/subprocess.py", line 69, in <module>
    import _posixsubprocess
ModuleNotFoundError: No module named '_posixsubprocess'

pip installing into a virtualenv seems to work ok still, but wanted to try out this option to compare the workflow differences for everyone (unless that's not supported anymore, which is fine too).

Much appreciate any help sorting this (and/or me) out!

add jsrun_np_max and lrun_np_max command line options

We currently can over-ride the 'np' on the command line (jsrun_np=4 for instance) which over-rides the np in the ATS line.

We need an option which sets this as the MAX. IE do not over-ride if ATS command line is less than 4 in the above example.

same node flux usaage (--requires=-rank:1) not working

This was working,

But in pre-release testing for 7.0.116 it is not now.

That is I allocate 3 nodes ,and use the testing in the HelloSameNode test directory in the ATS repository, and the writes and reads which have the same --require=:rank:0 are not being run on the same node.

Need to follow up on this with the flux team.

Will not advertise this as ready in the 7.0.116 release of ATS.

Flux timeout marked as failure by ATS

When a in Flux times out after being started with the "-t" option ATS marks it as a failure. Flux throws a job.exception that has a type of timeout, this information could be used in ATS to properly report timeouts.

Here is the output of a test that have timed out and currently reports FAIL:
5.035s: job.exception type=timeout severity=0 resource allocation expired
flux-job: task(s) exited with exit code 142

per-test option cpusPerTask should work with the slurm module

The command line option --cpusPerTask works with the slurm module.
This option will add this to the srun line so it looks like so if run with ats -cpusPerTask=2

srun --nodes=2-3 --cpus-per-task=2 --ntasks=40

However if the cpusPerTask=2 is specified on the ATS test line for an individual test case, it is ignored. That is, this line is generated.

srun --nodes=2-3 --cpus-per-task=1 --ntasks=40

The slurm module should honor that option on a per test basis, so that non threaded codes can reserve 2 CPUs for a test in order to undersubscribe nodes, or for any other reason.

Flux machine issues

I am currently seeing some issues with the flux scheduler that don't crop up in the others. The machine appears to struggle with lots of fast running jobs, possibly losing track of resource availability, leading to throughput grinding to a halt at some point -> looks to be related to there being a mix of using ats and the flux handler for resource tracking?

Additionally, the time argument to flux mini run is causing some issues, requiring large over estimates of job allocation times other wise there gets to be a race condition where ats thinks jobs are still remaining, but flux won't schedule any of them due to the time requested exceeding the remaining allocation time.

This is tested with a project specific ats wrapper, and on a flux scheduled cluster (not bootstrapped within slurm/etc).

Jeremy

Update --cutoff to over-ride per test job limits

[Yesterday 4:56 PM] Burmark, Jason
ats question, in the Ares tests we sometimes set a specific time for a problem that overrides the default time. I would like to be able to set a default time that would override the specific times if the default time was larger.

[Yesterday 4:59 PM] Burmark, Jason
I'm running into an issue where I run tests with special debugging stuff enabled that slows it down. Its easy to increase the default time but if a test has a specific time its not easy to override that.

[7:25 AM] Dawson, Shawn A.
Alrighty, this may be a good project for Zakharchanka, Mikhail .

I ran some tests and confirmed the order of operations for the --timelimit option.

  1. --timelimit on test line over-rides
  2. --timelmit on command line which over-rides
  3. default time limits.

​[7:26 AM] Dawson, Shawn A.
My understanding is that you want a command line option which will over-ride the 'per test' time limit. So if the per test --timelimit=30 (time seems to just take a single digit at this time, which is minutes) that you could over-ride it on the command line to be 1 minute or whatever.
​[7:27 AM] Dawson, Shawn A.
I think that used to be the purpose of the (perhaps not being used at this time) --cutoff option. The help states this for --cutoff and --time
​[7:27 AM] Dawson, Shawn A.
-t TIMELIMIT, --timelimit TIMELIMIT
Set the TIMEOUT default time limit on each test. This
may be over-ridden for specific tests. Jobs will
TIMEOUT at this time. The value may be given as a
digit followed by an s, m, or h to give the time in
seconds, minutes (the default), or hours.

​[7:27 AM] Dawson, Shawn A.
--cutoff CUTOFF Set the HALTED halt time limit on each test. Over-
rides job timelimit. All jobs will be HALTED at this
time. The value may be given as a digit followed by an
s, m, or h to give the time in seconds, minutes (the
default), or hours. This value if given causes jobs to
fail with status HALTED if they run this long and have
not already timed out or finished.

​[7:29 AM] Dawson, Shawn A.
Seems like the --cutoff option (we could rename that perhaps, not sure if any projects are actually using it) could be repurposed to over-tide the time limits specified for each specific job.
​[7:29 AM] Dawson, Shawn A.
Does that sound about right Burmark, Jason and Zakharchanka, Mikhail

Caliper Support for test cases

Adding caliper support to the test suites that use ATS requires updates to each existing test case, which can be onerous. Since all codes wanting caliper output need to do this we'd like it to be an option in ATS.

Current implementation using ATS introspection:
#ATS:if checkGlue("caliper"):
#ATS: name = "TestCase2D" + str( uuid4() )
#ATS: outputdirectory=log.directory+'/caliperoutput/'+name+'/'+name+'.cali'
#ATS: myExe = manager.options.executable + ''' --caliper "spot(output=%s)" '''%outputdirectory
#ATS: t = test(executable=myExe, clas="%s radGroups=16 steps=5 useBC=False meshtype=polygonalrz runDirBaseName=%s" % (SELF,name), nn=1, np=ndomains, nt=nthreads, ngpu=ngpus, suite="threads", label='caliper 80a 16g rz FP regression')

What the option would do is:

  • If caliper, Append --caliper spot(output=) to the test executable. Note that this didn't work as expected when passed as a clas, only when inserted into the executable var (myExe above).
  • The tests output directory needs to be unique to the test object so that after introspection each individual test has it's own directory. For example, a caliper dir could be saved with the same name as the test log directory, but extension cal or something like that. Perhaps at the same level, so that caliper data is preserved while run logs can still be cleaned up. We don't want to have to keep everything , just the caliper data.
  • That naming scheme would also be ideal for archiving caliper data for visualization making it easier to point SPOT at the cumulative runs for an individual test case.

Move Project Y to Current version of ATS and run using Flux on rzwhippet

Background: Our next machines will use Flux as the scheduler. We need to move projects to the current version of ATS with Flux support. We can test this on rzwhippet today (modulo GPU tests).

Task: Identify a project and change the title of this issue.
Task: Identify all road blocks for the project which may include transitioning from Python 2 to Python 3 as part of the proces
Task: Get them over the barriers and ont othe latest ATS
Task: Verify they can run using Flux on rzwhippet.

smpiargs breaks runs on blueos

Our code is running into problems running gpu tests on blueos systems. This seems to be a result of ATS throwing --smpiargs=-gpu in the jsrun command. Our tests pass when using --smpiargs=off or --smpiargs=disable_gpu_hooks however there doesn't seem to be a way to change this with ATS as far as I can see.

atslite1 not passing through arguments correctly

Reported by Jerome Verbeke

This is the part that no longer works, I tried several variations, but none works:

$ /usr/apps/ats/7.0.114/bin/atslite1 --exclusive --salloc -t 1h --nosub --filter=level==20 --okInvalid --verbose

INFO: atslite1 --nosub option will be used

rztopaz194

Removing toss_4_x86_64_ib.230825090723.logs

nosub option -- running ATS directly on login or pre-allocated node on toss_4_x86_64_ib

Executing: /usr/gapps/ats/toss_4_x86_64_ib/7.0.114/bin/ats --verbose --exclusive --salloc -t 1h --filter="level==20" --okInvalid --verbose ...

machineDir /usr/gapps/ats/toss_4_x86_64_ib/7.0.114/lib/python3.9/site-packages/ats/atsMachines

from ats.atsMachines.slurmProcessorScheduled import SlurmProcessorScheduled as Machine

SLURM VERSION STRING 22.05.8
SLURM VERSION NUMBER 22508

Found specification for slurm36 in /usr/gapps/ats/toss_4_x86_64_ib/7.0.114/lib/python3.9/site-packages/ats/atsMachines/slurmProcessorScheduled.py

Batch specification for toss_4_x86_64_ib in /usr/gapps/ats/toss_4_x86_64_ib/7.0.114/lib/python3.9/site-packages/ats/atsMachines/standard.py

Added filter: '"level==20"'

RZNevada -- concurrent job runs fail

The 7.0.5 version of ats uses slurm options to run cocurrent jobs. This works on alastor, genie, etc.

On rznevada this fails. While ATS can run jobs one after another (using the --sequential command line option), when two or more jobs are started concurrently, the jobs fail with

srun --exclusive --mpibind=off --distribution=block --nodes=1-2 --cpus-per-task=1 --ntasks=2

0: Fri Jul 23 10:59:54 2021: [PE_0]:inet_listen_socket_setup:inet_setup_listen_socket: bind failed port 1371 listen_sock = 3 Address already in use
0: Fri Jul 23 10:59:54 2021: [PE_0]:_pmi_inet_listen_socket_setup:socket setup failed
0: Fri Jul 23 10:59:54 2021: [PE_0]:_pmi_init:_pmi_inet_listen_socket_setup (full) returned -1
1: Fri Jul 23 10:59:54 2021: [PE_1]:inet_listen_socket_setup:inet_setup_listen_socket: bind failed port 1371 listen_sock = 3 Address already in use
1: Fri Jul 23 10:59:54 2021: [PE_1]:_pmi_inet_listen_socket_setup:socket setup failed
1: Fri Jul 23 10:59:54 2021: [PE_1]:_pmi_init:_pmi_inet_listen_socket_setup (full) returned -1

v 6.0+ import of six failing on toss

Not sure why this is just now showing up. But in the UltraCheckers, they are getting this at run time

Traceback (most recent call last):
File "/usr/apps/ats/6.4.0/lib/python2.7/site-packages/atsASC/checkers/UltraCheck107.py", line 87, in
from atsASC.modules.UltraCheckBase2 import UltraCheckBase2
File "/usr/apps/ats/6.4.0/lib/python2.7/site-packages/atsASC/modules/UltraCheckBase2.py", line 3, in
from six.moves import zip_longest
ImportError: No module named six.moves

This had been working.

Thought it had something to do with version 7.0+ so I reverted to version 6.4.0 and is still happening.

Possibly something to do with user environment?

ATS does not return error codes on failed tests

This function says that it returns false on interrupted or failed tests but there isn't a return statement:

ATS/ats/management.py

Lines 661 to 683 in 7ed78d6

def main(self, clas = '', adder=None, examiner=None):
"""
This is the main driver code.
Returns true if all interactive tests found passed, false if interrupted or
an error occurs.
``clas`` is a string containing command-line options, such as::
--debug --level 18
If ``clas`` is blank, sys.argv[1:] is used as the arguments.
See ``configuration.init``.
Routines ``adder`` and ``examiner``, if given, are called in ``configuration``
to allow user a chance to add options and examine results of option parsing.
"""
self.init(clas, adder, examiner)
self.firstBanner()
self.core()
self.postprocess()
self.finalReport()
self.saveResults()
self.finalBanner()

This causes problems in CI because unless you parse the reports for failed tests, jobs do not mark themselves as failures.

Add ability to run a test on the same node as a previous test

It would be very useful to add the ability to run a follow up command if the test passed that runs on the same node. For example, I have a test that outputs files that I want to check the answers in. Unfortunately, if I do the following:

foo_test = test(executable="foo") // outputs some file filled with answers
testif(foo_test, executable="foo_checker.py") // checks file output by foo for correctness

The foo_checker.py test runs on a completely different node from foo and the file is not there yet. So my team (and others I have spoken to) have to add sleep(60) to the start of our checker.

This could be avoided if they ran on the same node. Something like:

test(executable="foo", checker="foo_checker.py") // runs executable and then runs checker only if foo returns non-zero

or:

foo_test = test(executable="foo")
testif(foo_test, executable="foo_checker.py", same_node=True)

Edit another idea after talking to @dawson6:

// command_group maybe a better name than wrapper
my_wrapper = wrapper(executable=foo)
my_wrapper.add(executable=checker)
test(my_wrapper)

CPU and GPU jobs on same node

Codes are packing CPU-only and GPU jobs on the same nodes on sierra-like systems and because of how that creates job launch commands CUDA_VISIBLE_DEVICES isn't correct.

slurm: deprecation of --shared

From Ben Liu:

slurmProcessorScheduled.py:292

    if self.exclusive == True:
        ex_or_sh = "--exclusive"
    else:
        ex_or_sh = "--share"

uses a deprecated (and no longer existing) flag. The correct flag is “--exclusive=user”

Add 'mpirun' module

We have users of ATS who use laptops.
The laptops have MPI installed and look like Linux.
They need a module which uses 'mpirun' rather than 'srun' to run parallel tests.

Install and publicize 7.0, 7.2, etc. links

Great @brunner6 suggestion was to setup symlinks such as

7.0 -> 7.0.6

etc.

And advertise the use of these links '7.0', '7.1' etc. for use by projects. Would mean they would not have to update their version number as frequently AND we would not have to re-install a patch over the same number such as 7.0.5.

Goal would be to keep the minor number stable. Only updates which do not break the interface would be applied to the 7.0 installs.

line break in atsr.py is breaking project processing

[10:08 AM] Dempsey, Stephanie M.
I'm seeing an error when using the atsr.py output file in 7.0.5, which breaks the reporting for KULL using the floor install of ats. The atsr.py file has the 'state' of testing and we execfile on this to get an object we can create a junit report out of (or do whatever else you'd want to do). As it is now there's a newline which breaks the execfile. It looks like the file needs to join the first few lines to be exec'd now (at least that fixes it for me).

That has to be done when the file is written, though, rather than on the execfile end.  Any chance this could be fixed?

It is a difference between 6 and 7.

​[11:01 AM] Dawson, Shawn A.
Hi Stephanie. Testing with Kripke to see the diffs. Confirming that it used to look like so
state = AttributeDict(
badlist = [] ,
batchmachine = None ,
etc.

And now looks like so:
state =
AttributeDict(
badlist = [],
batchmachine = None,
etc.

And that line break is breaking your processing. I'll see if I can get that put back the way it was

Salloc + Slurm on Toss 4 (rzwhippet) Fails

On Toss 3, one could run like so:

salloc -N 3 -p pdebug --exclusive srun -n 1

And that will run the atswrapper on 1 of the allocated nodes, which would then run 'srun -n 1' commands on that node to submit all the jobs.

The benefit of this is that, while 'atswrapper' is not an MPI application, it prevents the followup srun jobs, submitted by atswrapper, from running on the login node.

This works on toss3.

But on toss4 (rzwhippet) the followup srun jobs all fail with:

srun: error: CPU binding outside of job step allocation, allocated CPUs are: 0x00000000000003000000000000070000000000000300000000000007.
srun: error: CPU binding outside of job step allocation, allocated CPUs are: 0x00000000000003000000000000070000000000000300000000000007.
srun: error: Task launch for StepId=1932.2 failed on node rzwhippet40: Unable to satisfy cpu bind request
srun: error: Task launch for StepId=1932.2 failed on node rzwhippet41: Unable to satisfy cpu bind request
srun: error: Application launch failed: Unable to satisfy cpu bind request

Now, one can run like so

salloc -N 3 -p pdebug --exclusive , and while that runs, it does the 'srun's on the login node, which looks bad.

OR 1 can run by splitting that iinto two steps

  1. salloc the node somehow
  2. run atswrapper

But combinging the salloc ... srun into 1 line h as issues now, it did not with toss3.

Add --unbuffered srun option for slurm module

Requested by Mike Lambert.

"It is difficult to see what is going on in a code when some layers of printf, cout, and/or cerr are buffering output. We can send –unbuffered to srun. Can we request –unbuffered through the ATS interface?"

Ben Liu - pip install from 7.0.5 issue

We ran into some issues using ATS 7.0.5 (using the pip-install local method) with ALEATS on the Mac. We are using the .pth method to find modules. We get the following error:

ats module cannot be imported; check Python path.

The ats installation directory is the 4th from the last file. However, when we added "init.py" (an empty file) into that directory, it seemed to work. Any ideas why?

See ATS Developers Teams Chat

slurm module breaks with i"8-2 bre

SlurmProcessorScheduled.slurm_version_int=(int(tarray[0]) * 1000) + (int(tarray[1]) * 100) + (int(tarray[2]))

ValueError: invalid literal for int() with base 10: '8-2'

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.