Comments (9)
I'm not familiar enough with the libfabric API to know how to do it programmatically off the top of my head. You could maybe look at how Mercury selects providers in na_ofi.c.
I'm not sure how well it will work with the code as structured, but you can also set a runtime environment variable (i.e. in the job scripts) that will restrict the set of providers that libfabric will allow. This would be the FI_PROVIDER
environment variable, which takes a comma separated list of providers, kind of like an allow list. See https://ofiwg.github.io/libfabric/main/man/fabric.7.html.
On Polaris we want to test "verbs,rxm" (two providers are required to use verbs in reliable datagram mode), on Crusher we want "cxi", and on Theta we want "gni", for example. I guess you could try setting that (or whatever is appropriate for your test platform) and see if the tests execute.
Based on this discussion it sounds like we really need the test output to report what provider was used (independent of what was attempted) for validation. I don't know what provider is being selected by default in the tests thus far, but if it is the tcp provider that's not really the transport we want to be testing.
from fabtsuite.
@hyoklee can you confirm that the rest of the test suite passes on cxi?
from fabtsuite.
I don't know. @gnuoyd , do you know? @derobins , please answer this question if you know.
@carns , how do other projects specify a provider?
from fabtsuite.
When FI_PROVIDER
is set to gni
, I get the following error:
hyoklee@thetalogin6:~/fabtsuite-m/build/transfer> ./fabtget
0.000000083 capabilities not available?
main.4785: fi_getinfo: No data available
When FI_PROVIDER
is set to cxi
, test hangs:
[[email protected] build]$ export FI_PROVIDER=cxi
[[email protected] build]$ ctest -I 1,1
Test project /ccs/home/hyoklee/fabtsuite/build
Start 1: single-node
C-c C-c
When FI_PROVIDER
is set to verbs,rxm
on Poaris, test hangs with libfabric 1.15.0:
hyoklee@polaris-login-02:~/fabtsuite/build> export FI_PROVIDER=verbs
hyoklee@polaris-login-02:~/fabtsuite/build> ctest -I 1,1
Test project /home/hyoklee/fabtsuite/build
Start 1: single-node
C-c C-c
from fabtsuite.
When
FI_PROVIDER
is set togni
, I get the following error:hyoklee@thetalogin6:~/fabtsuite-m/build/transfer> ./fabtget 0.000000083 capabilities not available? main.4785: fi_getinfo: No data available
When
FI_PROVIDER
is set tocxi
, test hangs:[[email protected] build]$ export FI_PROVIDER=cxi [[email protected] build]$ ctest -I 1,1 Test project /ccs/home/hyoklee/fabtsuite/build Start 1: single-node C-c C-c
How are you building libfabric (you can share your environment configuration if you are using Spack). It might be easiest to debug these kind of initialization problems by trying to launch the server in an interactive session. You can set the FI_LOG_LEVEL=debug environment variable to get more detailed information out of libfabric.
from fabtsuite.
For Crusher, 1.15.0 is provided. For Theta, I use spack install fabtsuite ^libfabric fabrics=gni,tcp,udp,rxd,rxm
.
from fabtsuite.
I ran the test again by specifying the
export FI_PROPVIDER=cxi
export FI_LOG_LEVEL=debug
to the test/wait.slurm script.
I used the system libfabric.
The test failed with timeout on Crusher.
Crusher reported an error message in detail.
libfabric:107344:1667248504:cxi:cq:cxip_cq_verify_attr():840<warn> crusher008: \
CQ wait objects not supported
get_state_open.4216: fi_cq_open: Function not implemented
real 0.09
user 0.00
sys 0.01
1
srun: error: crusher008: task 0: Exited with exit code 1
srun: launch/slurm: _step_signal: Terminating StepId=208067.1
0
I also could verify that the address returned by fabtget is different from tcp provider.
Thus, I think fabtsuite seems to be able to test a different provider.
from fabtsuite.
@carns , I tested the rest of suite today and they worked fine. Do you want me to update slurm job script to use CXI (e.g., cross.slurm)? Or just update documentation like FAQ?
from fabtsuite.
Thanks @hyoklee .
Both if you don't mind. The script can be hardcoded to use cxi; that's likely to be the only thing we test on Crusher. The doc can describe more generically how to set the test to exercise a particular provider (cxi or otherwise).
As a side note since we have mentioned platform-specific test scripts: the .slurm etc. files would be a little clearer if the names of the files included the machine name. There are a lot of slurm, qsub, etc. systems out there but what actually needs to be executed within the script is likely platform-specific. If the current naming is important to the overall test flow then maybe just a comment at the top of each one that says something like "# test script for the Polaris system @alcf".
from fabtsuite.
Related Issues (8)
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from fabtsuite.