GithubHelp home page GithubHelp logo

Comments (9)

carns avatar carns commented on July 29, 2024 1

I'm not familiar enough with the libfabric API to know how to do it programmatically off the top of my head. You could maybe look at how Mercury selects providers in na_ofi.c.

I'm not sure how well it will work with the code as structured, but you can also set a runtime environment variable (i.e. in the job scripts) that will restrict the set of providers that libfabric will allow. This would be the FI_PROVIDER environment variable, which takes a comma separated list of providers, kind of like an allow list. See https://ofiwg.github.io/libfabric/main/man/fabric.7.html.

On Polaris we want to test "verbs,rxm" (two providers are required to use verbs in reliable datagram mode), on Crusher we want "cxi", and on Theta we want "gni", for example. I guess you could try setting that (or whatever is appropriate for your test platform) and see if the tests execute.

Based on this discussion it sounds like we really need the test output to report what provider was used (independent of what was attempted) for validation. I don't know what provider is being selected by default in the tests thus far, but if it is the tcp provider that's not really the transport we want to be testing.

from fabtsuite.

carns avatar carns commented on July 29, 2024 1

@hyoklee can you confirm that the rest of the test suite passes on cxi?

from fabtsuite.

hyoklee avatar hyoklee commented on July 29, 2024

I don't know. @gnuoyd , do you know? @derobins , please answer this question if you know.

@carns , how do other projects specify a provider?

from fabtsuite.

hyoklee avatar hyoklee commented on July 29, 2024

When FI_PROVIDER is set to gni, I get the following error:

hyoklee@thetalogin6:~/fabtsuite-m/build/transfer> ./fabtget
0.000000083 capabilities not available?
main.4785: fi_getinfo: No data available

When FI_PROVIDER is set to cxi, test hangs:

[[email protected] build]$ export FI_PROVIDER=cxi
[[email protected] build]$ ctest -I 1,1
Test project /ccs/home/hyoklee/fabtsuite/build
    Start 1: single-node
  C-c C-c

When FI_PROVIDER is set to verbs,rxm on Poaris, test hangs with libfabric 1.15.0:

hyoklee@polaris-login-02:~/fabtsuite/build> export FI_PROVIDER=verbs
hyoklee@polaris-login-02:~/fabtsuite/build> ctest -I 1,1
Test project /home/hyoklee/fabtsuite/build
    Start 1: single-node
  C-c C-c

from fabtsuite.

carns avatar carns commented on July 29, 2024

When FI_PROVIDER is set to gni, I get the following error:

hyoklee@thetalogin6:~/fabtsuite-m/build/transfer> ./fabtget
0.000000083 capabilities not available?
main.4785: fi_getinfo: No data available

When FI_PROVIDER is set to cxi, test hangs:

[[email protected] build]$ export FI_PROVIDER=cxi
[[email protected] build]$ ctest -I 1,1
Test project /ccs/home/hyoklee/fabtsuite/build
    Start 1: single-node
  C-c C-c

How are you building libfabric (you can share your environment configuration if you are using Spack). It might be easiest to debug these kind of initialization problems by trying to launch the server in an interactive session. You can set the FI_LOG_LEVEL=debug environment variable to get more detailed information out of libfabric.

from fabtsuite.

hyoklee avatar hyoklee commented on July 29, 2024

For Crusher, 1.15.0 is provided. For Theta, I use spack install fabtsuite ^libfabric fabrics=gni,tcp,udp,rxd,rxm.

from fabtsuite.

hyoklee avatar hyoklee commented on July 29, 2024

I ran the test again by specifying the

export FI_PROPVIDER=cxi
export FI_LOG_LEVEL=debug

to the test/wait.slurm script.

I used the system libfabric.
The test failed with timeout on Crusher.
Crusher reported an error message in detail.

libfabric:107344:1667248504:cxi:cq:cxip_cq_verify_attr():840<warn> crusher008: \
CQ wait objects not supported
get_state_open.4216: fi_cq_open: Function not implemented
real 0.09
user 0.00
sys 0.01
1
srun: error: crusher008: task 0: Exited with exit code 1
srun: launch/slurm: _step_signal: Terminating StepId=208067.1
0

I also could verify that the address returned by fabtget is different from tcp provider.

Thus, I think fabtsuite seems to be able to test a different provider.

from fabtsuite.

hyoklee avatar hyoklee commented on July 29, 2024

@carns , I tested the rest of suite today and they worked fine. Do you want me to update slurm job script to use CXI (e.g., cross.slurm)? Or just update documentation like FAQ?

from fabtsuite.

carns avatar carns commented on July 29, 2024

Thanks @hyoklee .

Both if you don't mind. The script can be hardcoded to use cxi; that's likely to be the only thing we test on Crusher. The doc can describe more generically how to set the test to exercise a particular provider (cxi or otherwise).

As a side note since we have mentioned platform-specific test scripts: the .slurm etc. files would be a little clearer if the names of the files included the machine name. There are a lot of slurm, qsub, etc. systems out there but what actually needs to be executed within the script is likely platform-specific. If the current naming is important to the overall test flow then maybe just a comment at the top of each one that says something like "# test script for the Polaris system @alcf".

from fabtsuite.

Related Issues (8)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.