GithubHelp home page GithubHelp logo

Comments (12)

ymjiang avatar ymjiang commented on July 20, 2024

RDMA does not necessarily needs cuda10. For now we only provide docker files with cuda10 that support RDMA. But you can try to build your own cuda9+RDMA image.

from byteps.

yangwenhuan avatar yangwenhuan commented on July 20, 2024

@ymjiang
image

from byteps.

ymjiang avatar ymjiang commented on July 20, 2024

The images we provide in the tutorials need cuda10. But that does not mean RDMA needs cuda10. Like I said, you can build your own images with cuda9, and make sure to install RDMA drivers properly.

from byteps.

yangwenhuan avatar yangwenhuan commented on July 20, 2024

@ymjiang , I used RoCE, but got 'unreachable' error, did it matter with network bonding?

image

from byteps.

ymjiang avatar ymjiang commented on July 20, 2024

Did you set DMLC_PS_ROOT_URI and DMLC_INTERFACE correctly according to your RDMA device?

Can you run through this benchmark with RDMA? Remember to use make -j USE_RDMA=1 for building and export DMLC_ENABLE_RDMA=1 for running.

from byteps.

yangwenhuan avatar yangwenhuan commented on July 20, 2024

@ymjiang , I am not sure about DMLC_INTERFACE, how to check it?

image

image

from byteps.

ymjiang avatar ymjiang commented on July 20, 2024

Looks like you are using docker. When you launched the docker container, did you add the --device arguments as shown in the tutorial? It should carry your RDMA device information on the host machine.

from byteps.

yangwenhuan avatar yangwenhuan commented on July 20, 2024

@ymjiang , I am not sure about DMLC_INTERFACE, how to check it?

image

image

@ymjiang , this is the screenshot of my host machine, not in docker.

from byteps.

yangwenhuan avatar yangwenhuan commented on July 20, 2024

@ymjiang , I used RoCE, but got 'unreachable' error, did it matter with network bonding?

image

this screenshot is in docker, and I did specify --device option:

docker run -it --net=host --device /dev/infiniband/rdma_cm --device /dev/infiniband/issm0 --device /dev/infiniband/ucm0 --device /dev/infiniband/umad0 --device /dev/infiniband/uverbs0 --cap-add IPC_LOCK bytepsimage/byteps_server_rdma bash

from byteps.

bobzhuyb avatar bobzhuyb commented on July 20, 2024

This may be due to bonding. I am not sure. Can you try ib_write_bw with rdma_cm enabled? It should be ib_write_bw -z, or ib_write_bw --com_rdma_cm

Can you also check the configuration details of bonding here https://community.mellanox.com/s/article/howto-create-linux-bond--lag--interface-over-infiniband-network

Quote:

Since rdma_cm does not support MAC enslavement, fail_over_mac=1 (or active) should be added to the bond interface configuration in case of Ethernet interface.

from byteps.

yangwenhuan avatar yangwenhuan commented on July 20, 2024

This may be due to bonding. I am not sure. Can you try ib_write_bw with rdma_cm enabled? It should be ib_write_bw -z, or ib_write_bw --com_rdma_cm

Can you also check the configuration details of bonding here https://community.mellanox.com/s/article/howto-create-linux-bond--lag--interface-over-infiniband-network

Quote:

Since rdma_cm does not support MAC enslavement, fail_over_mac=1 (or active) should be added to the bond interface configuration in case of Ethernet interface.

image

I had tried before, it seems ok.

from byteps.

bobzhuyb avatar bobzhuyb commented on July 20, 2024

@yangwenhuan As I said, please use the rdma_cm option when running.ib_send_bw

from byteps.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.