bytedance / ps-lite Goto Github PK

View Code? Open in Web Editor NEW

This project forked from dmlc/ps-lite

69.0 9.0 24.0 1.21 MB

A lightweight parameter server interface

Home Page: http://ps-lite.readthedocs.org

License: Apache License 2.0

CMake 1.47% Makefile 0.97% C++ 88.83% C 0.31% Python 6.49% Shell 1.93%

deep-learning distributed-training rdma mxnet

ps-lite's Introduction

This is the communication library for BytePS. It is designed for high performance RDMA. However, it also supports TCP.

Build

git clone -b byteps https://github.com/bytedance/ps-lite
cd ps-lite 
make -j USE_RDMA=1

Remove USE_RDMA=1 if you don't want to build with RDMA ibverbs support.
Add USE_FABRIC=1 if you want to build with RDMA libfabric support for AWS Elastic Fabric Adaptor.

To build ps-lite with UCX:

# dependencies
sudo apt install -y build-essential libtool autoconf automake libnuma-dev unzip pkg-config

# build ucx
wget https://github.com/openucx/ucx/archive/refs/tags/v1.11.1.tar.gz
tar -xf v1.11.1.tar.gz
cd ucx-1.11.1
(./autogen.sh || ./autogen.sh) && ./configure --enable-logging --enable-mt --with-verbs --with-rdmacm --with-cuda=/usr/local/cuda
make clean && make -j && sudo make install -j

# build ps-lite
cd ..
make clean; USE_UCX=1 CUDA_HOME=/usr/local/cuda USE_CUDA=1 make -j

BytePS relies on UCXVan for GPU related communication, such as intra-node cuda-IPC, inter-node GPU-to-GPU / GPU-to-CPU communication with GPU-direct RDMA. For the list of transports UCX supports, see link.

Concepts

In ps-lite, there are three roles: worker, server and scheduler. Each role is an independent process.

The scheduler is responsible for setting up the connections between workers and servers at initialization. There should be only 1 scheduler process.

A worker process only communicates with server processes, and vice versa. There won't be any traffic between worker-to-worker, and server-to-server.

Tutorial

After build, you will have two testing applications under tests/ dir, namely test_benchmark and test_ipc_benchmark. Below we elaborate how you can run with them.

To debug, set PS_VERBOSE=1 to see important logs during connection setup, and PS_VERBOSE=2 to see each message log.

1. Basic benchmark

Suppose you want to run with 1 worker and 1 server on different machines. Therefore, we need to launch 3 processes in total (including the scheduler). You can launch the scheduler process at any machine as it does not affect the performance.

For the scheduler:

# common setup
export DMLC_ENABLE_RDMA=ibverbs
export DMLC_NUM_WORKER=1
export DMLC_NUM_SERVER=1 
export DMLC_PS_ROOT_URI=10.0.0.2  # scheduler's RDMA interface IP 
export DMLC_PS_ROOT_PORT=8123     # scheduler's port (can random choose)
export DMLC_INTERFACE=eth5        # my RDMA interface 

# launch scheduler
DMLC_ROLE=scheduler ./tests/test_benchmark

For the server:

# common setup
export DMLC_ENABLE_RDMA=ibverbs
export DMLC_NUM_WORKER=1
export DMLC_NUM_SERVER=1 
export DMLC_PS_ROOT_URI=10.0.0.2  # scheduler's RDMA interface IP 
export DMLC_PS_ROOT_PORT=8123     # scheduler's port (can random choose)
export DMLC_INTERFACE=eth5        # my RDMA interface 

# launch server
DMLC_ROLE=server ./tests/test_benchmark

For the worker:

# common setup
export DMLC_ENABLE_RDMA=ibverbs
export DMLC_NUM_WORKER=1
export DMLC_NUM_SERVER=1 
export DMLC_PS_ROOT_URI=10.0.0.2  # scheduler's RDMA interface IP 
export DMLC_PS_ROOT_PORT=8123     # scheduler's port (can random choose)
export DMLC_INTERFACE=eth5        # my RDMA interface 

# launch worker
DMLC_ROLE=worker ./tests/test_benchmark

If you want to use libfabric with Amazon Elastic Fabric Adaptor, make sure to set DMLC_ENABLE_RDMA=fabric for all processes. If you are using libfabric < 1.10, please also set FI_EFA_ENABLE_SHM_TRANSFER=0 to avoid a bug in the EFA shm provider.

If you just want to use TCP, make sure to unset DMLC_ENABLE_RDMA for all processes.

2. Benchmark with IPC support

The test_ipc_benchmark demonstrates how inter-process communication (IPC) helps improve RDMA performance when the server is co-located with the worker.

Suppose you have two machines. Each machine should launch a worker and a server process.

For the scheduler: (you can launch it on either machine-0 or machine-1)

# common setup
export DMLC_ENABLE_RDMA=ibverbs
export DMLC_NUM_WORKER=2
export DMLC_NUM_SERVER=2 
export DMLC_PS_ROOT_URI=10.0.0.2  # scheduler's RDMA interface IP 
export DMLC_PS_ROOT_PORT=8123     # scheduler's port (can random choose)
export DMLC_INTERFACE=eth5        # my RDMA interface 

# launch scheduler
DMLC_ROLE=scheduler ./tests/test_ipc_benchmark

For machine-0 and machine-1:

# common setup
export DMLC_ENABLE_RDMA=ibverbs
export DMLC_NUM_WORKER=2
export DMLC_NUM_SERVER=2 
export DMLC_PS_ROOT_URI=10.0.0.2  # scheduler's RDMA interface IP 
export DMLC_PS_ROOT_PORT=8123     # scheduler's port (can random choose)
export DMLC_INTERFACE=eth5        # my RDMA interface 

# launch server and worker
DMLC_ROLE=server ./tests/test_ipc_benchmark &
DMLC_ROLE=worker ./tests/test_ipc_benchmark

Note: This benchmark is only valid for RDMA.

3. Other GPU-related benchmarks

cd tests;
NODE_ONE_IP=xxx NODE_TWO_IP=yyy bash test.sh (local|remote|joint) bytes_per_msg msg_count (push_only|pull_only|push_pull) (cpu2cpu|cpu2gpu|gpu2gpu|gpu2cpu)

ps-lite's People

Contributors

Stargazers

Watchers

ps-lite's Issues

error with RDMA

Hi, developers of pslite. I am using the RDMA version of ps-lite. When I run the test_benchmark, the error occurs. The error log and bootscript is pinned below. What is the problem and how can I solve it. Thank you!

(recenv) [bob@need08 tests]$ ./local_multi_workers_RDMA.sh 1 1 test_benchmark
./local_multi_workers_RDMA.sh: line 28: test_benchmark: command not found
./local_multi_workers_RDMA.sh: line 36: test_benchmark: command not found
./local_multi_workers_RDMA.sh: line 45: test_benchmark: command not found
(recenv) [guowei@need08 tests]$ ./local_multi_workers_RDMA.sh 1 1 ./test_benchmark
[12:58:33] tests/test_benchmark.cc:494: 1 ports per node
[12:58:33] tests/test_benchmark.cc:499: recv buffer registration is NOT enabled
[12:58:33] tests/test_benchmark.cc:504: TEST_NUM_GPU_WORKER = 0
[12:58:33] tests/test_benchmark.cc:507: TEST_NUM_GPU_SERVER = 0
[12:58:33] src/postoffice.cc:60: Creating Van: 1. group_size=1
[12:58:33] src/van.cc:88: Creating RDMAVan.
[12:58:33] src/van.cc:89: DMLC_ENABLE_RDMA=1 will be deprecated. Please use DMLC_ENABLE_RDMA=ibverbs instead.
[12:58:33] src/./rdma_van.h:46: Shared memory IPC has been disabled
[12:58:33] tests/test_benchmark.cc:494: 1 ports per node
[12:58:33] tests/test_benchmark.cc:499: recv buffer registration is NOT enabled
[12:58:33] tests/test_benchmark.cc:504: TEST_NUM_GPU_WORKER = 0
[12:58:33] tests/test_benchmark.cc:507: TEST_NUM_GPU_SERVER = 0
[12:58:33] src/postoffice.cc:60: Creating Van: 1. group_size=1
[12:58:33] src/van.cc:88: Creating RDMAVan.
[12:58:33] src/van.cc:89: DMLC_ENABLE_RDMA=1 will be deprecated. Please use DMLC_ENABLE_RDMA=ibverbs instead.
[12:58:33] src/./rdma_van.h:46: Shared memory IPC has been disabled
[12:58:33] tests/test_benchmark.cc:494: 1 ports per node
[12:58:33] tests/test_benchmark.cc:499: recv buffer registration is NOT enabled
[12:58:33] tests/test_benchmark.cc:504: TEST_NUM_GPU_WORKER = 0
[12:58:33] tests/test_benchmark.cc:507: TEST_NUM_GPU_SERVER = 0
[12:58:33] src/postoffice.cc:60: Creating Van: 1. group_size=1
[12:58:33] src/van.cc:88: Creating RDMAVan.
[12:58:33] src/van.cc:89: DMLC_ENABLE_RDMA=1 will be deprecated. Please use DMLC_ENABLE_RDMA=ibverbs instead.
[12:58:33] src/./rdma_van.h:46: Shared memory IPC has been disabled
[12:58:34] src/./rdma_van.h:811: OnConnect to Node 32767 with Transport=RDMA
[12:58:34] src/./rdma_van.h:238: Connect to Node 1 with Transport=RDMA
[12:58:34] src/./rdma_van.h:811: OnConnect to Node 32767 with Transport=RDMA
[12:58:34] src/./rdma_van.h:238: Connect to Node 1 with Transport=RDMA
[12:58:34] src/./rdma_van.h:811: OnConnect to Node 1 with Transport=RDMA
[12:58:34] src/./rdma_van.h:238: Connect to Node 1 with Transport=RDMA
[12:58:34] src/./rdma_van.h:811: OnConnect to Node 1 with Transport=RDMA
[12:58:34] src/./rdma_van.h:238: Connect to Node 8 with Transport=RDMA
[12:58:34] src/./rdma_van.h:811: OnConnect to Node 1 with Transport=RDMA
[12:58:34] src/./rdma_van.h:238: Connect to Node 9 with Transport=RDMA
[12:58:35] src/./rdma_van.h:811: OnConnect to Node 9 with Transport=RDMA
[12:58:35] src/./rdma_van.h:238: Connect to Node 8 with Transport=RDMA
[12:58:35] src/./rdma_van.h:811: OnConnect to Node 8 with Transport=RDMA
[12:58:35] src/./rdma_van.h:238: Connect to Node 8 with Transport=RDMA
[12:58:35] src/./rdma_van.h:811: OnConnect to Node 9 with Transport=RDMA
[12:58:35] src/./rdma_van.h:811: OnConnect to Node 8 with Transport=RDMA
[[12:58:35] src/./rdma_van.h:238: 12:58:35Connect to Node ] 9src/./rdma_van.h with Transport=:RDMA
238: Connect to Node 9 with Transport=RDMA
[12:58:35] src/./rdma_van.h:898: OnDisconnected from Node 32767
[12:58:35] src/./rdma_van.h:898: OnDisconnected from Node 32767
[12:58:35] src/./rdma_van.h:811: OnConnect to Node 8 with Transport=RDMA
[12:58:35] src/./rdma_van.h:238: Connect to Node 1 with Transport=RDMA
[12:58:35] ./include/dmlc/logging.h:276: [12:58:35] src/./rdma_transport.h:145: Check failed: mr ibv_reg_mr failed: Cannot allocate memory
You can try to reduce BYTEPS_RDMA_START_DEPTH (default 128) or BYTEPS_RDMA_RX_DEPTH (default 2048)

Stack trace returned 8 entries:
[bt] (0) ./test_benchmark(+0xe3e0) [0x555a232143e0]
[bt] (1) ./test_benchmark(+0xe7fb) [0x555a232147fb]
[bt] (2) ./test_benchmark(+0x598eb) [0x555a2325f8eb]
[bt] (3) ./test_benchmark(+0x6451d) [0x555a2326a51d]
[bt] (4) ./test_benchmark(+0x64feb) [0x555a2326afeb]
[bt] (5) /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xbd6df) [0x7f7cc9e2e6df]
[bt] (6) /lib/x86_64-linux-gnu/libpthread.so.0(+0x76db) [0x7f7cc99416db]
[bt] (7) /lib/x86_64-linux-gnu/libc.so.6(clone+0x3f) [0x7f7cc966aa3f]

terminate called after throwing an instance of 'dmlc::Error'
what(): [12:58:35] src/./rdma_transport.h:145: Check failed: mr ibv_reg_mr failed: Cannot allocate memory
You can try to reduce BYTEPS_RDMA_START_DEPTH (default 128) or BYTEPS_RDMA_RX_DEPTH (default 2048)

[12:58:36] src/./rdma_van.h:898: OnDisconnected from Node 1
[12:58:36] src/./rdma_van.h:898: OnDisconnected from Node 1
[12:58:36] src/./rdma_van.h:898: OnDisconnected from Node 1
./local_multi_workers_RDMA.sh: line 48: 42385 Aborted (core dumped) ${bin} ${arg}
[12:58:36] ./include/dmlc/logging.h:276: [12:58:36] src/./rdma_van.h:616: Check failed: wc[i].status == IBV_WC_SUCCESS Failed status
Work Request Flushed Error 5 140478784480192 245 postoffice ptr: 0x55886602cf60

Stack trace returned 6 entries:
[bt] (0) ./test_benchmark(+0xe3e0) [0x55886420f3e0]
[bt] (1) ./test_benchmark(+0xe7fb) [0x55886420f7fb]
[bt] (2) ./test_benchmark(+0x630fd) [0x5588642640fd]
[bt] (3) /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xbd6df) [0x7fc3dcfca6df]
[bt] (4) /lib/x86_64-linux-gnu/libpthread.so.0(+0x76db) [0x7fc3dcadd6db]
[bt] (5) /lib/x86_64-linux-gnu/libc.so.6(clone+0x3f) [0x7fc3dc806a3f]

terminate called after throwing an instance of 'dmlc::Error'
what(): [12:58:36] src/./rdma_van.h:616: Check failed: wc[i].status == IBV_WC_SUCCESS Failed status
Work Request Flushed Error 5 140478784480192 245 postoffice ptr: 0x55886602cf60

[12:58:37] src/./rdma_van.h:898: OnDisconnected from Node 9
[12:58:37] src/./rdma_van.h:898: OnDisconnected from Node 9

UCXVan A100 multi-GPU support

Background

Capability of UCX

With the current version of openucx, a UCX context cannot choose the optimal route for multiple GPUs. However, if a UCX context and the data is bound to a particular GPU, then the data will be sent/recv with the optimal route.
Memory pinning: Each time a memory is pinned / unpinned for GPUDirect RDMA, there's some overhead. Ideally we want to avoid changing the memory buffer address when sending/receiving GPU data with GDR.

Use cases

Testing code #46

1 PSWorker per node. The PSWorker handles the data from all 8 GPUs & GPU on that node on the same process
Multiple PSWorkers per node. There are multiple processes, and each process has a PSWorker that handles the data of a particular GPU (and the CPU).

Proposed APIs

Ps-lite user experience

// assume we have done cudaMalloc with `data` (char*) on GPU 0 previous.
DeviceType curr_device_type = ps::kGPU;
int curr_device_id = 0;
// specify the target device
DeviceType tgt_device_type = ps::kGPU;
int tgt_device_id = 2;
SArray<char> vals(data, len, false,
                  curr_device_type, curr_device_id,
                  tgt_device_type, tgt_device_id);

PSKV pskv;
pskv.keys.push_back(0);
pskv.lens.push_back(1024000);
pskv.size = 1024000;
kv_worker->ZPush(pskv.keys, vals, pskv.lens);

void RequestHandler(const KVMeta& req_meta, const KVPairs<Val>& req_data, KVServer<Val>* server) {
  if (req_meta.dst_device_type == kGPU) {
    // do something with the received GPU data
  } else {
    // do something with the received CPU data
  }
};

Ps-lite data structures changes

Message.meta stores the device information

enum {
  kCPU,
  kGPU
} DeviceType;

struct Meta {
 DeviceType src_dev_type;
 uint8_t src_dev_id;
 DeviceType dst_dev_type;
 uint8_t dst_dev_id;
 ...
}

// now each ucp_context is bound to a port for a ucp_listener and a ucp_worker
// we replace the monolithic `int port` with an array of ports and devices
struct Node {
  uint8_t num_ports;
  int ports[32];
  DeviceType device_types[32];
  uint8_t device_ids[32];
}

Device extension for SArray

template<typename V>
class SArray {
 public:
  // Zero-copy constructor for device data
  SArray(V* data, size_t size, bool deletable = false,
         DeviceType curr_device_type, int curr_device_id,
         DeviceType tgt_device_type, int tgt_device_id);

  // copy-assignment operator
  template <typename W> void operator=(const SArray<W>& arr);

 private:
   DeviceType curr_device_type_;
   int curr_device_id_;
   DeviceType tgt_device_type_;
   int tgt_device_id_;
};

KVServer buffer registration

// to send data from local GPU to a remote GPU, we need the remote GPU
// to designate a memory buffer that can hold the result, so that we avoid
// a memory copy.
KVServer {
  public: 
    RegisterRecvBuffer(SArray<Key>& keys, const SArray<Val>& vals,
                       const SArray<int>& lens = {}, int cmd = 0);
};

UCXVan

public: 
  // creates multiple ucp_listeners and ports based on `CUDA_VISIBLE_DEVICES`
  // Update `my_node.ports` directly inside `Bind`
  // the return type is changed from int to void
  void Bind(const Node &my_node, int max_retry);

  // Connect to the target node based on the ports information
  // stored inside `node`.
  void Connect(const Node &node);

  // picks the corresponding ucp_context based on src_device_id
  // and dst_device_id
  int SendMsg(Message& msg);

  // set the curr_device_id upon receiving the message. 
  int RecvMsg(Message& msg);

  // register the receiving buffer for provided keys
  void RegisterRecvBuffer(SArray<Key>& keys, const SArray<Val>& vals,
                          const SArray<int>& lens = {}, int cmd = 0);

private:
  // multiple ucx contexts
  // also need to store the mapping for context <-> dev_type / dev_id
  std::vector<ucp_context_h> contexts_;

BytePS changes

This part should be handled by the BytePS dev team.

Memory management

As using GDR with frequently changing GPU data address could lead to pinning/unpinning overheads, using the same GPU buffer is preferred. There are two options:

BytePS manages the buffers registered for communication. Upon each push, we copy the original tensor to the communication buffer before sending it out. Similarly, we use the communication buffer to hold the result of pull, and then copy the data to the destination tensor.
BytePS directly passes whatever GPU memory that needs to be sent/received, and relies on UCX's lazy pinning optimizations to mitigate the overhead.
The option mainly affects BytePS's memory management implementation. From UCXVan's point of view, it always tries to pin the GPU data passed to it.

PushPull and key encoding on A100

On A100 nodes, each PCI-e switch is connected with 2 GPUs and an NIC. A pushpull call in BytePS would become the following series of operations on the local node:

reduce scatter, which results in 8 GPU tensors
ZPush/ZPull with the (same) key and corresponding device IDs
All-gather.

Note that the above ps-lite change does not affect how ps-lite scheduler assign ranks to each node. BytePS partitions a tensor as usual (1 ps key for every 4MB data chunk). The key encoding logic also stays the same, where the first x bits are reserved for the target node id. When sending a key with data residing on different GPUs, we just need to annotate the source SArray with the corresponding device id.

error when send multiple keys in one message

I want to apply the RDMA version of ps-lite in sparse case in whch one message contains multiple keys and multiple values. However, the error occurs. The worker sends 100 keys, however, the server only receives 1 keys. I wonder whether this implemention only works in dense case such as test_benchmark.cpp and byteps package in which on message contains only one key and many values. Thank you.

ucx van with GDR

If I want to register a GPU buffer, then use that buffer with ucp_tag_send_nb(), what are the steps to do that? Is there code samples I can use as a reference?

These are some memory registration related settings in ucx:

    {"REG_METHODS", "rcache,odp,direct",
     "List of registration methods in order of preference. Supported methods are:\n"
     "  odp         - implicit on-demand paging\n"
     "  rcache      - userspace registration cache\n"
     "  direct      - direct registration\n",
     ucs_offsetof(uct_ib_md_config_t, reg_methods), UCS_CONFIG_TYPE_STRING_ARRAY},

is export UCX_IB_REG_METHODS=rcache what I need to manually register memory regions?