GithubHelp home page GithubHelp logo

bytedance / ps-lite Goto Github PK

View Code? Open in Web Editor NEW

This project forked from dmlc/ps-lite

69.0 9.0 24.0 1.21 MB

A lightweight parameter server interface

Home Page: http://ps-lite.readthedocs.org

License: Apache License 2.0

CMake 1.47% Makefile 0.97% C++ 88.83% C 0.31% Python 6.49% Shell 1.93%
deep-learning distributed-training rdma mxnet

ps-lite's Issues

error with RDMA

Hi, developers of pslite. I am using the RDMA version of ps-lite. When I run the test_benchmark, the error occurs. The error log and bootscript is pinned below. What is the problem and how can I solve it. Thank you!

(recenv) [bob@need08 tests]$ ./local_multi_workers_RDMA.sh 1 1 test_benchmark
./local_multi_workers_RDMA.sh: line 28: test_benchmark: command not found
./local_multi_workers_RDMA.sh: line 36: test_benchmark: command not found
./local_multi_workers_RDMA.sh: line 45: test_benchmark: command not found
(recenv) [guowei@need08 tests]$ ./local_multi_workers_RDMA.sh 1 1 ./test_benchmark
[12:58:33] tests/test_benchmark.cc:494: 1 ports per node
[12:58:33] tests/test_benchmark.cc:499: recv buffer registration is NOT enabled
[12:58:33] tests/test_benchmark.cc:504: TEST_NUM_GPU_WORKER = 0
[12:58:33] tests/test_benchmark.cc:507: TEST_NUM_GPU_SERVER = 0
[12:58:33] src/postoffice.cc:60: Creating Van: 1. group_size=1
[12:58:33] src/van.cc:88: Creating RDMAVan.
[12:58:33] src/van.cc:89: DMLC_ENABLE_RDMA=1 will be deprecated. Please use DMLC_ENABLE_RDMA=ibverbs instead.
[12:58:33] src/./rdma_van.h:46: Shared memory IPC has been disabled
[12:58:33] tests/test_benchmark.cc:494: 1 ports per node
[12:58:33] tests/test_benchmark.cc:499: recv buffer registration is NOT enabled
[12:58:33] tests/test_benchmark.cc:504: TEST_NUM_GPU_WORKER = 0
[12:58:33] tests/test_benchmark.cc:507: TEST_NUM_GPU_SERVER = 0
[12:58:33] src/postoffice.cc:60: Creating Van: 1. group_size=1
[12:58:33] src/van.cc:88: Creating RDMAVan.
[12:58:33] src/van.cc:89: DMLC_ENABLE_RDMA=1 will be deprecated. Please use DMLC_ENABLE_RDMA=ibverbs instead.
[12:58:33] src/./rdma_van.h:46: Shared memory IPC has been disabled
[12:58:33] tests/test_benchmark.cc:494: 1 ports per node
[12:58:33] tests/test_benchmark.cc:499: recv buffer registration is NOT enabled
[12:58:33] tests/test_benchmark.cc:504: TEST_NUM_GPU_WORKER = 0
[12:58:33] tests/test_benchmark.cc:507: TEST_NUM_GPU_SERVER = 0
[12:58:33] src/postoffice.cc:60: Creating Van: 1. group_size=1
[12:58:33] src/van.cc:88: Creating RDMAVan.
[12:58:33] src/van.cc:89: DMLC_ENABLE_RDMA=1 will be deprecated. Please use DMLC_ENABLE_RDMA=ibverbs instead.
[12:58:33] src/./rdma_van.h:46: Shared memory IPC has been disabled
[12:58:34] src/./rdma_van.h:811: OnConnect to Node 32767 with Transport=RDMA
[12:58:34] src/./rdma_van.h:238: Connect to Node 1 with Transport=RDMA
[12:58:34] src/./rdma_van.h:811: OnConnect to Node 32767 with Transport=RDMA
[12:58:34] src/./rdma_van.h:238: Connect to Node 1 with Transport=RDMA
[12:58:34] src/./rdma_van.h:811: OnConnect to Node 1 with Transport=RDMA
[12:58:34] src/./rdma_van.h:238: Connect to Node 1 with Transport=RDMA
[12:58:34] src/./rdma_van.h:811: OnConnect to Node 1 with Transport=RDMA
[12:58:34] src/./rdma_van.h:238: Connect to Node 8 with Transport=RDMA
[12:58:34] src/./rdma_van.h:811: OnConnect to Node 1 with Transport=RDMA
[12:58:34] src/./rdma_van.h:238: Connect to Node 9 with Transport=RDMA
[12:58:35] src/./rdma_van.h:811: OnConnect to Node 9 with Transport=RDMA
[12:58:35] src/./rdma_van.h:238: Connect to Node 8 with Transport=RDMA
[12:58:35] src/./rdma_van.h:811: OnConnect to Node 8 with Transport=RDMA
[12:58:35] src/./rdma_van.h:238: Connect to Node 8 with Transport=RDMA
[12:58:35] src/./rdma_van.h:811: OnConnect to Node 9 with Transport=RDMA
[12:58:35] src/./rdma_van.h:811: OnConnect to Node 8 with Transport=RDMA
[[12:58:35] src/./rdma_van.h:238: 12:58:35Connect to Node ] 9src/./rdma_van.h with Transport=:RDMA
238: Connect to Node 9 with Transport=RDMA
[12:58:35] src/./rdma_van.h:898: OnDisconnected from Node 32767
[12:58:35] src/./rdma_van.h:898: OnDisconnected from Node 32767
[12:58:35] src/./rdma_van.h:811: OnConnect to Node 8 with Transport=RDMA
[12:58:35] src/./rdma_van.h:238: Connect to Node 1 with Transport=RDMA
[12:58:35] ./include/dmlc/logging.h:276: [12:58:35] src/./rdma_transport.h:145: Check failed: mr ibv_reg_mr failed: Cannot allocate memory
You can try to reduce BYTEPS_RDMA_START_DEPTH (default 128) or BYTEPS_RDMA_RX_DEPTH (default 2048)

Stack trace returned 8 entries:
[bt] (0) ./test_benchmark(+0xe3e0) [0x555a232143e0]
[bt] (1) ./test_benchmark(+0xe7fb) [0x555a232147fb]
[bt] (2) ./test_benchmark(+0x598eb) [0x555a2325f8eb]
[bt] (3) ./test_benchmark(+0x6451d) [0x555a2326a51d]
[bt] (4) ./test_benchmark(+0x64feb) [0x555a2326afeb]
[bt] (5) /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xbd6df) [0x7f7cc9e2e6df]
[bt] (6) /lib/x86_64-linux-gnu/libpthread.so.0(+0x76db) [0x7f7cc99416db]
[bt] (7) /lib/x86_64-linux-gnu/libc.so.6(clone+0x3f) [0x7f7cc966aa3f]

terminate called after throwing an instance of 'dmlc::Error'
what(): [12:58:35] src/./rdma_transport.h:145: Check failed: mr ibv_reg_mr failed: Cannot allocate memory
You can try to reduce BYTEPS_RDMA_START_DEPTH (default 128) or BYTEPS_RDMA_RX_DEPTH (default 2048)

Stack trace returned 8 entries:
[bt] (0) ./test_benchmark(+0xe3e0) [0x555a232143e0]
[bt] (1) ./test_benchmark(+0xe7fb) [0x555a232147fb]
[bt] (2) ./test_benchmark(+0x598eb) [0x555a2325f8eb]
[bt] (3) ./test_benchmark(+0x6451d) [0x555a2326a51d]
[bt] (4) ./test_benchmark(+0x64feb) [0x555a2326afeb]
[bt] (5) /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xbd6df) [0x7f7cc9e2e6df]
[bt] (6) /lib/x86_64-linux-gnu/libpthread.so.0(+0x76db) [0x7f7cc99416db]
[bt] (7) /lib/x86_64-linux-gnu/libc.so.6(clone+0x3f) [0x7f7cc966aa3f]

[12:58:36] src/./rdma_van.h:898: OnDisconnected from Node 1
[12:58:36] src/./rdma_van.h:898: OnDisconnected from Node 1
[12:58:36] src/./rdma_van.h:898: OnDisconnected from Node 1
./local_multi_workers_RDMA.sh: line 48: 42385 Aborted (core dumped) ${bin} ${arg}
[12:58:36] ./include/dmlc/logging.h:276: [12:58:36] src/./rdma_van.h:616: Check failed: wc[i].status == IBV_WC_SUCCESS Failed status
Work Request Flushed Error 5 140478784480192 245 postoffice ptr: 0x55886602cf60

Stack trace returned 6 entries:
[bt] (0) ./test_benchmark(+0xe3e0) [0x55886420f3e0]
[bt] (1) ./test_benchmark(+0xe7fb) [0x55886420f7fb]
[bt] (2) ./test_benchmark(+0x630fd) [0x5588642640fd]
[bt] (3) /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xbd6df) [0x7fc3dcfca6df]
[bt] (4) /lib/x86_64-linux-gnu/libpthread.so.0(+0x76db) [0x7fc3dcadd6db]
[bt] (5) /lib/x86_64-linux-gnu/libc.so.6(clone+0x3f) [0x7fc3dc806a3f]

terminate called after throwing an instance of 'dmlc::Error'
what(): [12:58:36] src/./rdma_van.h:616: Check failed: wc[i].status == IBV_WC_SUCCESS Failed status
Work Request Flushed Error 5 140478784480192 245 postoffice ptr: 0x55886602cf60

Stack trace returned 6 entries:
[bt] (0) ./test_benchmark(+0xe3e0) [0x55886420f3e0]
[bt] (1) ./test_benchmark(+0xe7fb) [0x55886420f7fb]
[bt] (2) ./test_benchmark(+0x630fd) [0x5588642640fd]
[bt] (3) /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xbd6df) [0x7fc3dcfca6df]
[bt] (4) /lib/x86_64-linux-gnu/libpthread.so.0(+0x76db) [0x7fc3dcadd6db]
[bt] (5) /lib/x86_64-linux-gnu/libc.so.6(clone+0x3f) [0x7fc3dc806a3f]

[12:58:37] src/./rdma_van.h:898: OnDisconnected from Node 9
[12:58:37] src/./rdma_van.h:898: OnDisconnected from Node 9

image

UCXVan A100 multi-GPU support

Background

Capability of UCX

  • With the current version of openucx, a UCX context cannot choose the optimal route for multiple GPUs. However, if a UCX context and the data is bound to a particular GPU, then the data will be sent/recv with the optimal route.
  • Memory pinning: Each time a memory is pinned / unpinned for GPUDirect RDMA, there's some overhead. Ideally we want to avoid changing the memory buffer address when sending/receiving GPU data with GDR.

Use cases

Testing code #46

  • 1 PSWorker per node. The PSWorker handles the data from all 8 GPUs & GPU on that node on the same process
  • Multiple PSWorkers per node. There are multiple processes, and each process has a PSWorker that handles the data of a particular GPU (and the CPU).

Proposed APIs

Ps-lite user experience

// assume we have done cudaMalloc with `data` (char*) on GPU 0 previous.
DeviceType curr_device_type = ps::kGPU;
int curr_device_id = 0;
// specify the target device
DeviceType tgt_device_type = ps::kGPU;
int tgt_device_id = 2;
SArray<char> vals(data, len, false,
                  curr_device_type, curr_device_id,
                  tgt_device_type, tgt_device_id);

PSKV pskv;
pskv.keys.push_back(0);
pskv.lens.push_back(1024000);
pskv.size = 1024000;
kv_worker->ZPush(pskv.keys, vals, pskv.lens);
void RequestHandler(const KVMeta& req_meta, const KVPairs<Val>& req_data, KVServer<Val>* server) {
  if (req_meta.dst_device_type == kGPU) {
    // do something with the received GPU data
  } else {
    // do something with the received CPU data
  }
};

Ps-lite data structures changes

Message.meta stores the device information

enum {
  kCPU,
  kGPU
} DeviceType;

struct Meta {
 DeviceType src_dev_type;
 uint8_t src_dev_id;
 DeviceType dst_dev_type;
 uint8_t dst_dev_id;
 ...
}

// now each ucp_context is bound to a port for a ucp_listener and a ucp_worker
// we replace the monolithic `int port` with an array of ports and devices
struct Node {
  uint8_t num_ports;
  int ports[32];
  DeviceType device_types[32];
  uint8_t device_ids[32];
}

Device extension for SArray

template<typename V>
class SArray {
 public:
  // Zero-copy constructor for device data
  SArray(V* data, size_t size, bool deletable = false,
         DeviceType curr_device_type, int curr_device_id,
         DeviceType tgt_device_type, int tgt_device_id);

  // copy-assignment operator
  template <typename W> void operator=(const SArray<W>& arr);

 private:
   DeviceType curr_device_type_;
   int curr_device_id_;
   DeviceType tgt_device_type_;
   int tgt_device_id_;
};

KVServer buffer registration

// to send data from local GPU to a remote GPU, we need the remote GPU
// to designate a memory buffer that can hold the result, so that we avoid
// a memory copy.
KVServer {
  public: 
    RegisterRecvBuffer(SArray<Key>& keys, const SArray<Val>& vals,
                       const SArray<int>& lens = {}, int cmd = 0);
};

UCXVan

public: 
  // creates multiple ucp_listeners and ports based on `CUDA_VISIBLE_DEVICES`
  // Update `my_node.ports` directly inside `Bind`
  // the return type is changed from int to void
  void Bind(const Node &my_node, int max_retry);

  // Connect to the target node based on the ports information
  // stored inside `node`.
  void Connect(const Node &node);

  // picks the corresponding ucp_context based on src_device_id
  // and dst_device_id
  int SendMsg(Message& msg);

  // set the curr_device_id upon receiving the message. 
  int RecvMsg(Message& msg);

  // register the receiving buffer for provided keys
  void RegisterRecvBuffer(SArray<Key>& keys, const SArray<Val>& vals,
                          const SArray<int>& lens = {}, int cmd = 0);

private:
  // multiple ucx contexts
  // also need to store the mapping for context <-> dev_type / dev_id
  std::vector<ucp_context_h> contexts_;

BytePS changes

This part should be handled by the BytePS dev team.

Memory management

As using GDR with frequently changing GPU data address could lead to pinning/unpinning overheads, using the same GPU buffer is preferred. There are two options:

  1. BytePS manages the buffers registered for communication. Upon each push, we copy the original tensor to the communication buffer before sending it out. Similarly, we use the communication buffer to hold the result of pull, and then copy the data to the destination tensor.
  2. BytePS directly passes whatever GPU memory that needs to be sent/received, and relies on UCX's lazy pinning optimizations to mitigate the overhead.
    The option mainly affects BytePS's memory management implementation. From UCXVan's point of view, it always tries to pin the GPU data passed to it.

PushPull and key encoding on A100

On A100 nodes, each PCI-e switch is connected with 2 GPUs and an NIC. A pushpull call in BytePS would become the following series of operations on the local node:

  • reduce scatter, which results in 8 GPU tensors
  • ZPush/ZPull with the (same) key and corresponding device IDs
  • All-gather.

Note that the above ps-lite change does not affect how ps-lite scheduler assign ranks to each node. BytePS partitions a tensor as usual (1 ps key for every 4MB data chunk). The key encoding logic also stays the same, where the first x bits are reserved for the target node id. When sending a key with data residing on different GPUs, we just need to annotate the source SArray with the corresponding device id.

ucx van with GDR

If I want to register a GPU buffer, then use that buffer with ucp_tag_send_nb(), what are the steps to do that? Is there code samples I can use as a reference?

These are some memory registration related settings in ucx:

    {"REG_METHODS", "rcache,odp,direct",
     "List of registration methods in order of preference. Supported methods are:\n"
     "  odp         - implicit on-demand paging\n"
     "  rcache      - userspace registration cache\n"
     "  direct      - direct registration\n",
     ucs_offsetof(uct_ib_md_config_t, reg_methods), UCS_CONFIG_TYPE_STRING_ARRAY},

is export UCX_IB_REG_METHODS=rcache what I need to manually register memory regions?

error when send multiple keys in one message

I want to apply the RDMA version of ps-lite in sparse case in whch one message contains multiple keys and multiple values. However, the error occurs. The worker sends 100 keys, however, the server only receives 1 keys. I wonder whether this implemention only works in dense case such as test_benchmark.cpp and byteps package in which on message contains only one key and many values. Thank you.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.