bytedance / ps-lite Goto Github PK
View Code? Open in Web Editor NEWThis project forked from dmlc/ps-lite
A lightweight parameter server interface
Home Page: http://ps-lite.readthedocs.org
License: Apache License 2.0
This project forked from dmlc/ps-lite
A lightweight parameter server interface
Home Page: http://ps-lite.readthedocs.org
License: Apache License 2.0
Hi, developers of pslite. I am using the RDMA version of ps-lite. When I run the test_benchmark, the error occurs. The error log and bootscript is pinned below. What is the problem and how can I solve it. Thank you!
(recenv) [bob@need08 tests]$ ./local_multi_workers_RDMA.sh 1 1 test_benchmark
./local_multi_workers_RDMA.sh: line 28: test_benchmark: command not found
./local_multi_workers_RDMA.sh: line 36: test_benchmark: command not found
./local_multi_workers_RDMA.sh: line 45: test_benchmark: command not found
(recenv) [guowei@need08 tests]$ ./local_multi_workers_RDMA.sh 1 1 ./test_benchmark
[12:58:33] tests/test_benchmark.cc:494: 1 ports per node
[12:58:33] tests/test_benchmark.cc:499: recv buffer registration is NOT enabled
[12:58:33] tests/test_benchmark.cc:504: TEST_NUM_GPU_WORKER = 0
[12:58:33] tests/test_benchmark.cc:507: TEST_NUM_GPU_SERVER = 0
[12:58:33] src/postoffice.cc:60: Creating Van: 1. group_size=1
[12:58:33] src/van.cc:88: Creating RDMAVan.
[12:58:33] src/van.cc:89: DMLC_ENABLE_RDMA=1 will be deprecated. Please use DMLC_ENABLE_RDMA=ibverbs instead.
[12:58:33] src/./rdma_van.h:46: Shared memory IPC has been disabled
[12:58:33] tests/test_benchmark.cc:494: 1 ports per node
[12:58:33] tests/test_benchmark.cc:499: recv buffer registration is NOT enabled
[12:58:33] tests/test_benchmark.cc:504: TEST_NUM_GPU_WORKER = 0
[12:58:33] tests/test_benchmark.cc:507: TEST_NUM_GPU_SERVER = 0
[12:58:33] src/postoffice.cc:60: Creating Van: 1. group_size=1
[12:58:33] src/van.cc:88: Creating RDMAVan.
[12:58:33] src/van.cc:89: DMLC_ENABLE_RDMA=1 will be deprecated. Please use DMLC_ENABLE_RDMA=ibverbs instead.
[12:58:33] src/./rdma_van.h:46: Shared memory IPC has been disabled
[12:58:33] tests/test_benchmark.cc:494: 1 ports per node
[12:58:33] tests/test_benchmark.cc:499: recv buffer registration is NOT enabled
[12:58:33] tests/test_benchmark.cc:504: TEST_NUM_GPU_WORKER = 0
[12:58:33] tests/test_benchmark.cc:507: TEST_NUM_GPU_SERVER = 0
[12:58:33] src/postoffice.cc:60: Creating Van: 1. group_size=1
[12:58:33] src/van.cc:88: Creating RDMAVan.
[12:58:33] src/van.cc:89: DMLC_ENABLE_RDMA=1 will be deprecated. Please use DMLC_ENABLE_RDMA=ibverbs instead.
[12:58:33] src/./rdma_van.h:46: Shared memory IPC has been disabled
[12:58:34] src/./rdma_van.h:811: OnConnect to Node 32767 with Transport=RDMA
[12:58:34] src/./rdma_van.h:238: Connect to Node 1 with Transport=RDMA
[12:58:34] src/./rdma_van.h:811: OnConnect to Node 32767 with Transport=RDMA
[12:58:34] src/./rdma_van.h:238: Connect to Node 1 with Transport=RDMA
[12:58:34] src/./rdma_van.h:811: OnConnect to Node 1 with Transport=RDMA
[12:58:34] src/./rdma_van.h:238: Connect to Node 1 with Transport=RDMA
[12:58:34] src/./rdma_van.h:811: OnConnect to Node 1 with Transport=RDMA
[12:58:34] src/./rdma_van.h:238: Connect to Node 8 with Transport=RDMA
[12:58:34] src/./rdma_van.h:811: OnConnect to Node 1 with Transport=RDMA
[12:58:34] src/./rdma_van.h:238: Connect to Node 9 with Transport=RDMA
[12:58:35] src/./rdma_van.h:811: OnConnect to Node 9 with Transport=RDMA
[12:58:35] src/./rdma_van.h:238: Connect to Node 8 with Transport=RDMA
[12:58:35] src/./rdma_van.h:811: OnConnect to Node 8 with Transport=RDMA
[12:58:35] src/./rdma_van.h:238: Connect to Node 8 with Transport=RDMA
[12:58:35] src/./rdma_van.h:811: OnConnect to Node 9 with Transport=RDMA
[12:58:35] src/./rdma_van.h:811: OnConnect to Node 8 with Transport=RDMA
[[12:58:35] src/./rdma_van.h:238: 12:58:35Connect to Node ] 9src/./rdma_van.h with Transport=:RDMA
238: Connect to Node 9 with Transport=RDMA
[12:58:35] src/./rdma_van.h:898: OnDisconnected from Node 32767
[12:58:35] src/./rdma_van.h:898: OnDisconnected from Node 32767
[12:58:35] src/./rdma_van.h:811: OnConnect to Node 8 with Transport=RDMA
[12:58:35] src/./rdma_van.h:238: Connect to Node 1 with Transport=RDMA
[12:58:35] ./include/dmlc/logging.h:276: [12:58:35] src/./rdma_transport.h:145: Check failed: mr ibv_reg_mr failed: Cannot allocate memory
You can try to reduce BYTEPS_RDMA_START_DEPTH (default 128) or BYTEPS_RDMA_RX_DEPTH (default 2048)
Stack trace returned 8 entries:
[bt] (0) ./test_benchmark(+0xe3e0) [0x555a232143e0]
[bt] (1) ./test_benchmark(+0xe7fb) [0x555a232147fb]
[bt] (2) ./test_benchmark(+0x598eb) [0x555a2325f8eb]
[bt] (3) ./test_benchmark(+0x6451d) [0x555a2326a51d]
[bt] (4) ./test_benchmark(+0x64feb) [0x555a2326afeb]
[bt] (5) /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xbd6df) [0x7f7cc9e2e6df]
[bt] (6) /lib/x86_64-linux-gnu/libpthread.so.0(+0x76db) [0x7f7cc99416db]
[bt] (7) /lib/x86_64-linux-gnu/libc.so.6(clone+0x3f) [0x7f7cc966aa3f]
terminate called after throwing an instance of 'dmlc::Error'
what(): [12:58:35] src/./rdma_transport.h:145: Check failed: mr ibv_reg_mr failed: Cannot allocate memory
You can try to reduce BYTEPS_RDMA_START_DEPTH (default 128) or BYTEPS_RDMA_RX_DEPTH (default 2048)
Stack trace returned 8 entries:
[bt] (0) ./test_benchmark(+0xe3e0) [0x555a232143e0]
[bt] (1) ./test_benchmark(+0xe7fb) [0x555a232147fb]
[bt] (2) ./test_benchmark(+0x598eb) [0x555a2325f8eb]
[bt] (3) ./test_benchmark(+0x6451d) [0x555a2326a51d]
[bt] (4) ./test_benchmark(+0x64feb) [0x555a2326afeb]
[bt] (5) /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xbd6df) [0x7f7cc9e2e6df]
[bt] (6) /lib/x86_64-linux-gnu/libpthread.so.0(+0x76db) [0x7f7cc99416db]
[bt] (7) /lib/x86_64-linux-gnu/libc.so.6(clone+0x3f) [0x7f7cc966aa3f]
[12:58:36] src/./rdma_van.h:898: OnDisconnected from Node 1
[12:58:36] src/./rdma_van.h:898: OnDisconnected from Node 1
[12:58:36] src/./rdma_van.h:898: OnDisconnected from Node 1
./local_multi_workers_RDMA.sh: line 48: 42385 Aborted (core dumped) ${bin} ${arg}
[12:58:36] ./include/dmlc/logging.h:276: [12:58:36] src/./rdma_van.h:616: Check failed: wc[i].status == IBV_WC_SUCCESS Failed status
Work Request Flushed Error 5 140478784480192 245 postoffice ptr: 0x55886602cf60
Stack trace returned 6 entries:
[bt] (0) ./test_benchmark(+0xe3e0) [0x55886420f3e0]
[bt] (1) ./test_benchmark(+0xe7fb) [0x55886420f7fb]
[bt] (2) ./test_benchmark(+0x630fd) [0x5588642640fd]
[bt] (3) /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xbd6df) [0x7fc3dcfca6df]
[bt] (4) /lib/x86_64-linux-gnu/libpthread.so.0(+0x76db) [0x7fc3dcadd6db]
[bt] (5) /lib/x86_64-linux-gnu/libc.so.6(clone+0x3f) [0x7fc3dc806a3f]
terminate called after throwing an instance of 'dmlc::Error'
what(): [12:58:36] src/./rdma_van.h:616: Check failed: wc[i].status == IBV_WC_SUCCESS Failed status
Work Request Flushed Error 5 140478784480192 245 postoffice ptr: 0x55886602cf60
Stack trace returned 6 entries:
[bt] (0) ./test_benchmark(+0xe3e0) [0x55886420f3e0]
[bt] (1) ./test_benchmark(+0xe7fb) [0x55886420f7fb]
[bt] (2) ./test_benchmark(+0x630fd) [0x5588642640fd]
[bt] (3) /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xbd6df) [0x7fc3dcfca6df]
[bt] (4) /lib/x86_64-linux-gnu/libpthread.so.0(+0x76db) [0x7fc3dcadd6db]
[bt] (5) /lib/x86_64-linux-gnu/libc.so.6(clone+0x3f) [0x7fc3dc806a3f]
[12:58:37] src/./rdma_van.h:898: OnDisconnected from Node 9
[12:58:37] src/./rdma_van.h:898: OnDisconnected from Node 9
Testing code #46
// assume we have done cudaMalloc with `data` (char*) on GPU 0 previous.
DeviceType curr_device_type = ps::kGPU;
int curr_device_id = 0;
// specify the target device
DeviceType tgt_device_type = ps::kGPU;
int tgt_device_id = 2;
SArray<char> vals(data, len, false,
curr_device_type, curr_device_id,
tgt_device_type, tgt_device_id);
PSKV pskv;
pskv.keys.push_back(0);
pskv.lens.push_back(1024000);
pskv.size = 1024000;
kv_worker->ZPush(pskv.keys, vals, pskv.lens);
void RequestHandler(const KVMeta& req_meta, const KVPairs<Val>& req_data, KVServer<Val>* server) {
if (req_meta.dst_device_type == kGPU) {
// do something with the received GPU data
} else {
// do something with the received CPU data
}
};
Message.meta stores the device information
enum {
kCPU,
kGPU
} DeviceType;
struct Meta {
DeviceType src_dev_type;
uint8_t src_dev_id;
DeviceType dst_dev_type;
uint8_t dst_dev_id;
...
}
// now each ucp_context is bound to a port for a ucp_listener and a ucp_worker
// we replace the monolithic `int port` with an array of ports and devices
struct Node {
uint8_t num_ports;
int ports[32];
DeviceType device_types[32];
uint8_t device_ids[32];
}
Device extension for SArray
template<typename V>
class SArray {
public:
// Zero-copy constructor for device data
SArray(V* data, size_t size, bool deletable = false,
DeviceType curr_device_type, int curr_device_id,
DeviceType tgt_device_type, int tgt_device_id);
// copy-assignment operator
template <typename W> void operator=(const SArray<W>& arr);
private:
DeviceType curr_device_type_;
int curr_device_id_;
DeviceType tgt_device_type_;
int tgt_device_id_;
};
KVServer buffer registration
// to send data from local GPU to a remote GPU, we need the remote GPU
// to designate a memory buffer that can hold the result, so that we avoid
// a memory copy.
KVServer {
public:
RegisterRecvBuffer(SArray<Key>& keys, const SArray<Val>& vals,
const SArray<int>& lens = {}, int cmd = 0);
};
UCXVan
public:
// creates multiple ucp_listeners and ports based on `CUDA_VISIBLE_DEVICES`
// Update `my_node.ports` directly inside `Bind`
// the return type is changed from int to void
void Bind(const Node &my_node, int max_retry);
// Connect to the target node based on the ports information
// stored inside `node`.
void Connect(const Node &node);
// picks the corresponding ucp_context based on src_device_id
// and dst_device_id
int SendMsg(Message& msg);
// set the curr_device_id upon receiving the message.
int RecvMsg(Message& msg);
// register the receiving buffer for provided keys
void RegisterRecvBuffer(SArray<Key>& keys, const SArray<Val>& vals,
const SArray<int>& lens = {}, int cmd = 0);
private:
// multiple ucx contexts
// also need to store the mapping for context <-> dev_type / dev_id
std::vector<ucp_context_h> contexts_;
This part should be handled by the BytePS dev team.
As using GDR with frequently changing GPU data address could lead to pinning/unpinning overheads, using the same GPU buffer is preferred. There are two options:
On A100 nodes, each PCI-e switch is connected with 2 GPUs and an NIC. A pushpull call in BytePS would become the following series of operations on the local node:
Note that the above ps-lite change does not affect how ps-lite scheduler assign ranks to each node. BytePS partitions a tensor as usual (1 ps key for every 4MB data chunk). The key encoding logic also stays the same, where the first x bits are reserved for the target node id. When sending a key with data residing on different GPUs, we just need to annotate the source SArray with the corresponding device id.
If I want to register a GPU buffer, then use that buffer with ucp_tag_send_nb()
, what are the steps to do that? Is there code samples I can use as a reference?
These are some memory registration related settings in ucx:
{"REG_METHODS", "rcache,odp,direct",
"List of registration methods in order of preference. Supported methods are:\n"
" odp - implicit on-demand paging\n"
" rcache - userspace registration cache\n"
" direct - direct registration\n",
ucs_offsetof(uct_ib_md_config_t, reg_methods), UCS_CONFIG_TYPE_STRING_ARRAY},
is export UCX_IB_REG_METHODS=rcache
what I need to manually register memory regions?
I want to apply the RDMA version of ps-lite in sparse case in whch one message contains multiple keys and multiple values. However, the error occurs. The worker sends 100 keys, however, the server only receives 1 keys. I wonder whether this implemention only works in dense case such as test_benchmark.cpp and byteps package in which on message contains only one key and many values. Thank you.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.