GithubHelp home page GithubHelp logo

dmlc / ps-lite Goto Github PK

View Code? Open in Web Editor NEW
1.5K 90.0 541.0 919 KB

A lightweight parameter server interface

Home Page: http://ps-lite.readthedocs.org

License: Apache License 2.0

Shell 1.23% C++ 80.25% Makefile 1.97% CMake 1.87% C 0.68% Python 13.99%

ps-lite's Introduction

Build Status GitHub license

A light and efficient implementation of the parameter server framework. It provides clean yet powerful APIs. For example, a worker node can communicate with the server nodes by

  • Push(keys, values): push a list of (key, value) pairs to the server nodes
  • Pull(keys): pull the values from servers for a list of keys
  • Wait: wait untill a push or pull finished.

A simple example:

  std::vector<uint64_t> key = {1, 3, 5};
  std::vector<float> val = {1, 1, 1};
  std::vector<float> recv_val;
  ps::KVWorker<float> w;
  w.Wait(w.Push(key, val));
  w.Wait(w.Pull(key, &recv_val));

More features:

  • Flexible and high-performance communication: zero-copy push/pull, supporting dynamic length values, user-defined filters for communication compression
  • Server-side programming: supporting user-defined handles on server nodes

Build

ps-lite requires a C++11 compiler such as g++ >= 4.8. On Ubuntu >= 13.10, we can install it by

sudo apt-get update && sudo apt-get install -y build-essential git

Instructions for gcc 4.8 installation on other platforms:

Then clone and build

git clone https://github.com/dmlc/ps-lite
cd ps-lite && make -j4

How to use

ps-lite provides asynchronous communication for other projects:

Research papers

  1. Mu Li, Dave Andersen, Alex Smola, Junwoo Park, Amr Ahmed, Vanja Josifovski, James Long, Eugene Shekita, Bor-Yiing Su. Scaling Distributed Machine Learning with the Parameter Server. In Operating Systems Design and Implementation (OSDI), 2014
  2. Mu Li, Dave Andersen, Alex Smola, and Kai Yu. Communication Efficient Distributed Machine Learning with the Parameter Server. In Neural Information Processing Systems (NIPS), 2014

ps-lite's People

Contributors

anancds avatar anandj91 avatar changlan avatar chungongyu avatar codingcat avatar crazyboycjr avatar cykustcc avatar eric-haibin-lin avatar hitzzc avatar leezu avatar madjam avatar mli avatar nhynes avatar qiaohaijun avatar rahul003 avatar redwrasse avatar reyoung avatar shilad avatar snowzjx avatar solin319 avatar subenle avatar szha avatar tqchen avatar travisbarrydick avatar willzhang4a58 avatar xlvector avatar yajiedesign avatar ymjiang avatar yzhliu avatar zhouhaiy avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

ps-lite's Issues

[dev branch] Cmake error

Both Windows and Linux builds fail with cmake. (Of course you can use make on Linux, I just tried cmake on Linux to verify the problem.)

CMake Error at cmake/ProtoBuf.cmake:28 (message):
Error: pslite_protobuf_generate_cpp_py() called without any proto files
Call Stack (most recent call first):
CMakeLists.txt:23 (pslite_protobuf_generate_cpp_py)

I printed out the parameters:

proto_gen_folder=/home/hct/test/ps-lite/include prot_srcs= proto_hdrs=
proto_python= project_src_dir=/home/hct/test/ps-lite proto_files=

Zmq occasionally hangs when ps app exits.

I've been intensively using ps-lite, specifically difacto in wormhole based on ps-lite. It occasionally happens that one difacto.dmlc process (which should be the scheduler) hangs after the workers/servers have finished all the iterations. I reproduced this issue both in local and YARN modes.

And the thread stack dump shows:

(gdb) bt
#0  0x00000031da4df343 in poll () from /lib64/libc.so.6
#1  0x000000000051cfca in zmq::signaler_t::wait (this=0x136eaf8, timeout_=-1) at src/signaler.cpp:218
#2  0x0000000000515be0 in zmq::mailbox_t::recv (this=0x136ea98, cmd_=0x7fffbbeb4670, timeout_=-1) at src/mailbox.cpp:80
#3  0x000000000050f03c in zmq::ctx_t::terminate (this=0x136ea00) at src/ctx.cpp:167
#4  0x000000000046ba14 in ps::Van::~Van (this=0x7b7e28, __in_chrg=<value optimized out>) at src/system/van.cc:24kj
#5  0x0000000000461aef in ps::Manager::~Manager (this=0x7b7cf8, __in_chrg=<value optimized out>) at src/system/manager.cc:16
#6  0x0000000000466b00 in ps::Postoffice::~Postoffice (this=0x7b7c40, __in_chrg=<value optimized out>) at src/system/postoffice.cc:8
#7  0x00000031da435e22 in exit () from /lib64/libc.so.6
#8  0x00000031da41ed24 in __libc_start_main () from /lib64/libc.so.6
#9  0x0000000000409b51 in _start ()
(gdb) info threads
  3 Thread 0x7ff3e4ee3700 (LWP 14957)  0x00000031da4e9163 in epoll_wait () from /lib64/libc.so.6
  2 Thread 0x7ff3e44e2700 (LWP 14958)  0x00000031da4e9163 in epoll_wait () from /lib64/libc.so.6
* 1 Thread 0x7ff3e4f66780 (LWP 14948)  0x00000031da4df343 in poll () from /lib64/libc.so.6
(gdb) thread 2
[Switching to thread 2 (Thread 0x7ff3e44e2700 (LWP 14958))]#0  0x00000031da4e9163 in epoll_wait () from /lib64/libc.so.6
(gdb) bt
#0  0x00000031da4e9163 in epoll_wait () from /lib64/libc.so.6
#1  0x000000000051518b in zmq::epoll_t::loop (this=0x1373200) at src/epoll.cpp:156
#2  0x00000000005277ee in thread_routine (arg_=0x1373280) at src/thread.cpp:96
#3  0x00000031dac079d1 in start_thread () from /lib64/libpthread.so.0
#4  0x00000031da4e8b6d in clone () from /lib64/libc.so.6
(gdb) thread 3
[Switching to thread 3 (Thread 0x7ff3e4ee3700 (LWP 14957))]#0  0x00000031da4e9163 in epoll_wait () from /lib64/libc.so.6
(gdb) bt
#0  0x00000031da4e9163 in epoll_wait () from /lib64/libc.so.6
#1  0x000000000051518b in zmq::epoll_t::loop (this=0x136ee30) at src/epoll.cpp:156
#2  0x00000000005277ee in thread_routine (arg_=0x136eeb0) at src/thread.cpp:96
#3  0x00000031dac079d1 in start_thread () from /lib64/libpthread.so.0
#4  0x00000031da4e8b6d in clone () from /lib64/libc.so.6

Can you guys look into this? @mli @tqchen

support for string Key

Hi there,

Do you have any plan to support "string" Key? Right now Key is uint64_t or uint32_t.

Thanks.

more standard implementation of SArray::operator=

It might be better to implement SArray::operator= in a more standard way.

1 template
2 SArray& operator=(const SArray& arr)
3 {
4 if (this == &arr) return *this;
5 size_ = arr.size() * sizeof(W) / sizeof(V);
6 CHECK_EQ(size_ * sizeof(V), arr.size() * sizeof(W));
7 capacity_ = arr.capacity() * sizeof(W) / sizeof(V);
8 ptr_ = std::shared_ptr(arr.ptr(), reinterpret_cast<V *>(arr.data()));
9 }

First KVworker always throws an exception

Hi

I am new to ps-lite and I am trying to use ps-lite as part of a university group project. I am using the KVServer and KVWorker. Whenever I deploy the app using:

./ps-lite/tracker/dmlc_local.py -s 1 -n 3 ./logreg ,

an exception gets thrown:

[11:39:51] /home/rosko/Lab2016-Group4/ps-lite/include/dmlc/logging.h:208: [11:39:51] /home/rosko/Lab2016-Group4/ps-lite/src/postoffice.cc:77: Check failed: (customers_.count(id)) == ((size_t)0) id 0 already exists terminate called after throwing an instance of 'dmlc::Error' what(): [11:39:51] /home/rosko/Lab2016-Group4/ps-lite/src/postoffice.cc:77: Check failed: (customers_.count(id)) == ((size_t)0) id 0 already exists

This exception is thrown only for the first worker (rank is 0), when the post office object is retrieved:

ps::Postoffice *office = ps::Postoffice::Get();

All the other workers are working fine. Can anyone help with resolving the exception?

Any form of help is highly appreciated! Thank you!

Hot join and quit of workers.

The number of workers is decided by DMLC_NUM_WORKER? Am I right?

What if workers dynamically join and quit the network?

OSX build for glog fails

A recent commit switched to downloading and building specific versions of dependencies. The supplied version of glog fails to build on OSX. It looks like more recent builds have fixed the problem.

What's the appropriate solution? Is there a clean way to use installed libs instead of downloading them?

Here's th glog build error on OSX, for reference:

g++ -DHAVE_CONFIG_H -I. -I./src  -I./src  -D_THREAD_SAFE    -I/Users/a558989/Projects/ps-lite/deps/include -Wall -Wwrite-strings -Woverloaded-virtual -Wno-sign-compare  -DNO_FRAME_POINTER  -g -O2 -MT utilities_unittest-utilities_unittest.o -MD -MP -MF .deps/utilities_unittest-utilities_unittest.Tpo -c -o utilities_unittest-utilities_unittest.o `test -f 'src/utilities_unittest.cc' || echo './'`src/utilities_unittest.cc
In file included from src/stl_logging_unittest.cc:34:
In file included from ./src/glog/stl_logging.h:54:
/Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/bin/../include/c++/v1/ext/hash_set:205:5: warning: 
      Use of the header <ext/hash_set> is deprecated. Migrate to <unordered_set> [-W#warnings]
#   warning Use of the header <ext/hash_set> is deprecated.  Migrate to <unordered_set>
    ^
In file included from src/stl_logging_unittest.cc:34:
In file included from ./src/glog/stl_logging.h:55:
/Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/bin/../include/c++/v1/ext/hash_map:212:5: warning: 
      Use of the header <ext/hash_map> is deprecated. Migrate to <unordered_map> [-W#warnings]
#   warning Use of the header <ext/hash_map> is deprecated.  Migrate to <unordered_map>
    ^
In file included from src/stl_logging_unittest.cc:34:
./src/glog/stl_logging.h:56:11: fatal error: 'ext/slist' file not found
# include <ext/slist>
          ^

Van::GetTimestamp(), a bug or not?

include/ps/internal/van.h:

63 /**
64 * \brief get next available timestamp. thread safe
65 */
66 int GetTimestamp() { return timestamp_++; }

126 std::atomic timestamp_{0};

The implementation of GetTimestamp and its comment don't match. GetTimestamp doesn't get "next" availabe timestamp but the current logic timestamp.
Meanwhile timestamp_ is initialized as zero.
So the first call to GetTimestamp will return zero. Is it okay?

[WIP] Simple Example of Async SGD

Based on what we have discussed, I think it is of great interest to show the following snippet of code.

  • Assume features are mapped consecutively(i.e. no Localizer is neede)
  • Assume all the node take its own partition of the data(i.e. no scheduler is needed)

Write a simple async SGD that defines ServerHandle and Worker Logic and that is all you need to do. This will demonstrate the core functionality of ps-lite without getting distracted in the other optimizations. And I think the efficiency should be fine for the case we described.

In the long run, we will want to move some of the specific optimization module(Localizer, WorkLoadPool) to dmlc-core eventually to design a user friendly API for more users to use them

Build error on MacOSX El Captan

I have faced a build problem on Mac OSX El Captan

As described here, I installed HPC tool including gcc-5.2 that can be used on also El Captan. But I found below build error. It may be caused by building dependencies.

configure: error: in `/Users/sasakikai/dev/ps-lite/cityhash-1.1.1':
configure: error: C compiler cannot create executables
See `config.log' for more details

https://gist.github.com/Lewuathe/d64aa3822e8a9551eeef

What's the reason of the downgrading from original parameter-server?

The original implementation of parameter-server has implemented the important features in the osdi 2014 paper of "Scaling Distributed Machine Learning with the Parameter Server", such as replication, bounded delay consistency,...,etc. It seems current ps-lite is a much simpler implementation, it's just a simple ps solution with BSP and ASP, what's the reason for such a downgrading?

potential memory error caused by unsuitable zmq_msg_* functions

zmq_van.h, RecvMsg function;

190 } else if (i == 1) {
191 // task
192 UnpackMeta(buf, size, &(msg->meta));
193 zmq_msg_close(zmsg);
194 bool more = zmq_msg_more(zmsg);
195 delete zmsg;
196 if (!more) break;
197 } else {

Seems we should call line 194 first and then call line 193.

KVServer is single threaded?

As I observed, there is only one working thread in ps, and request handles including KVServerDefaultHandle need not to be thread safe.
The result is that ps-lite can't leverage all CPU cores.

May I have a multi-thread server?

How to fix 'kEmptyString' is not a member of 'google::protobuf::internal' ?

...

./ps-lite/src/proto/task.pb.h: In member function ‘void ps::Task::set_allocated_msg(std::string*)’:
./ps-lite/src/proto/task.pb.h:540:16: error: ‘kEmptyString’ is not a member of ‘google::protobuf::internal’
   if (msg_ != &::google::protobuf::internal::kEmptyString) {
                ^
./ps-lite/src/proto/task.pb.h:548:41: error: ‘kEmptyString’ is not a member of ‘google::protobuf::internal’
     msg_ = const_cast< ::std::string*>(&::google::protobuf::internal::kEmptyString);

...

Windows build

Now that mxnet depends on ps-lite, we should also have a Windows build for ps-lite, or mxnet won't build.

About the lock in ZMQVan::SendMsg

Might the granularity of the lock be smaller? I'm not sure.

I feel that reasons why the lock guards the whole function body are:

  1. protect senders_ from racing
  2. keep the data transfer to the socket continuously and avoid being interrupted

As for reason 1, smaller granularity is acceptable.
As for reason 2, perhaps other operations than locking could be tried.

Build issue of ps-lite for gflags error

When build ps-lite on Ubuntu 14.04
make -j4
Error appears

/usr/local/lib/libglog.a(libglog_la-utilities.o): In function google::GetStackTrace(void**, int, int)': /home/zixie1991/mygit/ps-lite/glog-0.3.3/src/stacktrace_libunwind-inl.h:65: undefined reference to_Ux86_64_getcontext'
/home/zixie1991/mygit/ps-lite/glog-0.3.3/src/stacktrace_libunwind-inl.h:66: undefined reference to _ULx86_64_init_local' /home/zixie1991/mygit/ps-lite/glog-0.3.3/src/stacktrace_libunwind-inl.h:78: undefined reference to_ULx86_64_step'
/home/zixie1991/mygit/ps-lite/glog-0.3.3/src/stacktrace_libunwind-inl.h:70: undefined reference to `_ULx86_64_get_reg'

Memory leak van.cc

==39376== 297,840 (257,472 direct, 40,368 indirect) bytes in 5,364 blocks are definitely lost in loss record 494 of 495
==39376== at 0x4A05FD5: operator new(unsigned long) (vg_replace_malloc.c:324)
==39376== by 0x50BB61: ps::Van::Recv(ps::Message_, unsigned long_) (van.cc:209)
==39376== by 0x5096D9: ps::Postoffice::Recv() (postoffice.cc:53)
==39376== by 0x332B2B45DF: execute_native_thread_routine (thread.cc:84)
==39376== by 0x331F2079D0: start_thread (in /lib64/libpthread-2.12.so)
==39376== by 0x331EEE88FC: clone (in /lib64/libc-2.12.so)

Handling of array values

If I push a ps::Key of { 0 } with a ps::Val of { 1, 3 }, the documentation made me believe the server should be storing 0 => [ 1, 3 ]. However, it looks to be storing 0 => (1 + 3) = 4. Is this intentional?

I'm going to extend KVWorker class to handle float arrays, but I was under the impression it would work out of the box. Thanks for your advice!

Build issue of ps-lite need #include <algorithm>

src/ps/shared_array.h: In member function 'ps::SArray ps::SArray::SetUnion(const ps::SArray&) const':
src/ps/shared_array.h:208:15: error: 'set_union' is not a member of 'std'
V* last = std::set_union(
^

Need to add #include at src/ps/shared_array.h

Possible to run two KVWorkers in one process?

Hi Mu,

I'm trying to integrate MXNet with Spark (apache/mxnet#1637), while I found that it was hard to test the application in spark-local mode. When I try to start two workers in one process (local[*] mode), the postoffice complains

16/04/25 20:55:44 WARN TaskSetManager: Lost task 0.0 in stage 6.0 (TID 6, localhost): ml.dmlc.mxnet.MXNetError: [20:55:44] src/postoffice.cc:77: Check failed: (customers_.count(id)) == ((size_t)0) id 0 already exists
    at ml.dmlc.mxnet.Base$.checkCall(Base.scala:108)
    at ml.dmlc.mxnet.KVStore$.create(KVStore.scala:26)
    at ml.dmlc.mxnet.spark.MXNet$$anonfun$main$2.apply(MXNet.scala:69)
    at ml.dmlc.mxnet.spark.MXNet$$anonfun$main$2.apply(MXNet.scala:55)
    at org.apache.spark.rdd.RDD$$anonfun$foreachPartition$1$$anonfun$apply$33.apply(RDD.scala:920)
    at org.apache.spark.rdd.RDD$$anonfun$foreachPartition$1$$anonfun$apply$33.apply(RDD.scala:920)
    at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1858)
    at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1858)
    at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
    at org.apache.spark.scheduler.Task.run(Task.scala:89)
    at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
    at java.lang.Thread.run(Thread.java:745)

So I wonder whether it is possible to have two customers (with same app_id) in one process?

@mli @tqchen

GetAvailablePort function

src/network_utils.h:

Perhaps two issues on the function GetAvailablePort.

First of all, although it doesn't affect the expected behavior of the function GetAvailablePort, it might be better to return -1 instead of 0 if there is any error. After all, port 0 is reserved by TCP/IP suite.
If this function returns -1 instead of 0, CHECK() should also be replaced by CHECH_LT().

Second, necessary calls to close(sock) are missing when there are errors amid, so memory leak might happen in some corner cases.

when using dmlc_ssh.py to start a cluster ,it can't using netinterface ib0

when I start a cluster by dmlc_ssh.py,I find it not use ib0.
In my hosts :
10.10.10.4 Server_10_10_10_4
10.10.101.4 Server_10_10_10_4_IB
eth0 ip is 10.10.10.4,ib0 ip is 10.10.101.4
In https://github.com/dmlc/ps-lite/blob/master/tracker/tracker.py line 391: hostIP = socket.gethostbyname(socket.getfqdn()) or hostIP = socket.gethostbyname(socket.gethostname())
not using DMLC_INTERFACE = ib0,so it get ip 10.10.10.4.
This lead scheduler not using ib0,and all roles using eth0.

dist-train on windows

is there any dist-train example on windows?And how to use.I try to run dist-train in .\multi-node, but all failed.

Build issue on Mac OS X 10.9.5

Building with:

[zjb238:~/code/ps-lite]$ g++ -v                                                                (master)
Using built-in specs.
COLLECT_GCC=g++
COLLECT_LTO_WRAPPER=/usr/local/Cellar/gcc/5.2.0/libexec/gcc/x86_64-apple-darwin13.4.0/5.2.0/lto-wrapper
Target: x86_64-apple-darwin13.4.0
Configured with: ../configure --build=x86_64-apple-darwin13.4.0 --prefix=/usr/local/Cellar/gcc/5.2.0 --libdir=/usr/local/Cellar/gcc/5.2.0/lib/gcc/5 --enable-languages=c,c++,objc,obj-c++,fortran --program-suffix=-5 --with-gmp=/usr/local/opt/gmp --with-mpfr=/usr/local/opt/mpfr --with-mpc=/usr/local/opt/libmpc --with-isl=/usr/local/opt/isl --with-system-zlib --enable-libstdcxx-time=yes --enable-stage1-checking --enable-checking=release --enable-lto --with-build-config=bootstrap-debug --disable-werror --with-pkgversion='Homebrew gcc 5.2.0' --with-bugurl=https://github.com/Homebrew/homebrew/issues --enable-plugin --disable-nls --enable-multilib
Thread model: posix
gcc version 5.2.0 (Homebrew gcc 5.2.0)

I get the following error:

asic_stringstream<char, std::__1::char_traits<char>, std::__1::allocator<char> > in libzmq.a(libzmq_la-address.o)
      construction vtable for std::__1::basic_iostream<char, std::__1::char_traits<char> >-in-std::__1::basic_stringstream<char, std::__1::char_traits<char>, std::__1::allocator<char> > in libzmq.a(libzmq_la-tcp_address.o)
ld: symbol(s) not found for architecture x86_64
collect2: error: ld returned 1 exit status
make: *** [guide/example_a] Error 1```

What will be happened if the message of adding recovery node be ignored?

In a circumstance, yarn launched a recovery node and then this node send a message of type control::ADD_NODE to the scheduler node. But, if the heartbeat_timeout is large enough so that the scheduler don't find one node has been dead. So, the scheduler will ignore this recovery message so the recovery node will not be added into the management of the scheduler . What will be happened to this event? Thanks.
@mli

Download problem for ZMQ

While compiling mxnet I run into a certificate issue with Github. Maybe the right thing to do is not to download binaries, but to recompile zmq?

hive:/data/mxnet (master *)> make
make CXX=g++ DEPS_PATH=/data/mxnet/deps -C ./ps-lite ps
make[1]: Entering directory `/data/mxnet/ps-lite'
rm -rf zeromq-4.1.2.tar.gz zeromq-4.1.2
wget https://raw.githubusercontent.com/mli/deps/master/build/zeromq-4.1.2.tar.gz && tar -zxf ze
romq-4.1.2.tar.gz
--2015-11-16 15:02:00--  https://raw.githubusercontent.com/mli/deps/master/build/zeromq-4.1.2.t
ar.gz
Resolving raw.githubusercontent.com... 103.245.222.133
Connecting to raw.githubusercontent.com|103.245.222.133|:443... connected.
ERROR: certificate common name “www.github.com” doesn’t match requested host name “raw.githubus
ercontent.com”.
To connect to raw.githubusercontent.com insecurely, use ‘--no-check-certificate’.
make[1]: *** [/data/mxnet/deps/include/zmq.h] Error 5
make[1]: Leaving directory `/data/mxnet/ps-lite'
make: *** [ps-lite/build/libps.a] Error 2
hive:/data/mxnet (master)> 

`fabs` function maybe undeclared...

When building test routines, this appeared:

tests/test_kv_app.cc: In function ‘void RunWorker()’:
tests/test_kv_app.cc:44:43: error: ‘fabs’ was not declared in this scope
     res += fabs(rets[i] - vals[i] * repeat);

my compiler: gcc version 5.4.0 20160603 (Ubuntu 5.4.0-3ubuntu1~12.04)

I put cmath header on line2 in tests/test_kv_app.cc and fixed line44 like std::fabs(..., then, passed

Message Data transform other data type to SArray<char>,it can't support complicated data struct

I want to push a string as value to server,but can't work.
I notice that all data are transform to SArray,so it difficult to support comlicated data struct.
as said:
*A smart array that retains shared ownership. It provides similar

  • functionalities comparing to std::vector, including data(), size(),
  • operator[], resize(), clear(). SArray can be easily constructed from
  • std::vector

but I can't do this:
std::vector vec;
SArray s(vec);

Does server replication works?

I noticed in class Executor member variable num_replicas_ is assigned as 0 by default. But in the implementation file there is no place assigning new values to this variable again. But in executor.cc line 358, this variable is checked to see whether enable the server replication or not, which means under this context the server replication feature never works.

Is the server replication feature provided by ps-lite? Thanks.

sync_timeout

I am fairly new to Makefiles and the like, so this may be a stupid question, but how do I set the value of sync_timeout in src/systems/manager.cc when making without having to hardcode a new value for it? ps-lite is being built as a dep in my build for wormhole, and I'm wondering what the most straightforward way to set the value of sync_timeout would be from a command line make call?

Build cxxnet-ps error in ps-lite

run the build_ps.sh , when compile at -o bin/cxxnet.ps,

ps-lite/build/libps.a(ps_main.o): In function ps::App::Create(int, char**)': /media/dl/Coding/AI/ML/cxxnet/ps-lite/src/ps_main.cc:7: undefined reference toCreateServerNode(int, char*)'
collect2: error: ld returned 1 exit status
make: *
* [bin/cxxnet.ps] Error 1

I don't modify anything. thanks.

tiny demo project using ps-lite

is there any project demostrating abilities and best practices of ps-lite ?
mxnet is great, but too heavy for understanding ps-lite.
I am expecting one thousand lines codes which using ps-lite to enable multi-cards training.

Memory leak ps_dist-inl.h

==39376== 13,413 (8 direct, 13,405 indirect) bytes in 1 blocks are definitely lost in loss record 491 of 495
==39376== at 0x4A05FD5: operator new(unsigned long) (vg_replace_malloc.c:324)
==39376== by 0x4C24F3: MShadowServerNode (ps_dist-inl.h:111)
==39376== by 0x4C24F3: CreateServerNode(int, char**) (nnet_ps_server.cpp:168)
==39376== by 0x4F8308: ps::App::Create(int, char**) (ps_main.cc:7)
==39376== by 0x5059DA: ps::Manager::Init(int, char**) (manager.cc:29)
==39376== by 0x508B15: ps::Postoffice::Run(int*, char***) (postoffice.cc:20)
==39376== by 0x4F8E44: StartSystem (ps.h:16)
==39376== by 0x4F8E44: main (ps_main.cc:14)

Does PS-lite provide node failure tolerance for server nodes ?

I tried playing with parameter_server linear example and killing a server process/node hangs the running process. Shouldn't the replicated node take over for the killed server as described in the paper ?

Any help in this regard will be highly appreciated.

Thanks,
Danish

SArray is buggy

#include "ps/sarray.h"

int main(int argc, char *argv[]) {
ps::SArray sa1{1, 2, 3, 4, 5, 6, 7, 8, 9};
ps::SArray sa2(sa1);

sa1.pop_back();

std::cout << sa1.size() << std::endl;
std::cout << sa2.size() << std::endl;

return 0;
}

outputs:
8
9

"sa1" and "sa2" should be the same, i think, by design.
But "size_" and "capacity_" are managed separately, it is a bug.

Slow memory leak?

Is it possible there's a slow memory leak in ps-lite? If I run the following simple example, the memory usage of the worker rises indefinitely at a rate of approximately 2MB / sec.

I've traced the origin of the leaked memory to van.cc's Recv(), but am unsure of who should be responsible for freeing it. Any advice or suggestions?

#include "ps.h"

#include "ps.h"
typedef float Val;

int CreateServerNode(int argc, char *argv[]) {
  ps::OnlineServer<Val> server;
  return 0;
}

int WorkerNodeMain(int argc, char *argv[]) {
  using namespace ps;
  std::vector<Key> key = {1};
  std::vector<Val> recv_val;
  KVWorker<Val> wk;
  while (1) {
    wk.Wait(wk.Pull(key, &recv_val));
  } 
  return 0;
}

Build error in ps-lite when build from cxxnet

I posted same issue on cxxnet as well. I am not sure if this is issue related to ps-lite or cxxnet but the error definitely is when compiling ps-lite.

In file included from src/ps.h:13:0,
from src/ps_main.cc:1:
src/kv/kv_store_sparse.h:3:21: fatal error: dmlc/io.h: No such file or directory
compilation terminated.
make: *** [build/ps_main.o] Error 1

Why using template function for GetEnv?

include/ps/internal/utils.h:

29 template
30 inline V GetEnv(const char *key, V default_val) {
31 const char *val = Environment::Get()->find(key);
32 if (val == nullptr) {
33 return default_val;
34 } else {
35 return atoi(val);
36 }
37 }

Both the comments of this function and where this function is called indicate the return value is of type int.

So I wonder why you use template.

Thanks.

[install error] Run test_kv_app meet bind error

Hi, super Mu,
Run test_kv_app meet bind error, when I build and run ps-lite test program on my machine, throw the exception

system info:

ps-lite: newest
CentOS release 6.3 (Final)
[16:35:01] src/van.cc:69: Bind to role=scheduler, id=1, ip=127.0.0.1, port=-1, is_recovery=0
[16:35:01] ./include/dmlc/logging.h:208: [16:35:01] src/van.cc:70: Check failed: (my_node_.port) != (-1) bind failed
terminate called after throwing an instance of 'dmlc::Error'
  what():  [16:35:01] src/van.cc:70: Check failed: (my_node_.port) != (-1) bind failed
[16:35:01] src/van.cc:69: Bind to role=server, ip=10.89.23.50, port=33775, is_recovery=0
[16:35:01] src/van.cc:130: ? => 1. Meta: request=0, timestamp=0, control={ cmd=ADD_NODE, node={ role=server, ip=10.89.23.50, port=33775, is_recovery=0 } }
[16:35:01] src/van.cc:69: Bind to role=worker, ip=10.89.23.50, port=17815, is_recovery=0
[16:35:01] src/van.cc:130: ? => 1. Meta: request=0, timestamp=0, control={ cmd=ADD_NODE, node={ role=worker, ip=10.89.23.50, port=17815, is_recovery=0 } }
./local.sh: line 37: 15894 Aborted                 (core dumped) ${bin} ${arg}

And I move to the Mac machine, It throw anthor exception when I compile the project....

system info:

ps-lite: newest
Mac 10.10.5
protobuf: stable 2.6.1 (bottled), devel 3.0.0-beta-4, HEAD
Protocol buffers (Google's data interchange format)
https://github.com/google/protobuf/
/usr/local/Cellar/protobuf/3.0.0-beta-4 (336 files, 17M) *
  Built from source on 2016-08-29 at 11:12:20
From: https://github.com/Homebrew/homebrew-core/blob/master/Formula/protobuf.rb
ps-lite/deps/lib -L/Users/dzh/github/ps-lite/deps/lib -lprotobuf-lite -lzmq -pthread
Undefined symbols for architecture x86_64:
  "google::protobuf::internal::WireFormatLite::ReadString(google::protobuf::io::CodedInputStream*, std::basic_string<char, std::char_traits<char>, std::allocator<char> >*)", referenced from:
      ps::PBNode::MergePartialFromCodedStream(google::protobuf::io::CodedInputStream*)     in libps.a(meta.pb.o)
  "google::protobuf::internal::WireFormatLite::WriteBytes(int, std::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, google::protobuf::io::CodedOutputStream*)", referenced from:
      ps::PBMeta::SerializeWithCachedSizes(google::protobuf::io::CodedOutputStream*) const in libps.a(meta.pb.o)
  "google::protobuf::internal::WireFormatLite::WriteString(int, std::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, google::protobuf::io::CodedOutputStream*)", referenced from:
      ps::PBNode::SerializeWithCachedSizes(google::protobuf::io::CodedOutputStream*) const in libps.a(meta.pb.o)
  "google::protobuf::internal::WireFormatLite::ReadBytes(google::protobuf::io::CodedInputStream*, std::basic_string<char, std::char_traits<char>, std::allocator<char> >*)", referenced from:
      ps::PBMeta::MergePartialFromCodedStream(google::protobuf::io::CodedInputStream*)     in libps.a(meta.pb.o)
ld: symbol(s) not found for architecture x86_64
collect2: error: ld returned 1 exit status
make: *** [tests/test_connection] Error 1

Best approach to adding new commands?

In the v.1 version of ps-lite, the command inside a message was an int (hidden within protobuf). This made it easy to add new commands. The new version uses an enum for a command, so it doesn't seem to be extensible.

I'm trying to implement a distributed tracker (similar to the direction difacto is headed). However, it's not clear to me what the best was is to do this on top of a Message. Is the best approach handling the command exclusively within the Message payload (e.g. the vector of safe arrays)? Are there other ways to do this? Or should ps-lite not be used at all for this purpose?

Thanks for you advice!

question about the global model

Hi @mli
I have a question about this project, according to some your docs, such as the asynchronous sgd, i find that the worker nodes will get the global model from all servers, do you think this will case a issue, or can we just pull partial model from the servers. thank you very much!

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.