eyescale / collage Goto Github PK
View Code? Open in Web Editor NEWCross-platform C++ library for building heterogenous, distributed applications
Home Page: http://libcollage.net/
License: Other
Cross-platform C++ library for building heterogenous, distributed applications
Home Page: http://libcollage.net/
License: Other
NMAKE : fatal error U1073: don't know how to make 'preinstall'
Stop.
CMake Error: Error processing file: cmake_install.cmake
CMake Error: Error processing file: cmake_install.cmake
CMake Error: Error processing file: cmake_install.cmake
CMake Error: Error processing file: cmake_install.cmake
At least the initial handshake commands to exchange information are not byte-swapped. Need test setup, current Cadmos setup does not allow us to log onto Intel admin servers.
I come to wake up the sleeping project.
which boost version if collage and its sub project fits best?
I use boost_1_75_0 to compile, and get some error: can't find some boost symbles.
Collage‘s’ Programming and User Guide is in the Eyescal's.
https://github.com/Eyescale/EqDocs/raw/master/Developer/ProgrammingGuide/paper.pdf
Hard to find the Collage Guide if use the Collage separately.
Suggestion: add a link in the Collage's README.md
Requested by @delyas: For larger client pools with a lot of requests (mapping) the appNode commandqueue blows up. Blocking the receiver thread and eventually the sending clients.
This has the potential for deadlocks, but I can't see an obvious pattern atm.
The BufferCache deletes Buffers in compact which are free. After the refCount of a buffer reaches 0, i.e., it is free, notifyFree() is still being processed which leaves a small window where the buffer might get deleted but is still being in notifyFree.
The rtt branch contains a broken attempt to fix this.
I find ping() and pingIdleNodes() cmd,but donot find the online state management of the nodes.
Does Collage maintain the nodes' state via heartbeat and provide state request api, or on/off line callback reg api?
If not, any suggestion for me to build it myself?
Thanks very much!
In relation to @julitopower work:
The endianness conversion code rightly introduced using buffers for async receives, the send interface should follow.
Fundamentally the same bug as Eyescale/Equalizer#100
Sub-issue to Eyescale/Equalizer#146
The 'this' pointer in the dispatcher unit test FooBar::cmd is wrong:
ctor 0xfffff9334a8
cmd 0xfffff9334b8
See pull request Eyescale/Equalizer#178
Sending custom node commands is poorly documented after packet refactoring. Document and add unit test?
Have a fixed-size array in OStream, into which the data is written. If it overflows, use a heap-allocated Buffer as overflow storage.
I've got a distributed object that inherits from co::Serializable. This objects serializes a few kilobytes of data all at once, which is then pushed to slave instances. No subsequent commits happen to that object.
Everything was working fine and dandy with previous versions of EQ (this is actually code that's barely been touched in two years or so). Recently, I pulled from the eyscale repos to the latest versions of all the libraries (Collage commit: 378ab5c).
Now I'm noticing the following behavior. If i commit less than 600-700 bytes of data, everything works fine. However, if I serialize more than that, I get a bunch of errors within co::fdConnection ("Invalid argument (22)", "Error during write after 0 bytes, closing connection", etc).
On the support forums (http://software.1713.n2.nabble.com/Issue-with-co-serializable-in-recent-Collage-version-td7585328.html#a7585332) Daniel noted that this is related to compression. Indeed, deactivating compression on the object seems to restore functionality.
Related to Eyescale/Equalizer#146
Found during #27: At least for map and sync, the CMs access _object which dangles if the object has been deregistered/unmapped in the meantime.
Having multiple threads A share the same barrier, commit from a different thread B multiple versions ahead, sync and enter the barrier the threads A.
The sync after the barrier leave can be executed before another thread is leaving the barrier, as specified by pthread_cond_broadcast() the order of execution is determined by the scheduling policy. As unpack() during sync() resets the monitor/barrier, this causes a deadlock.
Reported be @hernando: when launching a node:
localNode.cpp:296 26 Invalid global variables string: ##100#100#100#10#1#5#20#3#64#5#65000#524288#1#1#8#512#5000#-1#300000#1023##, using default global variables.
Document typical use case, features and conveniences.
syncObject( object, id, version, node ) is an optimized version of map + unmap which just copies the given instance data onto the object without mapping it. It will mostly be used by static objects such as init data in Equalizer.
reproduced in DG and eqPly
open two windows on different GPUs
both windows are not swapsynced
config: https://gist.github.com/3524955
start eqPly with lucy and this config, navigate and observe :)
The test tests/connection.cpp fails in a machine with an Intel i7-3630QM @ 2.40 Ghz because the protocol performs very poorly and the watchdog goes off.
If I reduce NPACKETS until the test finishes in time the reported bandwidth is less than 1 MB/s. In another machine with an Intel i7 (non mobile but I can't recall the model and I can't check it right now) I get 11 MB/s. In an Intel Xeon E5645 2.4 Ghz the performance is 96 MB/s.
I've also noticed that it affects other network applications (Spotify streaming freezes in the mobile CPU).
Is it thread safe to return a reference in:
const InstanceCache::Data& InstanceCache::operator[]( const UUID& id )
?
I don't see the lock being returned as part of InstanceCache::Data.
I've tested this on two different machines with IB interfaces with the same result, the test hangs at line 132 calling writer->connect().
I've tried moving acceptSync to the reader thread, just in case connect was blocking because of that, but then it fails in _checkCQ at rmdaConnection.cpp:539 in one machine (EPFL vizcluster) and rmdaConnection.cpp:595 in the other one (Lugano cluster).
On the other hand, coNetPerf works.
Some parameters (like "${CMAKE_SOURCE_DIR}" and "${DEFINES_FILE_IN}") are passed to CMake commands in your build scripts without enclosing them by quotation marks. I see that these places will result in build difficulties if the contents of the used variables will contain special characters like spaces.
I would recommend to apply advices from a Wiki article.
OS: Ubuntu 13.04
g++ (Ubuntu/Linaro 4.7.3-1ubuntu1) 4.7.3
I got the following compilation error after typing
make Equalizer
[ 22%] Performing build step for 'Collage'
[ 1%] Building CXX object co/CMakeFiles/Collage.dir/udtConnection.cpp.o
/home/lguyot/develop/Buildyard/src/Collage/co/udtConnection.cpp:158:10: error: 'void >co::{anonymous}::CTCP::onACK(int32_t)' marked override, but does not override
See http://www.kernel.org/doc/man-pages/online/pages/man2/eventfd.2.html and RDMAConnection.
While compiling a Release version of Collage with gcc 4.4 (for instance on RHEL6), you'll get a warning (as error) regarding breaking strict aliasing rules:
Building CXX object tests/CMakeFiles/dataStream.dir/dataStream.cpp.o
cc1plus: warnings being treated as errors
/home/nachbaur/dev/bytest/Build/install/include/lunchbox/bitOperation.h: In function ‘int testMain(int, char**)’:
/home/nachbaur/dev/bytest/Build/install/include/lunchbox/bitOperation.h:111: error: dereferencing pointer ‘value.133’ does break strict-aliasing rules
/home/nachbaur/dev/bytest/Build/install/include/lunchbox/bitOperation.h:119: note: initialized from here
/home/nachbaur/dev/bytest/Build/install/include/lunchbox/bitOperation.h:142: error: dereferencing pointer ‘value.136’ does break strict-aliasing rules
/home/nachbaur/dev/bytest/Build/install/include/lunchbox/bitOperation.h:150: note: initialized from here
Google tells you to disable strict-aliasing entirely on gcc as it is broken by design, especially if you follow Linus' argumentation (http://www.mail-archive.com/[email protected]/msg01647.html). Later versions of gcc do not have this issue, though. Also Clang 3.x works fine with the current code.
Use lunchbox::Buffer* _data in co::Buffer to point to master buffer, and keep the pointers in the lunchbox::Buffer unchanged.
Make lunchbox::Buffer member private again
Use cloning in ObjectStore (for CommandFunc 'this')
Related to Eyescale/Equalizer#145
at least when mapping with VERSION_NONE. Reported by @delyas.
The connection is added to the connection set before the recv or accept is posted, which can cause it to fire in the recv thread and cause the recvSync to be called before the recvNB from the thread calling connect().
More info from logfile:
...
[ 209s] /home/abuild/rpmbuild/BUILD/Collage-1.0.0/tests/queue:1:1: error: stray '\177' in program
[ 209s] �ELF���so.2�% I‹�$L‰зяP�H‹�H‰ЯяP0и\ляяH‹L$(ёяяяяр�Б�ѓи�…Аu�H‹C�HЌ{�яP�HЃДи�
...
https://pmbs.links2linux.de/build/home:awissu/openSUSE_13.1/x86_64/Collage/_log
Project:
https://pmbs.links2linux.de/package/show/home:awissu/Collage
%20 percent of the builds in the lugano cluster build machine, this test fails. "https://bbpcode.epfl.ch/ci/job/viz.latest/854/platform=cscsviz/consoleFull"
The 'simultaneous connect' code path in co::LocalNode::_cmdConnectReply() will invoke _closeNode() on one of the connections, which results in a notifyDisconnect callback.
An application will receive two LocalNode::notifyConnect callbacks, followed by one notifyDisconnect, all referring to the same remote node.
An application that uses notifyDisconnect to detect the loss of connectivity to a remote node needs to implement additional logic to distinguish these spurious disconnect notifications from those actually indicating a loss of connectivity.
It might still be desirable to have matching numbers of connect and disconnect events, but in that case the documentation should clearly state the possibility of spurious disconnect notifications.
During the RSP session one of the clients gets disconnected and whole session get stalled. It's possible to reproduce it using eqPly example, loading some model, like dragon_vrip. Issue occurs on windows-based cluster.
Steps to reproduce:
pre-launch clients with RSP option
rttScaleClient.exe –-eq-client –-eq-listen 10.0.0.2:12345 --eq-listen RSP#102400#239.255.42.42#10.0.0.2#11147##
run master
eqPly.exe -m \user\nlu\Models\dragon_vrip.ply --eq-config \user\nlu\rttScaleConfig_clean.eqc
config: http://pastebin.com/DjWXj87Q
Originally reported by @delyas as Eyescale/Equalizer#265:
when having multiple concurrent input streams, the recv thread blocks 200-2000ms on a single receive. The current working theory is that a partial package is received, and the kernel does not receive more data on this socket since others go unprocessed, until some internal timeout is reached.
Implement threadpool-based receiving to see if that fixes the issue.
From the barrier unit test (https://s3.amazonaws.com/archive.travis-ci.org/jobs/20283377/log.txt) and user report (see BBPRTN-307):
1: 22250 R PN2co9Loc src/Collage/co/connection.cpp:256 14 Assert: _impl->buffer [No pending receive on TCPIP#102400#testing-worker-linux-4-1-25432-linux-5-20283377##3109#default#] in:
1: 9: lunchbox::abort()
1: 8: co::Connection::recvSync(lunchbox::RefPtr<co::Buffer>&, bool)
1: 7: co::LocalNode::_readHead(lunchbox::RefPtr<co::Connection>)
1: 6: co::LocalNode::_handleData()
1: 5: co::LocalNode::_runReceiverThread()
1: 4: co::detail::ReceiverThread::run()
1: 3: lunchbox::Thread::_runChild()
1: 2: lunchbox::Thread::runChild(void*)
1: 1: /lib/x86_64-linux-gnu/libpthread.so.0(+0x7e9a) [0x2b8873267e9a]
1: 0: /lib/x86_64-linux-gnu/libc.so.6(clone+0x6d) [0x2b8873d4a4bd]
1: 22250 R PN2co9Loc c/Lunchbox/lunchbox/debug.cpp:44 16
1/21 Test #1: barrier ..........................***Exception: Other 0.03 sec
Reported by @delyas: The current ConnectionSet::select processing favors the first connections, which causes timeouts when a set of clients pushes a lot of requests fast (mapping) onto a single node, which will then first serve the 'first' clients.
Processing this in a round-robin fashion will ensure that all clients make progress and should speed up the rcv thread since commands are fully read already.
The old implementation did send large data blob at the end of the packet directly on locked connections, since the send call implicitly did finish the packet. The new << operators do not know when the packet ends and can't send before the final size is known. (Eyescale/Equalizer#145)
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.