uwsampa / grappa Goto Github PK
View Code? Open in Web Editor NEWGrappa: scaling irregular applications on commodity clusters
Home Page: grappa.io
License: BSD 3-Clause "New" or "Revised" License
Grappa: scaling irregular applications on commodity clusters
Home Page: grappa.io
License: BSD 3-Clause "New" or "Revised" License
Mostly just consists of reductions, useful for implementing custom statistics' merge_all().
VampirTrace support was left out of the initial CMake port.
The plan is to enable Vampir as a third mode
option in configure
:
./configure --mode=Vampir --gen=Make
# creates build/Make+Vampir
(some code to do this is already there, it just needs to be re-integrated and tested)
Currently only one allreduce<T, ReduceOp>
(for a given T
and ReduceOp
**) can be outstanding at a time, since messages are not tagged. Since this is a SPMD operation but we can support multiple SPMD contexts naturally, we should be able to support concurrent allreduces of the same type.
One simple solution is to add an additional template argument that distinguishes the specific use of allreduce in the code (Basically a static tag).
** (More accurately, all reduces with type T conflict since we template the static Reduction
storage on just T
).
(new-aggregator-merge)
Used block_until_sent() to wait until it is safe to memset the queue to 0. But this is no longer safe inside of handlers.
Currently commented out. Extend mark_sent
to do the memset.
We've talked about this for a while, but to put it into concrete terms.
Tasks are limited to a bounded number of arguments by the fact that we store unstarted tasks in queues of fixed-size elements for performance.
In order to provide extra arguments, a task can be given a global address referring to a struct of additional arguments which it can cache on startup and use.
However, we often spawn many tasks that share a number of arguments, such as base addresses to arrays. It is highly inefficient to acquire identical argument structs multiple times by different tasks on the same node.
Instead, we should cache these shared arguments once and share the copy among tasks on that node. So, ideas so far for implementing this:
localize()
which just finds the nearest section of global memory that is local.Currently you create a generic delegate operation by passing a functor, which holds any sent and returned values. This is a bit messy, as well as potentially inefficient since the whole functor must be copied both ways. Replace with templated delegate with <SendDataType, ReturnDataType, ReturnDataType (*Func)(SendDataType) >
In file included from tasks/Task.hpp:12:0,
from tasks/Task.cpp:7:
tasks/StealQueue.hpp: In instantiation of ‘int64_t StealQueue::workShare(int16_t, uint64_t) [with T = Task; int64_t = long int; int16_t = short int; uint64_t = long unsigned int]’:
tasks/Task.cpp:139:61: required from here
tasks/StealQueue.hpp:567:79: warning: narrowing conversion of ‘amount’ from ‘uint64_t {aka long unsigned int}’ to ‘int’ inside { } [-Wnarrowing]
might want to make this more specific sometime.
I neglected to clear out the wait queue pointer after waking a thread.
Currently, tasks are forced to be a fixed size of 64 bytes: 64-bit function pointer, 3 64-bit arguments. This restriction is to make the task queue simple. We could support tasks of different sizes with perhaps little additional complexity. The gain would be flexibility to have larger tasks, space and copying efficiency for small tasks.
In the designs so far, work stealing in chunks efficiently tends to be the main challenge. We would like to do something that looks like a memcpy; on the other hand it's still linear and traversing the queue might not be that much more expensive.
designs:
All the function pointers passed to spawn() functions are known at compile time, and certainly at the start of the program. These can be registered with the task queue to create appropriately sized queues.
Should be pretty straightforward to implement a version of some parallel loops that do a reduction and spit out the reduced value at the end.
Something like:
long total = 0;
forall_localized(array, N, &total, [](int64_t i, long& elem, long& total){
total += elem;
});
Could also roll it into a generic reduction:
long total = reduce(array, N, [](long& e, long& total) {
total += e;
});
I imagine implementation would look something like:
T reduce(GlobalAddress<T> array, size_t N, F func) {
T total;
on_all_cores([&total]{
T local_total;
forall_here([base](int64_t i) {
func(base[i], local_total);
});
local_total = reduce(local_total, func);
if (mycore() == 0) total = local_total;
});
return total;
}
Bad stuff happens!
In order to support buffered reads through Grappa file IO of line-based text files.
If we put the FileIO library in charge of parallelization--as in, for example, the Grappa_read_array API--then the user will provide a callback on_line( char * line)
.
An ordered line-based interface has to provide also a line index (on_line( int64_t i, char * line)
), but requires an extra pass through the file to do a prefix sum. A reasonable simplification of this might be to accept only normalized files where all lines are equal length.
BrandonH saw the aggregator complain about too-big payload messages in intsort in the all reduce.
Ensure everything has the right license header, and update what's in our LICENSE file.
system/
,bin/
– our A-GPL licenseapplications/*
– whatever the original app wascompiler/
– Chapel's BSD license (even just in its branch)@nelsonje and I discussed the idea of doing the "flat combining" on GCE complete messages. The basic idea is to have a message going to each destination core stored in the GCE, as well as a count of outgoing completes to each core.
It's unsafe to modify a message after it's been enqueued, which is why we have the counter.
mark_sent()
overridden so when it is sent, it checks the outstanding completes count for that core, if it is > 0, then the message takes that count and re-enqueues itself.This should strictly improve the number of messages we send for completions, and in the limit reduce the traffic for feed-forward delegates by half.
Use @deprecate
tag in Doxygen comment. But also should use GCC/Clang compatible function attributes to get compiler warnings.
Specifically, make sure we have working:
void finish([]{ });
void forall(int64_t,size_t,[](int64_t){ });
void forall(GlobalAddress<T>,size_t,[](T&){ });
I think we wanted to make it simpler to include most of the Grappa functionality with a single include. This should require just a bit of refactoring to avoid include cycles.
I think what we want is for Grappa.hpp
to include:
on_all_cores
, reduce
, etc)Delegate.hpp
, AsyncDelegate.hpp
)ParallelLoop.hpp
)GlobalAllocator.hpp
Addressing.hpp
Tasking.hpp
Array.hpp
(for things like memset
, array_str
)They can presumably include things like GlobalHashSet
separately.
I know we've rewritten "reduce" so many times now, but this idea is cool.
Proposing to extend the symmetric address reduce (1181e16) with something that's integrated into parallel loops.
auto sum = symmetric_global_alloc<long>();
auto total = on_all_cores( reduce<add>(sum), [](long* sum) {
*sum += foo();
});
// or same could be done with `forall`:
auto array = global_alloc<long>(N);
auto total = forall(array, N, reduce<add>(sum), [](long& v, long* sum) {
*sum += foo(v);
});
Another thing to consider is to make the reduction part of the loop's sync object. So if we supported arbitrary GCE's for loops, you could just make a Reduce sync object and then have the 'return' from the loop lambda be the thing to reduce:
auto sinc = GlobalCompletionEvent<Reduce<add>>::create();
auto total = forall(array, N, sinc, [](long& v){
return foo(v);
});
LOG(INFO) << "total = " << sinc.get();
This could be done as part of #119.
This issue applies only to existing tests. We'll have another issue for each new test that is missing.
This should be simple---we just need to walk the suspended coroutines and count them.
We've been looking for a better way to do statistics gathering for a while. I thought I'd start a new Issue related to this so we can gather our various ideas and brainstorm a solution.
Grappa::barrier deadlocks when Grappa::cores()==1
example inputs
make mpi_run TARGET=pagerank.exe NNODE=2 PPN=2 GARGS='--num_starting_workers=64 --logN=4 --nnz_factor=4'
make mpi_run TARGET=graph.exe NNODE=2 PPN=2 GARGS='--num_starting_workers=64 -- -s 4 -e 4'
[0]: 1 2 4 5 12 15
0: [1]: -1 0 1 2 3 4 5 6 8 11 12 14
0: [2]: -1 0 1 2 3 4 5 6 8 11 12 14 15
0: [3]: 1 2 4 5 9 11 12 15
0: [4]: 0 1 2 3 5 6 7 8 9 10 12 14 15
0: [5]: 0 1 2 3 4 5 7 10 12 14 15
0: [6]: 0 1 2 3 4 5
0: [7]: 4 5 15
0: [8]: 1 4 10
0: [9]: 1 3 4 15
0: [10]: 4 5 8 15
0: [11]: 1 3 14
0: [12]: 0 1 2 3 4 5 14 15
0: [13]: 1 15
0: [14]: 1 4 5 6 11 12 15
0: [15]: 0 1 2 3 4 5 7 9 10 12 13 14
We would like to be able to issue multiple delegate operations and then block on them completing later. We've discussed some ideas for how to do this in Issue #23, including things like returning the FullEmpty<> to block on, or passing in an object where results will be filled in.
This is to allow for:
On new-messages-old-aggregator, the readyQ is currently checked after checking for new task work. Since the task loop expects a stealing thread to suspend the scheduler, when there is just 1 core the stealing thread keeps getting to run.
possible choices:
choice 2 seems best
In BFS, occasionally, early. Graph generation.
@bmyerz: Looks like there's some FullEmpty thing in steal_locally
that gets stuck, causing the application not to complete. Seems like we could solve this either by fixing the FullEmpty suspend, or by just not "counting" it when we try to determine if the app can finish.
Here's the backtrace of the culprit task:
#0 Grappa::FullEmpty<long>::block_until (this=0x22ead00, desired_state=128) at FullEmpty.hpp:42
#1 0x00004008077a5000 in ?? ()
#2 0x0000000000481d3b in Grappa::impl::StealQueue<Grappa::impl::Task>::steal_locally(short, long) ()
#3 0x0000000000481eab in Grappa::impl::TaskManager::checkPull() ()
#4 0x000000000047d4d2 in Grappa::impl::TaskManager::waitConsumeAny (this=0x7c17a0, result=Unhandled dwarf expression
opcode 0xf3
) at tasks/Task.cpp:291
#5 0x000000000047dc20 in Grappa::impl::TaskManager::getWork (this=0x7c17a0, result=0x4008078259c0)
at tasks/Task.cpp:153
#6 0x000000000047ad58 in Grappa::impl::workerLoop (me=Unhandled dwarf expression opcode 0xf3
) at tasks/TaskingScheduler.cpp:144
#7 0x00000000004871c7 in tramp (me=0x22d3000, arg=0x170d000) at tasks/Thread.cpp:69
#8 0x0000000000487288 in _makestack () at stack.S:192
#9 0x0000000000000000 in ?? ()
BrandonH may have seen this in BFS. Haven't seen it in intsort. Graph generation maybe?
In Barrier.hpp we have Grappa::barrier().
It blocks the calling task until one task on each node has entered.
The implementation properly allows other threads (notably the network progress threads) to continue.
The use of barrier() is very low-level and limited right now because of its global-ness.
Multiple barriers in an application must not be concurrently callable or else the barriers will be mismatched.
One example of the behavior we may want is UPC's lexical barrier upc_barrier
.
we'll need to mark coros that are suspended but are on the unassigned queues so that GDB can do the filtering.
This will be easier once we integrate Thread and coro into a single Worker struct.
Grappa::allreduce with Grappa::cores()==1 deadlocks.
Recommended to reproduce:
make mpi_test TARGET=Collective_tests.test NNODE=1 PPN=1 VERBOSE_TESTS=1 GARGS=' --num_starting_workers=64'
(long term improvement)
Find a better way to do comprehensive regression testing of Grappa infrastructure. Possible implementation approaches:
Desired features:
master
for exampleOne idea for having shared MessagePools is to associate a pool with a GCE, so that all async delegates spawned within a GCE scope (i.e. in a forall_localized
) can, in addition to sharing the functor and synchronizing with the GCE, also take advantage of the shared message pool to not block on issuing messages.
Recently changed the way shared_pool works, and had to change call_async()
to call send_heap_message()
directly (df452fc), so it now ignores the passed pool argument. Should change the API (and uses) to not take an explicit pool anymore.
Implement shared heap-allocated message pool to replace send_heap_message
.
Here's an idea for the compiler: can we add a qualifier to functions that indicates they may block or context switch? and can we have add a qualifier to blocks of code that requires no such functions are called within?
Perhaps this could reuse some of the exception/noexcept mechanisms in the compiler.
Obviously this only works for some cases, but it could be helpful.
Thoughts?
In spmv_mult benchmark, during pack_vtx_edges, there is apparent deadlock, where all nodes are polling. It is not due to too few stacks.
make mpi_run TARGET=spmv_mult.exe
PPN=4
NNODE=12 \
GARGS='--num_starting_workers=1024 --steal=1 --chunk_size=100 --aggregator_autoflush_ticks=2000000 --async_par_for_threshold=1 --periodic_poll_ticks=20000 --nnz_factor=16 --logN=22 --row_distribute=true'
Turning down PPN makes this deadlock not occur.
Not sure exactly when it stopped working, but sometime after introducing the new stats and now.
a more accurate name for these
For refactoring, I'm also concerned about tooling that renaming might break.
The only thing I can think of is the json parser though.
This is in there from a looooong time ago, and it should really go, because it's led to a few really hard to track down bugs.
Addressing.hpp:
/// generic cast operator
/// TODO: do we really need this? leads to unneccessary type errors...
template< typename U >
operator GlobalAddress< U >( ) {
GlobalAddress< U > u = GlobalAddress< U >::Raw( storage_ );
return u;
}
Mainly on ruby scripts that are symlinked.
Let me know if you cannot replicate it
Migrate uses of the old statistics.
Be sure to run tests to ensure the new stats are reasonable.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.