uwsampa / grappa Goto Github PK

View Code? Open in Web Editor NEW

157.0 157.0 51.0 15.2 MB

Grappa: scaling irregular applications on commodity clusters

Home Page: grappa.io

License: BSD 3-Clause "New" or "Revised" License

CMake 1.46% Ruby 2.35% C++ 61.71% C 33.16% Makefile 0.32% Shell 0.62% HTML 0.02% R 0.10% Perl 0.02% Assembly 0.24%

grappa's People

Contributors

Stargazers

Watchers

grappa's Issues

Port Collectives

Mostly just consists of reductions, useful for implementing custom statistics' merge_all().

CMake: VampirTrace mode

VampirTrace support was left out of the initial CMake port.

The plan is to enable Vampir as a third mode option in configure:

./configure --mode=Vampir --gen=Make
# creates build/Make+Vampir

(some code to do this is already there, it just needs to be re-integrated and tested)

Concurrent Grappa::allreduce of same type (to support multiple SPMD contexts)

Currently only one allreduce<T, ReduceOp> (for a given T and ReduceOp **) can be outstanding at a time, since messages are not tagged. Since this is a SPMD operation but we can support multiple SPMD contexts naturally, we should be able to support concurrent allreduces of the same type.

One simple solution is to add an additional template argument that distinguishes the specific use of allreduce in the code (Basically a static tag).

** (More accurately, all reduces with type T conflict since we template the static Reduction storage on just T).

Fix steal_locally DEBUG reply message

(new-aggregator-merge)
Used block_until_sent() to wait until it is safe to memset the queue to 0. But this is no longer safe inside of handlers.

Currently commented out. Extend mark_sent to do the memset.

Delegate instrumentation

Add instrumentation to async delegates.
Be able count short-circuited ops
Make sure we can count Caches in the same way (number of objects cached)

ArgCache

We've talked about this for a while, but to put it into concrete terms.

Tasks are limited to a bounded number of arguments by the fact that we store unstarted tasks in queues of fixed-size elements for performance.

In order to provide extra arguments, a task can be given a global address referring to a struct of additional arguments which it can cache on startup and use.

However, we often spawn many tasks that share a number of arguments, such as base addresses to arrays. It is highly inefficient to acquire identical argument structs multiple times by different tasks on the same node.

Instead, we should cache these shared arguments once and share the copy among tasks on that node. So, ideas so far for implementing this:

Lookup:

Maintain a global map on each node of ArgCaches that have been acquired. When a task goes to acquire an ArgCache, it first looks in the table.
Remove elements when no one is using them to prevent unbounded growth of map
Index using GlobalAddress.
Hash or Tree or something more like page table?

Ownership:

Assuming read-only arguments
Should be able to safely destroy the original (or allow it to go out of scope if stack allocated)
Heap-allocate the cached copy
Reference counting:
- Treat like Joiners (pre-register tasks that get spawned with an address to an ArgCache)
- When task's cache "releases", just decrement count, free if count is 0
- Children: pass a reference to the local copy of the ArgCache, so if they get stolen they still report back to the correct copy (like GlobalJoiner)

Cache or GlobalAddress?

GlobalAddress responsible for being able to "localize" itself? This is very different than the current localize() which just finds the nearest section of global memory that is local.
GlobalAddress doing reference counting makes it more like Smart Pointers, so copy constructor can increment, etc.
Alternative would be to give the ArgCache object a method to get a GlobalAddress out to be passed to task spawns as their "shared" argument, but this obviously has its own flaws.

Generic delegates with template

Currently you create a generic delegate operation by passing a functor, which holds any sent and returned values. This is a bit messy, as well as potentially inefficient since the whole functor must be copied both ways. Replace with templated delegate with <SendDataType, ReturnDataType, ReturnDataType (*Func)(SendDataType) >

low priority: losing precision in steal queue

In file included from tasks/Task.hpp:12:0,
from tasks/Task.cpp:7:
tasks/StealQueue.hpp: In instantiation of ‘int64_t StealQueue::workShare(int16_t, uint64_t) [with T = Task; int64_t = long int; int16_t = short int; uint64_t = long unsigned int]’:
tasks/Task.cpp:139:61: required from here
tasks/StealQueue.hpp:567:79: warning: narrowing conversion of ‘amount’ from ‘uint64_t {aka long unsigned int}’ to ‘int’ inside { } [-Wnarrowing]

might want to make this more specific sometime.

ConditionVariable fails to clean up thread state after waking thread

I neglected to clear out the wait queue pointer after waking a thread.

Multi-sized task queue

Currently, tasks are forced to be a fixed size of 64 bytes: 64-bit function pointer, 3 64-bit arguments. This restriction is to make the task queue simple. We could support tasks of different sizes with perhaps little additional complexity. The gain would be flexibility to have larger tasks, space and copying efficiency for small tasks.

In the designs so far, work stealing in chunks efficiently tends to be the main challenge. We would like to do something that looks like a memcpy; on the other hand it's still linear and traversing the queue might not be that much more expensive.

designs:

separate queue for every task size. Each of these queues has an index. Ordering of tasks could be maintained by having a central queue that stores the index of the next queue to pull from. Stealing a chunk from the back of the queue involves grabbing elements from queues specified by the ordering queue. There needs to be enough information transmitted to the thief to deserialize the fragmented queues properly.

All the function pointers passed to spawn() functions are known at compile time, and certainly at the start of the program. These can be registered with the task queue to create appropriately sized queues.

tasks have size information, stored in the queue. Stealing chunks would require traversing the headers to grab the right amount. There is certainly a knob to turn here on how length information can be kept or summarized.

RDMAAggregator segfaults with PayloadMessages with >=128 bytes

Parallel loop with builtin reduction

Should be pretty straightforward to implement a version of some parallel loops that do a reduction and spit out the reduced value at the end.

Something like:

long total = 0;
forall_localized(array, N, &total, [](int64_t i, long& elem, long& total){
  total += elem;
});

Could also roll it into a generic reduction:

long total = reduce(array, N, [](long& e, long& total) {
  total += e;
});

I imagine implementation would look something like:

T reduce(GlobalAddress<T> array, size_t N, F func) {
  T total;
  on_all_cores([&total]{
    T local_total;
    forall_here([base](int64_t i) {
      func(base[i], local_total);
    });
    local_total = reduce(local_total, func);
    if (mycore() == 0) total = local_total;
  });
  return total;
}

Test issue

Bad stuff happens!

Buffered Parallel file IO by line

In order to support buffered reads through Grappa file IO of line-based text files.

If we put the FileIO library in charge of parallelization--as in, for example, the Grappa_read_array API--then the user will provide a callback on_line( char * line).

An ordered line-based interface has to provide also a line index (on_line( int64_t i, char * line)), but requires an extra pass through the file to do a prefix sum. A reasonable simplification of this might be to accept only normalized files where all lines are equal length.

Payload messages too big?

BrandonH saw the aggregator complain about too-big payload messages in intsort in the all reduce.

Update license headers

Ensure everything has the right license header, and update what's in our LICENSE file.

system/,bin/ – our A-GPL license
applications/* – whatever the original app was
compiler/ – Chapel's BSD license (even just in its branch)

Let gasnet detect conduits to build; then just search if they exist

GCE Completion Flat Combining

@nelsonje and I discussed the idea of doing the "flat combining" on GCE complete messages. The basic idea is to have a message going to each destination core stored in the GCE, as well as a count of outgoing completes to each core.

It's unsafe to modify a message after it's been enqueued, which is why we have the counter.

If you go to complete and the message for that destination has not been enqueued, then you enqueue the message yourself
If the message is enqueued already, increment the complete counter
The message has mark_sent() overridden so when it is sent, it checks the outstanding completes count for that core, if it is > 0, then the message takes that count and re-enqueues itself.

This should strictly improve the number of messages we send for completions, and in the limit reduce the traffic for feed-forward delegates by half.

full @deprecate pass on all APIs

Use @deprecate tag in Doxygen comment. But also should use GCC/Clang compatible function attributes to get compiler warnings.

Make API match what's in our SC Ads

Specifically, make sure we have working:

void finish([]{ });
void forall(int64_t,size_t,[](int64_t){ });
void forall(GlobalAddress<T>,size_t,[](T&){ });

Common includes (Grappa.hpp)

I think we wanted to make it simpler to include most of the Grappa functionality with a single include. This should require just a bit of refactoring to avoid include cycles.

I think what we want is for Grappa.hpp to include:

Collectives (on_all_cores, reduce, etc)
Delegates (Delegate.hpp, AsyncDelegate.hpp)
Parallel loops (ParallelLoop.hpp)
GlobalAllocator.hpp
Addressing.hpp
Tasking.hpp
Array.hpp (for things like memset, array_str)

They can presumably include things like GlobalHashSet separately.

test Grappa::global_free on a symmetric allocation

New Reduce

I know we've rewritten "reduce" so many times now, but this idea is cool.

Proposing to extend the symmetric address reduce (1181e16) with something that's integrated into parallel loops.

auto sum = symmetric_global_alloc<long>();

auto total = on_all_cores( reduce<add>(sum), [](long* sum) {
  *sum += foo();
});

// or same could be done with `forall`:
auto array = global_alloc<long>(N);
auto total = forall(array, N, reduce<add>(sum), [](long& v, long* sum) {
  *sum += foo(v);
});

Another thing to consider is to make the reduction part of the loop's sync object. So if we supported arbitrary GCE's for loops, you could just make a Reduce sync object and then have the 'return' from the loop lambda be the thing to reduce:

auto sinc = GlobalCompletionEvent<Reduce<add>>::create();
auto total = forall(array, N, sinc, [](long& v){
  return foo(v);
});
LOG(INFO) << "total = " << sinc.get();

This could be done as part of #119.

Transfer or eliminate tests on old APIs

Eliminate irrelevant tests
Migrate relevant tests to new APIs

This issue applies only to existing tests. We'll have another issue for each new test that is missing.

print suspended queue length in GDB

This should be simple---we just need to walk the suspended coroutines and count them.

Statistics

We've been looking for a better way to do statistics gathering for a while. I thought I'd start a new Issue related to this so we can gather our various ideas and brainstorm a solution.

Grappa::barrier 1 core

Grappa::barrier deadlocks when Grappa::cores()==1

GDB Macros Broken

Since the changes to the coro/Worker struct, the GDB macros appear to be broken. Could we fix these? @bmyerz?

Or @nelsonje did you want to rewrite them in Python anyway?

Tuple -> csr fails on scale <= 4

example inputs

pagrank

make mpi_run TARGET=pagerank.exe NNODE=2 PPN=2 GARGS='--num_starting_workers=64 --logN=4 --nnz_factor=4'

bfs

make mpi_run TARGET=graph.exe NNODE=2 PPN=2 GARGS='--num_starting_workers=64 -- -s 4 -e 4'

[0]: 1 2 4 5 12 15
0: [1]: -1 0 1 2 3 4 5 6 8 11 12 14
0: [2]: -1 0 1 2 3 4 5 6 8 11 12 14 15
0: [3]: 1 2 4 5 9 11 12 15
0: [4]: 0 1 2 3 5 6 7 8 9 10 12 14 15
0: [5]: 0 1 2 3 4 5 7 10 12 14 15
0: [6]: 0 1 2 3 4 5
0: [7]: 4 5 15
0: [8]: 1 4 10
0: [9]: 1 3 4 15
0: [10]: 4 5 8 15
0: [11]: 1 3 14
0: [12]: 0 1 2 3 4 5 14 15
0: [13]: 1 15
0: [14]: 1 4 5 6 11 12 15
0: [15]: 0 1 2 3 4 5 7 9 10 12 13 14

Asynchronous delegates

We would like to be able to issue multiple delegate operations and then block on them completing later. We've discussed some ideas for how to do this in Issue #23, including things like returning the FullEmpty<> to block on, or passing in an object where results will be filled in.

This is to allow for:

more straightforward way of overlapping of communication in a task than Caches
???

Stealing with Grappa::cores==1 can livelock

On new-messages-old-aggregator, the readyQ is currently checked after checking for new task work. Since the task loop expects a stealing thread to suspend the scheduler, when there is just 1 core the stealing thread keeps getting to run.

possible choices:

assertion failure when load_balance!=none and cores=1
(might be annoying)
just turn doSteal off when cores=1
switch priority of readyQ (would be better for this bug not to hinge on this ordering)

choice 2 seems best

Context-switch inside am handler?

In BFS, occasionally, early. Graph generation.

Failure to terminate with stealing enabled.

@bmyerz: Looks like there's some FullEmpty thing in steal_locally that gets stuck, causing the application not to complete. Seems like we could solve this either by fixing the FullEmpty suspend, or by just not "counting" it when we try to determine if the app can finish.

Here's the backtrace of the culprit task:

#0  Grappa::FullEmpty<long>::block_until (this=0x22ead00, desired_state=128) at FullEmpty.hpp:42
#1  0x00004008077a5000 in ?? ()
#2  0x0000000000481d3b in Grappa::impl::StealQueue<Grappa::impl::Task>::steal_locally(short, long) ()
#3  0x0000000000481eab in Grappa::impl::TaskManager::checkPull() ()
#4  0x000000000047d4d2 in Grappa::impl::TaskManager::waitConsumeAny (this=0x7c17a0, result=Unhandled dwarf expression
opcode 0xf3
) at tasks/Task.cpp:291
#5  0x000000000047dc20 in Grappa::impl::TaskManager::getWork (this=0x7c17a0, result=0x4008078259c0)
    at tasks/Task.cpp:153
#6  0x000000000047ad58 in Grappa::impl::workerLoop (me=Unhandled dwarf expression opcode 0xf3
) at tasks/TaskingScheduler.cpp:144
#7  0x00000000004871c7 in tramp (me=0x22d3000, arg=0x170d000) at tasks/Thread.cpp:69
#8  0x0000000000487288 in _makestack () at stack.S:192
#9  0x0000000000000000 in ?? ()

Messages delivered to wrong core?

BrandonH may have seen this in BFS. Haven't seen it in intsort. Graph generation maybe?

Safer barrier()

In Barrier.hpp we have Grappa::barrier().
It blocks the calling task until one task on each node has entered.
The implementation properly allows other threads (notably the network progress threads) to continue.

The use of barrier() is very low-level and limited right now because of its global-ness.
Multiple barriers in an application must not be concurrently callable or else the barriers will be mismatched.

One example of the behavior we may want is UPC's lexical barrier upc_barrier.

don't print unassigned coros in suspended coro list in GDB

we'll need to mark coros that are suspended but are on the unassigned queues so that GDB can do the filtering.

This will be easier once we integrate Thread and coro into a single Worker struct.

Grappa::allreduce on 1 core

Grappa::allreduce with Grappa::cores()==1 deadlocks.

Recommended to reproduce:
make mpi_test TARGET=Collective_tests.test NNODE=1 PPN=1 VERBOSE_TESTS=1 GARGS=' --num_starting_workers=64'

tools on linux

5eb8373

Better Testing Infrastructure

(long term improvement)

Find a better way to do comprehensive regression testing of Grappa infrastructure. Possible implementation approaches:

CMake/CTest with existing Boost Unit Test stuff
CTest with Google Test

Desired features:

Can be run before committing to master for example
Verifies that all build modes (vtrace on/off, etc) work and run
Able to parse output to verify stats output

Add MessagePool to GlobalCompletionEvent

One idea for having shared MessagePools is to associate a pool with a GCE, so that all async delegates spawned within a GCE scope (i.e. in a forall_localized) can, in addition to sharing the functor and synchronizing with the GCE, also take advantage of the shared message pool to not block on issuing messages.

Fix async API to not allow specifying the pool

Recently changed the way shared_pool works, and had to change call_async() to call send_heap_message() directly (df452fc), so it now ignores the passed pool argument. Should change the API (and uses) to not take an explicit pool anymore.

Shared MessagePool

Implement shared heap-allocated message pool to replace send_heap_message.

mayinterleave type qualifier and nointerleave{} regions?

Here's an idea for the compiler: can we add a qualifier to functions that indicates they may block or context switch? and can we have add a qualifier to blocks of code that requires no such functions are called within?

Perhaps this could reuse some of the exception/noexcept mechanisms in the compiler.

Obviously this only works for some cases, but it could be helpful.

Thoughts?

Apparent deadlock in AMPoll

In spmv_mult benchmark, during pack_vtx_edges, there is apparent deadlock, where all nodes are polling. It is not due to too few stacks.

make mpi_run TARGET=spmv_mult.exe
PPN=4
NNODE=12 \
GARGS='--num_starting_workers=1024 --steal=1 --chunk_size=100 --aggregator_autoflush_ticks=2000000 --async_par_for_threshold=1 --periodic_poll_ticks=20000 --nnz_factor=16 --logN=22 --row_distribute=true'

Turning down PPN makes this deadlock not occur.

  /// generic cast operator
  /// TODO: do we really need this? leads to unneccessary type errors...
  template< typename U >
  operator GlobalAddress< U >( ) {
    GlobalAddress< U > u = GlobalAddress< U >::Raw( storage_ );
    return u;
  }

uwsampa / grappa Goto Github PK

grappa's People

Contributors

Stargazers

Watchers

Forkers

grappa's Issues

Lookup:

Ownership:

Cache or GlobalAddress?

pagrank

bfs

Recommend Projects

Recommend Topics

Recommend Org

Jobs