GithubHelp home page GithubHelp logo

uwsampa / grappa Goto Github PK

View Code? Open in Web Editor NEW
157.0 157.0 51.0 15.2 MB

Grappa: scaling irregular applications on commodity clusters

Home Page: grappa.io

License: BSD 3-Clause "New" or "Revised" License

CMake 1.46% Ruby 2.35% C++ 61.71% C 33.16% Makefile 0.32% Shell 0.62% HTML 0.02% R 0.10% Perl 0.02% Assembly 0.24%

grappa's People

Contributors

ahh avatar alexfrolov avatar baldas avatar bholt avatar bmyerz avatar luisceze avatar markoskin avatar nelsonje avatar simonkahan avatar theyorkwei avatar tkonolige avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

grappa's Issues

Port Collectives

Mostly just consists of reductions, useful for implementing custom statistics' merge_all().

CMake: VampirTrace mode

VampirTrace support was left out of the initial CMake port.

The plan is to enable Vampir as a third mode option in configure:

./configure --mode=Vampir --gen=Make
# creates build/Make+Vampir

(some code to do this is already there, it just needs to be re-integrated and tested)

Concurrent Grappa::allreduce of same type (to support multiple SPMD contexts)

Currently only one allreduce<T, ReduceOp> (for a given T and ReduceOp **) can be outstanding at a time, since messages are not tagged. Since this is a SPMD operation but we can support multiple SPMD contexts naturally, we should be able to support concurrent allreduces of the same type.

One simple solution is to add an additional template argument that distinguishes the specific use of allreduce in the code (Basically a static tag).

** (More accurately, all reduces with type T conflict since we template the static Reduction storage on just T).

Fix steal_locally DEBUG reply message

(new-aggregator-merge)
Used block_until_sent() to wait until it is safe to memset the queue to 0. But this is no longer safe inside of handlers.

Currently commented out. Extend mark_sent to do the memset.

Delegate instrumentation

  • Add instrumentation to async delegates.
  • Be able count short-circuited ops
  • Make sure we can count Caches in the same way (number of objects cached)

ArgCache

We've talked about this for a while, but to put it into concrete terms.

Tasks are limited to a bounded number of arguments by the fact that we store unstarted tasks in queues of fixed-size elements for performance.

In order to provide extra arguments, a task can be given a global address referring to a struct of additional arguments which it can cache on startup and use.

However, we often spawn many tasks that share a number of arguments, such as base addresses to arrays. It is highly inefficient to acquire identical argument structs multiple times by different tasks on the same node.

Instead, we should cache these shared arguments once and share the copy among tasks on that node. So, ideas so far for implementing this:

Lookup:

  • Maintain a global map on each node of ArgCaches that have been acquired. When a task goes to acquire an ArgCache, it first looks in the table.
  • Remove elements when no one is using them to prevent unbounded growth of map
  • Index using GlobalAddress.
  • Hash or Tree or something more like page table?

Ownership:

  • Assuming read-only arguments
  • Should be able to safely destroy the original (or allow it to go out of scope if stack allocated)
  • Heap-allocate the cached copy
  • Reference counting:
    • Treat like Joiners (pre-register tasks that get spawned with an address to an ArgCache)
    • When task's cache "releases", just decrement count, free if count is 0
    • Children: pass a reference to the local copy of the ArgCache, so if they get stolen they still report back to the correct copy (like GlobalJoiner)

Cache or GlobalAddress?

  • GlobalAddress responsible for being able to "localize" itself? This is very different than the current localize() which just finds the nearest section of global memory that is local.
  • GlobalAddress doing reference counting makes it more like Smart Pointers, so copy constructor can increment, etc.
  • Alternative would be to give the ArgCache object a method to get a GlobalAddress out to be passed to task spawns as their "shared" argument, but this obviously has its own flaws.

Generic delegates with template

Currently you create a generic delegate operation by passing a functor, which holds any sent and returned values. This is a bit messy, as well as potentially inefficient since the whole functor must be copied both ways. Replace with templated delegate with <SendDataType, ReturnDataType, ReturnDataType (*Func)(SendDataType) >

low priority: losing precision in steal queue

In file included from tasks/Task.hpp:12:0,
from tasks/Task.cpp:7:
tasks/StealQueue.hpp: In instantiation of ‘int64_t StealQueue::workShare(int16_t, uint64_t) [with T = Task; int64_t = long int; int16_t = short int; uint64_t = long unsigned int]’:
tasks/Task.cpp:139:61: required from here
tasks/StealQueue.hpp:567:79: warning: narrowing conversion of ‘amount’ from ‘uint64_t {aka long unsigned int}’ to ‘int’ inside { } [-Wnarrowing]

might want to make this more specific sometime.

Multi-sized task queue

Currently, tasks are forced to be a fixed size of 64 bytes: 64-bit function pointer, 3 64-bit arguments. This restriction is to make the task queue simple. We could support tasks of different sizes with perhaps little additional complexity. The gain would be flexibility to have larger tasks, space and copying efficiency for small tasks.

In the designs so far, work stealing in chunks efficiently tends to be the main challenge. We would like to do something that looks like a memcpy; on the other hand it's still linear and traversing the queue might not be that much more expensive.

designs:

  • separate queue for every task size. Each of these queues has an index. Ordering of tasks could be maintained by having a central queue that stores the index of the next queue to pull from. Stealing a chunk from the back of the queue involves grabbing elements from queues specified by the ordering queue. There needs to be enough information transmitted to the thief to deserialize the fragmented queues properly.

All the function pointers passed to spawn() functions are known at compile time, and certainly at the start of the program. These can be registered with the task queue to create appropriately sized queues.

  • tasks have size information, stored in the queue. Stealing chunks would require traversing the headers to grab the right amount. There is certainly a knob to turn here on how length information can be kept or summarized.

Parallel loop with builtin reduction

Should be pretty straightforward to implement a version of some parallel loops that do a reduction and spit out the reduced value at the end.

Something like:

long total = 0;
forall_localized(array, N, &total, [](int64_t i, long& elem, long& total){
  total += elem;
});

Could also roll it into a generic reduction:

long total = reduce(array, N, [](long& e, long& total) {
  total += e;
});

I imagine implementation would look something like:

T reduce(GlobalAddress<T> array, size_t N, F func) {
  T total;
  on_all_cores([&total]{
    T local_total;
    forall_here([base](int64_t i) {
      func(base[i], local_total);
    });
    local_total = reduce(local_total, func);
    if (mycore() == 0) total = local_total;
  });
  return total;
}

Buffered Parallel file IO by line

In order to support buffered reads through Grappa file IO of line-based text files.

If we put the FileIO library in charge of parallelization--as in, for example, the Grappa_read_array API--then the user will provide a callback on_line( char * line).

An ordered line-based interface has to provide also a line index (on_line( int64_t i, char * line)), but requires an extra pass through the file to do a prefix sum. A reasonable simplification of this might be to accept only normalized files where all lines are equal length.

Payload messages too big?

BrandonH saw the aggregator complain about too-big payload messages in intsort in the all reduce.

Update license headers

Ensure everything has the right license header, and update what's in our LICENSE file.

  • system/,bin/ – our A-GPL license
  • applications/* – whatever the original app was
  • compiler/ – Chapel's BSD license (even just in its branch)

GCE Completion Flat Combining

@nelsonje and I discussed the idea of doing the "flat combining" on GCE complete messages. The basic idea is to have a message going to each destination core stored in the GCE, as well as a count of outgoing completes to each core.

It's unsafe to modify a message after it's been enqueued, which is why we have the counter.

  • If you go to complete and the message for that destination has not been enqueued, then you enqueue the message yourself
  • If the message is enqueued already, increment the complete counter
  • The message has mark_sent() overridden so when it is sent, it checks the outstanding completes count for that core, if it is > 0, then the message takes that count and re-enqueues itself.

This should strictly improve the number of messages we send for completions, and in the limit reduce the traffic for feed-forward delegates by half.

full @deprecate pass on all APIs

Use @deprecate tag in Doxygen comment. But also should use GCC/Clang compatible function attributes to get compiler warnings.

Make API match what's in our SC Ads

Specifically, make sure we have working:

void finish([]{ });
void forall(int64_t,size_t,[](int64_t){ });
void forall(GlobalAddress<T>,size_t,[](T&){ });

Common includes (Grappa.hpp)

I think we wanted to make it simpler to include most of the Grappa functionality with a single include. This should require just a bit of refactoring to avoid include cycles.

I think what we want is for Grappa.hpp to include:

  • Collectives (on_all_cores, reduce, etc)
  • Delegates (Delegate.hpp, AsyncDelegate.hpp)
  • Parallel loops (ParallelLoop.hpp)
  • GlobalAllocator.hpp
  • Addressing.hpp
  • Tasking.hpp
  • Array.hpp (for things like memset, array_str)

They can presumably include things like GlobalHashSet separately.

New Reduce

I know we've rewritten "reduce" so many times now, but this idea is cool.

Proposing to extend the symmetric address reduce (1181e16) with something that's integrated into parallel loops.

auto sum = symmetric_global_alloc<long>();

auto total = on_all_cores( reduce<add>(sum), [](long* sum) {
  *sum += foo();
});

// or same could be done with `forall`:
auto array = global_alloc<long>(N);
auto total = forall(array, N, reduce<add>(sum), [](long& v, long* sum) {
  *sum += foo(v);
});

Another thing to consider is to make the reduction part of the loop's sync object. So if we supported arbitrary GCE's for loops, you could just make a Reduce sync object and then have the 'return' from the loop lambda be the thing to reduce:

auto sinc = GlobalCompletionEvent<Reduce<add>>::create();
auto total = forall(array, N, sinc, [](long& v){
  return foo(v);
});
LOG(INFO) << "total = " << sinc.get();

This could be done as part of #119.

Transfer or eliminate tests on old APIs

  • Eliminate irrelevant tests
  • Migrate relevant tests to new APIs

This issue applies only to existing tests. We'll have another issue for each new test that is missing.

Statistics

We've been looking for a better way to do statistics gathering for a while. I thought I'd start a new Issue related to this so we can gather our various ideas and brainstorm a solution.

GDB Macros Broken

Since the changes to the coro/Worker struct, the GDB macros appear to be broken. Could we fix these? @bmyerz?

Or @nelsonje did you want to rewrite them in Python anyway?

Tuple -> csr fails on scale <= 4

example inputs

pagrank

make mpi_run TARGET=pagerank.exe NNODE=2 PPN=2 GARGS='--num_starting_workers=64 --logN=4 --nnz_factor=4'

bfs

make mpi_run TARGET=graph.exe NNODE=2 PPN=2 GARGS='--num_starting_workers=64 -- -s 4 -e 4'

[0]: 1 2 4 5 12 15
0: [1]: -1 0 1 2 3 4 5 6 8 11 12 14
0: [2]: -1 0 1 2 3 4 5 6 8 11 12 14 15
0: [3]: 1 2 4 5 9 11 12 15
0: [4]: 0 1 2 3 5 6 7 8 9 10 12 14 15
0: [5]: 0 1 2 3 4 5 7 10 12 14 15
0: [6]: 0 1 2 3 4 5
0: [7]: 4 5 15
0: [8]: 1 4 10
0: [9]: 1 3 4 15
0: [10]: 4 5 8 15
0: [11]: 1 3 14
0: [12]: 0 1 2 3 4 5 14 15
0: [13]: 1 15
0: [14]: 1 4 5 6 11 12 15
0: [15]: 0 1 2 3 4 5 7 9 10 12 13 14

Asynchronous delegates

We would like to be able to issue multiple delegate operations and then block on them completing later. We've discussed some ideas for how to do this in Issue #23, including things like returning the FullEmpty<> to block on, or passing in an object where results will be filled in.

This is to allow for:

  • more straightforward way of overlapping of communication in a task than Caches
  • ???

Stealing with Grappa::cores==1 can livelock

On new-messages-old-aggregator, the readyQ is currently checked after checking for new task work. Since the task loop expects a stealing thread to suspend the scheduler, when there is just 1 core the stealing thread keeps getting to run.

possible choices:

  1. assertion failure when load_balance!=none and cores=1
    (might be annoying)
  2. just turn doSteal off when cores=1
  3. switch priority of readyQ (would be better for this bug not to hinge on this ordering)

choice 2 seems best

Failure to terminate with stealing enabled.

@bmyerz: Looks like there's some FullEmpty thing in steal_locally that gets stuck, causing the application not to complete. Seems like we could solve this either by fixing the FullEmpty suspend, or by just not "counting" it when we try to determine if the app can finish.

Here's the backtrace of the culprit task:

#0  Grappa::FullEmpty<long>::block_until (this=0x22ead00, desired_state=128) at FullEmpty.hpp:42
#1  0x00004008077a5000 in ?? ()
#2  0x0000000000481d3b in Grappa::impl::StealQueue<Grappa::impl::Task>::steal_locally(short, long) ()
#3  0x0000000000481eab in Grappa::impl::TaskManager::checkPull() ()
#4  0x000000000047d4d2 in Grappa::impl::TaskManager::waitConsumeAny (this=0x7c17a0, result=Unhandled dwarf expression
opcode 0xf3
) at tasks/Task.cpp:291
#5  0x000000000047dc20 in Grappa::impl::TaskManager::getWork (this=0x7c17a0, result=0x4008078259c0)
    at tasks/Task.cpp:153
#6  0x000000000047ad58 in Grappa::impl::workerLoop (me=Unhandled dwarf expression opcode 0xf3
) at tasks/TaskingScheduler.cpp:144
#7  0x00000000004871c7 in tramp (me=0x22d3000, arg=0x170d000) at tasks/Thread.cpp:69
#8  0x0000000000487288 in _makestack () at stack.S:192
#9  0x0000000000000000 in ?? ()

Safer barrier()

In Barrier.hpp we have Grappa::barrier().
It blocks the calling task until one task on each node has entered.
The implementation properly allows other threads (notably the network progress threads) to continue.

The use of barrier() is very low-level and limited right now because of its global-ness.
Multiple barriers in an application must not be concurrently callable or else the barriers will be mismatched.

One example of the behavior we may want is UPC's lexical barrier upc_barrier.

Grappa::allreduce on 1 core

Grappa::allreduce with Grappa::cores()==1 deadlocks.

Recommended to reproduce:
make mpi_test TARGET=Collective_tests.test NNODE=1 PPN=1 VERBOSE_TESTS=1 GARGS=' --num_starting_workers=64'

Better Testing Infrastructure

(long term improvement)

Find a better way to do comprehensive regression testing of Grappa infrastructure. Possible implementation approaches:

  • CMake/CTest with existing Boost Unit Test stuff
  • CTest with Google Test

Desired features:

  • Can be run before committing to master for example
  • Verifies that all build modes (vtrace on/off, etc) work and run
  • Able to parse output to verify stats output

Add MessagePool to GlobalCompletionEvent

One idea for having shared MessagePools is to associate a pool with a GCE, so that all async delegates spawned within a GCE scope (i.e. in a forall_localized) can, in addition to sharing the functor and synchronizing with the GCE, also take advantage of the shared message pool to not block on issuing messages.

Fix async API to not allow specifying the pool

Recently changed the way shared_pool works, and had to change call_async() to call send_heap_message() directly (df452fc), so it now ignores the passed pool argument. Should change the API (and uses) to not take an explicit pool anymore.

Shared MessagePool

Implement shared heap-allocated message pool to replace send_heap_message.

__mayinterleave__ type qualifier and __nointerleave__{} regions?

Here's an idea for the compiler: can we add a qualifier to functions that indicates they may block or context switch? and can we have add a qualifier to blocks of code that requires no such functions are called within?

Perhaps this could reuse some of the exception/noexcept mechanisms in the compiler.

Obviously this only works for some cases, but it could be helpful.

Thoughts?

Apparent deadlock in AMPoll

In spmv_mult benchmark, during pack_vtx_edges, there is apparent deadlock, where all nodes are polling. It is not due to too few stacks.

make mpi_run TARGET=spmv_mult.exe
PPN=4
NNODE=12 \
GARGS='--num_starting_workers=1024 --steal=1 --chunk_size=100 --aggregator_autoflush_ticks=2000000 --async_par_for_threshold=1 --periodic_poll_ticks=20000 --nnz_factor=16 --logN=22 --row_distribute=true'

Turning down PPN makes this deadlock not occur.

Vampir Trace Broken

Not sure exactly when it stopped working, but sometime after introducing the new stats and now.

Refactor 'statistics'->'metrics'

a more accurate name for these

For refactoring, I'm also concerned about tooling that renaming might break.
The only thing I can think of is the json parser though.

Don't allow implicit casts on GlobalAddresses

This is in there from a looooong time ago, and it should really go, because it's led to a few really hard to track down bugs.

Addressing.hpp:

  /// generic cast operator
  /// TODO: do we really need this? leads to unneccessary type errors...
  template< typename U >
  operator GlobalAddress< U >( ) {
    GlobalAddress< U > u = GlobalAddress< U >::Raw( storage_ );
    return u;
  }

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.