Creating and asynchronously processing a directed acyclic computation graph.
Tools for creating and executing computational Directed Acyclic Graphs. Currently this
only operates on types that implement the Operable
trait, which mainly requires
the Product
and Sum
traits to operate over iterators. The provided examples
all use simple arithmetic types, but this could be extended to tensorial types.
This application is meant to:
- Generate random computational DAGs
- Execute DAGs in parallel on a multi-threaded CPU
- Print the order in which the nodes were executed, specifying which nodes were executed in parallel.
Part 1 can be considered a separate problem, of creating a "valid" DAG, where the graph must be connected and acyclic.
Parts 2 and 3 are achieved together with the DAG executor.
This program was written using Rust 1.44.1, the latest stable. Please follow the instructions at Install Rust for ways to install on your system.
Additionally, the dot
program from GraphViz is required for pretty-printing,
which can be found at GraphViz.
As an optional requirement, pandoc
is used for generating the pdf version of
this document.
Everything is organized as a standard Rust repo:
/src
for the codemain.rs
for the entrypointlib.rs
for module declarations.
Overview of files:
operation.rs
: defines simple operations over arithmetic types, to be used inside the DAG executordag.rs
: defines DAG and required operationsrandom.rs
: specifies how to generate a randomOperation
andDag
, along with a custom distribution for theDag
, which can be customized at the command-linecomputation.rs
: multi-threaded execution model given aDag
Additionally, a Makefile
is provided to avoid remembering commands. Try running
make help
to see all of the possibilities.
There is no particular standard for how to represent a DAG in code, but there
do exist formats for graphically representing them. For pretty-printing, this
program outputs DAGs in the DOT format using the print
command.
For example, to generate a png version of a random graph:
cargo run print | dot -Tpng -o out.png
To stay with the dot format, you can run:
cargo run print | dot
The Makefil variants are:
make print_png
make print_dot
make print
An example graph is provided at example.png
.
NOTE: dot
takes a long time for anything over 50 nodes, so be sure to stay
underneath that threshold.
For simplicity in random DAG generation by using rand::random()
, the Dag
type implements the Distribution
trait on Standard
, which uses const
variables for the node bounds. These variables, named
MIN_NODES
, MAX_NODES
, and EDGE_PERCENTAGE
, can be modified by hand in
random.rs
.
Additionally, a custom distribution is provided, DagDistribution
,
for deeper customization of the random graph. By providing
command-line arguments, you can customize the minimum nodes, maximum nodes, and
percentage of an edge between two nodes. Try make cargo_help
for all
running options.
Since there isn't a particular set of topological rules that the computation DAG needs to follow, we will consider all DAGs to be valid, even unconnected graphs. In a perfect solution, we would do more graph analysis and separate unconnected graphs for separate processing.
DAGs can be represented with adjacency matrices, but it was easiest to directly generate the DAG in code, rather than creating some an alternative representation.
In our simple representation, the operation
performed at a node is defined
as a simple function that takes inputs of one type, and outputs one result of
that type, which in simplified form would be:
pub type Operation = dyn Fn(&Vec<T>) -> T
In a real computation environment, such as in a neural network, operations
would likely be performed on tensors or matrices whose dimensions are not always
the same. For example, one layer in a neural network may perform operations on
a 40 x 20 matrix, while the next layer requires a 10 x 100 matrix. The Operable
type would need to be extended for tensorial types.
In order to perform all computations in a multi-threaded environment, the first
and simplest possibility is to use the raw thread::spawn
on every single
computation. This approach, unfortunately, can quickly overrun a system when
processing a large computation graph with thousands of nodes.
The next option is to use a threadpool in order to avoid over-using system
resources, but this creates a scheduling problem. It's unwise to schedule everything
from the start and allow everything to resolve. For example,
if all threadpool resources are used and waiting for an
answer from an unscheduled computation, the program will lock.
Good topology analysis is required to
address this limitation. For example, a simple solution would be to perform all computations
at a certain "rank" all at once. In the earlier provided example, that would
be first executing nodes 1
, 2
, 3
, and 5
, since they are all parent
nodes. This optimization may not maximize system
resources, however. Certain computations may finish much faster, but the entire
network needs to wait for the full rank to finish before moving on to the next one.
The best option is an asynchronous solution, allowing the computation of each
node to be handled concurrently by some runtime. The tokio
runtime is the current standout
option in the Rust ecosystem, but other options exist, such as async-std
. By
using asynchronous tasks with a threadpool under the hood, you can schedule
many more operations than would be possible with a standard threadpool. You can
even schedule everything right from the start and allow the runtime to execute
the entire DAG, which addresses our previous problems. For example, if 1000
tasks are all waiting for one computation, the tasks can stay await
ed without
taking up (too many) system resources.
To begin with, I attempted to do this simply with async
and await
, which
would have worked, except that we sometimes need to "split" the result of an
computation. For example, if a node has more than one child, then
each child needs to await
some Future
, but the same Future
can't be awaited
more than once.
The solution is to use channel
s: each node await
s the
receiver end of its parents' oneshot
channels, runs its computation, and then
sends the result to all of its children. Again, everything can
be scheduled from the start, without hogging system resources. Once all tasks
are setup, you simply need to join them and allow for the runtime to resolve
everything.
To create and execute a random DAG, run:
make execute
This will print out the DAG and when a node is processed, including which thread eventually executed the node.
To process a huge graph, run:
make execute_huge
This uses the Default
function everywhere to avoid overflows.
Also, the number of threads in the tokio
runtime can
be modified through the number of core_threads
at #[tokio::main(core_threads = 8)]
in main.rs
.
Since the approach is totally asynchronous, it's impossible to show
the order of processing of the nodes without doing additional topological
analysis. As an alternative, a Delay
operation is provided, which simply
stalls for 2 seconds before returning Default::default()
, so the result will
always be one more 0
s.
By using delay operations though, you can see how each rank is optimally executed
without requiring additional analysis of the DAG. You can run the DAG with
Delay
operations with:
make execute_delay
The graph is intentionally made small for this command, to easily trace which operations are happening at the same time.
Sample output:
digraph {
1 -> 6;
3 -> 4;
5;
6;
2 -> 3;
2 -> 5;
4;
}
--- snip ---
Starting everything!
ThreadId(7): processing node 2
ThreadId(8): processing node 1
--- 2 second delay ---
ThreadId(8): processing node 5
ThreadId(6): processing node 6
ThreadId(4): processing node 3
--- 2 second delay ---
ThreadId(4): processing node 4
--- 2 second delay ---
Collecting results
Results: [0, 0, 0]
In this example, we have three processing steps:
- Nodes 1 and 2 have no parents, so they are processed at the same time
- Nodes 3, 5, and 6 are processed because 3 and 5 depend on 2, and 6 depends on 1
- Node 4 is processed because it depends on 3
Looking at the printed digraph
earlier, this corresponds perfectly to our
expectations.
Unit tests are provided in each module, which can be run using:
make test
Note that this may take up to 30 seconds as some tests involve 100,000 nodes.
I tried maxing this out with huge numbers of nodes. Unfortunately, at around
300,000 nodes, there is a big performance slowdown at the join_all
portion of
the tasks. I need to do more research and see limits for tasks,
or any limitations on the join
operation. Otherwise the performance is pretty
good!
Feel free to reach out about any questions.