I. The NEW FILTER
should semantically implement a partitioning (distrimination) of an array elements into a number of arbitrary classes (where the number of arbitrary classes should be bound to <= 15). The equivalence classes are specified in terms of (input) predicates. For example, the external-language Futhark program below
fun {[int],[int]} main([int] A) =
let {x0,x1,x2,x3} =
filter( fn bool (int a) => a % 4 == 0, // pred0
, fn bool (int a) => a % 4 == 1, // pred1
, fn bool (int a) => a % 4 == 2, // pred2
, A)
in {x0,x3}
the array X
is partitioned into four arrays x0,x1,x2,x3`` such that their summed size equals the size of the original array
Aand
x0contains the integers in
4Z + 0, ...,
x3contains the integers in
4Z + 3```.
Note that the last predicate is implicit, i.e., not ((a%4 == 0) && (a&4 == 1) && (a&4 == 3))
,
and, in general, the semantics of the filter construct is that, while the input predicates might not be mutually exclusive, they transformed to be so, intuitively via if-elif-...-else
type of construct. For example this means that the equivalence classes correspond to:
1. pred0
2. pred1 && (not pred0)
3. pred2 && (not (pred0 || pred1))
4. not (pred0 || pred1 || pred2)
In the internal language filter
is represented as the composition between a map
that computes the equivalence-class key for each value and a partition
that "permutes" the elements of the original array based on these keys, such that elements corresponding to the same key belong to the same output array AND preserve relative ordering in the original array. (For example, the latter allows one to reduce each output array with an associative binary operator, instead of associative and commutative).
// external language
let {x0,x1,x2,x3} =
filter( fn bool (int a) => a % 4 == 0, // pred0
, fn bool (int a) => a % 4 == 1, // pred1
, fn bool (int a) => a % 4 == 2, // pred2
, A)
// internal language
let keyarr = map( fn int (int a) => // result is in [0..n],
// where n = #predicates.
if (a % 4 == 0) then 0
else if (a % 4 == 1) then 1
else if (a % 4 == 2) then 2
else 3
, A )
let { s0,s3, x0,x3 } = partition(4, {0,3}, keyarr, A)
in {x0, x3}
In the code above, the arguments of partition
:
-- 4
denotes the range of the keys, i.e., keys have values \in [0...3]
-- {0,3}
denotes the equivalence classes that are actually of interest,
i.e., provide an optimization hook. For example, listing all equivalence
classes is safe, i.e., {0,1,2,3}
, BUT one can observe that x1
and x2
are dead, hence, as an optimization we could partition the
array only in three equivalence classes 0
, 3
, and all the rest
(as opposed to four classes).
-- keyarr
denotes the array of keys (one key for each value),
-- A
is the array to be partitioned,
-- s0
and s3
are the existential sizes of the result partitions x0
and x3
.
II. The NEW REDOMAP
is extended to implicitly use array concatenation as a way to be able to return both the mapped array and its reduction result.
redomap :: { ( {b,b} -> b ) , ( {b,a} -> {b, c} ) , b , [a] } -> { b, [c] }
should be able to fuse
fun {real,[real]} main([int] X) =
let Y = map(f, X) in
let s = reduce(op +, 0.0, Y) in
{s,Y}
into
fun {real,[real]} main([int] X) =
redomap( op +
, fn {real,real} (real s, int x) =>
let y = f(x) in { s + a, y }
, 0.0, X
)
This is still a ```reduce o map'' composition if we would make explicit the array concatenation:
fun {real,[real]} main([int] X) =
redomap( fn {real, [real]} ( {real,[real]} t1, {real,[real] t2} ) =>
let {sum, Y} = t1 in let {s, y} = t2 in {sum + s, Y++y}
, fn {real,[real]} (real s, int x) =>
let y = f(x) in { s + a, [y] }
, 0.0, X
)
III. Finally with the new semantics of redomap
and filter
FUSION should become significantly more aggressive. Below is a demonstration of how Fusion2.0 should work:
fun int main([real] A) =
let {X0, X1, X2, X3} =
filter( op <(1.0)
, op <(10.0)
, op <=(100.0)
, A) in
let y1 = map(f1, X1) in
let (Z1,Z2) =
filter(op <(50.0), Y1) in
let y3 = map(f3, X3) in
let s1 = reduce(op +, 0.0, Z2) in
let s2 = reduce(op *, 1.0, X2) in
let s3 = reduce(min, y3) in
{ s1, s2, s3, X0, Z2, Y3 }
We replace the filter
with the partition o map
composition and rearrange the code in terms of the dependency graph to make the fusion steps easier to follow (fusion is implemented as a T2-reduction of the dependency graph -- see FHPC'12 Paper):
fun int main([real] A) =
let keys_X =
map( fn int (real a) =>
if 1.0 < a then 0
else if 10.0 < a then 1
else if 100.0 <= a then 2
else 3
, A ) in
// DEPENDENCY ^
// |
let {sx0,sx1,sx2,sx3, X0,X1,X2,X3} =
partition( 4, {0,1,2,3}, keys_X, A ) in
// DEPENDENCIES ^
// _________________________|______________________________________________
// | | |
// |
let Y1 = map(f1, X1) in
// ^ | |
// | | |
let Z_keys = map( fn int (real y1) =>
if 50.0 < y1 then 0
else 1
, Y1 )
// ^ | |
// | | |
let {sz2, Z2} =
partition( 2, {1}, Z_keys, Y1) in
// ^ | |
// | | |
let Y3 = map(f3, X3) in
let s3 = reduce(min, Y3) in
// | |
let s1 = reduce(op +, 0.0, Z2) in
// |
let s2 = reduce(op *,1.0,X2) in
{ s1, s2, s3, X0, Z2, Y3 }
Note that the partition
on Y1
in the code above is "optimized" in that, since
Z1
is dead, it does not express it, i.e., it mentions only the Z2
partition.
We proceed by fusing bottom-up on the dependency graph:
-- the map
producing Y3
with the reduce
consuming Y3
and producing s3
-- the partition
producing Z2
with the reduce
consuming Z2
and producing s1
-- fusing a partition
with a reduce
corresponds to moving the partition
after reduce
and transforming the reduce
into a redomap
that accumulates according to the key array,
(see the code below)
-- fusing a partition
with a map
can be done similar as with reduce
, but
IF AND ONLY IF the result of the map
function is of size smaller or equal than the input,
because otherwise the resulted partition
would be more expensive than the original
because would need to interchange "bigger" elements.
fun int main([real] A) =
let keys_X =
map( fn int (real a) =>
if 1.0 < a then 0
else if 10.0 < a then 1
else if 100.0 <= a then 2
else 3
, A ) in
// DEPENDENCY ^
// |
let {sx0,sx1,sx2,sx3, X0,X1,X2,X3} =
partition( 4, {0,1,2,3}, keys_X, A ) in
// DEPENDENCIES ^
// _________________________|______________________________________________
// | | |
// |
let Y1 = map(f1, X1) in
// ^ | |
// | | |
let Z_keys = map( fn int (real y1) =>
if 50.0 < y1 then 0
else 1
, Y1 )
// ^ | |
// | | |
let s1 =
redomap( op +
, fn real (real acc,real y1,real z_key)
=> let acc1 =
if z_key == 1
then acc + y1
else acc
in acc1
, 0.0, Y1, Z_keys ) in
// | |
let {sz2, Z2} =
partition(2,{1},Z_keys,Y1) in
// | |
let s2 = reduce(op *,1.0,X2) in |
let {s3,Y3} =
redomap( min
, fn {real,[real]} (real acc,real x3) =>
let y3 = f(x3) in {min(acc,y3), y3}
, -INF, X3 ) in
{ s1, s2, s3, X0, Z2, Y3 }
Then we fuse the two maps
producing Y1
and Z_keys
with the corresponding redomap
kernel.
fun int main([real] A) =
let keys_X =
map( fn int (real a) =>
if 1.0 < a then 0
else if 10.0 < a then 1
else if 100.0 <= a then 2
else 3
, A ) in
// DEPENDENCY ^
// |
let {sx0,sx1,sx2,sx3, X0,X1,X2,X3} =
partition( 4, {0,1,2,3}, keys_X, A ) in
// DEPENDENCIES ^
// _________________________|______________________________________________
// | | |
// |
// | | |
let {s1,Z_keys,Y1} =
redomap( op +
, fn {real,int,real}
(real acc,real x1) =>
let y1 = f1(x1) in
let z_key = if 50.0 < y1
then 0
else 1 in
let acc1 =
if z_key == 1
then acc + y1
else acc
in {acc1, z_key, y1}
, 0.0, X1) in
// | |
let {sz2, Z2} =
partition(2,{1},Z_keys,Y1) in
// | |
let s2 = reduce(op *,1.0,X2) in |
let {s3,Y3} =
redomap( min
, fn {real,[real]} (real acc,real x3) =>
let y3 = f(x3) in {min(acc,y3), y3}
, -INF, X3 ) in
{ s1, s2, s3, X0, Z2, Y3 }
!!!FOLLOWS THE TRICKY AND IMPORTANT STEP!!!
It would seem that we got stuck here, i.e., this is the best it can get: we have three
"independent" redomap
s that consume the array results of the previous partition
.
The next intuitive step would be a "horizontal" fusion of the three redomap
s, but
this cannot be done as an independent step because the three inputs X1
, X2
and X3
have different sizes, hence cannot be fused horizontally into a redomap
.
However, it is possible to do a MEGA horizontal+vertical fusion in one step: the
partition
is fused with the three redomap
s. We know the following facts:
-- the first redomap
consumes X1
AND produces Y1
,
which is subsequently subject to a partition
operation,
-- the second reduce
consumes X2
,
-- the third redomap
consumes X3
AND produces Y3
Performing the MEGA-fusion step requires two mini steps:
1.) Merge the partition
with the three redomap
s, by
discriminating the inputs from X0, X1, X2, X3
,
based on the values of the keys_X
array
AND
2.) Merge the partition
of Y1
with the partition
of X
into one partition
operation. This is because Y1
is produced
from X1
which is, at its turn, partition
ed from X
.
In essence, the combined redomap
should return:
-- The accumulated results: s1
, s2
, s3
-- One combined array containing the elements of X0
, Y1
and Y3
,
which is possible only when X0
, Y1
, and Y3
have the
same type and identical inner shapes.
-- One key array, which, in our case combines the keys of Z
(keys_Z
)
with the keys of X
(keys_X
).
fun int main([real] A) =
let keys_X =
map( fn int (real a) =>
if 1.0 < a then 0
else if 10.0 < a then 1
else if 100.0 <= a then 2
else 3
, A ) in
// DEPENDENCY ^
// |
let {s1,s2,s3,ZX_keys,X0Y1Y3}
redomap( fn {real,real,real} ( {real,real,real} t1
{real,real,real} t2 ) =>
let {acc,prd,mn1} = t1 in
let {y1, x2, y3 } = t2 in
{ acc+y1, prd*x2, min(mn1,y3) }
, fn {real,real,real,int,real}
( real acc,real prd,real mn1, int key_x, real x )
=>
if key_x == 0
then {acc,prd,mn1,0+2,x}
else if key_x == 1
then
let y1 = f1(x1) in
let z_key = if 50.0 < y1
then 0
else 1 in
let acc1 =
if z_key == 1
then acc + y1
else acc in
{acc1,prd,mn1,z_key,y}
else if key_x == 2
then
let prd1 = prd * x2 in
{acc,prd1,mn1,2+2,x}
else // if key_x == 3
then
let y3 = f(x3) in
let mn2 = min(mn1, y3) in
{acc,prd,mn2,3+2,y3}
, {0.0,1.0,-INF}, X ) in
let {sx0,sy3,sz2, X0,Y3,Z2} =
partition( 6, {1,2,5}, ZX_keys, X0Y1Y3 ) in
in { s1, s2, s3, X0, Z2, Y3 }
Finally, the last step is to fuse the map
with the redomap
: this is trivial and is not shown.
In CONCLUSION: the original code was traversing the arrays several times,
i.e., more accesses to global memory, and was performing two partition
operations. The fused code is traversing the original array exactly once
to compute the values and then it requires one partition
operation!
The downside of the MEGA step is that it may introduce significant DIVERGENCE overhead on hardware such as GPGPU.