diku-dk / futhark Goto Github PK

:boom::computer::boom: A data-parallel functional programming language

License: ISC License

Shell 0.24% Haskell 68.61% Yacc 0.61% Python 1.82% C 8.42% CSS 0.05% Makefile 0.06% Roff 0.02% Nix 0.25% Futhark 19.41% Lex 0.16% JavaScript 0.30% TeX 0.01% Cuda 0.04%

language boom gpu hpc compiler gpgpu futhark opencl cuda hacktoberfest

futhark's Introduction

The Futhark Programming Language

Futhark is a purely functional data-parallel programming language in the ML family. It can be compiled to typically very efficient parallel code, running on either a CPU or GPU. The language is developed at DIKU at the University of Copenhagen, originally as part of the HIPERFIT centre. It is quite stable and suitable for practical programming.

For more information, see:

A collection of code examples
Installation instructions
The main website
Parallel Programming in Futhark, an extensive introduction and guide
The Futhark User's Guide
Documentation for the built-in prelude
Futhark libraries

Hacking

Issues tagged with good first issue do not require deep knowledge of the code base.

For contributing code, see the hacking instructions.

futhark's People

Contributors

Stargazers

Watchers

Forkers

haggi yanweifu maccam912 sfears cuulee sjfloat jiangbingo ifa6 winning1120xx yecq mrakgr duataud senaygui iurisegtovich fenollp cpehle betamatrix zimurgh yuhangwang syohex bluerider salamandra7 samrat strogo longjohncoder jakirkham ihc jameslinus ragnardanneskjold natzu-onzo alexpanter hbcbh1999 nwoeanhinnogaehr feibhwang linpingchuan jeslyvarghese ghosthamlet mikkelstorgaard gitter-badger moreheadm neo4reo jxub freroe sth-fish crazymerlyn zfnmxt pepijndevos happy-ferret ricelang zyntax9 abhiroop tpolakovic zhu-fei kariped mxxo aircom-inc nonplused lijiansong namibj jmesyou nystud vvvinceocam dhruvdh awesomedatatool stjordanis munksgaard pabloreszczynski he-la ralfbarkow emilmasoumi porcuquine zeta1999 sword-smith toreaad mb64 tobiasroikjer amir ruplsingh michaelrauh fluxusmagna strong-roots-capital andreasnicolaisen lambdaxymox simpletondl michael-elkh balzic tayacan mortentc kbtale nhey philass q60 graydon a8e5 mfkiwl gautamjangidop mariszo strategist922 shenzhen-cloudatawalk-technology-co-ltd silky

futhark's Issues

Remove over-annotation in the AST

The largest design flaw in the Futhark AST is the pervasive embedding
of highly information-rich type annotations. It used to be even
worse, when types also embedded aliasing information, but it is still
pretty bad.

The problem is that every use of a variable contains that variables
full type, including uniqueness and shape, which makes it
astonishingly annoying to change the binding of a variable, despite
the fact that the uses may not care about all these details.

The advantage is that we can determine the type of an expression
without looking at a symbol table. This is a pretty dubious advantage
since we will typically have a symbol table handy whenever we are
doing interesting things with the AST.

How do other compilers do this, I wonder? As I recall, GHC only
embeds types at binding-time.

Nevertheless, I think there may be some value in keeping limited
annotations at usage-points, as long as it is information that could
never change without forcing a complete traversal of the program
anyway (like array element type or rank).

Apart from their essential structure, types conceptually have two
components where we can dial the amount of stored information:

Their uniqueness.
Their shape.

Note that both of these only matter for arrays. If we know whether a
type is unique, we know it is either Unique or Nonunique. Pretty
simple. Things are more complicated for array shapes, which can be:

Just the rank.
Existential shapes.
Known shapes.

Right now, we already exploit this to a degree:

Epxreeion types have existential shapes (ExtType).
Lambda return types have static shapes (Type). These are also
used for annotating all variable bindings (whether function/lambda
parameters or let-bound). A type symbol table is a mapping from
names to Types.
Function types have existential shapes, but also some additional things.

The problematic type annotations are Types. I believe we can make
our lives simpler by converting these to a new type representation
that has only rank information and stores no uniqueness information.
The new types could be called:

ExtType, with existential shape information and uniqueness.
AnnotType, with rank information and no uniqueness.
Type, with static shapes and uniqueness.

So why not remove type annotations completely? I am doing that in a
branch I call break-everything, but when I got to range analysis, I
realised that in subExpKnownRange we check whether a variable is an
array or not in a rather innocent manner. For simplicity of
implementation, the range analyser does not maintain a full symbol
table. I think there are other similar cases elsewhere in the
compiler, and perhaps it is better to keep them intact.

On the other hand, maybe we just need to be Tough and Do The Right
Thing, and decide that trying to analyse or modify an AST without
maintaining a symbol table is a really weird thing to support. Yet
again, the complexity of the Futhark compiler is already totally out
of control (23000 SLOC as of this writing), and I am very hesitant
about adding a cross-cutting complexity burden, which this is likely
to be.

assert(True) binding gets removed, not it's reference

I encountered a strange bug when trying to use the explicit bounds checking. It is present if I use a static array, but not when using iota.

The error I get is Variable dummy_5 not found in symbol table at assert-true.l0:6:15-6:20. when using enabling optimizations on data/test/assert-true.l0 :

fun int main(int i) =
  let a = [9, 8, 7, 6, 5, 4, 3, 2, 1, 0] in
  //let a = iota(10) in
  if 0 <= i && i < 10
  then let dummy = assert(True) in
           a[<dummy>|i]
  else ~1

I've tracked it down to that the code gets modified by TupleTransform.transformProg to:

fun int main(int i_3) =
  let a_4 = [9, 8, 7, 6, 5, 4, 3, 2, 1, 0] in
  if 0 <= i_3 && i_3 < 10
  then a_4[<dummy_5>|i_3]
  else ~1

Couldn't really figure out how to proceed from there.

Double buffering for merge variables

When inserting memory annotations for do-loops, for every array-typed merge variable with a loop-invariant shape, create two memory blocks, which are copied between for every iteration. This should guarantee that we will be able to hoist of every allocation inside the loop (as long as the size of the allocation is loop-invariant).

Add sizeof() primitive

The imperative IL has a sizeof construct, but the internal Futhark representation does not. Instead, the explicit memory reprsentation defines a function mapping scalar types to their byte size. If we don't abstract this a bit, by adding a sizeof PrimOp, we will have a lot of trouble untangling it later on, I fear. It does prevent architecture-specific optimisation, however.

Add support for simple shape annotations to the external language

Only the simplest ones. For example, we want this to be a valid declaration of a matrix multiplication function:

  fun [[int,n],m] matmultFun([[int,k],m] a, [[int,n],k] b ) = ...

Rearrange gives out of bounds

Interpreting this program

fun [[[int]]] main () =
    let xss = iota(16) in
    let tmp = reshape( (2,4,2), xss ) in
    rearrange( (2,0,1), tmp )

Gives
futhark: Ix{Int}.index: Index (16) out of range ((0,15))

Handle arrays of zero-tuples

These are conceptually just non-negative integers, but they probably require some special cases in the internaliser. Very low priority.

The ability to turn specific simplification rules on and off with a command line flag

This might be useful for debugging.

Turn map(replicate(n) . f, ...) into transpose(replicate(n, (map(f, ...))))

I think this would enable better memory usage. This crops up in one function in CalibVolDiff:

fun *[[real]] setPayoff(real strike, [real] myX, [real] myY) =
    let n = size(0, myY) in
    copy(map(fn [real] (real xi) => replicate(n, max(xi-strike,0.0)), myX))

The above is a fairly natural way to write the function, but for the compiler, it would be nicer to have transpose and replicate on the outside.

OMP?

No idea, Cosmin just left this one on the whiteboard. We already have something in the branch ugly-openmp-hack.

Support while-loops

Should work just like for-loops (with merge variables etc), and are always sequential.

Use invariants provided by assertions

If you have an expression like assert(x > 10), the compiler is currently not able to take advantage of this fact to optimise away the branch in e.g. if x > 10 then ....

Modify the simplifier to refine the ranges in the symbol table whenever it encounters assertions. Do it the naive way.

Better user-oriented compiler frontend programs

Right now, if you want to compile a Futhark program, you need to pass the entire pipeline to futhark, pipe the result into a C file, run a C compiler with the right flags, and finally you have a binary. While this is a very enjoyable process for a compiler developer, it is less so for people who just want to try out Futhark.

I propose we add a futhark-c program to compile freestanding programs using the sequential C backend. With an option, it will generate just an .o and .h file (for use as a library), and with yet another option, just a .c file.

Which optimisations are permissible?

At the HIPERFIT reading group meeting, we discussed whether it is permissible for the optimiser to convert non-terminating programs into terminating ones. I am still not convinced one way or the other (but currently the Futhark compiler will indeed do that).

Basically, the question is whether the two propositions "the unoptimised program computes V" and "the optimised program computes V" are connected by implication or bi-implication.

If we want this to be a bi-implication, we may hinder ourselves quite a bit, for example by not permitting removal of a division expression (even as dead code) unless we can statically prove that the divisor is non-zero.

Better notation for currying operators

The current operator currying syntax is derived from SML, and is therefore quite ugly. This should affect the external language only.

Bad error message when array-of-tuples to tuple-of-arrays transformation fails

When running futhark on this

fun [ { [int], [int] } ] main () =
    [ {[1,2], [3,4,5]}, {[4], [1,2,3,4]} ]

I get the cryptic error message futhark: (Array.!): undefined array element. This is of because of the array-of-tuples to tuple-of-arrays transformation, which is also shown in the documentation, but I still think a better error message could be expected.

lets also get a way to add negative tests

Inefficient slicing/improve dead code eliminator/simplifier

We rely heavily on slicing for size analysis. In some cases, we do not slice very well, as for example this map:

fun [[int]] main([int] as, [bool] bs) =
  zipWith(fn [int] (int a, bool b) =>
            if b
            then replicate(10,a)
            else replicate(5,a),
          as, bs)

After full optimisation, the result is this:

fun {[[int,?1],?0]} main(int size_8, int size_9, [int,size_8] as_10,
                         [bool,size_9] bs_11) =
  let {bool zip_cmp_12} = size_8 == size_9 in
  let {cert zip_assert_13} = assert(zip_cmp_12) in
  // bs_zip_res_14 aliases bs_11
  let {[bool,size_8] bs_zip_res_14} = <zip_assert_13>reshape((size_8), bs_11) in
  let {bool cond_15} = size_8 == 0 in
  let {int size_22} =
    if cond_15
    then {0}
    else let {int elem_16} = as_10[0] in
         let {bool elem_17} = <zip_assert_13>bs_11[0] in
         // res_21 aliases elem_16
         let {int size_20, [int,size_20] res_21} =
           if elem_17
           then // res_18 aliases elem_16
                let {[int,10] res_18} = replicate(10, elem_16) in
                {res_18}
           else // res_19 aliases elem_16
                let {[int,5] res_19} = replicate(5, elem_16) in
                {res_19} in
         {size_20} in
  let {[[int,size_22],size_8] res_32} =
    map(fn {[int,size_22]} (int a_23, bool b_24) =>
          // res_28 aliases a_23
          let {int size_27, [int,size_27] res_28} =
            if b_24
            then // res_25 aliases a_23
                 let {[int,10] res_25} = replicate(10, a_23) in
                 {res_25}
            else // res_26 aliases a_23
                 let {[int,5] res_26} = replicate(5, a_23) in
                 {res_26} in
          let {bool assert_arg_29} = size_22 == size_27 in
          let {cert shape_cert_30} = assert(assert_arg_29) in
          // result_proper_shape_31 aliases res_28
          let {[int,size_22] result_proper_shape_31} = <shape_cert_30>reshape((size_22),
                                                                              res_28) in
          {result_proper_shape_31},
        as_10, bs_zip_res_14) in
  {res_32}

Note the essentially un-simplified slice. This is because the result depends on a part of the existential context of a branch, which the simplifier does not know how to deal with.

This gets even worse after you add in memory allocations, where the slice will allocate an array. This shows up in the CalibGA benchmark (although I don't think it has a great impact on performance).

And yes, this can be resolved by adding shape declarations to the source program, but that is not a solution.

New REDOMAP semantics & Fusion2.0 Idea

I. The NEW FILTER should semantically implement a partitioning (distrimination) of an array elements into a number of arbitrary classes (where the number of arbitrary classes should be bound to <= 15). The equivalence classes are specified in terms of (input) predicates. For example, the external-language Futhark program below

fun {[int],[int]} main([int] A) = 
    let {x0,x1,x2,x3} = 
        filter( fn bool (int a) => a % 4 == 0,  // pred0
              , fn bool (int a) => a % 4 == 1,  // pred1
              , fn bool (int a) => a % 4 == 2,  // pred2
              , A)
    in {x0,x3}

the array X is partitioned into four arrays x0,x1,x2,x3`` such that their summed size equals the size of the original arrayAandx0contains the integers in4Z + 0, ...,x3contains the integers in4Z + 3```.

Note that the last predicate is implicit, i.e., not ((a%4 == 0) && (a&4 == 1) && (a&4 == 3)),
and, in general, the semantics of the filter construct is that, while the input predicates might not be mutually exclusive, they transformed to be so, intuitively via if-elif-...-else type of construct. For example this means that the equivalence classes correspond to:
1. pred0
2. pred1 && (not pred0)
3. pred2 && (not (pred0 || pred1))
4. not (pred0 || pred1 || pred2)

In the internal language filter is represented as the composition between a map that computes the equivalence-class key for each value and a partition that "permutes" the elements of the original array based on these keys, such that elements corresponding to the same key belong to the same output array AND preserve relative ordering in the original array. (For example, the latter allows one to reduce each output array with an associative binary operator, instead of associative and commutative).

    // external language
    let {x0,x1,x2,x3} = 
        filter( fn bool (int a) => a % 4 == 0,  // pred0
              , fn bool (int a) => a % 4 == 1,  // pred1
              , fn bool (int a) => a % 4 == 2,  // pred2
              , A)

    // internal language
    let keyarr = map( fn int (int a) =>  // result is in [0..n], 
                                         // where n = #predicates.
                        if      (a % 4 == 0) then 0
                        else if (a % 4 == 1) then 1
                        else if (a % 4 == 2) then 2
                        else                      3
                    , A )
    let { s0,s3, x0,x3 } = partition(4, {0,3}, keyarr, A)
    in  {x0, x3}

In the code above, the arguments of partition:
-- 4 denotes the range of the keys, i.e., keys have values \in [0...3]
-- {0,3} denotes the equivalence classes that are actually of interest,
i.e., provide an optimization hook. For example, listing all equivalence
classes is safe, i.e., {0,1,2,3}, BUT one can observe that x1
and x2 are dead, hence, as an optimization we could partition the
array only in three equivalence classes 0, 3, and all the rest
(as opposed to four classes).
-- keyarr denotes the array of keys (one key for each value),
-- A is the array to be partitioned,
-- s0 and s3 are the existential sizes of the result partitions x0 and x3.

II. The NEW REDOMAP is extended to implicitly use array concatenation as a way to be able to return both the mapped array and its reduction result.
redomap :: { ( {b,b} -> b ) , ( {b,a} -> {b, c} ) , b , [a] } -> { b, [c] } should be able to fuse

fun {real,[real]} main([int] X) = 
    let Y = map(f, X) in  
    let s = reduce(op +, 0.0, Y) in
    {s,Y}

into

fun {real,[real]} main([int] X) =
    redomap( op +
           , fn {real,real} (real s, int x) =>
                 let y = f(x) in { s + a, y }
           , 0.0, X
           )

This is still a ```reduce o map'' composition if we would make explicit the array concatenation:

fun {real,[real]} main([int] X) =
    redomap( fn {real, [real]} ( {real,[real]} t1, {real,[real] t2} ) =>
                 let {sum, Y} = t1 in let {s, y} = t2 in {sum + s, Y++y}
           , fn {real,[real]} (real s, int x) =>
                 let y = f(x) in { s + a, [y] }
           , 0.0, X
           )

III. Finally with the new semantics of redomap and filter FUSION should become significantly more aggressive. Below is a demonstration of how Fusion2.0 should work:

fun int main([real] A) = 
    let {X0, X1, X2, X3} = 
        filter( op <(1.0)
              , op <(10.0)
              , op <=(100.0)
              , A)                 in
    let y1 = map(f1, X1)           in
    let (Z1,Z2) = 
        filter(op <(50.0), Y1)     in
    let y3 = map(f3, X3)           in
    let s1 = reduce(op +, 0.0, Z2) in
    let s2 = reduce(op *, 1.0, X2) in
    let s3 = reduce(min, y3) in
    { s1, s2, s3, X0, Z2, Y3 }

We replace the filter with the partition o map composition and rearrange the code in terms of the dependency graph to make the fusion steps easier to follow (fusion is implemented as a T2-reduction of the dependency graph -- see FHPC'12 Paper):

fun int main([real] A) = 
                    let keys_X = 
                        map( fn int (real a) =>
                                 if        1.0 <  a then 0
                                 else if  10.0 <  a then 1
                                 else if 100.0 <= a then 2
                                 else                    3
                           , A )                             in
//  DEPENDENCY                         ^
//                                     |
                    let {sx0,sx1,sx2,sx3, X0,X1,X2,X3} = 
                        partition( 4, {0,1,2,3}, keys_X, A ) in
// DEPENDENCIES                        ^
//            _________________________|______________________________________________
//            |                                      |                                |
//            |
    let Y1 = map(f1, X1) in                     
//            ^                                      |                                |
//            |                                      |                                |
    let Z_keys = map( fn int (real y1) =>
                        if 50.0 < y1 then 0
                        else              1
                    , Y1 ) 
//               ^                                  |                                |
//               |                                  |                                |
    let {sz2, Z2} =  
        partition( 2, {1}, Z_keys, Y1) in
//               ^                                   |                                |
//               |                                   |                                |
                                                                            let Y3 = map(f3, X3)     in
                                                                            let s3 = reduce(min, Y3) in
//               |                                   |
    let s1 = reduce(op +, 0.0, Z2) in
//                                                   |
                                        let s2 = reduce(op *,1.0,X2) in


    { s1, s2, s3, X0, Z2, Y3 }

Note that the partition on Y1 in the code above is "optimized" in that, since
Z1 is dead, it does not express it, i.e., it mentions only the Z2 partition.

We proceed by fusing bottom-up on the dependency graph:
-- the map producing Y3 with the reduce consuming Y3 and producing s3
-- the partition producing Z2 with the reduce consuming Z2 and producing s1
-- fusing a partition with a reduce corresponds to moving the partition after reduce
and transforming the reduce into a redomap that accumulates according to the key array,
(see the code below)
-- fusing a partition with a map can be done similar as with reduce, but
IF AND ONLY IF the result of the map function is of size smaller or equal than the input,
because otherwise the resulted partition would be more expensive than the original
because would need to interchange "bigger" elements.

fun int main([real] A) = 
                    let keys_X = 
                        map( fn int (real a) =>
                                 if        1.0 <  a then 0
                                 else if  10.0 <  a then 1
                                 else if 100.0 <= a then 2
                                 else                    3
                           , A )                             in
//  DEPENDENCY                         ^
//                                     |
                    let {sx0,sx1,sx2,sx3, X0,X1,X2,X3} = 
                        partition( 4, {0,1,2,3}, keys_X, A ) in
// DEPENDENCIES                        ^
//            _________________________|______________________________________________
//            |                                      |                                |
//            |
    let Y1 = map(f1, X1) in                     
//            ^                                      |                                |
//            |                                      |                                |
    let Z_keys = map( fn int (real y1) =>
                        if 50.0 < y1 then 0
                        else              1
                    , Y1 ) 
//               ^                                   |                                |
//               |                                   |                                |

    let s1 = 
    redomap( op +
           , fn real (real acc,real y1,real z_key) 
             => let acc1 = 
                    if z_key == 1 
                    then acc + y1
                    else acc
                in acc1   
           , 0.0, Y1, Z_keys ) in
//                                                   |                                |
    let {sz2, Z2} = 
        partition(2,{1},Z_keys,Y1) in
//                                                   |                                |
                                          let s2 = reduce(op *,1.0,X2) in             |
                                                                            let {s3,Y3} =
                                                                                redomap( min
                                                                                       , fn {real,[real]} (real acc,real x3) =>
                                                                                            let y3 = f(x3) in {min(acc,y3), y3}
                                                                                       , -INF, X3 ) in
    { s1, s2, s3, X0, Z2, Y3 }

Then we fuse the two maps producing Y1 and Z_keys with the corresponding redomap kernel.

fun int main([real] A) = 
                    let keys_X = 
                        map( fn int (real a) =>
                                 if        1.0 <  a then 0
                                 else if  10.0 <  a then 1
                                 else if 100.0 <= a then 2
                                 else                    3
                           , A )                             in
//  DEPENDENCY                         ^
//                                     |
                    let {sx0,sx1,sx2,sx3, X0,X1,X2,X3} = 
                        partition( 4, {0,1,2,3}, keys_X, A ) in
// DEPENDENCIES                        ^
//            _________________________|______________________________________________
//            |                                      |                                |
//            |
//            |                                      |                                |
    let {s1,Z_keys,Y1} = 
        redomap( op +
               , fn {real,int,real} 
                    (real acc,real x1) => 
                        let y1    = f1(x1) in 
                        let z_key = if 50.0 < y1
                                    then 0
                                    else 1    in
                        let acc1 = 
                            if z_key == 1 
                            then acc + y1
                            else acc
                        in {acc1, z_key, y1}   

               , 0.0, X1) in
//                                                   |                                |
    let {sz2, Z2} = 
        partition(2,{1},Z_keys,Y1) in
//                                                   |                                |
                                          let s2 = reduce(op *,1.0,X2) in             |
                                                                            let {s3,Y3} =
                                                                                redomap( min
                                                                                       , fn {real,[real]} (real acc,real x3) =>
                                                                                            let y3 = f(x3) in {min(acc,y3), y3}
                                                                                       , -INF, X3 ) in
    { s1, s2, s3, X0, Z2, Y3 }

!!!FOLLOWS THE TRICKY AND IMPORTANT STEP!!!

It would seem that we got stuck here, i.e., this is the best it can get: we have three
"independent" redomaps that consume the array results of the previous partition.
The next intuitive step would be a "horizontal" fusion of the three redomaps, but
this cannot be done as an independent step because the three inputs X1, X2
and X3 have different sizes, hence cannot be fused horizontally into a redomap.

However, it is possible to do a MEGA horizontal+vertical fusion in one step: the
partitionis fused with the three redomaps. We know the following facts:
-- the first redomap consumes X1 AND produces Y1,
which is subsequently subject to a partition operation,
-- the second reduce consumes X2,
-- the third redomap consumes X3 AND produces Y3

Performing the MEGA-fusion step requires two mini steps:
1.) Merge the partition with the three redomaps, by
discriminating the inputs from X0, X1, X2, X3,
based on the values of the keys_X array
AND
2.) Merge the partition of Y1 with the partition of X
into one partition operation. This is because Y1 is produced
from X1 which is, at its turn, partitioned from X.

In essence, the combined redomap should return:
-- The accumulated results: s1, s2, s3
-- One combined array containing the elements of X0, Y1 and Y3,
which is possible only when X0, Y1, and Y3 have the
same type and identical inner shapes.
-- One key array, which, in our case combines the keys of Z (keys_Z)
with the keys of X (keys_X).

fun int main([real] A) = 
    let keys_X = 
        map( fn int (real a) =>
                if        1.0 <  a then 0
                else if  10.0 <  a then 1
                else if 100.0 <= a then 2
                else                    3
           , A )                             in
//  DEPENDENCY                         ^
//                                     |
    let {s1,s2,s3,ZX_keys,X0Y1Y3}
        redomap( fn {real,real,real} ( {real,real,real} t1
                                       {real,real,real} t2 ) =>
                     let {acc,prd,mn1} = t1 in
                     let {y1, x2, y3 } = t2 in
                     { acc+y1, prd*x2, min(mn1,y3) }

               , fn {real,real,real,int,real} 
                    ( real acc,real prd,real mn1, int key_x, real x )
                     =>   
                        if      key_x == 0 
                        then {acc,prd,mn1,0+2,x}
                        else if key_x == 1 
                        then 
                            let y1    = f1(x1)        in 
                            let z_key = if 50.0 < y1
                                        then 0
                                        else 1        in
                            let acc1 = 
                                if z_key == 1 
                                then acc + y1
                                else acc              in
                             {acc1,prd,mn1,z_key,y}
                        else if key_x == 2
                        then 
                            let prd1 = prd * x2       in
                             {acc,prd1,mn1,2+2,x}
                        else // if key_x == 3
                        then 
                            let y3   = f(x3)          in 
                            let mn2  = min(mn1, y3)   in
                             {acc,prd,mn2,3+2,y3}

               , {0.0,1.0,-INF}, X ) in

    let {sx0,sy3,sz2, X0,Y3,Z2} = 
        partition( 6, {1,2,5}, ZX_keys, X0Y1Y3 ) in 

in  { s1, s2, s3, X0, Z2, Y3 }

Finally, the last step is to fuse the map with the redomap: this is trivial and is not shown.

In CONCLUSION: the original code was traversing the arrays several times,
i.e., more accesses to global memory, and was performing two partition
operations. The fused code is traversing the original array exactly once
to compute the values and then it requires one partition operation!

The downside of the MEGA step is that it may introduce significant DIVERGENCE overhead on hardware such as GPGPU.

Add a way to put negative run-time tests in the test suite

You can already make tests that are expected to fail at compile-time by having neither .in nor .out files. A test expected to fail at runtime could have an .in, but no .out file.

Of course, it would also be nice to be able to specify the expected failure. In a perfect world, it would also be nice to embed this information in the test program itself, and not in separate files.

But this is not a perfect world (well, maybe it will be when Maya is done).

Fix index functions

Index functions don't really work well. I am working on fixing them in a branch called funky-index-functions.

Specify compiler pipeline for futhark-test

For me it would be handy to be able to add -s --flattening to test if flattening actually does work :)

Predicate optimisation doesn't quite work

It's broken in various ways, and bitrotted further because we didn't have an automated way to regression test it.

The fundamental algorithm is sound, but the implementation more gnarly than we originall expected. The current design could be fixed to work, though.

Prototype OpenCL backend

The ambition is to have a testing vehicle for the rest of the work we have to do.

The design is as follows:

Only maps will be converted to OpenCL kernels, in the obvious way. They must contain all allocations in their immediate body, and said allocations must be of a size known before entering the map. Memory will be copied into the kernel before execution and out when it is done. This is extremely wasteful for multi-kernel programs, but a usable starting point.
A new pass will be added that does "kernel sequentialisation". This involves sequentialising all SOACs, except for outermost maps, which remain maps. Operates on the Basic representation.
The ExplicitMemory representation will be given support for map. I did this before, but removed it at some point. It can be added back easily enough, although it must be slightly more advanced this time. This is the big unknown factor.
Finally, writing a version of the Futhark->ImpCode that turns maps into kernels, and a version of the ImpCode->C transformer that can handle kernel invocations and generate OpenCL code.

This is a fairly non-modular and hacky design, but I believe the compiler has just enough infrastructure that it can be gradually untangled as we do more of it The Right Way. As a very important case, the above does not do any kind of flattening, even trivial map-nest flattening.

Transpose removes elements

Is this correct behavior? -- I wouldn't expect a transpose to remove elements from an array.

fun [[[int]]] main () =
    let xss = iota(16) in
    let tmp = reshape( (2,4,2), xss ) in
    transpose(0,2, tmp)

gives

[[[0, 2], [8, 10]], [[1, 3], [9, 11]], [[2, 4], [10, 12]], [[3, 5], [11, 13]]]

(6 is missing)

Slicing of memory allocations

When we have a loop with loop-variant allocations, generate a slice that computes the (maximum) allocation sizes, and use those to pre-allocate.

Improve external language documentation

While the internal language and compiler design changes frequently, the external language is fairly stable. We should improve the documentation (doc/futhark.tex) in places where it is currently insufficient.

Of course, another problem is that writing documentation as a LaTeX document intended to be compiled into a PDF seems horribly old-fashioned. What do the cool kids do nowadays? We still want to be able to have decent math support, probably.

Utility function for splitting a redomap into seperate reduce and map

Would just be nice to have. For example redomap does not make much sense for flattening.

Would also be nice to have better documentation for the internal redomap.

More operators in external language

We have <, but not >. This is getting a bit silly.

Add the large (Finpar) benchmarks to the automatic test suite

I constantly break the benchmarks because there is no automatic system to yell at me when I do. For example, they are broken now, and I am not sure when that happened, but it looks like it has something to do with CSE. They were also broken due to small bugs in range analysis, but I just fixed that.

An alternative is to write some other programs that are just as big and punishing of the compiler optimisations, but I have no idea what they would be. Maybe some machine learning or crypto thing?

Improve range propagation

In this program, the bounds check is not optimised away:

fun int main(bool b) =
  let a = [1,2,3] in
  let i = if b then 0 else 1 in
  a[i]
  g

This is because ranges are not tracked for the results of ifs, or much else really. It would be nice if the result of a map was likewise tracked. This is fixable.

Replace `real` with `single` and `double`

The real type does not belong in a low-level language. I will add a compiler flag that will control whether real is translated to single or double at parse-time (or maybe internalisation-time).

Of course, in the newest floating point spec, single and double are no longer official names - these would be binary32 and binary64 respectively. Maybe float32 and float64 would be better names? Or perhaps that is too unusual.

Error when optimizing `0 pow 0`

foo.l0:

fun int main () =
    0 pow 0

l0c -e foo.l0:

fun int main(int a_1) =
  0

l0c -i foo.l0:
Result of evaluation: 1

Fixed in our fork by https://github.com/RasmusWriedtLarsen/L0Language/commit/e11f34fc5ae192e8855aa5da6b393beacf764316

We need better logging

Well, we need logging at all. Currently it's really hard to debug the compiler, because you have to insert surgical calls to 'Debug.Trace' where you think things are going wrong. It would be nice if the compiler itself kept a running log of most of the things it rewrites (and maybe even why). The following problems arise:

Due to renaming, it may be hard to follow what is going on. I've been minimising the amount of renaming going on, so maybe this isn't a big problem. It generally is not when I trace.
The code for performing the logging may be large and obscure the actual optimisations. This can be reduced by design improvements - for example, if we require that each simplification rule is associated with a name (instead of being a plain Haskell function), we can in one place put a log message that says "rule R turned X into Y".
When turned off, the deactivated logging code must be completely free of cost. Lazy evaluation and monadic design will help us with this.

I do not want to create an #ifdef nightmare. The degree of logging must be a command-line parameter - maybe even supporting some simple filters. Also, the logging code must be entirely pure - we currently do not have IO anywhere but a single module, and I would like to keep it that way.

Our name is over-utilised and poorly recognisable

There are already three languages called L:

http://en.wikipedia.org/wiki/L_(programming_language)

Does the world need another language with a single-letter name?

The Bohrium people managed to switch from the very generic "Copenhagen Vector Code" to the very cool Bohrium, so can't we do something similar?

(My previous proposal of calling the language "Niels", as it is "in front of Bohr[ium]", was mostly in jest.)

Bring back CSE

I broke this sometime last year. It would be nice to have back.

New Filter Semantics

Internal Futhark:

filter :: (a->bool, ..., a->bool, [a]) ->
(int, ..., int, [a])

External Futhark:

filter :: (a->bool, ..., a->bool, [a]) ->
([a], ..., [a], [a])

Semantics (External Futhark):
input: n conditions and an array
output: n + 1 arrays, such that the first one contains only the
original-array elements that succeed under the first condition, ...,
and the last (n+1th) array contains the array elements that failed
under all conditions.

freeInExp is broken

freeInExp from Futhark/Representation/AST/Attributes/Names.hs does no longer produce correct results. Running it on the following expression

map(fn {[int,size_12]} ([int,size_12] x_55 (()), [int,size_26] x_56 (())) =>
      let {[int,size_12] ys_zip_res_63} = <zip_assert_62>reshape((size_12),
                                                                 x_56) in
      let {[int,size_12] res_67} =
        map(fn {int} (int x_64 (()), int y_65 (())) =>
              let {int res_66} = x_64 + y_65 in
              {res_66},
            x_55, ys_zip_res_63) in
      {res_67},
    A_27, B_zip_res_37)

Gives the these free variables
[size_12, x_55, size_26, x_56, zip_assert_62, x_64, y_65, A_27, B_zip_res_37]

My understanding is that this should only be:
[size_12, size_26, zip_assert_62, A_27, B_zip_res_37]

Refractor type checkers

There are two type checkers, src/Futhark/TypeCheck.hs for the internal language and src/Futhark/Representation/External/TypeChecker.hs for the external language.

Why not put the internal under src/Futhark/Representation/ and the external under src/Language/Futhark/, so it matches the file structure of everything else (ie. the AST)

Type checker does not check #dimensions for reshape

reshapeWrongDims.l0:

fun int main () =
    let b1 = copy( [ [1,2], [3,4] ] ) in
    let c1 = reshape( (1), b1) in
    c1[0]

l0c -i reshapeWrongDims.l0:

Interpreter error:
Type error at reshapeWrongDims.l0:3:14-3:21 in evalExp Reshape reshape during interpretation.  This implies a bug in the type checker.

Add coding style guidelines

Just to have some consistency. It would be nice if it also contained some Emacs configuration to make following the style more convenient.

Fixing range propagation across branches

Currently, the simplifier can only track ranges for a very few kinds of expressions (mostly scalar expressions and filter as a special case). If we modified the representation used by the simplification engine to also track the ranges of results of bodies, we could get significantly better range analysis.

Improve type repertoire

The Futhark scalar types are a mess. We have char which is not a character but a byte, int which has no defined size, real which has no defined size, and very few decisions made about what this all means.

For example, in the code generator, int maps to a 32-bit signed integer, but in the interpreter, it is a Haskell Int, which I think is on the order of a 61-bit signed integer. This caught me off balance today while trying to implement MD5, which expects to be able to operate on unsigned 32-bit words (or being able to fake it using 32-bit signed integers).

I propose switching to entirely concrete types, and adding some more:

Rename char to byte.
Rename int to int32, and add int16 and maybe int64.
Rename real to double and add single.
Add types word16, word32 and word64. Alternatively we can call these uint*, as in C99.

In the external language, we can have several names for the same type (e.g byte and uint8 and word8 all meaning the same thing). We can also have a generic real which can be mapped to either double or single based on a compiler flag. I do not want this in the internal language.

We will also need some built-in functions for converting between these types. One question is whether they should be built-in constructs at the AST level, or treated like library functions. We already have toReal and some rounding functions that work the latter way.

Function-Level Shape/Invariant Annotatations

Internal Futhark:

fun [([int,k1], char, [bool,k2]), k3]
    myFun(int M, int N, [[real, k4],M] arr)
            where { 
                         k1 < k2, 
                         k3 in k2*k1 + N ... M,
                         k4 < M + N,
                         M >= MAX(2*N,0),
                         arr in N ... M-1 } = 
    body of function

Such annotations allow the user to specify invariants related to
simple algebraic relations between integers. (Perhaps MIN and
MAX should become part of the language as well.)

Only symbols can appear in shapes: either an int parameter of the function
(i.e., not existential) or an (existential) symbol, which may be qualified in
exactly one of the where clauses.

where clause specifies relations between scalars
I. -- there is exactly one symbol on the left-hand side, followed by
-- a relation, which can be >, >=, <, <=, =, in
in requires a lower-bound and an upper-bound expression
-- followed by a (simple) algebraic scalar expression
II. a symbol may appear in the left-hand side of at most one qualification!
III. a symbol use in a where clause CANNOT be followed by a qualification
of that symbol, i.e., regular chain: first define the contraint, then use
the symbol.
IV. Note that arrays can be qualified as well: the last qualification
arr in N ... M-1 means that all elements of arr have lower
bound N and upper bound M-1.

Most Relaxed (Liberal) Semantics:
-- everything is optional, whatever is specified will be checked, i.e.,
translated to assertions (if function is inlined) and will
be used for range propagation, allocation, etc.
-- whatever is not specified will be computed (as currently done)
Restricted Semantics:
-- all array-parameter shapes should be exactly specified
(with equality) in terms of former declared scalars (can
even be required that all such scalars come first)
-- all result-array shapes have to be specified at least with an
upper bound.
Combined: compiler flag that determines which 1. or 2.
and reduces to transforming 1 into 2, which ``should'' not
be too difficult

Error in Exp pretty printer

ppExp does not handle Minus with multiple terms very well.

cat foo.l0c :

fun int main() =
    12 - (34 + 45)

l0c foo.l0c :

fun int main() =
  12 - 34 + 45

Memory block merging

Also known as "register allocation", but I picked a new one because people tend to get confused. The idea is to use an algorithm very similar to register allocation to merge memory blocks and reduce copies and memory pressure.

Do in-place-lowering (or equivalent) when sequentialising loops

Consider this program:

fun [[int]] main([int] a, [[int]] contribs) =
  scan( fn [int] ([int] x, [int] y) => zipWith(op +, x, y)
      , a
      , contribs
      )

Our strategy for compiling this is to sequentialise it into do-loops, then perform in-place lowering to remove memory copies. Unfortunately, due to the loss of information about the memory access pattern of the inner map, the optimiser is not able to make it fully in-place:

      let {*[int,size_31] result_44} = scratch(int, size_31) in
      // Consumes result_44
      let {*[int,size_31] res_53} =
        loop {*[int,size_31] map_outarr_47} <- {*[int,size_31] map_outarr_47} = {result_44}
        for i_48 < size_31 do
          let {int x_49} = acc_39[i_48] in
          let {int y_50} = fold_arr_40[i_41, i_48] in
          let {int res_51} = x_49 + y_50 in
          // Consumes map_outarr_47
          let {(*[int,size_31] lw_dest_52 <- map_outarr_47)[i_48]} = res_51 in
          {lw_dest_52} in
      // lw_dest_54 aliases res_53
      // Consumes fold_arr_40
      let {(*[[int,size_31],size_30] lw_dest_54 <- fold_arr_40)[i_41]} = res_53 in

Note that fold_arr_40 is used inside the loop, and is the source in the following in-place update. Of course, we could do loop parallelism analysis on the loop to figure out whether lowering is safe anyway and fix up the array accesses, but it feels really ridiculous to try to re-acquire knowledge that was completely apparent in the original map.

Instead, the solution is to be smarter when turning SOACs into do-loops. Specifically, I want to do in-place lowering at the very spot the transformation is done. I want to say "turn this map into a do-loop, and put the result at row a[i]" - which means that we can modify the generated loop to write to a[i,j] in iteration j. This would require a significant increase of intelligence and complexity in the first-order-transformer, but it seems like a fairly critical component anyway, so I believe this is justified.

The assertion mechanism is unsound

Our assertion mechanism is pretty lightweight and has served us well, but there are places where optimisations break it, and there is really no way for the optimiser to be more careful. Look at this program:

fun int main([int] a, int m) =
  let n = size(0,a) in
  loop (x=0) = for i < n do
    x+a[0]
  in
  x

After internalisation it looks like this:

fun [int] main(int a_size_0, [int,a_size_0] a_1, int m_2) =
  let {int n_3} = a_size_0 in
  let {int val_18} = 0 in
  let {int x_27} =
    loop {int x_19} <- {int x_19} = {val_18}
    for i_20 < n_3 do
      let {bool x_21} = 0 <= 0 in
      let {bool y_22} = 0 < a_size_0 in
      let {bool assert_arg_23} = x_21 && y_22 in
      let {cert bounds_check_24} = assert(assert_arg_23) in
      let {int y_25} = <bounds_check_24>a_1[0] in
      let {int res_26} = x_19 + y_25 in
      {res_26} in
  {x_27}

While the assertion is technically loop-invariant, the compiler is smart enough not to hoist assertions out of loops and branches. However, the loop is only entered if 0 < n, in which case the bounds check will always succeed. The optimiser will realise this and remove the assertion, but now the array index is loop-invariant, and the optimiser has no idea that it is only valid due to the invariants provided by entering the loop. Hence, it will be hoisted, and the result will be the following unsafe (but fast!) program.

fun [int] main(int a_size_3, [int,a_size_3] a_4, int m_5) =
  let {int y_6} = a_4[0] in
  let {int x_10} =
    loop {int x_7} <- {int x_7} = {0}
    for i_8 < a_size_3 do
      let {int res_9} = x_7 + y_6 in
      {res_9} in
  {x_10}

One way to solve this would be to be much more conservative about when we hoist - for example, never hoist anything that can accept certification certificates, but this is very problematic because that is most expressions due to the assert-zip mechanism.

The assert-zip mechanism itself is alright. It means that this program

fun [{int,int}] main([int] a, [int] b) =
  zip(a,b)

becomes this

fun {cert, [int,?0], [int,?1]} main(int a_size_0, int b_size_1,
                                    [int,a_size_0] a_8, [int,b_size_1] b_9) =
  let {bool zip_cmp_10} = a_size_0 == b_size_1 in
  let {cert zip_assert_11} = assert(zip_cmp_10) in
  {zip_assert_11, a_8, b_9}

However, this has forced us to add certificate-dependency slots to all sorts of expressions in order to ensure that the assertions are not optimised away. It also means we have pseudo-hacks like functions returning certificates (as above), functions accepting certificates as arguments, and, worst of all, a certificate slot in body results. This last one is a major problem, because there are several optimisations, notably fusion, that do not preserve them. For example, this program

fun [int] main([[int]] a, [[int]] b) =
  let c = map(fn [{int,int}] ([int] ar, [int] br) =>
                zip(ar, br),
              zip(a,b)) in
  map(fn int ([{int,int}] p) =>
        let {p1, p2} = unzip(p) in
        reduce(op+, 0, p1) + reduce(op+, 0, p2),
      c)

we would expect to fail if the inner sizes of a and b do not match, but after optimisation, we note that only the assertion for the outer sizes has been preserved:

fun {[int,?0]} main(int a_size_8, int a_size_9, int b_size_10, int b_size_11,
                    [[int,a_size_9],a_size_8] a_12,
                    [[int,b_size_11],b_size_10] b_13) =
  let {bool zip_cmp_17} = a_size_8 == b_size_10 in
  let {cert zip_assert_18} = assert(zip_cmp_17) in
  let {bool cond_14} = a_size_8 == 0 in
  let {int shape_15, int shape_16} =
    if cond_14
    then {0, 0}
    else {a_size_9, b_size_11} in
  let {[int,a_size_8] res_30} =
    <zip_assert_18>
    mapT(fn {int} ([int,shape_15] p_19, [int,shape_16] p_20) =>
           let {int x_28} =
             reduceT(fn {int} (int x_25, int y_26) =>
                       let {int res_27} = x_25 + y_26 in
                       {res_27},
                     {0}, p_19) in
           let {int y_24} =
             reduceT(fn {int} (int x_21, int y_22) =>
                       let {int res_23} = x_21 + y_22 in
                       {res_23},
                     {0}, p_20) in
           let {int res_29} = x_28 + y_24 in
           {res_29},
         a_12, b_13) in
  {res_30}

(In this case, I think the problem is in the simplification rule that recognises identity maps, but fusion would mess this up as well.)

Now that we have sizes in our type system, we could replace this part of the assertion system with a typecasting mechanism instead. For example, we could compile the first zip-example to this instead:

fun {[int,?0], [int,?0]} main(int a_size_0, int b_size_1,
                                    [int,a_size_0] a_8, [int,b_size_1] b_9) =
  let {bool zip_cmp_10} = a_size_0 == b_size_1 in
  let {cert zip_assert_11} = assert(zip_cmp_10) in
  let {[int,a_size_0] c_12} = <zip_assert_11>typecast([int,a_size_0], b_9)
  {a_8, b_9}

The assertion functions as a witness that the type of b_9 is identical to the one of a_8. This is still a fairly lightweight mechanism, but I think it is much less vulnerable to mangling.

Dependency graph representation of body contents

Right now, a Body is just a list of bindings. This enforces an unfortunate and arbitrary sequentialisation, making some optimisations harder. Maybe we can improve this.

Cannot handle arrays inside tuple-arrays

arrayInTupleArray.l0:

fun {[int],bool} main () =
    let arr = [ {[1,2,3],True} , {[4,5,6],False} ] in
    arr[0]

l0c -s arrayInTupleArray.l0:

Type error after pass 'tuple-transform':
Declaration of function main at arrayInTupleArray.l0:1:18-1:22 declares return type {[int], bool}, but body has type {cert, [int], bool}

Improve the integration test system

Our unit test system is OK. Except that there are almost no actual tests, of course.

Our integration test system is not so good. It's a bit inconvenient that you need multiple files per test program, and you cannot have multiple data sets for the same program. I would like to modify it to rely on annotations in comments in the Futhark program that provide test cases, as well as other metadata, such as the optimisations to run. This will also allow us to make use of Maya's work on Vidar (as soon as it's either on Hackage or we just import it ourselves).