For the next few weeks, we will be working through a project to build a basic compiler. In doing this, we will gain experience with several useful tools for practical compiler development (namely, the Python AST library for embedded front-end development, and LLVM for back-end code generation). Along the way we will also encounter common data structures, patterns, and idioms in building compilers and DSLs.
We will build a simple compiler from a subset of Python to optimized machine code. If you are familiar with Numba, you can think of this assignment as building our own toy Numba JIT. With that, we will be able to write code like the following:
@Compile
def square(x : int) -> int:
return x*x
Using the @Compile
decorator will invoke our compiler on the square
function, parsing, analyzing, JIT compiling, and ultimately replacing it with native machine code.
If you are not very familiar with Python, the @Compile
syntax may look magic or unclear. Don't worry, like most things in Python, it's actually very simple: @Compile
is just a "decorator" we will define to package up our functionality. Decorators are just syntactic sugar for applying a higher-order function (a function that takes and returns other functions, in this case Compile
) to another function (in this case, square
) immediately after its definition, and replacing that definition with the result of the higher-order function applied to it. In other words, this is just syntactic sugar for:
# define square just like any other Python function:
def square(x : int) -> int:
return x*x
# replace the definition of square with a compiled version of itself:
square = Compile(square)
If you are familiar with Python and that code still looks strange to you, it's probably because of the type annotations. Yes, this is real, vanilla Python, it's just using a very new addition to the syntax called "type hints" (we will be using Python 3.6, which is the first version where type hints are officially standard).
To begin, we need a way to turn actual Python code into something we can work with. This will be our job for the first assignment.
Generally, the front-end of a compiler is responsible for mapping from raw input into an Abstract Syntax Tree (AST), an Intermediate Representation (IR) which corresponds more cleanly to the logical level at which we want to think about user code. (The "abstract" name distinguishes an AST from a literal parse tree. An AST is generally simplified and normalized beyond the raw output of a parser in external languages, or from operator traces or other information in embedded languages.)
Python is a nice platform for building language extensions because the standard library includes rich tools for parsing, representing, and manipulating the complete Python syntax within the language itself. Using this, Python programs can relatively represent and manipulate their own ASTs.
For our compiler, we are only going to worry about a simplified subset of Python. Specifically, we're only going to compile constructs which trivially map into the following simple IR:
Expr = BinOp(Bop op, Expr left, Expr right)
| CmpOp(Cop op, Expr left, Expr right)
| UnOp(Uop op, Expr e)
| Ref(Str name, Expr? index)
| FloatConst(float val)
| IntConst(int val)
Uop = Neg | Not
Bop = Add | Sub | Mul | Div | Mod | And | Or
Cop = EQ | NE | LT | GT | LE | GE
Stmt = Assign(Ref ref, Expr val)
| Block(Stmt* body)
| If(Expr cond, Stmt body, Stmt? elseBody)
| For(Str var, Expr min, Expr max, Stmt body)
| Return(Expr val)
| FuncDef(Str name, Str* args, Stmt body)
This is a description of our intermediate representation as an algebraic data type. You should read this as "An Expr[ession] can be either a Bin[ary]Op, containing an op and left and right child Exprs, or a CmpOp, containing..., or an Int[eger]Const[ant] containing an int value." The ?
suffix means that a field is optional (so a Ref[erence]
has an optional index
expression for when it is referencing an element of an array rather than a scalar variable), while a *
suffix means that it is a list of 0 or more items of that type. (You will see similar notation, written in an actual data structure generation DSL called ASDL, in the official documentation of the Python AST.) ADTs are an especially useful notation when describing tree-structured data, where each tree node can be one of many types, as is usually the case in compiler IRs.
This IR is similar in structure to many imperative languages like C or Python: it has separate notions of "expressions" (trees of basic math operations, reads from variables, and constants) and "statements" (the top-level, sequentially-ordered operations which are delimited by semicolons in C or separate lines in Python). A statement can Assign
an expression to a variable reference, Return
a value, perform an If
/else branch or a For
loop over more statements, or encapsulate a list of potentially many statements in a Block
(like the contents of a pair of braces { }
in C).
The corresponding data structures are implemented in the code as a set of classes based on the Python base ast.AST
node type. We use this not because we're going to intermingle our IR nodes with the original Python AST nodes, but because ast.AST
embodies a nice, simple design pattern for representing these kinds of ADTs by simply declaring the list of fields we want each class to have, as well as utility libraries for recursively traversing the AST using the visitor pattern, namely the ast.NodeVisitor
. They also integrate naturally with the astor AST utility library, which includes useful utilities for pretty-printing Python ASTs (astor.dump
), and re-generating Python code from an AST.
Here's a quick overview of what you can do with IR nodes based on ast.AST
:
>>> import ast, astor
>>> class MyIRNode(ast.AST):
... _fields = ('foo', 'bar')
...
# construct with no fields assigned yet
# fields become attributes of the node object
>>> a = MyIRNode()
>>> a.foo = 'hi'
>>> a.bar = 'bye'
# use astor to pretty-print the IR
>>> astor.dump(a)
"MyIRNode(foo='hi', bar='bye')"
>>> b = MyIRNode('left', 'right') # construct fields by positional order
>>> c = MyIRNode(foo=b, bar=a) # construct fields by name
# c has a and b as child nodes:
>>> astor.dump(c)
"MyIRNode(foo=MyIRNode(foo='left', bar='right'), bar=MyIRNode(foo='hi', bar='bye'))"
# Build a custom IR visitor.
# For each node they visit, NodeVisitors try to dispatch to a method named
# visit_[NodeClassName]. If it doesn't exist, they fall back to generic_visit.
>>> class MyIRVisitor(ast.NodeVisitor):
... def visit_MyIRNode(self, node):
... return str.format("MyIRNode(foo={}, bar={})",
... self.visit(node.foo),
... self.visit(node.bar))
... def generic_visit(self, node):
... return str(node)
...
>>> MyIRVisitor().visit(c)
'MyIRNode(foo=MyIRNode(foo=left, bar=right), bar=MyIRNode(foo=hi, bar=bye))'
- Install Python 3.6. I highly recommend Anaconda, on which I will build a distribution for later parts of this project, but for part 1 most versions of Python 3 should work fine.
- Install astor:
pip install astor
. - Fork and clone your own copy of this repository.
- If you're rather new to Python and aren’t already familiar with it, I highly recommend IPython:
conda install ipython
orpip install ipython
to be sure you have it. It includes both the now-popular Jupyter Notebook interactive web interface, and just a much better REPL for interactively writing and testing Python on the command line. Runipython
in your shell. Experience the wonders of tab completion, syntax highlighting, shell history, etc. Runipython notebook
to get the web notebook interface.
Our first task is to translate Python code into our simplified IR. We will do this by building an ast.NodeVisitor
to recursively walk over a piece of Python AST and construct the corresponding simplified IR. The skeleton for this is defined in the PythonToSimple
class in the starter code.
Running python compiler.py
executes this on a trivial function using the simple test code at the bottom. The starter code implements just the bare minimum necessary to translate the function:
def trivial():
return 5
Your first job is to complete the PythonToSimple
visitor to translate any reasonably expressible Python.
I haven't yet provided any additional tests, but you can extend this arbitrarily to write and test your own richer examples.
To understand the Python AST, and how to build ast.NodeVisitor
s, I highly recommend not only the official module documentation, but also the GreenTreeSnakes unofficial documentation.
Bonus: add explicit NotImplemented
exceptions when attempting to translate non-translatable Python IR.
To test this, and to exercise our skills working with tree-structured IRs, we will also build a simple interpreter for our IR, which we can execute and compare with the original.
You should submit your code by forking this repository on GitHub. Precise instructions will follow.