thinker227 / nanopasssharp Goto Github PK

A language-agnostic framework intended to reduce the amount of boilerplate when writing compilers.

License: MIT License

C# 100.00%

nanopasssharp's Introduction

I'm a hobbyist programmer and (very) occasionally game developer :3

My main area of interest ranges from compilers and programming languages to small hobby game projects. Also interested in functional programming. Ask me what a monad is, I might have a reasonable answer.

Currently working on a programming language called Noa!

Main languages are , , and sometimes if I'm feeling adventureous.

nanopasssharp's People

Contributors

Stargazers

Watchers

Forkers

lpeter1997

nanopasssharp's Issues

Basic CLI

Implement a basic CLI which allows the user to enter an input language, output language, a pass definitions file, and the output location.

A possible CLI:

nanopass-sharp <input-language> <output-language> <pass-flie> [--output <location>]

Implement C# output language

Implement AST emission to C# as a minimal example of an output language.

Visitor generation

One of the key features of the Nanopass framework is automating the "boring-parts" of transformation between passes. I think our equivalent could be a generated visitor base class for each pass. By default, it would do a catamorphic transformation on unchanged nodes - that you can still override, if you need -, and would require the user to write the ones that were affected.

For example, let's say we have this AST:

// Pass 1
record Expr
{
    record Unit : Expr;
    record IfElse(Expr Cond, Expr Then, Expr Else) : Expr;
    record If(Expr Cond, Expr Then) : Expr;
    record Name(string Id) : Expr;
}

And then we do the following modifications:

Remove the 'If' node
Add the member 'ISymbol Symbol' to 'Name'
Remove the member 'Id' from 'Name'

Then the next generated AST would look like:

// Pass 2
record Expr
{
    record Unit : Expr;
    record IfElse(Expr Cond, Expr Then, Expr Else) : Expr;
    record Name(ISymbol Symbol) : Expr;
}

And the generated visitor base could be:

abstract class Pass1ToPass2Base
{
    // Generate dispatch functionality
    public Pass2.Expr Apply(Pass1.Expr expr) => expr switch
    {
        Pass1.Expr.Unit e => Apply(e),
        Pass1.Expr.IfElse e => Apply(e),
        Pass1.Expr.If e => Apply(e),
        Pass1.Expr.Name e => Apply(e),
        _ => throw new InvalidOperationException(),
    };

    // Unchanged nodes are just translated trivially
    // The user can still override them, if needed

    public virtual Pass2.Expr Apply(Pass1.Expr.Unit e) => new Pass2.Expr.Unit();

    public virtual Pass2.Expr Apply(Pass1.Expr.IfElse e) => new Pass2.Expr.IfElse(
        Apply(e..Cond),
        Apply(e.Then),
        Apply(e.Else));

    // Changed or removed nodes are required to be handled by the user

    public abstract Pass2.Expr Apply(Pass1.Expr.If e);

    public abstract Pass2.Expr Apply(Pass1.Expr.Name e);
}

Then all the user needs to implement is:

class Pass1ToPass2 : Pass1ToPass2Base
{
    public override Pass2.Expr Apply(Pass1.Expr.If e) => new Pass1.Expr.IfElse(
        Apply(e..Cond),
        Apply(e.Then),
        new Pass2.Expr.Unit());

    public override Pass2.Expr Apply(Pass1.Expr.Name e) => /* ... */;
}

Implement YAML input language

Implement YAML as an input language as a minimal example of an input language.

Output added/modified source code

Source code that has been added/modified should be output to the root project. Would most easily be done using Workspace.ApplyChangesAsync.

Design question: should output files be written to their own folder (ex. NanopassSharp), be written to the root folder, or be written to the same folder as the file the modified type is located in?

Format generated source code

Format generated source code according to the options of the project, either using Microsoft.CodeAnalysis.Formatting.Formatter.FormatAsync or some equivalent alternative. This would be easiest done directly after the source has been generated but outside the PassSourceGenerator class, since performing formatting inside it would require passing additional information about the project into it.

Add BuildException

Add a unique exception type for exceptions thrown during a build operation by a builder. This would assist in writing tests (not having to test for just InvalidOperationException) and allow for more specific exceptions.

Architecture proposal

I'd like to propose an architecture for this tool to be as versatile, easily extensible and testable as possible. This would involve decoupling the input language from the class hierarchy and transformations, and decoupling the output format from those as well.

The components would be:

Core library: responsible for modeling the inheritance tree and providing model transformations
Input language library/libraries: responsible for building up the inheritance tree and transformations from some input language (YAML, JSON, ...)
Templating library: wraps up the tree model (sort of like in red-green trees) in the core library to provide a more extensive viewmodel for a templating engine, like Scriban
Output language library/libraries: Using the viewmodel provided by the templating component, it generates output for a specific language
CLI: A command-line interface to drive all of this as a simple to invoke .NET tool

In the following sections I'd like to detail these components slightly more.

Core library

Class hierarchy

The core could provide the model for the inheritance tree. It could look something like so (just a sketch):

// Describes a compiler pass
record CompilerPass(
	string Name,
	string Documentation,
	// Transformations to apply to get the tree based on the previous pass
	IList<Transformation> Transformations,
	TreeHierarchy Tree,
	CompilerPass? PreviousPass,
	CompilerPass? NextPass
);

// Wraps up an entire hierarchy
// Not necessary, but makes the API nicer
record TreeHierarchy(
	IDictionary<string, TreeNodeClass> Nodes
);

// Describes a single class in a hierarchy
record TreeNodeClass(
	string Name,
	string Documentation,
	TreeNodeClass Parent,
	IDictionary<string, TreeNodeMember> Members,
	// Language-specific things could be here
	// Sealed? Abstract? Some applied attribute for Python?
	ISet<object> Attributes
);

// Describes a single member/property in a class
record TreeNodeMember(
	string Name,
	string Documentation,
	// Dynamic languages might not have a type
	string? Type,
	// Language-specific things could be here
	// Public? Apply some attribute? Leave out from pretty-printing?
	ISet<object> Attributes
);

Something like this wouldn't be too language-specific, but isn't too general either to be practically useless. Things like the type specification could be elaborated better, if needed. Also, read-write properties would be nicer for such an API, I only used records for the simple syntax.

Tree transformation

The key operation the core would provide is tree transformation. It would take a tree hierarchy as an input, apply a transformation that would result in a new tree hierarchy. This is how the passes would build up their trees. Transformations could optionally be applied on nodes matching a certain pattern. A possible API:

interface ITreeNodePattern
{
	public bool IsRecursive { get; }
	public bool Matches(TreeNodeClass c);
}

interface ITreeTransformer
{
	public ITreeNodePattern? Pattern { get; }
	public TreeHierarchy Apply(TreeHierarchy h);
}

Built-in transformations we could provide (and we could extend later):

Add a node
Remove node
Add a member to node
Remove a member from node

Built-in patterns we could provide (and we could extend later):

Node with given name
Node with name matching a regex
Node with given member(s)
All nodes

Rationale for the scope

I believe this is a well-testable and easily extensible component. The rest deal with input and output, which likely means mostly integration and end-to-end tests will apply to them. This component can be unit-tested to oblivion with all the patterns and transformations.

Input language libraries

These would be less interesting libraries, taking an input language and then transforming it to the core library representations, describing passes. Most likely it would invoke some existing language parser, like YAML or JSON, but it could also be some custom notation. I wouldn't focus on developing many of these "front-ends" until the core has a stable enough API. Note, that the input languages don't have to expose 100% of the core features. It's perfectly fine to only support the necessities.

Templating library

The templating library would wrap up the tree into a more redundant data structure that is more easily consumed by template engines. For example, these node wrappers would provide navigation to both the parent and children, or they could list all members, including the inherited ones. To stay language-agnostic, these should be generic wrappers, that the language-specific wrappers could re-use. For example, this library could ship a node wrapper something like this:

abstract class TreeNodeClassView<TSelf>
	where TSelf : TreeNodeView<TSelf>
{
	private readonly TreeNodeClass underlying;

	protected virtual bool HasAttribute(object attr) => underlying.Attributes.Contains(attr);

	public TSelf Parent => /* wrap up the parent in this type */;
	public IEnumerable<TSelf> Derived => /* wrap up the derived classes in this type */;

	// ...
}

Output language libraries

The output language libraries would adapt the wrappers in the templating library to the destination language (this is why the wrappers are abstract and generic). For example, adapting it to C#:

class CSharpTreeNodeClassView : TreeNodeClassView<CSharpTreeNodeClassView>
{
	// Specialize things like attributes to be specific to C#
	public bool IsSealed => HasAttribute(CSharpAttribs.Sealed);
	public bool IsAbstract => HasAttribute(CSharpAttribs.Abstract);

	// ...
}

The libraries would ship the required templates:

A template for generating a class hierarchy
A template for generating a visitor base class

Optionally, the library would ship a language formatter, or have the knowledge to invoke a pre-installed language formatter.

Make pass sequence builder able to construct passes out of sequence order

Currently (in #6 and #7), PassSequenceBuilder is only able to construct a pass sequence linearly from its first to its last pass. This is usable and works as intended (though tests still need to be written), although the less restriction with the builder API the better. A more ideal API would be for each CompilerPassBuilder to have a settable string Next and string Previous property which are evaluated based on the registered nodes in the PassSequenceBuilder on build. Any inconsistencies between the Next and Previous properties between consecutive passes would cause an exception (or some other possibly configurable behavior) at build-time.

AstNodeBuilder.AddMember(AstNodeMember) causes a stack overflow

AstNodeBuilder.AddMember(AstNodeMember) calls itself unconditionally and causes a stack overflow.

Add ability for pass sequence builder to locate next relationship based on only previous

Add the ability for CompilerPassSequenceBuilder to be able to find the next pass in sequence based on the Previous property of another pass.

CompilerPassSequenceBuilder builder = new();
builder.Root = "a";
builder.AddPass("a");
builder.AddPass("b")
    .WithPrevious("a");

The above would succeed despite pass a not having specified a Next, because b specifies a as its Previous. This would obviously require lookup based on the Previous property of the currently processed pass, although this could be somewhat mitigated by constructing a dictionary with the Previous as the key. There would also have to be some mechanism to ensure that if multiple passes specify the same Previous then an exception would be thrown.

Language-specific CLI options

It would be useful to be able to pass options to (at least) the output language from the CLI. This would essentially act like the .NET CLI which allows for passing additional named options to the template in dotnet new <template>.

One possible API for this is to simply expose a Dictionary<string, string> to the output language which contains the additional options as key-value pairs. The upside of this approach would be that it would only require some method of parsing additional CLI options, and the output language handles everything else by itself. Another slightly more involved option would be to expose an object which internally contains the options as key-value pairs but which provides a simple parse API using something like Get<T>(string longName, string? shortName). Obvious upside of this is that the output language doesn't need to parse anything by itself.

A third much more involved option would be to require output languages to specify whether they need any options and what those options are. This would likely be done through some kind of reflection-based API using a generic variant of the IOutputLanguage interface, for instance IOutputLanguage<TOptions>. Downside of this is that it would require a much more involved API, although the obvious upside is that it would massively simplify the API for the output languages, where the languages could simply implement IOutputLanguage<TOptions> and receive the options in the EmitAsync method. This could also allow for better CLI errors since this API would allow for statically available options which means that the CLI could report an error if an invalid/misspelled option was specified. The other two approaches don't have this benefit as they would simply swallow invalid/misspelled options.

All of these options would require some degree of a custom CLI parser, or at least a CLI parser framework which can parse additional options and give them back in a meaningful way.