GithubHelp home page GithubHelp logo

vezel-dev / celerity Goto Github PK

View Code? Open in Web Editor NEW
7.0 2.0 1.0 2.15 MB

An expressive programming language for writing concurrent and maintainable software.

Home Page: https://docs.vezel.dev/celerity

License: BSD Zero Clause License

C# 98.56% Smalltalk 0.07% C 0.10% Shell 0.04% TypeScript 1.23%
celerity compiler csharp dotnet interpreter jit language gc runtime

celerity's People

Contributors

alexrp avatar dependabot[bot] avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar

Forkers

wilsonk

celerity's Issues

Implement a basic set of code quality lints

(The warning from #31 should be turned into a lint.)

What's missing:

  • A pass that warns on undocumented public declarations.
  • A pass that warns on unused items (private declarations, parameters, bindings).
    • This will require that we track references on Symbols (should be trivial to do).
  • A pass that warns on obviously dead code.
  • A pass that warns on tests that lack assert statements.

Depends on:

Block expression parsing has a few bugs

  • If attributes are parsed but no skipped tokens, we just kind of forget to do anything with the attributes.
  • We should require at least one statement in a block, per the grammar.

Reorient the syntax layer around text spans and defer source location resolution

This will be an essential step on the way to supporting incremental parsing and code refactoring in the future. Most text/syntax APIs will operate in terms of text spans, which will only be resolved to source locations (path, line, character) when needed - e.g. when printing diagnostics.

The text/syntax APIs will change to something like the following.

Vezel.Celerity.Text

 public readonly struct SourceLocation :
     IEquatable<SourceLocation>, IEqualityOperators<SourceLocation, SourceLocation, bool>
 {
+    public SourceTextSpan Span { get; }
 }
 public abstract class SourceText
 {
+    public SourceTextLineList Lines { get; }
-    public IEnumerable<SourceTextLine> EnumerateLines();
+    public override string ToString();
+    public string ToString(SourceTextSpan span);
 }
+// Internally caches line position information after it has been computed.
+public sealed class SourceTextLineList : IReadOnlyList<SourceTextLine>
+{
+    public SourceText Text { get; }
+    public int Count { get; }
+    public SourceTextLine this[int index] { get; }
+    public IEnumerator<SourceTextLine> GetEnumerator();
+}
 public readonly struct SourceTextLine :
     IEquatable<SourceTextLine>, IEqualityOperators<SourceTextLine, SourceTextLine, bool>
 {
-    public SourceLocation Location { get; }
-    public string Text { get; }
+    public SourceText Text { get; }
+    public SourceTextSpan Span { get; }
+    public int Line { get; }
+    // Calls Text.ToString(Span).
+    public override string ToString();
 }
+public readonly struct SourceTextSpan :
+    IEquatable<SourceTextSpan>, IEqualityOperators<SourceTextSpan, SourceTextSpan, bool>
+{
+    // Mostly the same stuff as on System.Range.
+}

Vezel.Celerity.Syntax

 public sealed class SyntaxAnalysis
 {
+    public SourceText Text { get; }
 }

Vezel.Celerity.Syntax.Tree

 public abstract class SyntaxItem
 {
+    // Internally stored as the parent on the root node.
+    public SyntaxAnalysis Analysis { get; }
+    public abstract SourceTextSpan Span { get; }
+    public abstract SourceTextSpan FullSpan { get; }
+    // Resolves path, line, and character location for Span by querying Analysis.Text.
+    public SourceLocation GetLocation();
 }
 public sealed class SyntaxTrivia : SyntaxItem
 {
-    public SourceLocation Location { get; }
+    // Computed from an internal position + Text.Length. Span and FullSpan are equivalent on trivia.
+    public override SourceTextSpan Span { get; }
+    public override SourceTextSpan FullSpan { get; }
 }
 public sealed class SyntaxToken : SyntaxItem
 {
-    public SourceLocation Location { get; }
+    // Computed from an internal position + Text.Length.
+    public override SourceTextSpan Span { get; }
+    // Computed from Span, and SyntaxTrivia.Span from LeadingTrivia/TrailingTrivia.
+    public override SourceTextSpan FullSpan { get; }
 }
 public abstract class SyntaxNode : SyntaxItem
 {
+    // Computed from SyntaxToken.Span from any descendant tokens.
+    public override SourceTextSpan Span { get; }
+    // Computed from SyntaxToken.FullSpan from any descendant tokens.
+    public override SourceTextSpan FullSpan { get; }
 }

Language idea: Consider removing the mandatory `mod` directive

Is there actually a good reason to have this in the language? Module lookup is probably not going to need it at all. Right now, the only function it serves is being able to attach certain well-known attributes to it. Perhaps we could find a different way to do that.

Basic LSP implementation

The syntax/semantic analysis APIs should now be sufficient for a basic LSP implementation. We don't need to be super ambitious here - just performing semantic highlighting would be a great first step.

Depends on:

Overhaul lint API to be less constraining

Not all lint passes will fit into the current model. Lints should just be passed a SemanticTree rather than being called on different node kinds.

This means that we will have to fundamentally rethink how we suppress lint diagnostics, since we currently update the lint configuration as we descend into the tree.

Support diagnostic IDs/names

All diagnostics issued by the compiler need to have a well-known ID (e.g. W0001, E0001, etc). Diagnostics from third-party analyses should use names (e.g. undocumented-declaration).

Implement parsing of union types and adjust syntax

This is currently completely missing. Same for the AST representation.

```ebnf
type ::= primary-type ('or' primary-type)*
return-type ::= none-type |
type
primary-type ::= any-type |
literal-type |
boolean-type |
integer-type |
real-type |
atom-type |
string-type |
reference-type |
handle-type |
module-type |
record-type |
error-type |
tuple-type |
array-type |
set-type |
map-type |
function-type |
agent-type |
nominal-type
```

Implement semantic analysis of well-known attributes

  • @deprecated "reason"
    • Must have a reason string.
    • Allowed on modules, constant declarations, and function declarations.
  • @doc "text", @doc false
    • Must have a documentation string, or false to explicitly mark as undocumented.
    • Allowed on modules, constant declarations, and function declarations.
      • Allowed on private declarations, but has no effect.
  • @flaky "reason"
    • Must have a reason string.
    • Allowed on test declarations.
    • Indicates that a test might fail and should not be counted as a failure if it does.
  • @ignore "reason"
    • Must have a reason string.
    • Allowed on test declarations.
    • Indicates that a test should not be run.
  • @lint "name:severity"
    • Must have a string literal containing the lint name and severity.
      • Severity is one of: none, warning, error
    • Allowed anywhere.

Consider creating a larger set of standard diagnostic codes for missing tokens

// TODO: Create more specific diagnostics for certain kinds of missing tokens.
public static DiagnosticCode ExpectedToken { get; } = CreateCode();
public static DiagnosticCode MissingDeclaration { get; } = CreateCode();
public static DiagnosticCode MissingStatement { get; } = CreateCode();
public static DiagnosticCode MissingType { get; } = CreateCode();
public static DiagnosticCode MissingExpression { get; } = CreateCode();
public static DiagnosticCode MissingBinding { get; } = CreateCode();
public static DiagnosticCode MissingPattern { get; } = CreateCode();

For comparison, Roslyn has a vast sea of error codes of this form starting with CS1001. I am not actually sure if doing this adds meaningful value for end users, though. How many are realistically going to look up the error code for a missing semicolon, or equals sign, or whatever? Presumably, the error message itself saying which exact token is missing should be enough?

Need to think on this more.

Language idea: Friend modules

A module A can declare that module B is a friend and so is allowed to access private members of A. The keyword is already reserved.

Something like this:

a.cel:

mod {
    friend B;

    fn foo() {
        42;
    }
}

b.cel:

mod {
    fn bar() {
        A.foo(); // OK; no panic.
    }
}

For this to work, the semantics of a field expression (. operator) would be changed to pass along the accessing module when looking up the member. The runtime would then check if the resolved module declares the accessing module as a friend.

This sounds inefficient, but I think object shapes and basic block versioning based on types would allow us to fully specialize most such cases. This feature can only realistically be prototyped and considered once we have a runtime capable of such optimizations.

This is tentatively approved for 2.0, pending prototyping.

Language idea: `try`/`catch` expressions

Right now, if you need to repeat a bunch of catch arms and/or error handling logic for a set of different calls, there isn't really a great solution. A try expression (keyword reserved) would address this:

let result = try {
    foo()?;
    bar()?;
    baz()?;
} catch {
    err AError { ... } -> ...,
    err BError { ... } -> ...,
    err CError { ... } -> ...,
};

The idea is fairly straightforward: Any error raised within a try block, whether from a ? call or a raise expression, will transfer control to the catch block, instead of propagating the error up the call stack.

Notably, to keep the complexity and performance of the feature under control, there is no unwinding. An error raised within a try block must be handled by the corresponding catch block, or a panic occurs. If the catch block wants to propagate the error, it must explicitly raise it again. This does allow nesting try blocks, but the control transfers between them is explicit, and the runtime does not need to search for handlers when an error is raised.

Approved for 1.0.

Implement semantic analysis

Depends on:

Semantic analysis means things like use resolution, variable name binding, lambda captures, local mutability checks, loop break/next binding, etc...

Type analysis is out of scope here.

Initial interpreter and standard library essentials

In the interest of being able to run Celerity code ASAP and getting a suite of behavior tests done, the initial interpreter implementation will lean heavily on the .NET runtime for garbage collection and data structures (BigInteger, List<T>, Dictionary<TKey, TValue>, HashSet<T> etc). Eventually, these components will be swapped with native ones shared between the interpreter and JIT compiler.

Some essentials of the standard library will need to be implemented - mostly just stuff for manipulating the various data types of the language and interacting with agents.

Partially depends on:

Optimize syntax tree traversal methods

// TODO: Optimize some of these (e.g. avoid descending into trivia and tokens when possible).
public IEnumerable<SyntaxNode> DescendantNodes()
{
return Descendants().OfType<SyntaxNode>();
}
public IEnumerable<SyntaxToken> DescendantTokens()
{
return Descendants().OfType<SyntaxToken>();
}
public IEnumerable<SyntaxTrivia> DescendantTrivia()
{
return Descendants().OfType<SyntaxTrivia>();
}

Language idea: Function contracts

It would be interesting to see if there's something we can do in this space. We have the assert statement currently, but I think there's room for a more principled feature here.

The feature would need to support preconditions and postconditions (with access to the return value). Preconditions would run before any code in the function, while postconditions would need to run after any defer and use statements in the function. Postconditions would only run when an error is not raised from the function.

Compile-time contract checking would be out of scope initially, but could always be done on a best-effort basis later down the line.

I have no idea what the syntax would look like yet.

Language idea: Generators (`yield fn`, `yield ret`, `yield break`)

The keyword is already reserved.

Something like:

yield fn range(x, y) {
    if y >= x {
        yield break;
    };
    let mut i = x;
    while i < y {
        yield ret i;
        i = i + 1;
    };
}
  • yield fns may use yield ret and yield break; normal fns may not.
  • yield fns must have at least one yield ret or yield break expression.
  • yield fns may not use raise expressions, normal ret expressions, and error-propagating calls.
  • yield fns do not have an implicit return value like normal fns.
  • yield fn is mutually exclusive with ext fn and err fn.
  • yield fn lambdas are supported.

The transformation into a state machine will happen when the module is loaded by the runtime. If a yield fn passes semantic analysis, it must be transformable.

This is tentatively approved for 2.0.

Language idea: `rec with`/`err` ... `with` expressions

Fairly straightforward feature:

let r1 = rec {
    x = 1,
    y = 2,
};
let r2 = rec with r1 {
    x = 4,
    mut y = nil,
    z = 3,
};
assert r2 == rec {
    x = 4,
    y = nil,
    z = 3,
};
r2.y = 5; // OK; no panic.

(Variation on syntax suggested by @Roukanken42.)

A with expression (keyword reserved) basically just clones a record or error value and adds/replaces the specified fields.

Approved for 1.0.

Language idea: Allow `mut` on parameters

fn foo(mut bar) {
    bar = 42;
    bar;
}
assert foo("hi") == 42;

Basically just using a pattern-binding in parameter grammar rules instead of a bespoke binding rule.

Language idea: `const` expressions

A const expression would be to a const declaration what an fn (lambda) expression is to an fn declaration. Basically, it is just an anonymous constant. It has all the same semantics that a regular constant does, but is anonymous and embedded directly in code.

Design and implement the shared linear IRs (HIR, MIR, LIR)

There will be 3 IRs in the runtime core. They will all be shared between the interpreter and the JIT compiler. Initially, we will implement HIR and MIR only, with the interpreter prototype (#58) consuming MIR. Later, as we reduce dependence on .NET types, we will implement and consume LIR in the interpreter. Finally, the JIT compiler will be implemented, which will transform LIR to AIR (#81), and then compile AIR to machine code.

The runtime will be based on lazy basic block versioning:

This is important to know to understand the IR design and behaviors described below.

HIR

High-Level IR (HIR) is the first intermediate representation. It mainly focuses on linearizing the code, turning it into SSA form, and desugaring some high-level language concepts. HIR is constructed from the semantic tree upfront when a module is loaded, and never changes after that.

HIR features basic blocks (with parameters), upvalues, constants, and operations as building blocks. Operation value operands can be upvalues, basic block parameters, constants, and (non-void) operations; there are no explicit variables or temporaries. Code is in SSA form, with basic block parameters serving as ฮฆ nodes. There is no propagation of explicit or inferred type information at this stage, but all type tests are made explicit.

Lowering to HIR gets rid of some high-level language concepts like pattern matching, agent send/receive syntax, defer statements, for expressions, try expressions, etc.

The HIR data structures will be very minimalistic and will not be amenable to analysis. For example, there will be no use/definition chains. HIR is only really meant to be walked during lowering to MIR. In other words, HIR serves as a template for specialization in MIR.

MIR

Mid-Level IR (MIR) is where type specialization and most optimizations happen. It is similar to HIR in the building blocks it has, but unlike HIR, everything now carries type information. Types are gathered from the running program through basic block versioning, entry point versioning, value shapes, etc.

Lowering from HIR to MIR happens on demand as the program executes code, and is done with basic block granularity. Due to type specialization, there can be many different versions of MIR code for a given HIR (extended) basic block. Lowering proceeds until a type test is encountered that cannot be resolved with the available type information, or until the end of the function is encountered.

MIR will maintain use/definition chains and various other data structures that simplify transformation of the code. This will facilitate a classic set of optimizations (#61) that can be performed now that type information is available.

LIR

Low-Level IR (LIR) decomposes managed values into their constituent raw value and shape words. LIR mostly has the same building blocks as HIR and MIR, but at this stage, the only types that exist are 64-bit integers, 64-bit floats, and untyped pointers. All high-level operations will have been decomposed to primitive CPU-like operations. LIR is essentially a simple register transfer language.

LIR allows certain optimizations that would be harder to express at the MIR level. For example, in a series of small integer operations, it's obvious that copying the shape word for every intermediate operation is unnecessary. Yet, because MIR only operates on managed values, this notion cannot be expressed. At the LIR level, it is trivial to detect and remove such copies.

Note that, while LIR is very close to the machine, it is not architecture-specific.

Expand test suite

  • Create a harness for testing the command line driver.
  • Create a test for setting lint severity with @lint attributes.
  • Create more tests for the undocumented-public-declaration pass.

Language idea: More advanced string literals

We need to come up with a design for more advanced string literals. In particular, string literals that can span multiple lines are frequently useful.

String interpolation is out of scope for this.

Consider expanding `SyntaxItemList<T>` and `SeparatedSyntaxItemList<TElement, TSeparator>` API surface

We can at least expose Span and FullSpan properties, as well as ToString() and ToFullString() methods.

This begs the question, though: Should we also expose GetText() and GetFullText()? That would require a Parent property. But then that might signal that the list is a node in the tree, which is not actually the case. We'd then also have to consider tree traversal methods, at which point these lists start to look an awful lot like SyntaxItems in their own right...

Need to think on this one.

Switch to CommandLineParser and Cathode

We will eventually want the standard library's console API to be oriented around a terminal. The Spectre.Console API sits at too high a level for this to be practical. Further, using Spectre.Console.Cli locks us into using Spectre.Console.

We should switch to CommandLineParser as it allows us to supply a TextWriter, effectively decoupling it from any particular console API. Then, we can use Cathode for all the low-level console interaction.

Avoid keeping the `SourceText` instance around in `SyntaxTree`

public sealed class SyntaxAnalysis
{
// TODO: We should eventually get rid of this. When we need the source text, we can reconstruct it from the tree.
public SourceText Text { get; }

This is neat because, for the happy case, we don't need to access the SourceText at all. Only when there are diagnostics do we need to access line information from the SourceText, and it's reasonable to just reconstruct it for those cases.

Even then, we still have to keep the SourceText around during parsing in order to construct locations for diagnostics. To remedy that, we should probably also consider changing SourceDiagnostic to not carry a SourceLocation, but rather a SourceTextSpan and a SyntaxItem reference. A SourceDiagnostic.GetLocation() method could then be exposed which would resolve the SourceLocation.

Cache repeated string instances in the lexer

When lexing a typical source file, there's going to be a lot of repeated strings - identifiers, literals, white space, and so on. We can't intern these, but it would make good sense to cache tokens up to a certain length and return the same instance instead of building them up repeatedly.

To implement this, instead of building up the token string in a StringBuilder, we would keep track of where the token starts and ends. When creating the token, if the length is below our caching threshold, we first look it up in the token cache. For larger tokens, we shouldn't bother as the lookup will take too long to be worth it.

Implement lexing and parsing

Nothing particularly difficult here. Just (very) boring implementation work.

Notably, though, we should try to do something to prevent stack overflows in the parser. Recursiont looks interesting here.

Consider merging language analysis assemblies into a single assembly

Splitting the language analysis layers into 5 separate assemblies might have been a bit overkill. Consider a new Vezel.Celerity.Analysis project with the following namespaces consolidated:

  • Vezel.Celerity.Quality
  • Vezel.Celerity.Quality.Passes
  • Vezel.Celerity.Semantics
  • Vezel.Celerity.Semantics.Binding
  • Vezel.Celerity.Semantics.Tree
  • Vezel.Celerity.Syntax
  • Vezel.Celerity.Syntax.Tree
  • Vezel.Celerity.Text
  • Vezel.Celerity.Typing

Support a `celerity.json` project configuration file

Something like:

{
    "name": "my-app", // Unique project identifier.
    "path": "src", // Optional path containing the project's own source files. Defaults to src.
    "kind": "executable", // Optional project kind (executable, library). Defaults to executable.
    "license": "0BSD", // Optional SPDX license expression.
    "version": "1.0.0", // Optional Semantic Versioning 2.0.0 version. Defaults to 0.0.0.

    // List of module search paths. The runtime will match the module path against the
    // prefixes listed here and then look up the remainder of the module path in the
    // specified directory.
    //
    // So e.g. LibA::Foo would find LibA here and then locate dep/lib-a/src/foo.cel,
    // whereas Company::LibB::Bar::Baz would locate dep/lib-b/src/bar/baz.cel.
    "paths": {
        "MyApp": "src", // Only necessary if the app itself uses e.g. MyApp::Main.
        "LibA": "dep/lib-a/src",
        "Company::LibB": "dep/lib-b/src",
    },

    // Overrides default lint severities.
    "lints": {
        "unused-local-symbol": null, // Don't run this pass at all.
        "test-without-assert": "none", // Hide diagnostics from this pass.
        "unreachable-code": "error" // Promote diagnostics from this pass to errors.
    }
}

The tooling APIs will pick this up and use it appropriately for the various celerity CLI commands.

Note that nothing about this file will flow transitively; we're intentionally keeping things super simple. A top-level executable project will have to declare module search paths for all dependencies, direct or transitive, that it needs. Also, the file is completely optional.

Language idea: Macros and AST quotation

The macro, quote, and unquote keywords are reserved for future exploration in this space.

Some obvious issues to tackle here:

  • How will macros work with a module system that loads modules lazily?
  • Should macros be able to generate declarations? Types? Statements?
    • I lean towards only allowing expression macros.
  • Do we want to formally specify the AST produced by quote expressions?
  • Should macros be hygienic? If so, how much?
  • Do we give up on type analysis when encountering macros?

Implement better error recovery for parsing of separated syntax lists

private (ImmutableArray<T>.Builder Elements, ImmutableArray<SyntaxToken>.Builder Separators) ParseSeparatedList<T>(
Func<LanguageParser, T> parser,
SyntaxTokenKind separator,
SyntaxTokenKind closer,
bool allowEmpty,
bool allowTrailing)
where T : SyntaxNode
{
// TODO: The way we parse a parameter list (and other similar syntax nodes) causes the parser to misinterpret
// the entire function body for some invalid inputs. We need to do better here.
var result = SeparatedBuilder<T>();
var (elems, seps) = result;
bool NextIsRelevant()
{
return Peek1() is { IsEndOfInput: false } next && next.Kind != closer;
}
if (!allowTrailing)
{
if (allowEmpty && !NextIsRelevant())
return result;
elems.Add(parser(this));
while (Optional(separator) is { } sep)
{
seps.Add(sep);
elems.Add(parser(this));
}
return result;
}
if (!allowEmpty)
{
elems.Add(parser(this));
if (Optional(separator) is not { } sep)
return result;
seps.Add(sep);
}
while (NextIsRelevant())
{
elems.Add(parser(this));
if (Optional(separator) is not { } sep2)
break;
seps.Add(sep2);
}
return result;
}

Provide a way to specify existing bindings when analyzing an interactive document

Also, we need to process let statements in a similar fashion to what we do in block expressions:

public override InteractiveDocumentSemantics VisitInteractiveDocument(InteractiveDocumentSyntax node)
{
var subs = ConvertList(node.Submissions, static (@this, sub) => @this.VisitSubmission(sub));
return new(node, subs);
}

public override BlockExpressionSemantics VisitBlockExpression(BlockExpressionSyntax node)
{
using var ctx = PushScope<BlockScope>();
var stmts = Builder<StatementSemantics>(node.Statements.Count);
// Let statements are somewhat special in that they introduce a 'horizontal' scope in the tree; that is,
// bindings in a let statement become available to siblings to the right of the let statement.
var lets = new List<ScopeContext<Scope>>();
var defers = ctx.Scope.DeferStatements;
foreach (var stmt in node.Statements)
{
if (stmt is LetStatementSyntax)
lets.Add(PushScope<Scope>());
var sema = VisitStatement(stmt);
if (sema is DeferStatementSemantics defer)
defers.Add(defer);
stmts.Add(sema);
}
for (var i = lets.Count - 1; i >= 0; i--)
lets[i].Dispose();
defers.Reverse();
return new(node, List(node.Statements, stmts), defers.DrainToImmutable());
}

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.