joelburget / lvca Goto Github PK

View Code? Open in Web Editor NEW

19.0 2.0 0.0 9.64 MB

language verification, construction, and analysis

Home Page: https://lvca.dev

License: MIT License

HTML 0.13% OCaml 63.60% JavaScript 34.11% Makefile 0.03% CSS 2.07% Python 0.07%

language programming-language parsing pretty-printing language-workbench denotational-semantics typechecking

lvca's Introduction

Language Verification, Construction, and Analysis

Introduction

LVCA is a toolkit for building programming languages. There are two main activities in LVCA -- (1) defining the abstract syntax for a language and (2) mapping between languages.

Abstract syntax

We can, for example define the abstract syntax for the lambda calculus.

term := Lam(term. term) | App(term; term)

This language definition defines a new sort, term. It says that a term is either a Lam or an App.

A Lam (lambda abstraction) binds a variable of sort term within an expression of sort term. For example, the classic identity function (\x -> x) looks like Lam(x. x). An App (function application) holds two subterms of sort term. We can apply the identity function to itself: App(Lam(x. x); Lam(x. x)).

Aside: If you're familiar with a language with algebraic datatypes (like Haskell, OCaml, Rust, etc), then this ought to look familiar. We've just defined a sums-of-products-style datatype. We can work with these as you'd work with algebraic datatypes: by constructing them and pattern-matching against them. However, sort declarations generalize algebraic datatypes because they have a notion of binding structure.

Let's try a different example.

string : *
primitive : *
list : * -> *

term :=
  | Operator(string; list term)
  | Primitive(primitive)

This definition says that a term is either an Operator (which holds both a string and list term) or a Primitive (which holds a single primitive).

We've also declared three external sorts: string, primitive, and list. These are sorts that are assumed to exist but will not be defined in our language.

Note that in each case we've just defined the abstract syntax of the language (not the concrete syntax). We can also define the concrete syntax via a parser and pretty-printer, but for now, we'll work with just the abstract syntax.

With a language definition like either of the above, LVCA can provide some nice tools:

We can view the binding structure of a term.
We can write a query for a given pattern over a codebase. For example, we could search for all lambda abstractions with the pattern Lam(_). Or we could search for all identity functions with Lam(x. x). Important note: this pattern will match Lam(y. y) or any other variable name.
Similarly, we can even rewrite parts of our codebase.

Mapping between languages

Once we've defined syntax, the real fun is mapping between languages. For example, say we have a language which combines the lambda calculus with real-valued expressions.

real : *

term :=
  | Lam(term. term)
  | App(term; term)
  | Real_expr(real)

Now we can define a mapping to reals:

\(term: term) -> match term with {
  | Lam(_) -> tm
  | App(f; arg) -> match reduce f with {
    | Lam(x. body) -> body[f := reduce arg]
    | f' -> {App(f'; {reduce arg})}
  }
  | Real_expr(expr) -> expr
}

This is a function of type term -> real, meaning it interprets terms as reals. This function defines the semantics of terms by translation to another language.

Now, if we can evaluate real expressions (and we can evaluate the translation from term to real), then we can evaluate terms.

One final thing we might want to do is lift reals back to term:

\(real: real) -> Real_expr(real)

Now, since we have a term -> real and a real -> term, we can compose them (with the real evaluator real -> real in the middle) to get a term evaluator of type term -> term.

About the name

LVCA is an acronym for Language Verification, Construction, and Automation
In biology, LUCA stands for Last Universal Common Ancestor -- the most recent common ancestor of all life on earth. LVCA occupies a somewhat analogous position (maybe that's a stretch) as it can be used to implement any programming language.
I pronounce it "Luca".

Subpackages

LVCA is composed of several subpackages. Topologically sorted by dependencies:

util: A few utilities used in the rest of the packages. Mostly extensions to Jane Street's base.
provenance: Types to represent provenance, ie where did a term come from.
parsing: Extensions to Angstrom, used in the rest of the packages for parsing.
syntax: The most important package -- contains representations for the core LVCA data types.
core: Definition of a "core" language.
bidirectional: Experimental library for defining bidirectional typechecking schemes.
constructive-real: Constructive / computable real numbers.
crowbar: Defines a binary for property checking.
syntax_quoter: Utilities used by both ppx_lvca and ppx_lvca_core.
ppx_lvca: An OCaml ppx for easily defining languages and terms.
ppx_lvca_core: An OCaml ppx for easily defining core language terms.
languages: Example languages built with LVCA.
pages: Web pages (many available at lvca.dev).

Build

LVCA is written in OCaml and built with dune. So first you need opam and dune installed, then:

make install-deps
dune build

The make commands are available only from the project root.dune build can be run from the project root or any of the subpackages.

`pages`-specific:

To produce JS files small enough to put online, run in release mode. Optionally, also compress with terser:

dune build --profile=release
terser -c toplevel,sequences=false,drop_console=true --mangle -- _build/default/pages/0x-huttons-razor/main.bc.js > out.js

Test

From the top level or any subpackage (syntax, core, etc):

dune runtest

From the top level:

make lint

lvca's People

Contributors

Stargazers

Watchers

lvca's Issues

Generated files should be reproducible

Both Menhir and Ocamllex generated files have embedded paths specific to the computer they were generated on. For example:

# 1 "/Users/joel/code/lvca-bucklescript/src/Term_Lexer.mll"

This is annoying because a bunch of lines change when files are regenerated on my laptop vs desktop. And obviously more of a problem with contributors. The best thing to do would be to not check in generated files, but that might make building LVCA slightly painful.

Core language

I think it's important to have a core language that serves as a good default target.

My initial attempt looks like this:

type core =
  (* first four constructors correspond to regular term constructors *)
  | Operator of string * core_scope list
  | Var of string
  | Sequence of core list
  | Primitive of primitive
  (* plus, core-specific ctors *)
  | Lambda of sort list * core_scope
  | CoreApp of core * core list
  | Case of core * core_scope list
  (** A metavariable refers to a term captured directly from the left-hand-side
  *)
  | Metavar of string
  (** Meaning is very similar to a metavar in that it refers to a capture from
      the left-hand-side. However, a meaning variable is interpreted. *)
  | Meaning of string

This issue is here to braindump some thoughts and references.

GHC Core:
- Understanding GHC Core
- core syn type
- Implementing functional languages: a tutorial: "The principal content of the book is a series of implementations of a small functional language called the Core language. The Core language is designed to be as small as possible, so that it is easy to implement, but still rich enough to allow modern non-strict functional languages to be translated into it without losing efficiency."
GRIN:
- https://github.com/grin-compiler/ghc-grin
- A modern look at GRIN, an optimizing functional language back end
Transformation and Analysis of Functional Programs Neil Mitchell's thesis.
A-Normalization: Why and How

The first question to answer is what exactly is this core language for? It's to unambiguously define the semantics of a language (via translation to core). It's nice if we can do other things like step it with a debugger, but that's secondary.

Two important concerns, fairly unique to this project, are inclusion of terms from other languages and computational primitives.

By "terms from other languages" I mean that denotational semantics (in general / in LVCA) is about translating from language A to B. When using core, this is specialized to a translation from A to Core(B), where Core(B) is core terms with terms from B embedded. As an example, a case expression in Core(bool):

case(
  true();
  [ true(). false(), false(). true() ]
)

Some of the syntax is up for debate, but the point is that this is the equivalent of (OCaml) match true with true -> false | false -> true, but where booleans are not built in to core at all, they're from the language embedded in core.

The other concern I mentioned above is computational primitives, by which I mean primitives that are expected to actually do something. For example, you might have primitive #not, in which case you could write something like the above example as #not(true()). Here #not is not built in to the specification of core, but it's provided by the runtime environment. (I'm using a hash to denote primitives, but it's just a convention I think is nice).

With primitives we're now dealing with "core plus", core extended with a set of primitives. So the example #not(true()) is a term in Core(bool()){#not}. The syntax is complete undecided, but the idea is that this term can be evaluated in any environment that provides the #not primitive. I think this is really cool. You could easily find the set of primitives your language relies on. It would even be possible to do a translation to a different set of primitives, eg Core(bool()){#not} -> Core(bool()){#nand}.

Rename modules according to Jane Street style guide?

Identifiers (constructors, variables, structures, type names, …) are written using underscores to separate words, not in CamelCase. So write num_apples not numApples and Foo_bar, not FooBar

EG AbstractSyntax -> Abstract_syntax.

Undecided because the OCaml standard library uses CamelCase.

Evaluate switching from Jison

Problems:

Not a fan of its lexical format
Conversion to / from Jison format is really gross
Not sure how to handle conflicts or parse errors

Comments -> provenance

Update standard parser to capture comments and put them in a new provenance type.

Parsing `int32`s

The primitive parser doesn't currently parse int32s. We should either introduce a syntax for them (?) or document this.

Revamp check_term

Make agnostic to primitives

Revamped lexing

I believe it's useful to slightly generalize lexing from a stateless producer of tokens each implicitly containing a string. I'll start with a couple examples for context on why.

Example 1: React

Here's an example I took from the React homepage:

class TodoList extends React.Component {
  render() {
    return (
      <ul>
        {this.props.items.map(item => (
          <li key={item.id}>{item.text}</li>
        ))}
      </ul>
    );
  }
}

The example shows JavaScript (class TodoList ...) which contains HTML(-ish) (<ul>...</ul>), which contains JS (this.props.items...), which contains HTML (<li>...</li>), which contains JS (item.id, item.text).

JavaScript embeds HTML via tags (<x>...</x>) and HTML embeds JavaScript via braces ({...}).

Example 2: Oil / OSH

Andy Chu, working on Oil Shell, reached similar conclusions:

What do the characters :- mean in this code?

$ echo "foo:-bar  ${foo:-bar}  $(( foo > 1 ? 5:- 5 ))"
foo:-bar  bar  -5

Three different things, depending on the context:

Literal characters to be printed to stdout.

The "if empty or unset" operator within ${}.

The : in the C-style ternary operator, then the unary minus operator for negation.

Andy explains this in more detail on the blog. But the crux is this,

OSH uses a simple lexing technique to recognize the shell's many sublanguages in a single pass. I now call it modal lexing.

This is how we address the language composition problem.

Andy uses slightly different terminology, but I think arrives at essentially the same conclusion as me. I also really like his post When Are Lexer Modes Useful?.

Example 3: LVCA sublanguages

Here's an example denotational semantics rule: [[ x + y ]] = nat-case(x; y; x'. succ([[ x' + y' ]])). This rule is translating addition to a natural number recursor. On the right-hand side, [[...]] signals a transition from the outer (target) language to the inner (source).

Something similar comes up when dealing with typing rules.

ctx >> x => nat, y => nat
-------------------------
   ctx >> x + y => nat

This example is similar in that we're mixing tokens from different languages -- >>, =>, ,, and ------------------------- are tokens from the typechecking metalanguage, while x, y, and + are tokens from the object language. I want to have a principled story for how this all works.

Proposal

Laurence Tratt points out that composition of grammars is hard:

For those using old parsing algorithms such as LR (and LL etc.), there is a more fundamental problem. If one takes two LR-compatible grammars and combines them, the resulting grammar is not guaranteed to be LR-compatible (i.e. an LR parser may not be able to parse using it). Therefore such algorithms are of little use for grammar composition.

He points to Earley as a partial solution to the problem. However, for the purposes I have in mind, it seems a lower-tech solution ought to work. There's usually an obvious sentinel denoting language embedding, something like {...} or "..." or [[...]]. We just need the outer language's lexer to lex it as one big chunk, then hand it to the inner language's lexer to lex the inside. For example:

/\w[\w0-9'_-]+/ -> ID
"(" -> LPAREN
")" -> RPAREN
";" -> SEMI
"." -> DOT
/\[\[(.*)\]\]/ -> MEANING($1)

Here our example from earlier, nat-case(x; y; x'. succ([[ x' + y' ]])), will produce a sequence of tokens

ID("nat-case")
LPAREN
ID("x")
SEMI
ID("y")
SEMI
ID("x'")
DOT
ID("succ")
LPAREN
MEANING("x' + y'")
RPAREN
RPAREN

Where the meaning token is further lexed by the inner language lexer, which will produce tokens

ID("x'")
ADD
ID("y'")

This means we never have to confront the general problem of composition of parsers, just the much easier problem of composition of lexers (which is not a technical problem at all, more a question of how to tie everything together). The only drawback is that the transition between outer- and inner-language must be well-enough defined for a lexer to recognize it. In any case seems like a practice that languages designers should really strive for.

Remove Core_kernel dep?

Not sure if we're using anything in Core_kernel not in Base. See https://discuss.ocaml.org/t/reducing-the-size-of-js-of-ocaml-output/2538

Check_term demo language stack overflow

Error only in the browser. The message is completely useless:

It's possible to get locations by deleting the source map. Error goes away when you comment out the | lambda(value. term) line.

Allow specifying whether a language sort should generate variable definitions or not.

This applies only to the lvca.abstract_syntax_module ppx. There, term := Operator(list term) generates:

type 'info term =
  | Operator of 'info * 'info List.t 
  | Term_var of 'info * string

Do we want the Term_var constructor? Probably, usually. If the language ever binds a term, then yes:

term := Operator(list term)
scope := Scope(term. term)

But not if there's no scope. This can be easily statically determined when all definitions live in the same language, but not if term is imported as an external into another language, however I guess at that point it would be represented as Nominal (currently).

On the `Language_object` definition

Classical parts of a language:

abstract syntax (not all types will have this, those that do... add Representable type?)
concrete syntax ☑
statics -- add checking?
dynamics -- 🅇 this is its own separate thing

Pretty-printing

The three classic pretty-printer papers:

Pretty Printing -- Derek C. Oppen, 1979.
The Design of a Pretty-printing Library -- John Hughes, 1995.
A prettier printer -- Philip Wadler, 1997. This algorithm is fairly widely used, eg Prettier uses it.

There's an adaptation of Wadler's algorithm to OCaml -- Strictly Pretty -- Christian Linding, 2000.

Wadler's algorithm is presented as a set of combinators

(<>) :: Doc -> Doc -> Doc
nil :: Doc
text :: String -> Doc
line :: Doc
nest :: Int -> Doc -> Doc
group :: Doc -> Doc
(<|>) :: Doc -> Doc -> Doc
flatten :: Doc -> Doc

However, we want to specify layout as part of the concrete syntax declaration. Probably with boxes. Eg:

com :=
  | "skip" { skip() }
  | [<hv 1,3,0> [<h 1> name ":="] iexp] { assign($1; $2) }
  | [<hov 1, 0, 0> 
      [<h 1> "if" bexp]
	  [<h 1> "then" com]
      [<h 1> "else" com]
    ] { if($2; $4; $6) }
  ...

This example is taken with only minor modifications from Syn: a single language for specifying abstract syntax trees, lexical analysis, parsing and pretty-printing -- Richard J Boulton, 1996.

This is also quite similar to Ocaml's Format (they even have almost exactly the same types of boxes):

Format module.
Format Unraveled - Richard Bonichon, Pierre Weis, 2017.

The Syn declaration seems a little heavy-weight. Ocaml also has break hints. This seems like roughly the right direction -- I'm still weighing the tradeoffs.

Other:

A Pretty But Not Greedy Printer (Functional Pearl) Jean-Philippe Bernardy, 2017. Possibly useful while implementing the layout algorithm.

Fix string lexing

We fail to lex strings with escapes, eg "\\".

See ocaml lexer for example: https://github.com/ocaml/ocaml/blob/trunk/lex/lexer.mll

Allow empty sorts?

Just like Haskell supports empty data declarations with EmptyDataDecls. And Agda / Coq allow the same. I'm strongly leaning towards allowing this. AFAIK this basically just entails updating the term parser and core parser to allow empty declarations and empty pattern matches.

Implement precedence and fixity

Currently the concrete syntax grammars accept precedence and fixity hints but don't actually use them.

Generated signatures are not easy to work with

The worst part is that they require Wrapper. We should be able to ignore this
val language : Lvca_syntax.Abstract_syntax.t is always required.
There's this awkward wrapping with the inner Ty module:

module Type : sig
  include
    [%lvca.abstract_syntax_module_sig
  {|
    sort : *

    ty := Sort(sort) | Arrow(ty; ty)
    |}
  , { sort = "Sort_model.Sort" }]

  include Nominal.Convertible.Extended_s with type t := Ty.t

ld: warning: directory not found for option '-L/opt/local/lib'

❯ dune runtest -w
    ocamlopt del/.lvca_del.inline-tests/inline_test_runner_lvca_del.exe
ld: warning: directory not found for option '-L/opt/local/lib'
    ocamlopt bidirectional/.lvca_bidirectional.inline-tests/inline_test_runner_lvca_bidirectional.exe
ld: warning: directory not found for option '-L/opt/local/lib'
    ocamlopt languages/.lvca_languages.inline-tests/inline_test_runner_lvca_languages.exe
ld: warning: directory not found for option '-L/opt/local/lib'
    ocamlopt syntax/.lvca_syntax.inline-tests/inline_test_runner_lvca_syntax.exe
ld: warning: directory not found for option '-L/opt/local/lib'
    ocamlopt ppx_lvca/.ppx_lvca.inline-tests/inline_test_runner_ppx_lvca.exe
ld: warning: directory not found for option '-L/opt/local/lib'
    ocamlopt ppx_lvca_del/.ppx_lvca_del.inline-tests/inline_test_runner_ppx_lvca_del.exe
ld: warning: directory not found for option '-L/opt/local/lib'
    ocamlopt syntax_quoter/.lvca_syntax_quoter.inline-tests/inline_test_runner_lvca_syntax_quoter.exe
ld: warning: directory not found for option '-L/opt/local/lib'

Remove to_string functions

There's no need for these, since Fmt includes to_to_string.

bignum dependency pulling in core_kernel?

Kind annotations

I'm working on adding kind checking. One thing I don't think I'll get to right now, but would be nice, is optional kind annotations for sorts in abstract syntax declarations. I'd like to enable them both at the top level and when declaring an argument to a newly-defined sort. Example:

primitive : *
list : * -> *

foo (a : *) := foo(list a)
bar b := bar(b primitive) // note that annotations are still optional, `b : * -> *` is inferred.

Evaluation provenance

Create new provenance type that summarizes where a value was evaluated.

Parsing comments / attachment

Currently, most places will parse a single trailing comment (search for option' comment). But this will fail in the following case:

term()  // comment 1
// comment 2

This will currently fail to parse. Not good.

Tools like Ocamldoc have rules for attaching comments to terms and will only attach one comment.

We should certainly succeed in parsing the above example. I'm not sure yet how we should handle it.

comment 1 is attached to the term, comment 2 is not
multiple comments can be attached to terms
normalized to "comment 1\n comment 2"
some larger rethinking of how comment attachment works