harc / ohm Goto Github PK

A library and language for building parsers, interpreters, compilers, etc.

License: MIT License

Shell 1.35% JavaScript 96.30% TypeScript 2.07% Python 0.28%

parsing peg grammars parser javascript typescript compiler

ohm's Introduction

Ohm ·

Ohm is a parsing toolkit consisting of a library and a domain-specific language. You can use it to parse custom file formats or quickly build parsers, interpreters, and compilers for programming languages.

The Ohm language is based on parsing expression grammars (PEGs), which are a formal way of describing syntax, similar to regular expressions and context-free grammars. The Ohm library provides a JavaScript interface for creating parsers, interpreters, and more from the grammars you write.

Full support for left-recursive rules means that you can define left-associative operators in a natural way.
Object-oriented grammar extension makes it easy to extend an existing language with new syntax.
Modular semantic actions. Unlike many similar tools, Ohm completely separates grammars from semantic actions. This separation improves modularity and extensibility, and makes both grammars and semantic actions easier to read and understand.
Online editor and visualizer. The Ohm Editor provides instant feedback and an interactive visualization that makes the entire execution of the parser visible and tangible. It'll make you feel like you have superpowers. 💪

Some awesome things people have built using Ohm:

Seymour, a live programming environment for the classroom.
Shadama, a particle simulation language designed for high-school science.
turtle.audio, an audio environment where simple text commands generate lines that can play music.
A browser-based tool that turns written Konnakkol (a South Indian vocal percussion art) into audio.
Wildcard, a browser extension that empowers anyone to modify websites to meet their own specific needs, uses Ohm for its spreadsheet formulas.

Getting Started

The easiest way to get started with Ohm is to use the interactive editor. Alternatively, you can play with one of the following examples on JSFiddle:

Resources

Tutorial: Ohm: Parsing Made Easy
The math example is extensively commented and is a good way to dive deeper.
Examples
Documentation
For community support and discussion, join us on Discord, GitHub Discussions, or the ohm-discuss mailing list.
For updates, follow @_ohmjs on Twitter.

Installation

On a web page

To use Ohm in the browser, just add a single <script> tag to your page:

<!-- Development version of Ohm from unpkg.com -->
<script src="https://unpkg.com/ohm-js@17/dist/ohm.js"></script>

<!-- Minified version, for faster page loads -->
<script src="https://unpkg.com/ohm-js@17/dist/ohm.min.js"></script>

This creates a global variable named ohm.

Node.js

First, install the ohm-js package with your package manager:

npm: npm install ohm-js
Yarn: yarn add ohm-js
pnpm: pnpm add ohm-js

Then, you can use require to use Ohm in a script:

const ohm = require('ohm-js');

Ohm can also be imported as an ES module:

import * as ohm from 'ohm-js';

Deno

To use Ohm from Deno:

import * as ohm from 'https://unpkg.com/ohm-js@17';

Basics

Defining Grammars

To use Ohm, you need a grammar that is written in the Ohm language. The grammar provides a formal definition of the language or data format that you want to parse. There are a few different ways you can define an Ohm grammar:

The simplest option is to define the grammar directly in a JavaScript string and instantiate it using ohm.grammar(). In most cases, you should use a template literal with String.raw:
```
const myGrammar = ohm.grammar(String.raw`
  MyGrammar {
    greeting = "Hello" | "Hola"
  }
`);
```

In Node.js, you can define the grammar in a separate file, and read the file's contents and instantiate it using ohm.grammar(contents):

In myGrammar.ohm:

  MyGrammar {
    greeting = "Hello" | "Hola"
  }

In JavaScript:

const fs = require('fs');
const ohm = require('ohm-js');
const contents = fs.readFileSync('myGrammar.ohm', 'utf-8');
const myGrammar = ohm.grammar(contents);

For more information, see Instantiating Grammars in the API reference.

Using Grammars

Once you've instantiated a grammar object, use the grammar's match() method to recognize input:

const userInput = 'Hello';
const m = myGrammar.match(userInput);
if (m.succeeded()) {
  console.log('Greetings, human.');
} else {
  console.log("That's not a greeting!");
}

The result is a MatchResult object. You can use the succeeded() and failed() methods to see whether the input was recognized or not.

For more information, see the main documentation.

Debugging

Ohm has two tools to help you debug grammars: a text trace, and a graphical visualizer.

You can try the visualizer online.

To see the text trace for a grammar g, just use the g.trace() method instead of g.match. It takes the same arguments, but instead of returning a MatchResult object, it returns a Trace object — calling its toString method returns a string describing all of the decisions the parser made when trying to match the input. For example, here is the result of g.trace('ab').toString() for the grammar G { start = letter+ }:

ab         ✓ start ⇒  "ab"
ab           ✓ letter+ ⇒  "ab"
ab             ✓ letter ⇒  "a"
ab                 ✓ lower ⇒  "a"
ab                   ✓ Unicode [Ll] character ⇒  "a"
b              ✓ letter ⇒  "b"
b                  ✓ lower ⇒  "b"
b                    ✓ Unicode [Ll] character ⇒  "b"
               ✗ letter
                   ✗ lower
                     ✗ Unicode [Ll] character
                   ✗ upper
                     ✗ Unicode [Lu] character
                   ✗ unicodeLtmo
                     ✗ Unicode [Ltmo] character
           ✓ end ⇒  ""

Publishing Grammars

If you've written an Ohm grammar that you'd like to share with others, see our suggestions for publishing grammars.

Contributing to Ohm

Interested in contributing to Ohm? Please read CONTRIBUTING.md and the Ohm Contributor Guide.

ohm's People

Contributors

Stargazers

Watchers

Forkers

endocrimes djdeath sourceops anfedorov seidtgeist mcanthony mroeder graydon jwmerrill kalhauge tonyg josephg andru255 shubhamkumar13 justinmchase jogleasonjr jimbog igravious ignoredambience wenger lilyx nangal leinadlime rlugojr dethe rjmcguire huddy1985 boy12371 adamnemecek stevestrong aa10000 acetophore owch njjewers wildthink sfinnie kyuweftea acslk akeizer rtoal oeohomos smorin hrtoolbox hhy5277 fengweijp bradparks msylvia hacknuts codezeilen mlajtos philippotto daewon 05px anothermattbrown solertis eknowledger devmessias canfan longjohncoder rjsamson attomos praveenmunagapati morganjk belyenochi im-deepfriedwater abdulazizasiri nedislavd ull-esit-pl-1718 43trh airhorns willyliaowh stanistan jonnydubowsky gamadril alexintosh justmycode-sec hsuching carlosmart7104 hesam lorenzleutgeb iffy swustlqsh brecert yashwordlife iwwee mieco nikolaysuslov jakerockland zdenko kustomzone stupid-engineer oparisy tongbenchuan longde123 lguzzon-scratchbook thepian kale-code szhorizon elgertam meepobrother

ohm's Issues

Visualizer: Parse Tree Visualization does not (always) resolve arguments

For a grammar like:
G { Start = WithArg<"123"> WithArg<Arg> = Arg ForwardArg<Arg> ForwardArg<Arg> = "=" Arg }
the visualizer shows the following result when matching 123=123:

The WithArg bar shows the supplied argument whereas the ForwardArg does not.

In a larger context, the argument sometimes is resolved and sometimes shows up as $0, etc.:

"basic parsing example" from readme is busted

produces console error:

Uncaught Error: Line 8, col 26:
  7 |   col = colChar*
> 8 |   colChar = ~(eol | ",") _
                               ^
  9 |   eol = "\r"? "\n"
Rule _ is not declared in grammar CSV

Proposal: Add a built-in `asArray` operation for ListOf nodes.

The built-in ListOf rule is helpful, but when you write a semantics for your grammar, you almost always end up writing some boilerplate actions like this:

NonemptyListOf(first, _, rest) {
  return [first.myAttr].concat(rest.myAttr);
},
EmptyListOf() {
  return [];
}

It would be nice if we had a built-in operation/attribute that would help people avoid this boilerplate. Possible names: asArray, getElements. It would just return an array of wrappers, and then you could call map to call whatever operation you want on them. E.g.: params.asArray().map((n) => n.myAttr).

null vs. undefined keyword question

null: Matches a null value (or the equivalent in the host language).

What about undefined? In javascript at least undefined exists, would the null keyword match null and undefined, or how is undefined handled?

CLI tool available?

I want to be able to run files and stdin through particular grammars and semantic operations that will generate files or print results to stdout. I've started building my own tooling for doing this with my own work, but wanted to check to see if something already existed and I didn't find anything in a cursory search through the repositories posted here.

Example:

cat model | run_operation RM_PGSQL toPostgreSQL > schema.ddl

Where RM_PGSQL refers to a grammar and toPostgreSQL refers to a semantic action (or derivative).

Does something like this exist, or is a common solution planned?

My apologies if I have missed something.

Parameterizing semantic operations

I'll start with an example grammar:

DateGrammar {
  Date = "now"   -- now
       | "later" -- later
}

And a semantic operation:

semantics.addOperation('extractDate', {
  'Date_now': function() {
    let now = new Date()
    return now
  },

  'Date_later': function() {
    let now = new Date()
    let later = new Date(now + 3 * 3600)
    return later
  }
})

Now, when writing tests for this semantic operation I'd like to set a fixed date instead of relying on the current time. I realize this probably wasn't possible short of wrapping everything in another scope or (gulp) using dynamic scope.

So I imagined something like this:

What if semantic operations could be parameterized?

let match = grammar.match('now')
let now = new Date(1440000000000)
let date = semantics(match).extract(now)
assert(+date === +now)

And the parameters could be accessed like this?

semantics.addOperation('extractDate', {
  'Date_now': function() {
    let now = this.parameters[0]
    return now
  },

  'Date_later': function() {
    let now = this.parameters[0]
    let later = new Date(now + 3 * 3600)
    return later
  }
})

This could be accessed via this.parameters, or this.context, or this.contextParameters, and would be the same value in all nodes.

built in rule recommendation: any

_: Matches a single item from the input stream. For a string, it will match any one character.

I would recommend an alias for this: any (or only using any). In such a token heavy, semantically rich language if you can take advantage of short keywords it seems like a good thing.

Visualizer error: Cannot read property 'length' of undefined

For the grammar

Arithmetic {
  Exp     = Exp "+" number      -- plus
          | number
  number  = digit+ ("." digit+)?
}

and input

2+2

the visualizer script produces an error, as shown in the following screenshot

The cause of the problem seems to be that in somtimes the value of traceNode becomes a plain Object and not a Trace object. I logged the value of traceNode like so:

  console.log(traceNode); // Added this line  
  var text = (traceNode.displayString.length > 20 && traceNode.displayString.indexOf(' ') !== -1) ?
      (traceNode.displayString.slice(0, 20) + '\u2026') : traceNode.displayString;
  var label = wrapper.appendChild(createElement('.label', text));
  label.setAttribute('title', traceNode.displayString);

and after processing the input 1+1 I saw

Somehow trace nodes are not trace nodes....

Visualizer: Handle error due to missing start rule

When writing a grammar from scratch, you will almost always end up with an uncaught exception for "Missing start rule argument -- the grammar has no default start rule." We should catch this and ignore it.

MatchResult.getDiscardedSpaces() needs documentation

It should be added to the API doc.

"Ambiguous" left- and right- recursive rules should be an error

(See #55 for the backstory.)

Right now it's possible for Ohm programmers to write "ambiguous" left- and right-recursive rules in a grammar, e.g.,

AddExp
  = AddExp "+" AddExp  -- plus
  | AddExp "-" AddExp  -- minus
  | MulExp

This should be a compile-time error.

Why? Because the associativity of the + and - operators (see above) is not obvious, and it easily could have been. For instance, if you want them to associate to the left, then a much better way to write the above rule is:

AddExp
  = AddExp "+" MulExp  -- plus
  | AddExp "-" MulExp  -- minus
  | MulExp

And now the associativity is obvious. (Similarly, if right-associativity is desired, then AddExp should be rewritten to be right-recursive.)

No release since November

Digging a little into Ohm's commit history, I found this commit. So I tried to use a leading pipe in my rules but Ohm is not accepting it. I thought I was doing something wrong. But after some time I found out that the version I'm using is 0.9.0 which was released last November and doesn't contain the "leading pipe" commit.

There has been a lot of activity since then, but there are no releases. Would it make sense to make a minor release every few changes/commits?

Add Note to readme describing supported host languages.

Currently the readme implies other host languages are possible, but does not seem to mention which actually exist.

Improve error messages resulting from bad "operation prototype strings", etc.

Right now if you specify an invalid operation prototype string when you add an operation to a semantics, e.g.,

MyGrammar.semantics().addOperation('myOperation(x, y', {...})

you'll get a generic Ohm error message like this:

> 1 | myOperation(x, y
                      ^
Expected ")"

which is not very nice.

A couple of things that would make it better:

(1) Give a little context before showing the snippet of the input and the failures / expectations
(2) Don't include a line number

E.g.,

Invalid operation prototype string:

  myOperation(x, y
                  ^
Expected ")"

(Thing (2) suggests that we should have a better API for getting messages out of a MatchFailure. At the moment you can either get a one-size-fits-all message via the message property, or build your own using the getRightmostFailures() method. It would be nice to add a getMessage method that lets you specify some options, e.g., mf.getMessage({includeLineNumbers: false, includeFluffyFailures: true, ...}).)

Showing input errors in visualizer

In addition to showing grammar errors, I thought it would be interesting to show the failures in the input field.
You can hoover the expected input errors and it should highlight the related part of the grammar :

Each > character represents a layer in the failure stack.
Related branch : https://github.com/djdeath/ohm/tree/input-errors

Should unknown escape characters be an error?

In JavaScript strings, any character can come after a backslash -- the backslash is ignored if it's not one of the actual escape codes (\n, \r, etc.) So "\W" === "W". Right now, Ohm behaves the same way.

Other languages are different. I quickly tested a few on repl.it:

"\W" → "W": JavaScript, Ruby, Lua, Ohm
"\W" → "\W": Python
Error: JSON, Java, Go

Should we keep it the way it is in Ohm? To me, one of the other options seems to make more sense.

Visualizer: Don't do block highlighting if \n isn't consumed

In the visualizer, when highlighting the text in the top left pane, we always attempt to expand the selection into a block highlight. So if the last non-whitespace character on the line is consumed by a rule, the highlight will go to the end of the line, which implies that the newline was also consumed.

We should only do a block highlight if the rule interval begins at 0 and ends after a newline character.

Crash when default start rule has parameters

> ohm.grammar('G { Start<x> = x }').match('x')
TypeError: Cannot read property 'getExprType' of undefined
    at Apply.pexprs.Apply._calculateExprType (/Users/dubroy/dev/cdg/ohm/src/pexprs-getExprType.js:112:28)
    at Apply.pexprs.PExpr.getExprType (/Users/dubroy/dev/cdg/ohm/src/pexprs-getExprType.js:33:19)
    at Apply.PExpr.newInputStreamFor (/Users/dubroy/dev/cdg/ohm/src/pexprs.js:39:23)
    at Object.Grammar._match (/Users/dubroy/dev/cdg/ohm/src/Grammar.js:102:38)
    at Object.Grammar.match (/Users/dubroy/dev/cdg/ohm/src/Grammar.js:93:22)
    at repl:1:35
    at REPLServer.defaultEval (repl.js:252:27)
    at bound (domain.js:287:14)
    at REPLServer.runBound [as eval] (domain.js:300:12)
    at REPLServer.<anonymous> (repl.js:417:12)

Proposal: Eliminate special syntax for Str

The syntax for a Str pexpr, which enters "string matching mode" for the next value in the current input stream, is currently a bit awkward:

G {
  start = [``ident'']
}

Since it's relatively uncommon to need this, and it's hard to guess what it means when you see it, what about using a parameterized rule for this instead? E.g.:

G {
  start = [stringMatching<ident>]
}

Rule descriptions in override and extend

With the new rule description syntax, would it be possible to also add rule descriptions to overriding and extending rules?

Something along these lines:

  Rule
    = ident Formals? ruleDescr? "="  Alt  -- define
    | ident Formals?  ruleDescr? ":=" Alt  -- override
    | ident Formals?  ruleDescr? "+=" Alt  -- extend

Add a static check that disallows nullable expressions to be used as the operands of Kleene-+s and Kleene-*s

... because this can lead to infinite loops while matching inputs. Right now we avoid the infinite loops by detecting this condition dynamically and throwing an InfiniteLoop error, but it would be better and less confusing to catch this problem at compile-time.

(See #13 for a related discussion.)

Overriding spaces_ shouldn't be possible

Right now it's possible to override spaces_, even though there's a comment in the code saying that it shouldn't be possible to override it.

See http://jsfiddle.net/pdubroy/seuwpew3/.

Requiring ohm-js in package.json installs a pre-commit hook in my project

I have a devDependencies line for ohm-js pointing to this github repo from my package's package.json. When I run npm install . for my package, it retrieves and installs ohm-js as expected - but then the ohm-js postinstall script runs, calling in turn dev-setup.sh, which uses git-rev-parse to find the .git directory and installs a pre-commit hook therein. Unfortunately, this leads to a pre-commit hook being added to the .git directory of my package, not ohm-js! Subsequent commits of my package then fail because npm run prepublish is being called.

postinstall script fails on windows

When on windows, simply doing:

npm install ohm-js

Fails with this error:

> [email protected] postinstall e:\code\test\node_modules\ohm-js
> bin/dev-setup.sh; touch .install-timestamp

'bin' is not recognized as an internal or external command,
operable program or batch file.

I believe the postinstall script in package.json isn't valid. If I clone it locally and do npm link ohm-js instead, I can then change the path to be bin\\dev-setup.sh and then it works. I think it's trying to use cmd on windows instead of bash to run the script commands.

Proposal: Generic list/sequence interface

One of the nice things about the way we handle _iter nodes is that they do map for free. That is, iterNode.interpret() produces an Array which is the result of invoking the interpret operation on each of the children of the _iter node. With arrays, you'd have to write anArray.map(function(x) { return x.interpret(); }).

What if we take this idea, and make it more generally available in Ohm as a "List" or "Sequence" interface?

In #62, I proposed adding a built-in asArray operation for ListOf nodes. It would return an Array of wrappers -- which is very similar to what _iter nodes do. Instead of asArray, maybe we should implement asSequence, so that you would have the same convenience of mapping that you get with _iter nodes.

Other places we could use this: the children property of nodes, and as the argument to _default and _nonterminal actions.

Support parameterized rules as arguments

From @mroeder:

G {
  Start = call<double, "x">
  call<rule, param> = rule<param>
  double<x> = x x
}

throws a Error: FIXME: should catch this earlier at Apply.pexprs.Apply.introduceParams (./src/pexprs-introduceParams.js:63:13)

In principle, I don't see any reason why we can't support this. One thing that would need fixing is pexprs.Apply.prototype.introduceParams (where the error is thrown). When we go to install the call rule into the grammar, we replace the Apply node in the body (the application of rule) with a Param node, and in doing so, discard the arguments.

One thing I'm not sure about is how we'd want to check that the rule application has the correct number of arguments. Some options:

Do it dynamically, à la Python
Do it at grammar creation time, through type inference.
Put type annotations on the parameters, e.g. call<rule/1, param> = rule<param>.

Visualizer does not display characters that match "any"

Infinite loop

Since the nullable checks have been introduced I started to see more infinite loops in Ohm (completely freezing the visualizer for example).

Here is my example with which I'm trying to parse an empty input (i.e. '') :

Empty {
  Grammar
    = srcElems

  srcElems
    = (stmt ~_)* stmt?

  stmt
    = sc

  spacesNoNl
    = (~"\n" space)*

  sc
    = spacesNoNl newLine -- newLine
    | spacesNoNl ";"     -- semicolon
  newLine
    = "\n" -- lineEnd
    | &"}" -- blockEnd
    | ~_   -- streamEnd

}

Any idea?

Overloading parameterized rules?

Would there be any objection to allowing parameterized rules to be 'overloaded' in Ohm?

I am talking about multiple rules with the same base name, but a differing number of parameters. I'm not sure if there is any 'type' associated with a parameter that could be further disambiguated on, but I imagine that would also be useful, if possible.

Example:

RM {
    Contained<element> = Contained<"{", element, "}">
    Contained<open, element, close> = open element close
}

Show action call stack when a semantic action is missing

When there is a missing semantic action, we get an error message like: Error: Missing semantic action for name in ast operation...but sometimes the error is that an operation/attribute was invoked on a node that it shouldn't have been. To make it easier to debug these kinds of errors, we should somehow indicate whether the operation/attribute was called from.

Undeclared rule is not flagged when passed as a rule parameter

With the following grammar:

G {
    start = ListOf<asdlfk, ",">
}

I expect to get a ohm.error.UndeclaredRule exception. Instead, I get the following exception:

Uncaught TypeError: Cannot read property 'description' of undefined
pexprs.Apply._eval @ pexprs-eval.js:360
pexprs.PExpr.eval @ pexprs-eval.js:50
pexprs.Param._eval @ pexprs-eval.js:132
pexprs.PExpr.eval @ pexprs-eval.js:50

Not operator syntax reccomendation

Negative Lookahead
~ expr

It would be a lot more intuitive, in my opinion at least, to use ! instead of ~. It seems like most modern languages these days use ! to mean "not". Just a recommendation.

Idea: Syntax for specifying binding rules

It might be nice to have a syntax for specifying lexical binding rules for languages. Currently, the best way to do name resolution is inside an attribute/operation, but this makes it more difficult to reuse the results. A declarative syntax could be useful for (say) an IDE that wants support renaming variables, without having to rely on language- and implementation-specific attributes/operations.

One idea I had is a syntax to embed the binding rules in the grammar itself: https://gist.github.com/pdubroy/984ad6e3ecf181b1c15c#file-gistfile1-txt. However, this would complicate the grammars somewhat, and would likely only be able to handle simple cases.

Another option would be some sort of supplemental syntax -- maybe separate rules in the grammar that define how names are resolved. E.g.:

ML {
  Expr = ...
  LetExpr = ...

  @LetExpr = /* binding syntax goes here */
}

Some relevant papers:

Consider removing grammarFromFile

Hey,

the node.js detection for providing grammarFromFile fails in some popular non-node.js environments:

browserify
webpack, if process is defined
react-native packager

I've tried changing the detection by also looking for process.browser (browserify only), or looking for the global window object, but there's almost always an environment where one or all of those are defined, too.

Therefore I'd consider removing the load-from-file function from the API and provide an example for loading grammars in the documentation instead. Also, with template strings I'm usually embedding my grammars next to the semantic actions or other code.

Should undefined be a keyword?

Currently undefined is a keyword in Ohm. Since it's a JavaScript-specific concept, perhaps it shouldn't be. It's not part of the JSON spec: http://json.org/

API Reference Docs missing arity info on _terminal, _nonterminal, and _iter semantic actions

From my intern Saketh:

"While the document on semantic actions (api-reference.md) describes the existence of the _terminal, _nonterminal and _iter semantic actions, it does not describe the arity requirements of each."

Proposal: Optional naming of terms

In OMeta you had to name the rules you wanted to work with as grammar and semantic action where tightly coupled:

ometa Calc {
  digit    = ^digit:d                 -> d.digitValue(),
  number   = number:n digit:d         -> (n * 10 + d)
           | digit,
  addExpr  = addExpr:x '+' mulExpr:y  -> (x + y)
           | addExpr:x '-' mulExpr:y  -> (x - y)
           | mulExpr,
  mulExpr  = mulExpr:x '*' primExpr:y -> (x * y)
           | mulExpr:x '/' primExpr:y -> (x / y)
           | primExpr,
  primExpr = '(' expr:x ')'           -> x
           | number,
  expr     = addExpr
}

Ohm does not offer any way of doing that but so far I have seen some cases where it would have been helpful:

automatic naming of sub-terms in operations and attributes (especially helpful for an Ohm IDE)
CST-to-AST conversions (would reduce the necessary mapping even further)
generic (semi-)structured editor (powered by an Ohm grammar)

Therefore my proposal is to allow (optional) naming of terms in Ohm grammars:

Calc {
  expr = addExpr
  addExpr = addExpr:x "+" mulExpr:y
          | addExpr:x "-" mulExpr:y
          | mulExpr
  mulExpr = mulExpr:x "*" primExpr:y
          | mulExpr:x "/" primExpr:y
          | primExpr
  primExpr = "(" expr:x ")"
           | number
  number = number:n digit:d
         | digit
}

Note that some optional kind of rule naming is already provided by rule descriptions.

New built-in `asIteration` attribute needs documentation

Dealing with InfiniteLoop exceptions

I've spent a great deal of time figuring out where a grammar enters an infinite loop.
The visualizer or the g.trace(input).toString() doesn't really help you figuring out where the problem is, because the current behavior is to throw an exception.

Would it be possible to implement something that would show something like this :

"Your grammar enters an infinite loop here :
Grammar
-> rule1
-> rule2
-> rule3"

where rule3 is the rule that detected the infinite loop.

Eliminate multiple underscore arguments in tests and examples

From the mdn page on strict mode

strict mode requires that function parameter names be unique

The house style so far uses underscores for every argument that is not actually used in a semantic action, but this throws errors when run in strict mode. It might be better style to use function (_1, _2, _3) {} (or some other alternative?) instead.

Performance comparison

It will be great to have some performance comparison between OHM, Ometa/JS and custom parsers (such as Esprima).

Arity of semantic operations

Looking at the default grammar in the visualizer, on this particular rule :

ArrayLiteral = "[" ValueExpr ("," ValueExpr)* "]"

How many arguments should the associated operation have?
I would expect 4 :

function (open, valueexpr, valueexprs, close) {
}

Ohm complains that my function doesn't have enough arguments :

Error: Found errors in the action dictionary of the 'eval' operation:
- Semantic action 'ArrayLiteral' has the wrong arity: expected 5, got 4
- Semantic action 'AssignmentPattern_rec' has the wrong arity: expected 5, got 4
- Semantic action 'ArrayLiteral' has the wrong arity: expected 5, got 4
- Semantic action 'AssignmentPattern_rec' has the wrong arity: expected 5, got 4

Thanks!

Matching with parameterized start rule does not work

grammar.match(input) can be passed a second, optional rule name to start parsing the input from that rule.
Currently parameterized rules like Foo<arg> cannot be use and matched. This turns out to be a specific problem with e.g. the ES5 grammar where <guardIn> is used now.

For the ES5 grammar, a call with a reference to a defined rule (instead of a specific value) would be necessary/sufficient:
g.match('1+2', 'Expression<withIn>'); for Expression<guardIn>

Improve error messages for invalid escape sequences

Rather than a generic parse error, we should have a message like "Invalid escape sequence: \p"

Exceptions in actions

When I throw an exception inside a semantic action it turns into a TypeError exception. Is there a supported way to do this?

Handling rules that are both left- and right-recursive

I'm reading the literature on PEGs. This paper talks about a bug in @alexwarth's approach to left recursion which is present in ohm:

This grammar:

MyLang {
  Expr = Expr "-" Num --a
       | Num --b  
  Num = digit+
}

... Correctly parses "1-2-3" to (((1)-2)-3). But when I change the grammar to this:

MyLang {
  Expr = Expr "-" Expr --a
       | Num --b
  Num = digit+
}

The tree is flipped to (1-(2-(3))).

Maybe this isn't an issue in practice? Its hard to say. But I can certainly see it tripping some people up.

Error message method

I'd like to report syntax errors with line and column info. But ohm does not export common.getLineAndColumnMessage. I'd like to write something like the following function. Or even better it should be a method on Interval.

export function syntaxError(interval: ohm.Interval, message: string): void {
    throw new Error(
        ohm.getLineAndColumnMessage(
            interval.inputStream.source,
            interval.startIdx) +
        message);
}

Visualizer: Support choice of starting rule

Right now, the test input is parsed with the grammar's default start rule. We should support choosing the start rule (e.g. from a combo box). The placement of the control should be chosen in a way that will still make sense when we support multiple test cases in the UI.

The ecmascript package example can't work

Due to this line:
https://github.com/cdglabs/ohm/blob/master/examples/ecmascript/es5.js#L12

When its installed via package it is no longer a subdirectory of ohm. I think if you just require('ohm-js') it should resolve up the directory chain.

But given its a peer dependency it seems like it should probably change it to allow ohm to be passed in via peer also, something like:

var ohm = require('ohm-js');
var es = require('ohm-example-ecmascript');
var es5 = es.es5({ ohm: ohm });
...

`instanceof` should work as expected for Node constructors

Currently, the Node constructors just return an instance of Node, rather than an instance of a subclass of Node. This means that e.g. g.constructors.Foo() instanceof g.constructors.Foo return false. It would be nice if this would return true.

I implemented that in a branch here: https://github.com/cdglabs/ohm/tree/node-constructors. Unfortunately, it seems to cause a pretty significant performance degradation -- 5-7% when compiling ohm.js with the example ES5 grammar. So I'm not going to land this for now, but wanted to create the bug to keep track of the issue.