kach / nearley Goto Github PK

View Code? Open in Web Editor NEW

3.6K 45.0 232.0 2.58 MB

📜🔜🌲 Simple, fast, powerful parser toolkit for JavaScript.

Home Page: https://nearley.js.org

License: MIT License

JavaScript 56.83% HTML 32.82% CSS 2.63% Nearley 7.72%

parser earley-algorithm earley-parser javascript nodejs nearley node parsing parsing-library

nearley's Introduction

nearley ↗️

nearley is a simple, fast and powerful parsing toolkit. It consists of:

nearley is a streaming parser with support for catching errors gracefully and providing all parsings for ambiguous grammars. It is compatible with a variety of lexers (we recommend moo). It comes with tools for creating tests, railroad diagrams and fuzzers from your grammars, and has support for a variety of editors and platforms. It works in both node and the browser.

Unlike most other parser generators, nearley can handle any grammar you can define in BNF (and more!). In particular, while most existing JS parsers such as PEGjs and Jison choke on certain grammars (e.g. left recursive ones), nearley handles them easily and efficiently by using the Earley parsing algorithm.

nearley is used by a wide variety of projects:

artificial intelligence and
computational linguistics classes at universities;
file format parsers;
data-driven markup languages;
compilers for real-world programming languages;
and nearley itself! The nearley compiler is bootstrapped.

nearley is an npm staff pick.

Documentation

Please visit our website https://nearley.js.org to get started! You will find a tutorial, detailed reference documents, and links to several real-world examples to get inspired.

Contributing

Please read this document before working on nearley. If you are interested in contributing but unsure where to start, take a look at the issues labeled "up for grabs" on the issue tracker, or message a maintainer (@kach or @tjvr on Github).

nearley is MIT licensed.

A big thanks to Nathan Dinsmore for teaching me how to Earley, Aria Stewart for helping structure nearley into a mature module, and Robin Windels for bootstrapping the grammar. Additionally, Jacob Edelman wrote an experimental JavaScript parser with nearley and contributed ideas for EBNF support. Joshua T. Corbin refactored the compiler to be much, much prettier. Bojidar Marinov implemented postprocessors-in-other-languages. Shachar Itzhaky fixed a subtle bug with nullables.

Citing nearley

If you are citing nearley in academic work, please use the following BibTeX entry.

@misc{nearley,
    author = "Kartik Chandra and Tim Radvan",
    title  = "{nearley}: a parsing toolkit for {JavaScript}",
    year   = {2014},
    doi    = {10.5281/zenodo.3897993},
    url    = {https://github.com/kach/nearley}
}

nearley's People

Contributors

Stargazers

Watchers

Forkers

aredridel rwindelz blitmap moasamuelsson akollegger jlarsson sgtransporter bojidar-bg jcorbin smashpapst2 beaugunderson alfhub jaukia ipelovski corwin-of-amber whitten towerofnix yafahedelman lubergalexander kanef sandiegoscott jjsoa1 0x4e38 hbcbh1999 kustomzone longjohncoder empia noscripter monkeypod orchestor athiwatp localchart hhy5277 rlugojr alvarlaigna tant42 tjvr vitoc kirill-havryliuk fvictorio lgromanowski heatherleaf vihanb ericmcornelius ribrdb julien-f grandsong pouyakary niilante edenteq pandahisham jdrew1303 vsl-lang kasbah hhornbacher alexxnica kryndex sammons nopik gdg pombredanne simonhildebrandt darrylyeo josephjunker underdolphin pabloleon alexandertrefz jmdejno gerhobbelt skishore glmdgrielson cube3power guoyr zzzio semanticparsing wooodhead hyperobject omardarwish dashie alanchrt dkilfoyle mofx cliffspradlin ull-esit-pl-1718 blainehansen leanto reyadrahman ertant robroseknows coolreader18 robsimmons rasata stevestrong crguezl gsklee v-stein ballercat jian-li-7 philschatz pipcet

nearley's Issues

Terminal comments

Comments at the end of the file don't work because they expect a \n.

Implement Aycock-Horspool precomputation of nullable token sets

from Aycock & Horspool "Practical Earley Parsing", The Computer Journal, Vol. 45, No. 6, 2002

2 parses when there should be 1

Thanks for your lightning quick response on my last bug report Hardmath123. Another suspected bug now:

The grammar below produces 1 parse for $a=1; but 2 parses for $a =1;

Unless I'm overlooking something in my grammar, I can't see why this would be the case.

program -> _ block {% function(d) { return d[1]; } %}
block -> (statement _):* {% function(d) { return ["block", d[0].map(function(s){return s[0];})]; } %}
statement -> expression _ ";" {% id %}
expression -> expression _ ("="|"=="|"!="|">"|"<"|"<="|">=") _ sum {% function(d) { return ["operation", d[0], d[2], d[4]]; } %} | sum {% id %}
sum -> sum ("*"|"/") product {% function(d) { return ["sum",d[0],d[1],d[2]]; } %} | product {% id %}
product -> product ("*"|"/") exp {% function(d) { return ["product",d[0],d[1],d[2]]; } %} | exp {% id %}
exp -> unaryoperation "^" exp {% function(d) { return ["exp",d[0],d[1],d[2]]; } %} | unaryoperation {% id %} # this is right associative!
unaryoperation -> unaryoperation _ ("++"|"--") {% function(d) { return ["unaryoperation",d[0],d[2]]; } %} | mapoperation {% id %}
mapoperation -> mapoperation _ "[" _ expression _ "]" {% function(d) { return ["map",d[0],d[4]]; } %} | element {% id %}
element -> variable {% id %} | number {% id %} | "(" _ expression _ ")" {% function(d) { return d[2]; } %} | "{" _ block "}" {% function(d) { return d[2]; } %} | "if" _ expression _ expression {% function(d) { return ["if",d[2],d[4]]; } %} | "while" _ expression _ expression {% function(d) { return ["for",d[2],d[4]]; } %}
variable -> "$":? [a-z]:+ {% function(d) { return ["variable", d[0], d[1].join("")]; } %}
number -> [0-9]:+ {% function(d) { return ["number", d[0].join("")]; } %}
_ -> ___:* {% id %}
__ -> ___:+ {% id %}
___ -> [\t \n] {% empty %} | mlcomment | slcomment
mlcomment -> "/*" mlcommentchars:+ .:? "*/" {% function(d) { return [];/*["comment", d[1].join("")+d[2]];*/ } %}
slcomment -> "//" [^\n]:* "\n" {% function(d) { return [];/*["comment", d[1].join("")];*/ } %}
mlcommentchars -> "*" [^/] {% function(d) { return d[0] + d[1]; } %} | [^*] . {% function(d) { return d[0] + d[1]; } %}

Precedence for ambiguous parsings?

Can't figure out how to change precedence for ambiguous parsings. My grammar is ambiguous because it includes emoji (a subset of unicode) and all unicode characters.

Because of this emoji can be parsed as a single emoji or as two constituent unicode characters--but I can't figure out how to prefer parsings where emoji are parsed as emoji (and not as their constituent characters).

Missing /lib

When i downloaded v0.0.4 from NPM, there was no /lib directory.

I recommend fixing it with a new version number.

Multi Line Comments Don't Work in the JS grammar

/x/*/ would still be parsed as one comment. I think I have a working regex for it and I'll see if I can finish and commit it later today.

Stuck at step 0

Alright, I've done the following:

Installed nearley via npm.
Created a parser.ne file.
Copied and pasted an example from the examples folder here.
Compiled that example to a grammar JS file successfully. (Although the javascript.ne file does not compile for me.)
Run valid input through the parser using the code example in the documentation.

Result:
Error at character 0

Since I'm using all the code provided here, I'm perplexed as to what the problem could be. Even a simple, one-line grammar file fails to actually parse anything. Sooo..... o_O ?

Bootstrap compiler

We should write the compiler in nearley's own language now.

JS Grammar Returns Ambiguous Parsings

Some grammars will have ambigous parsings due to things like
return [2];
being parseable as both return the list [2] or the element at index 2 of the array return.
Nearley may need new features to circmvent this problem easily.

Lazily evaluate postprocessors

Profiling shows that we're running postprocessors too often. Also, we're creating too many this.data = []s, which bloats memory.

Limiting array depth (and other questions)

Next question-- (Sorry for all the questions, I'm trying to evaluate this for a large project, would love to talk to you about it.)

When using the example javascript.js, or even in writing short grammar of my own, the parser seems to generate unnecessarily deep array structures. For instance, with parsing something like this:

(function() {
  var blah = 'blah';
});

I get a structure like this:

Most of the arrays contain nothing but another array. Even from the text output, you can see it's creating arrays with no value.

Since each array increases the number of JS objects in memory, is there a way to keep this structure a little flatter and eliminate empty arrays (or arrays with one value which is another array)? Seems like matching entities like variable names or values could be pushed as a single value. Does nearley support that?

Limiting Parsings for Ambiguous Grammars (Bocages)

I know its a long shot and not really something earley was meant for, but are there any methods or optimizations we can implement to deal with ambiguous grammars with exponential parsings? For instance:

num -> num "+" num | [0-9]:+

This seems simple, but nearley can't deal well with large statements of this type because the amount of possible parsings is exponential with the number of plus signs. Could we implement the option of limiting the number of parsings?

disclude `undefined` from data array, provide notreegen

Related to post processing and eliminating nodes, is there a value I can return from a matched token that will eliminate it from output automatically?

For example, it would be nice if I could just drop optional whitespace, such that--

selector _ combinator _ element

-- would return an array of 3 elements instead of 5. I thought I could write:

_ -> null

-- but then that just returned an actual null value. I can see why I might want to post-process based on null, so I'm not sure what to suggest, maybe a special nearley token? Like: %null% or something?

On the GPU

Libraries do exist for interfacing with the GPU in Java Script (such as https://github.com/timoxley/saltmine or other, more mature alternatives which do exist). Parsing is definitely something that can benefit from being on the GPU sometimes, the biggest problem being we'd have to convert parts of the parser too the languages they support on the GPU (such as https://en.wikipedia.org/wiki/OpenGL_Shading_Language). If we do lazily compute post proccessing then this could be quite useful when you have a specific expression that is large, complicated to check, or otherwise would benefit from being on the GPU.

Bug regarding percent symbol '%' in JS postproc code

When a % symbol is inside of a postprocessor block, it causes an error. It seems the parser doesn't ignore % that are found in arbitrary JS code; it consistently complains that there are "no possible parsings" while pointing at whatever symbol - any symbol - after the culprit % symbol.

Example:

main -> number {% function(d) { return d[0] + '%' } %}

number -> [\d]:+ {%
    function(d) { return d[0].join(''); }
%}

Error:

Error: nearley: No possible parsings (@48: ''').
    at Parser.feed (C:\Users\Raymond\AppData\Roaming\npm\node_modules\nearley\lib\nearley.js:219:23)
    at StreamWrapper.write [as _write] (C:\Users\Raymond\AppData\Roaming\npm\node_modules\nearley\lib\stream.js:12:18)
    at doWrite (_stream_writable.js:301:12)
    at writeOrBuffer (_stream_writable.js:288:5)
    at StreamWrapper.Writable.write (_stream_writable.js:217:11)
    at ReadStream.ondata (_stream_readable.js:540:20)
    at ReadStream.emit (events.js:107:17)
    at readableAddChunk (_stream_readable.js:163:16)
    at ReadStream.Readable.push (_stream_readable.js:126:10)
    at onread (fs.js:1679:12)

Tool to test if a grammar with a sample input file is ambiguous

It has been shown that testing whether or not a grammar is ambiguous automatically is impossible / close to it. However, many ambiguities will show up in a sufficiently complex test file. I'd like a simple program that can be attached to a nearleyc build script to run the grammar with a given test file: if there are no parsings / an error parsing, the script raises an error, if there are ambiguities, the script also raises an error. If there is a single, unique parsing, the script gives a return code of 0 / no output.

Would make developing grammars a bit less painful, maybe.

Prevent duplicate states from being added to tables

With the simple grammar from Aycock and Horspool:

S -> A A A A
A -> "a"
A -> E
E -> null

When run against the trivial input:

aa

This yields the following parse tables -- note the duplicate states.

table 0
     { _start → ● S },0
     { S → ● A A A A },0
     { A → ● "a" },0
     { A → ● E },0
     { A → E ● },0
     { S → A ● A A A },0
     { S → A A ● A A },0
     { S → A A A ● A },0
     { S → A A A A ● },0
     { _start → S ● },0
table 1
     { A → "a" ● },0
     { S → A ● A A A },0
     { S → A A ● A A },0
     { S → A A A ● A },0
     { S → A A A A ● },0
     { A → ● "a" },1
     { A → ● E },1
     { _start → S ● },0
     { A → E ● },1
     { S → A A ● A A },0
     { S → A A A ● A },0
     { S → A A A A ● },0
     { S → A A A ● A },0
     { S → A A A A ● },0
     { S → A A A A ● },0
     { _start → S ● },0
     { _start → S ● },0
     { _start → S ● },0
table 2
     { A → "a" ● },1
     { S → A A ● A A },0
     { S → A A A ● A },0
     { S → A A A A ● },0
     { S → A A A ● A },0
     { S → A A A A ● },0
     { S → A A A A ● },0
     { A → ● "a" },2
     { A → ● E },2
     { _start → S ● },0
     { _start → S ● },0
     { _start → S ● },0
     { A → E ● },2
     { S → A A A ● A },0
     { S → A A A A ● },0
     { S → A A A A ● },0
     { S → A A A A ● },0
     { _start → S ● },0
     { _start → S ● },0
     { _start → S ● },0

It does parse correctly, however, yielding

[ [ [ [] ], [ [] ], [ 'a' ], [ 'a' ] ],
  [ [ [] ], [ 'a' ], [ [] ], [ 'a' ] ],
  [ [ 'a' ], [ [] ], [ [] ], [ 'a' ] ],
  [ [ [] ], [ 'a' ], [ 'a' ], [ [] ] ],
  [ [ 'a' ], [ [] ], [ 'a' ], [ [] ] ],
  [ [ 'a' ], [ 'a' ], [ [] ], [ [] ] ] ]

So this is just an efficiency concern.

Right now this is due to the lack of duplication checking (or Set-like datastructure) in State.prototype.process, specifically table[location].push(x);

I'm actively working on this but haven't figured out a tidy way to solve it yet.

Throw a real error

http://www.devthought.com/2011/12/22/a-string-is-not-an-error/

`@include` for file in same directory fails

Given grammar a.ne:

a -> "a"

and b.ne:

@import "a.ne"
# or @import "./a.ne"

b -> a:*

I expect the include to work but it throws an exception. The path it ends up trying to include is ./b.ne/a.ne.

Error message when attempting to use undefined nonterminals

Browser tests

nearley should work in the browser. Ideally, there'd be a demo page which compiles samples (like PEGjs/Jison).

Live demo on website

It will be very nice if it is possible to test Nearly online just like pegjs (link).

Request: Add on-demand compilation like PEG.js' `buildParser` API

My workflow with PEG.js is to only ever use .pegjs files and to include them in node via PEG.js' buildParser API. I find this less error-prone than using generated JavaScript files (no wasted time debugging things and then realizing you forgot to regenerate after changing the grammar).

Failing to parse something: "No possible parsings"

The following grammar fails to parse $a; with No possible parsings (@2: ';').

A change - which shouldn't make any difference - lets it parse correctly. Changing the definition of statement to reference product rather than sum works – even through sum -> product.

program -> _ block {% function(d) { return d[1]; } %}
block -> (statement _):* {% function(d) { return ["block", d[0].map(function(s){return s[0];})]; } %}
#statement -> expression _ ";" {% id %}
statement -> sum _ ";" {% id %}
#statement -> product _ ";" {% id %}
expression -> expression _ ("="|"=="|"!="|">"|"<"|"<="|">=") _ sum {% function(d) { return ["operation", d[0], d[2], d[4]]; } %} | sum
#sum -> sum ("*"|"/") product | product
sum -> product
product -> product ("*"|"/") exp | exp
exp -> unaryoperation "^" exp | unaryoperation # this is right associative!
unaryoperation -> unaryoperation _ ("++"|"--") {% function(d) { return ["unaryoperation",d[0],d[2]]; } %} | mapoperation
mapoperation -> mapoperation _ "[" _ expression _ "]" {% function(d) { return ["map",d[0],d[4]]; } %} | element
element -> variable {% id %} | number {% id %} | "(" _ expression _ ")" {% function(d) { return d[2]; } %} | "{" _ block "}" {% function(d) { return d[2]; } %} | "if" _ expression _ expression {% function(d) { return ["if",d[2],d[4]]; } %} | "while" _ expression _ expression {% function(d) { return ["for",d[2],d[4]]; } %}
variable -> "$":? [a-z]:+ {% function(d) { return ["variable", d[0], d[1].join("")]; } %}
number -> [0-9]:+ {% function(d) { return ["number", d[0].join("")]; } %}
_ -> ___:* {% id %}
__ -> ___:+ {% id %}
___ -> [\t \n] {% empty %} | mlcomment | slcomment
mlcomment -> "/*" mlcommentchars:+ .:? "*/" {% function(d) { return [];/*["comment", d[1].join("")+d[2]];*/ } %}
slcomment -> "//" [^\n]:* "\n" {% function(d) { return [];/*["comment", d[1].join("")];*/ } %}
mlcommentchars -> "*" [^/] {% function(d) { return d[0] + d[1]; } %} | [^*] . {% function(d) { return d[0] + d[1]; } %}

Zero-lenth assertions?

I'm loving nearley, thank you for building it! This is vastly better than LR/LL parsing and I'm astonished LR/LL still garners so much attention given the limitations.

A question: Is there a way to encode zero-length assertions or otherwise control ambiguity when one nonterminal is an abbreviation of another?

For instance, consider a language of 'a', 'b', and 'ab' tokens where ab should be matchedinstead of 'a' followed by 'b'. Given the rule
tokens -> ("ab" | "a" | "b"):*

"aabab" would ideally parse to ["ab" | "a" | "b"]

Instead, you get 5 matches breaking up the "ab" tokens differently:
[[["a"],["ab"],["ab"]]]
[[["a"],["a"],["b"],["ab"]]]
[[["a"],["a"],["ba"],["b"]]]
[[["a"],["ab"],["a"],["b"]]]
[[["a"],["a"],["b"],["a"],["b"]]]

I can get the ideal output if I write a grammar where a standalone "b" is not allowed. But what if you need that? (e.g., if you're trying to parse "elseif", "else", and "if" distinctly)

I'd like to be able to either enforce priorities on some OR operators, or explicitly rule out look-behind/look-ahead matches without capturing those ruled-out match characters, ala:

tokens ->  ("ab" || "a" || "b")          or    tokens -> ("ab" | "a" !"b" | !"a" "b")

The best hacks I can come up with far are:

Apply a regular expression in advance that inserts a boundary token at all word boundaries, then modify the grammar to match boundaries. But this makes the grammar much more complex.
Detect and track word boundaries in advance (without changing the input string), then use a postprocessor to reject matches that don't start at a word boundary based on the l parameter. But this only works for leading-edge boundaries.

So far I'm thinking of building 2, and may be able to live without trailing-edge constraints.

Thanks for nearley, and for any ideas!

Show locations of parse errors

Right now the "it didn't work" is hard to debug in a large grammar. Having positions of errors (and what would work at that point) would be awesome.

Readme: install

The command is npm install -g nearley, not npm install -g nearleyc.

Complexity of parsing indented comments

I commented out a block of lines for debugging purposes and ran into some strange behaviour: the complexity of parsing a parser definition seems to grow dramatically with the number and indentation level of subsequent indented comments.

$ node_modules/.bin/nearleyc --version
0.2.2

For example, with a series of files named indentn.ne, where a block of comments is indented n spaces, i.e.

$ cat indent0.ne
foo -> "bar"
#1
#2
#3
#4
#5
#6

$ cat indent2.ne
foo -> "bar"
  #1
  #2
  #3
  #4
  #5
  #6

etc., execution time shoots up rapidly:

$ time node_modules/.bin/nearleyc indent0.ne > /dev/null
real    0m0.116s
user    0m0.096s
sys     0m0.019s
$ time node_modules/.bin/nearleyc indent2.ne > /dev/null
real    0m0.763s
user    0m0.720s
sys     0m0.048s
$ time node_modules/.bin/nearleyc indent4.ne > /dev/null
real    0m8.349s
user    0m8.138s
sys     0m0.256s

time node_modules/.bin/nearleyc indent6.ne > /dev/null has not yet finished. 😉

Can we have "generate" be called dynamically on given text and return the grammar object?

Move earley.md to a webpage

It doesn't belong in a code repo.

Is full regex supported?

Is this supported? foo -> [0-9A-F?]{1,6}

I'm getting a "no possible parsings" error when generating the grammar.js file.

Unrolling regexes

@JacobEdelman claims we can take full regexes in a grammar and compile them down to nearley automatically. I believe him, so I hereby assign this to him.

Move nearley.js and generated grammars to strict mode

First I want to thank you for this library. It made me able to experiment freely with the syntax of my toy language. It saved me a lot of time.

Now to the point of this post: while I was making small change here and there in the code I noticed that the nearly.js file and the generated grammar js file are not in strict mode. Adding the 'use strict' directive made the code ~2x faster when parsing my grammar on node.js 4.0. This may not be the case for other grammars and other environments, but I think it is worth trying.

Browser demo should let you test the grammar in the browser

I think the primary issue to overcome here is laziness. :P

Add performance tests

To see how large grammars and large inputs perform, and make sure an update doesn't dramatically slow it down.

EBNF support

It would be nice to compile grouping and a kleene operator to pure BNF.

Separate the node streams2 interface from a simpler, browser-friendly interface

It might be nice to separate the streams2 interface so the browser interface can just be parser.push(data) and then parser.push(null) to end, but retain the asynchronous nature of the parser.

Comments

Allow comments: anything after %% should be ignored on a line.

Use require architecture

When compiling to nodejs, the compiler should use require('nearley') instead of copying out nearley.js literally.

Free memory in completed states

Completed states in the table can be null'd and fed to the gc. This ought to free up memory (someone correct me if I'm wrong).

Named tokens

Provide a way to name a token, and bind these names by augmenting the data array to an object.

a -> name:string {% function(d) {return d.name;} %}

Allow full unicode literals in strings

From @beaugunderson

another aside: it seems like the only way to include unicode literals (like
\u2e03) is via a character class; is that intentional?

Note that for now, you can include the unicode literal, uh, "literally" like this:

a -> "café"

Alternatively, charclasses:

a -> [\uxxxx]

Calling nearley-test on a specific grammar results in 100% cpu usage and running for an indefinite amount of time

If nearley-test grammar.js -i <some input> is called with grammar.js (pastebin) containing a compiled grammar of

main -> a

a -> .:* a .:*  
   | null

The node process consumes 100% of the cpu and does not seem to stop in a reasonable amount of time. Oddly enough, removing | null makes it work.

Spurious parses stemming from nullable nonterminals

The following grammar:

@builtin "whitespace.ne"

d -> a

a -> b _ "&"
   | b

b -> letter
   | "(" _ d _ ")"

letter -> [a-z]

when run using nearley-test on (x), generates two (identical) parses [ [ [ '(', null, [ [ [ [ 'x' ] ] ] ], null, ')' ] ] ]. Since the above grammar is unambiguous, this is unexpected.

Notice that this does not occur if you omit the rule a -> b _ "&", which does not appear in the derivation, and this is even more unexpected. It has to do with the order of prediction.

I am preparing a pull request that suggests a fix.

feedback from a noob

hi, I am new to the world of language parsers, and I took some notes while wrapping my head around nearley, perhaps they can be useful:

nearley notes

I heard about nearley from @timoxley (@secoif on Twitter)
wiki article on Earley parsers does not describe syntax, only algorithm
wiki article on BNF is helpful but it's never explained in nearly why it uses a custom syntax rather than classic BNF (e.g. why it uses -> instead of ::=)
it's unclear to me whether or not nearley syntax is more like BNF or EBNF (http://en.wikipedia.org/wiki/Extended_Backus%E2%80%93Naur_Form). it seems like a hybrid, but is definitely it's own thing
I think a nearley -> english example like this would be helpful: http://en.wikipedia.org/wiki/Backus%E2%80%93Naur_Form#Example
IMO I think the following are vaguely worded

from calculator: "main is the nonterminal that nearley tries to parse, so we define it first."

vs from readme:

"The first nonterminal you define is the one that the parser tries to parse."

glossary

nonterminal - basic parser constructions, made up of a name and expansions
name - the left side of the -> in a nonterminal
expansions - the stuff on the right side of the -> in a nonterminal. you can have many of these if they are | separated
postprocessor - defined inside {% %} blocks at the end of production rules
production rules - AKA 'meanings', the name for the overall expression including nonterminals and postprocessors
id postprocessor - built in postprocessor that is a shorthand for doing function(data) {data[0];}

JS preprocessors

I'm quite accustomed to CoffeeScript, so I like to use it here and there. It would be nice if I can use a custom preprocessor with nearley (e.g. Babel, CoffeeScript, 5to6, PromisedLand...).

Add Leo reductions for right-recursive grammars

Using the tweaks to the algorithm from Leo, Joop "A general context-free parsing algorithm running in linear time on every LR(k) grammar without using lookahead", Theoretical Computer Science, Vol 82 (1991)

Why is the generated JS so verbose?

This is not meant to be a criticism but rather, I'm wondering why the generated JS is so "human-readable" and preserves all the named tokens. Is there any reason why the JS file generated from the grammar needs to be readable? Nearley doesn't need to know the names of the tokens, does it? Is that better for testing?

I was just taking a look at the parser file for javascript.js, and I think you could probably reduce the generated code to maybe 1/4 the size. And you could probably reduce the memory footprint by prototyping objects. ( new Literal("w") vs { "literal": "w" }. Objects that have the same "shape" can be optimized by the JIT compiler, whereas { "literal": "w" } and { "literal": "h" } won't necessarily be detected as having the same shape, IIRC.)

(Also, the nearley parser then could perform different actions on symbols based on a matching type, rather than reading the property name "literal".)

.... I suppose you're going to make me write this, lol.

small bug in Parser() ?

Lookie here: https://github.com/Hardmath123/nearley/blob/master/lib/nearley.js#L140

Should it not be: if (r.name === this.start) { ?

start might be undefined?

Calculator demo using right recursion...

At https://github.com/Hardmath123/nearley/blob/master/examples/calculator/arithmetic.ne#L64

    | [\s] _    {% function(d) {return null; } %}

This chokes with about 500 spaces.... Unlike this:

    | _ [\s]    {% function(d) {return null; } %}