cederberg / grammatica Goto Github PK

View Code? Open in Web Editor NEW

85.0 85.0 35.0 2.65 MB

Grammatica is a C# and Java parser generator (compiler compiler)

Home Page: https://grammatica.percederberg.net

License: Other

C# 36.30% XSLT 0.85% CSS 0.34% Java 62.51%

c-sharp java java-parser-generator library ll-parser parser-generator

grammatica's People

Contributors

Stargazers

Watchers

grammatica's Issues

Regular expressions are evaluated incorrectly when an optional subexpression ends in a + or *

(Note: I found this in the C# version. I don't know if the bug exists in the Java version, though I expect it does.)

When a Tokenizer is initialized, all terminals defined by regular expressions are converted to NFAs for evaluation during the lexing step.

When an optional subexpression's final term has an "X-or-more" modifier (? or *), the regex is converted incorrectly, producing an NFA that does not correspond to the specified regular expression. Instead, the modifier is applied to the portion of the subexpression before the final term.

Consider the following regular expression:
((##[^\r\n]*)?\r?\n)+
This contains the optional subexpression:
(##[^\r\n]*)?

This subexpression should match:

The empty string
Two consecutive # symbols followed by zero or more characters that are not CR or LF.

When the ParseFact parses the subexpression, it calls ParseAtom, which returns an NFA of the form:

S0:
  '#' -> S1
S1:
  '#' -> S2
S2: (accept/end state)
  [^\r\n] -> S2

Then, the '?' is read, and ParseFact calls ParseAtomModifier to rewrite the NFA.
ParseAtomModifier computes min=0 max=1 and simply adds an epsilon path from the start to the end state, to make it optional. This rewrites the state machine to:

S0:
  '#' -> S1
  nil -> S2
S1:
  '#' -> S2
S2: (accept/end state)
  [^\r\n] -> S2

Because there are outgoing transitions from S2, this changes the expression to (##)?[^\r\n]*, which matches:

The empty string
Any number of non-CR/LF characters

(As an aside, this turns my regex into ((##)?[^\r\n]*\r?\n)+, which will match any string that ends in LF as long as all CRs are immediately followed by LF; since that's going to be pretty much any real string input, the regex matched the entire input, which produced some very confusing error messages.)

The fix for this is simple: in ParseAtomModifier, before the "handle supported repeaters" comment:

if (end.outgoing.Length > 0) {
   end = end.AddOut(new NFAEpsilonTransition(end));
}

This causes the method to produce the following NFA:

S0:
  '#' -> S1
  nil -> S3
S1:
  '#' -> S2
S2:
  [^\r\n] -> S2
  nil -> S3
S3: (accept/end state)

This will correctly match:

The empty string
Two consecutive '#' characters followed by any number of non-CR/LF characters.

Is the C# runtime code mechanically translated from the Java runtime code, or was that a one-time translation that's kept in sync manually now? If the latter, I can submit a patch for both, as well as patch to fix several other minor errors in the Java->C# translation code.

testQuantifierStackOverflow test fails on 32 bit systems

testQuantifierStackOverflow test seems to fail on 32 bit systems with the following:

    [junit] Testcase: testQuantifierStackOverflow took 0.284 sec
    [junit] 	Caused an ERROR
    [junit] null
    [junit] java.lang.StackOverflowError
    [junit] 	at net.percederberg.grammatica.parser.re.StringElement.match(StringElement.java:98)
    [junit] 	at net.percederberg.grammatica.parser.re.RepeatElement.findMatches(RepeatElement.java:309)
    [junit] 	at net.percederberg.grammatica.parser.re.RepeatElement.findMatches(RepeatElement.java:321)

[many repeats of the last line follow]

Full build logs: i386, armhf.

Porting to Maven

Hello,
I've used Grammatica a gazillion of years ago (2005!) for my graduation thesis and I'm now rediscovering it to show how to use a parser for our internal tool purpose.

Is it possible to have the latest version on Maven? I see that https://mvnrepository.com/artifact/net.percederberg.grammatica/grammatica has version 1.5 only.

how to use grammar file?

ArgumentNullException in C# tokenizer at EOF when using .NET Regex extensions

I tried to to create a parser and test it (which was a challenge, because I can find no examples documenting how to properly use the generated code), and I get an ArgumentNullException whenever I call Parse with my parser.

Further research indicates that it's because I'm using regex syntax that the Grammatica regex engine doesn't support, so it's automatically switching to using .NET's System.Text.RegularExpressions.RegEx class.

Tracing through a parse of a file consisting of a single blank line, here's what I see:

ReaderBuffer.Peek(offset: 0) is called
ReadBuffer.Peek calls EnsureBuffered(offset: 1)
EnsureBuffered tries to read BLOCK_SIZE characters.
- It calls input.Read which reads 2 characters (CR and LF)
- It calls input.Read again which, because we're at EOF, returns 0
- It then calls input.Close and then sets input to null.
Eventually, control returns to Tokenizer.NextToken
Tokenizer.NextToken soon calls ReadBuffer.Read(offset: 2) -- which, despite the comments, seems to actually mean "Read from the current stream position until offset 1"
ReadBuffer.Read sees that input is null and calls Dispose() which sets buffer to null
Tokenizer.NextToken is called again to return the next token
NextToken calls regExpMatcher.Match
RegExpMatcher.Match calls (in a loop) REHandler.Match (an abstract method)
When it hits on the .NET-native RegEx, that method call resolves to SystemRE.Match
SystemRE.Match calls buffer.ToString()
buffer.ToString() is return new string(buffer, 0, length) (a different buffer)
Since that buffer is null, the string constructor throws an ArgumentNullException.

I see several problems here:

ReadBuffer.Read calls Dispose on itself. Dispose is supposed to mean "I am done using this object"; since ReadBuffer doesn't own itself, it's disposing an object owned by someone else (the Tokenizer) while the Tokenizer is still using it
Tokenizer.NextToken tries to parse at EOF; it should probably return null immediately at EOF without trying to parse an empty buffer
ReadBuffer.ToString can throw an exception (never a good thing)

Feature request: discard irrelevant tokens.

Hello,

to my understanding, currently Grammatica requires declaration for all tokens, even when they are irrelevant to the analyzer. For instance, in this production:

IfStatement = "IF" Predicate "THEN" ThenClause "ELSE" ElseClause "END" ;

even if the analyzer is interested only to Predicate, ThenClause and ElseClause, the grammar must declare "IF", "THEN", "ELSE", "END" as tokens, too. To discard them, the corresponding "Exit" method must be overridden to return null, but doing so clutters the code of the analyzer in addition to the grammar. Grammatica could instead discard undefined literal strings in %productions%. If you deem this to be error-prone - because a user could forget to define a token - then a grammar parameter could be added to enable this enhancement.

Thanks for your attention.

Error when evaluating `\\`

In grammatica 1.6:

INTERNAL ERROR: An internal error in Grammatica has been found.
    Please report this error to the maintainers (see the web
    site for instructions). Be sure to include the Grammatica
    version number, as well as the information below:

net.percederberg.grammatica.GrammarException: token 'LITERAL_CHAR' is invalid, as regular expression contains error(s): Illegal repetition near index 6
'(.|(\\{[+-]?[0-9a-fA-F_]+}))?'
      ^, on line 41
        at net.percederberg.grammatica.Grammar.createTokenizer(Grammar.java:225)
        at net.percederberg.grammatica.Grammatica.debug(Grammatica.java:416)
        at net.percederberg.grammatica.Grammatica.main(Grammatica.java:163)

Infinite loop bug

Hi again!
After some testing I've discovered that for loop at https://github.com/cederberg/grammatica/blob/master/src/csharp/PerCederberg.Grammatica.Runtime/RecursiveDescentParser.cs#L213 executes infinitely in some cases. And here's some criticism for you: the thing you've done by increasing/decreasing loop counter within loop is pretty much anti-pattern. It's very, very hard to understand what was your intention there, and what has gone wrong.

NuGet package

@cederberg Have you considered to make this available on NuGet?

cederberg / grammatica Goto Github PK

grammatica's People

Contributors

Stargazers

Watchers

Forkers

grammatica's Issues

Regular expressions are evaluated incorrectly when an optional subexpression ends in a + or *

testQuantifierStackOverflow test fails on 32 bit systems

Porting to Maven

how to use grammar file?

ArgumentNullException in C# tokenizer at EOF when using .NET Regex extensions

Feature request: discard irrelevant tokens.

Error when evaluating `\\`

Infinite loop bug

NuGet package

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent

Jobs