GithubHelp home page GithubHelp logo

cederberg / grammatica Goto Github PK

View Code? Open in Web Editor NEW
85.0 85.0 35.0 2.65 MB

Grammatica is a C# and Java parser generator (compiler compiler)

Home Page: https://grammatica.percederberg.net

License: Other

C# 36.30% XSLT 0.85% CSS 0.34% Java 62.51%
c-sharp java java-parser-generator library ll-parser parser-generator

grammatica's People

Contributors

asbjornu avatar cederberg avatar darthwalsh avatar merkys avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

grammatica's Issues

Regular expressions are evaluated incorrectly when an optional subexpression ends in a + or *

(Note: I found this in the C# version. I don't know if the bug exists in the Java version, though I expect it does.)

When a Tokenizer is initialized, all terminals defined by regular expressions are converted to NFAs for evaluation during the lexing step.

When an optional subexpression's final term has an "X-or-more" modifier (? or *), the regex is converted incorrectly, producing an NFA that does not correspond to the specified regular expression. Instead, the modifier is applied to the portion of the subexpression before the final term.

Consider the following regular expression:
((##[^\r\n]*)?\r?\n)+
This contains the optional subexpression:
(##[^\r\n]*)?

This subexpression should match:

  • The empty string
  • Two consecutive # symbols followed by zero or more characters that are not CR or LF.

When the ParseFact parses the subexpression, it calls ParseAtom, which returns an NFA of the form:

S0:
  '#' -> S1
S1:
  '#' -> S2
S2: (accept/end state)
  [^\r\n] -> S2

Then, the '?' is read, and ParseFact calls ParseAtomModifier to rewrite the NFA.
ParseAtomModifier computes min=0 max=1 and simply adds an epsilon path from the start to the end state, to make it optional. This rewrites the state machine to:

S0:
  '#' -> S1
  nil -> S2
S1:
  '#' -> S2
S2: (accept/end state)
  [^\r\n] -> S2

Because there are outgoing transitions from S2, this changes the expression to (##)?[^\r\n]*, which matches:

  • The empty string
  • Any number of non-CR/LF characters

(As an aside, this turns my regex into ((##)?[^\r\n]*\r?\n)+, which will match any string that ends in LF as long as all CRs are immediately followed by LF; since that's going to be pretty much any real string input, the regex matched the entire input, which produced some very confusing error messages.)

The fix for this is simple: in ParseAtomModifier, before the "handle supported repeaters" comment:

if (end.outgoing.Length > 0) {
   end = end.AddOut(new NFAEpsilonTransition(end));
}

This causes the method to produce the following NFA:

S0:
  '#' -> S1
  nil -> S3
S1:
  '#' -> S2
S2:
  [^\r\n] -> S2
  nil -> S3
S3: (accept/end state)

This will correctly match:

  • The empty string
  • Two consecutive '#' characters followed by any number of non-CR/LF characters.

Is the C# runtime code mechanically translated from the Java runtime code, or was that a one-time translation that's kept in sync manually now? If the latter, I can submit a patch for both, as well as patch to fix several other minor errors in the Java->C# translation code.

testQuantifierStackOverflow test fails on 32 bit systems

testQuantifierStackOverflow test seems to fail on 32 bit systems with the following:

    [junit] Testcase: testQuantifierStackOverflow took 0.284 sec
    [junit] 	Caused an ERROR
    [junit] null
    [junit] java.lang.StackOverflowError
    [junit] 	at net.percederberg.grammatica.parser.re.StringElement.match(StringElement.java:98)
    [junit] 	at net.percederberg.grammatica.parser.re.RepeatElement.findMatches(RepeatElement.java:309)
    [junit] 	at net.percederberg.grammatica.parser.re.RepeatElement.findMatches(RepeatElement.java:321)

[many repeats of the last line follow]

Full build logs: i386, armhf.

ArgumentNullException in C# tokenizer at EOF when using .NET Regex extensions

I tried to to create a parser and test it (which was a challenge, because I can find no examples documenting how to properly use the generated code), and I get an ArgumentNullException whenever I call Parse with my parser.

Further research indicates that it's because I'm using regex syntax that the Grammatica regex engine doesn't support, so it's automatically switching to using .NET's System.Text.RegularExpressions.RegEx class.

Tracing through a parse of a file consisting of a single blank line, here's what I see:

  • ReaderBuffer.Peek(offset: 0) is called
  • ReadBuffer.Peek calls EnsureBuffered(offset: 1)
  • EnsureBuffered tries to read BLOCK_SIZE characters.
    • It calls input.Read which reads 2 characters (CR and LF)
    • It calls input.Read again which, because we're at EOF, returns 0
    • It then calls input.Close and then sets input to null.
  • Eventually, control returns to Tokenizer.NextToken
  • Tokenizer.NextToken soon calls ReadBuffer.Read(offset: 2) -- which, despite the comments, seems to actually mean "Read from the current stream position until offset 1"
  • ReadBuffer.Read sees that input is null and calls Dispose() which sets buffer to null
  • Tokenizer.NextToken is called again to return the next token
  • NextToken calls regExpMatcher.Match
  • RegExpMatcher.Match calls (in a loop) REHandler.Match (an abstract method)
  • When it hits on the .NET-native RegEx, that method call resolves to SystemRE.Match
  • SystemRE.Match calls buffer.ToString()
  • buffer.ToString() is return new string(buffer, 0, length) (a different buffer)
  • Since that buffer is null, the string constructor throws an ArgumentNullException.

I see several problems here:

  • ReadBuffer.Read calls Dispose on itself. Dispose is supposed to mean "I am done using this object"; since ReadBuffer doesn't own itself, it's disposing an object owned by someone else (the Tokenizer) while the Tokenizer is still using it
  • Tokenizer.NextToken tries to parse at EOF; it should probably return null immediately at EOF without trying to parse an empty buffer
  • ReadBuffer.ToString can throw an exception (never a good thing)

Feature request: discard irrelevant tokens.

Hello,

to my understanding, currently Grammatica requires declaration for all tokens, even when they are irrelevant to the analyzer. For instance, in this production:

IfStatement = "IF" Predicate "THEN" ThenClause "ELSE" ElseClause "END" ;

even if the analyzer is interested only to Predicate, ThenClause and ElseClause, the grammar must declare "IF", "THEN", "ELSE", "END" as tokens, too. To discard them, the corresponding "Exit" method must be overridden to return null, but doing so clutters the code of the analyzer in addition to the grammar. Grammatica could instead discard undefined literal strings in %productions%. If you deem this to be error-prone - because a user could forget to define a token - then a grammar parameter could be added to enable this enhancement.

Thanks for your attention.

Error when evaluating `\\`

In grammatica 1.6:

INTERNAL ERROR: An internal error in Grammatica has been found.
    Please report this error to the maintainers (see the web
    site for instructions). Be sure to include the Grammatica
    version number, as well as the information below:

net.percederberg.grammatica.GrammarException: token 'LITERAL_CHAR' is invalid, as regular expression contains error(s): Illegal repetition near index 6
'(.|(\\{[+-]?[0-9a-fA-F_]+}))?'
      ^, on line 41
        at net.percederberg.grammatica.Grammar.createTokenizer(Grammar.java:225)
        at net.percederberg.grammatica.Grammatica.debug(Grammatica.java:416)
        at net.percederberg.grammatica.Grammatica.main(Grammatica.java:163)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.