cederberg / grammatica Goto Github PK
View Code? Open in Web Editor NEWGrammatica is a C# and Java parser generator (compiler compiler)
Home Page: https://grammatica.percederberg.net
License: Other
Grammatica is a C# and Java parser generator (compiler compiler)
Home Page: https://grammatica.percederberg.net
License: Other
(Note: I found this in the C# version. I don't know if the bug exists in the Java version, though I expect it does.)
When a Tokenizer is initialized, all terminals defined by regular expressions are converted to NFAs for evaluation during the lexing step.
When an optional subexpression's final term has an "X-or-more" modifier (?
or *
), the regex is converted incorrectly, producing an NFA that does not correspond to the specified regular expression. Instead, the modifier is applied to the portion of the subexpression before the final term.
Consider the following regular expression:
((##[^\r\n]*)?\r?\n)+
This contains the optional subexpression:
(##[^\r\n]*)?
This subexpression should match:
#
symbols followed by zero or more characters that are not CR or LF.When the ParseFact
parses the subexpression, it calls ParseAtom
, which returns an NFA of the form:
S0:
'#' -> S1
S1:
'#' -> S2
S2: (accept/end state)
[^\r\n] -> S2
Then, the '?' is read, and ParseFact
calls ParseAtomModifier
to rewrite the NFA.
ParseAtomModifier
computes min=0 max=1 and simply adds an epsilon path from the start to the end state, to make it optional. This rewrites the state machine to:
S0:
'#' -> S1
nil -> S2
S1:
'#' -> S2
S2: (accept/end state)
[^\r\n] -> S2
Because there are outgoing transitions from S2, this changes the expression to (##)?[^\r\n]*
, which matches:
(As an aside, this turns my regex into ((##)?[^\r\n]*\r?\n)+
, which will match any string that ends in LF as long as all CRs are immediately followed by LF; since that's going to be pretty much any real string input, the regex matched the entire input, which produced some very confusing error messages.)
The fix for this is simple: in ParseAtomModifier
, before the "handle supported repeaters" comment:
if (end.outgoing.Length > 0) {
end = end.AddOut(new NFAEpsilonTransition(end));
}
This causes the method to produce the following NFA:
S0:
'#' -> S1
nil -> S3
S1:
'#' -> S2
S2:
[^\r\n] -> S2
nil -> S3
S3: (accept/end state)
This will correctly match:
Is the C# runtime code mechanically translated from the Java runtime code, or was that a one-time translation that's kept in sync manually now? If the latter, I can submit a patch for both, as well as patch to fix several other minor errors in the Java->C# translation code.
testQuantifierStackOverflow
test seems to fail on 32 bit systems with the following:
[junit] Testcase: testQuantifierStackOverflow took 0.284 sec
[junit] Caused an ERROR
[junit] null
[junit] java.lang.StackOverflowError
[junit] at net.percederberg.grammatica.parser.re.StringElement.match(StringElement.java:98)
[junit] at net.percederberg.grammatica.parser.re.RepeatElement.findMatches(RepeatElement.java:309)
[junit] at net.percederberg.grammatica.parser.re.RepeatElement.findMatches(RepeatElement.java:321)
[many repeats of the last line follow]
Hello,
I've used Grammatica a gazillion of years ago (2005!) for my graduation thesis and I'm now rediscovering it to show how to use a parser for our internal tool purpose.
Is it possible to have the latest version on Maven? I see that https://mvnrepository.com/artifact/net.percederberg.grammatica/grammatica has version 1.5 only.
I tried to to create a parser and test it (which was a challenge, because I can find no examples documenting how to properly use the generated code), and I get an ArgumentNullException
whenever I call Parse with my parser.
Further research indicates that it's because I'm using regex syntax that the Grammatica regex engine doesn't support, so it's automatically switching to using .NET's System.Text.RegularExpressions.RegEx
class.
Tracing through a parse of a file consisting of a single blank line, here's what I see:
ReaderBuffer.Peek(offset: 0)
is calledReadBuffer.Peek
calls EnsureBuffered(offset: 1)
EnsureBuffered
tries to read BLOCK_SIZE characters.
input.Read
which reads 2 characters (CR and LF)input.Read
again which, because we're at EOF, returns 0input.Close
and then sets input
to null.Tokenizer.NextToken
Tokenizer.NextToken
soon calls ReadBuffer.Read(offset: 2)
-- which, despite the comments, seems to actually mean "Read from the current stream position until offset 1"ReadBuffer.Read
sees that input
is null and calls Dispose()
which sets buffer
to nullTokenizer.NextToken
is called again to return the next tokenNextToken
calls regExpMatcher.Match
RegExpMatcher.Match
calls (in a loop) REHandler.Match (an abstract method)SystemRE.Match
SystemRE.Match
calls buffer.ToString()
buffer.ToString()
is return new string(buffer, 0, length)
(a different buffer)buffer
is null, the string constructor throws an ArgumentNullException.I see several problems here:
ReadBuffer.Read
calls Dispose on itself. Dispose is supposed to mean "I am done using this object"; since ReadBuffer doesn't own itself, it's disposing an object owned by someone else (the Tokenizer) while the Tokenizer is still using itReadBuffer.ToString
can throw an exception (never a good thing)Hello,
to my understanding, currently Grammatica requires declaration for all tokens, even when they are irrelevant to the analyzer. For instance, in this production:
IfStatement = "IF" Predicate "THEN" ThenClause "ELSE" ElseClause "END" ;
even if the analyzer is interested only to Predicate, ThenClause and ElseClause, the grammar must declare "IF", "THEN", "ELSE", "END" as tokens, too. To discard them, the corresponding "Exit" method must be overridden to return null, but doing so clutters the code of the analyzer in addition to the grammar. Grammatica could instead discard undefined literal strings in %productions%. If you deem this to be error-prone - because a user could forget to define a token - then a grammar parameter could be added to enable this enhancement.
Thanks for your attention.
In grammatica 1.6:
INTERNAL ERROR: An internal error in Grammatica has been found.
Please report this error to the maintainers (see the web
site for instructions). Be sure to include the Grammatica
version number, as well as the information below:
net.percederberg.grammatica.GrammarException: token 'LITERAL_CHAR' is invalid, as regular expression contains error(s): Illegal repetition near index 6
'(.|(\\{[+-]?[0-9a-fA-F_]+}))?'
^, on line 41
at net.percederberg.grammatica.Grammar.createTokenizer(Grammar.java:225)
at net.percederberg.grammatica.Grammatica.debug(Grammatica.java:416)
at net.percederberg.grammatica.Grammatica.main(Grammatica.java:163)
Hi again!
After some testing I've discovered that for
loop at https://github.com/cederberg/grammatica/blob/master/src/csharp/PerCederberg.Grammatica.Runtime/RecursiveDescentParser.cs#L213 executes infinitely in some cases. And here's some criticism for you: the thing you've done by increasing/decreasing loop counter within loop is pretty much anti-pattern. It's very, very hard to understand what was your intention there, and what has gone wrong.
@cederberg Have you considered to make this available on NuGet?
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.