Hey @kaby76,
I've just recently discovered your work around ANTLR and C#, and I'm impressed! Thank you very much, first of all.
So I was pretty excited to be able to move to the newer targets, and had good results in some project. However, I had a very big performance regression in one grammar, and I can't understand why.
The grammar is pretty simple. Essentially, some texts can have square brackets denoting some placeholders, which we try to parse so that we can build a data model around it.
The grammar looks like this:
grammar GAEB2000PlainTextTextAdditions;
@parser::members
{
protected const int EOF = Eof;
}
@lexer::members
{
protected const int EOF = Eof;
protected const int HIDDEN = Hidden;
}
/*
* Parser Rules
*/
text : ( textAddition | plainText | brackets )* compileUnit ;
textAddition : OpenBrack (buyer=TA | bidder=TB) identifier+=Digit+ content=textContent CloseBrack ;
plainText : ( Digit | Text | textAddLike )+ ;
textAddLike : TA | TB ;
brackets : OpenBrack | CloseBrack ;
textContent : heading=plainText? OpenBrack body=plainText CloseBrack tail=plainText? ;
compileUnit : EOF ;
/*
* Lexer Rules
*/
TA : 'TA' ;
TB : 'TB' ;
OpenBrack : '[' ;
CloseBrack : ']' ;
Digit : [0-9] ;
Text : . ;
Now, we're building a Lexer
and a Parser
from it and try it out it with the following test:
[Fact]
public void CheckAntlrPerformance()
{
var inputString = @"There is some text before
Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut
labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco
laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in
voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non
proident, sunt in culpa qui officia deserunt mollit anim id est laborum.
[TA11Offered Type: [....................................................]
]
Location : [TA12[.......]]";
for (var i = 0; i < 10_000; i++)
{
PerformAntlrParsing(inputString);
}
}
private void PerformAntlrParsing(string inputString)
{
var inputStream = new AntlrInputStream(inputString);
var lexer = new GAEB2000PlainTextTextAdditionsLexer(inputStream);
var tokenStream = new CommonTokenStream(lexer);
var parser = new GAEB2000PlainTextTextAdditionsParser(tokenStream);
parser.Interpreter.PredictionMode = PredictionMode.SLL;
TextContext result;
result = parser.text();
}
I was hoping for a bit of a performance improvement, but it actually went from around 2.4 seconds to 29.2 seconds, so roughly an order of magnitude slower. Profiling a few runs doesn't give me a lot of information. The memory allocation is pretty small throughout the test, and there's not a lot of GCs happening either. I see this after a few thousand iterations:
![image](https://user-images.githubusercontent.com/10274404/202145985-f09d798f-0cf3-4afc-adcc-03d3418221a5.png)
So, I'm a bit at a loss here how to proceed. I've checked the grammar also with tranalyze
, but if I understand it correctly, the output is just statistics or telling me a rule is fine (NotEmpty
):
trparse GAEB2000PlainTextTextAdditions.g4 | tranalyze -s text
7 occurrences of Antlr - nonterminal def
23 occurrences of Antlr - nonterminal ref
6 occurrences of Antlr - terminal def
3 occurrences of Antlr - keyword
5 occurrences of Antlr - literal
Rule text is NonEmpty
Rule textAddition is NonEmpty
Rule plainText is NonEmpty
Rule textAddLike is NonEmpty
Rule brackets is NonEmpty
Rule textContent is NonEmpty
Rule compileUnit is NonEmpty
Rule TA is NonEmpty
Rule TB is NonEmpty
Rule OpenBrack is NonEmpty
Rule CloseBrack is NonEmpty
Rule Digit is NonEmpty
Rule Text is NonEmpty
For reference, the strings we usually encounter in the wild are at most on the order of a kilobyte size or so, and usually just contain one or very few such tags.