Comments (22)
I indeed forgot to denote it as a raw string. The earley parser is slightly faster now. Hopefully I find more time to work on the lalr solution. Thanks for your help!
from lark.
Hi, this is simply one of the restrictions of the LALR algorithm. It's notoriously hard to explain, but essentially, the parser needs to always be able to determine which rule it's parsing, by looking at the next token. When there is more than one path for the same token, you get a collision.
Anyway, it was easy to fix. I changed two things:
MACRO
is nowmacro
. The lexer requires unique terminals to work properly. You can use?macro: STRING
if you don't want amacro
branch in your tree.- I changed
attrs
toattrs: attr (";" attr)* ";"?
. It seems minor, but it's a style that is more friendly to LALR.
Here is the fixed grammar:
action : STRING ACTION_OPERATOR (ESCAPED_STRING | STRING)
attr : (action | macro | conditional)
attrs : attr (";" attr)* ";"? // Colon is only used as separator, and thus optional for final attr
operator : OPERATOR
conditional : STRING "?" STRING ":"
expr : macro OPERATOR attrs
line : expr COMMENT?
file : line+
comment : COMMENT
macro : STRING
COMMENT : /#.*$/m
STRING : /[a-zA-Z0-9_.-]+/
OPERATOR : ":" | "+:"
ACTION_OPERATOR : "===" | "==" | "+=" | "-=" | "="
%import common.WS
%import common.NEWLINE
%import common.ESCAPED_STRING
%ignore WS
%ignore COMMENT
%ignore NEWLINE
from lark.
P.S. you'll never match comment
if you %ignore
it.
from lark.
Thanks for the suggestions. Interesting that 2) would be preferable. I noticed the same style is used in the JSON example. I have yet to decide what to do with the comments, whether I keep them in the tree or not. Its actually quite useful to get them out and keep them so I may remove the %ignore.
Now I get, using this text
foo +: some="text";
foo +: bar1; bar2; other = "multiple words";
the following error:
KeyError Traceback (most recent call last)
/nix/store/n7cb6ca5m0ddsk85kyxwcbs3whcdjqv2-python3-3.6.3-env/lib/python3.6/site-packages/lark/parsers/lalr_parser.py in get_action(key)
41 try:
---> 42 return states[state][key]
43 except KeyError:
KeyError: 'OPERATOR'
During handling of the above exception, another exception occurred:
UnexpectedToken Traceback (most recent call last)
<ipython-input-158-602f1e4c7647> in <module>()
32 """
33 parser = Lark(grammar, start='file', parser='lalr')
---> 34 parsed = parser.parse(text)
/nix/store/n7cb6ca5m0ddsk85kyxwcbs3whcdjqv2-python3-3.6.3-env/lib/python3.6/site-packages/lark/lark.py in parse(self, text)
186
187 def parse(self, text):
--> 188 return self.parser.parse(text)
189
190 # if self.profiler:
/nix/store/n7cb6ca5m0ddsk85kyxwcbs3whcdjqv2-python3-3.6.3-env/lib/python3.6/site-packages/lark/parser_frontends.py in parse(self, text)
29 def parse(self, text):
30 tokens = self.lex(text)
---> 31 return self.parser.parse(tokens)
32
33
/nix/store/n7cb6ca5m0ddsk85kyxwcbs3whcdjqv2-python3-3.6.3-env/lib/python3.6/site-packages/lark/parsers/lalr_parser.py in parse(self, seq, set_state)
69 i += 1
70 while True:
---> 71 action, arg = get_action(token.type)
72
73 if action == ACTION_SHIFT:
/nix/store/n7cb6ca5m0ddsk85kyxwcbs3whcdjqv2-python3-3.6.3-env/lib/python3.6/site-packages/lark/parsers/lalr_parser.py in get_action(key)
44 expected = states[state].keys()
45
---> 46 raise UnexpectedToken(token, expected, seq, i)
47
48 def reduce(rule, size, end=False):
UnexpectedToken: Unexpected token Token(OPERATOR, '+:') at line 3, column 4.
Expected: dict_keys(['STRING', '__SEMICOLON', '$end'])
Context: <no context>
Could that be because macro
exists at both the lhs and rhs of +:
?
from lark.
Interesting, if I remove the semi-colon from the first line,
foo +: some="text"
foo +: bar1; bar2; other = "multiple words";
the issue does not occur. The grammar for attrs
is however fine.
from lark.
It seems, from a shallow examination, that your grammar is not deterministic for a lookahead of one. In particular, it's not clear when a line
ends end a new line
begins. Not every grammar is LALR-compatible, and even for languages that are LALR-compatible, it takes some thought and intention to write a LALR-compatible grammar for them.
from lark.
Perhaps it will give you a better idea to consider that most languages have a line-terminating character. For example, BASIC ends in a newline, while C and Java end with a semicolon. These languages also have scope characters, such as {...}
or begin ... end
.
from lark.
In this case, each expr
is written on only one line. Can NEWLINE
be used for this? I suppose I should not ignore it then.
Currently, the semicolon is used only as separator, although with some minor changes to the files written in this language, the semicolon could always be required, thus giving
attrs : (attr ";")+
from lark.
Yes, using the NEWLINE as an "anchor" terminal will help a lot.
If you do so, the optional semicolon shouldn't be a problem (it isn't a problem in Python).
from lark.
As a side note, it will also make Earley parse it faster, since a lack of determinism is a big performance drain.
from lark.
Talking about Python, I keep getting
UnexpectedInput: No token defined for: '/' in '/#[^\n' at line 16 col 22
when using from python2.g
COMMENT: /#[^\n]*/
instead of
COMMENT: /#.*$/m
I am looking at Python because newlines and comments are the same.
Closing this issue because the main issue has been solved.
from lark.
Well, the Python2.g grammar is tested and working. Perhaps you copied it wrong?
Do you get the exception when calling "parse", or when instanciating Lark?
from lark.
When instanciating Lark.
from lark.
Then you just made a syntax error in the grammar.
If you paste the line that causes the error (and a few lines around it) I can tell you what.
from lark.
action : VARIABLE ACTION_OPERATOR (ESCAPED_STRING | STRING)
attr : (action | parent | conditional)
attrs : attr (";" attr)* ";"? // Colon is only used as separator, and thus optional for final attr
//attrs : (attr + ";")+
conditional : STRING "?" STRING ":"
expr : macro operator attrs
line : expr
file : (newline | line)*
parent : macro
?newline : NEWLINE
?macro : MACRO
COMMENT : /#[^\n]*/
VARIABLE : /[a-zA-Z0-9_.]+/
MACRO : /[a-zA-Z0-9_.]+/
STRING : /[a-zA-Z0-9_.-]+/
?operator : OPERATOR
OPERATOR : "+:" | ":"
//?action_operator: ACTION_OPERATOR
ACTION_OPERATOR : "===" | "==" | "+=" | "-=" | "="
WS : /[ \t\f]/+
%import common.NEWLINE
%import common.ESCAPED_STRING
%ignore WS
%ignore COMMENT
//%ignore NEWLINE
from lark.
This grammar works just fine. If you're inputting it as a string, make sure it's a raw string r""" ... """, so that \n doesn't become a literal newline character.
from lark.
Instead of parsing a file, I could also feed individual lines to Lark. What do you think that would do with performance?
from lark.
Don't you have any structure in your file that might happen between lines? Like blocks / scopes. Because if you don't, then it should be fine.
from lark.
No, no such structure exists.
I've been trying further with the LALR parser, now one line at a time:
grammar = r"""
action : VARIABLE ACTION_OPERATOR (ESCAPED_STRING | STRING)
attr : (action | parent | conditional)
attrs : attr (";" attr)* ";"? // Colon is only used as separator, and thus optional for final attr
//attrs : (attr + ";")+
conditional : STRING "?" STRING ":"
expr : macro operator attrs
line : expr | newline
file : line*
parent : macro
?newline : NEWLINE
?macro : MACRO
COMMENT : /#[^\n]*/
VARIABLE : /[a-zA-Z0-9_.-]+/
MACRO : /[a-zA-Z0-9_.]+/
STRING : /[a-zA-Z0-9_.-]+/
?operator : OPERATOR
OPERATOR : "+:" | ":"
//?action_operator: ACTION_OPERATOR
ACTION_OPERATOR : "===" | "==" | "+=" | "-=" | "="
WS : /[ \t\f]/+
%import common.NEWLINE
%import common.ESCAPED_STRING
%ignore WS
%ignore COMMENT
//%ignore NEWLINE
"""
text = """
foo +: bar==="Some text";
""".split('\n')
parser = Lark(grammar, start='line', parser='lalr')
n = 1
print(text[n])
parsed = parser.parse(text[n])
results in
foo +: bar==="Some text";
---------------------------------------------------------------------------
KeyError Traceback (most recent call last)
/nix/store/k7nh2kh8zf35wqbkvwvrk7p7ak9vmry3-python3-3.6.3-env/lib/python3.6/site-packages/lark/parsers/lalr_parser.py in get_action(key)
41 try:
---> 42 return states[state][key]
43 except KeyError:
KeyError: 'STRING'
During handling of the above exception, another exception occurred:
UnexpectedToken Traceback (most recent call last)
<ipython-input-82-16b7125b0b42> in <module>()
43 n = 1
44 print(text[n])
---> 45 parsed = parser.parse(text[n])
/nix/store/k7nh2kh8zf35wqbkvwvrk7p7ak9vmry3-python3-3.6.3-env/lib/python3.6/site-packages/lark/lark.py in parse(self, text)
186
187 def parse(self, text):
--> 188 return self.parser.parse(text)
189
190 # if self.profiler:
/nix/store/k7nh2kh8zf35wqbkvwvrk7p7ak9vmry3-python3-3.6.3-env/lib/python3.6/site-packages/lark/parser_frontends.py in parse(self, text)
29 def parse(self, text):
30 tokens = self.lex(text)
---> 31 return self.parser.parse(tokens)
32
33
/nix/store/k7nh2kh8zf35wqbkvwvrk7p7ak9vmry3-python3-3.6.3-env/lib/python3.6/site-packages/lark/parsers/lalr_parser.py in parse(self, seq, set_state)
69 i += 1
70 while True:
---> 71 action, arg = get_action(token.type)
72
73 if action == ACTION_SHIFT:
/nix/store/k7nh2kh8zf35wqbkvwvrk7p7ak9vmry3-python3-3.6.3-env/lib/python3.6/site-packages/lark/parsers/lalr_parser.py in get_action(key)
44 expected = states[state].keys()
45
---> 46 raise UnexpectedToken(token, expected, seq, i)
47
48 def reduce(rule, size, end=False):
UnexpectedToken: Unexpected token Token(STRING, 'foo') at line 1, column 0.
Expected: dict_keys(['macro', 'MACRO', 'expr', 'NEWLINE', 'newline'])
Context: <no context>
Why would it identify foo
as a STRING
? It indeed fulfills the regex, but since it is a left-right parser, should it not try MACRO
first? Or does it go for STRING
first because it is nested deeper?
from lark.
Tokenization is a separate stage from parsing. It doesn't always know which terminal to test first, but it returns the first match. Because STRING, VARIABLE and MACRO are the same regex, it will always match one over the others. You should use a single regex for it (a single terminal), and use rules to let the parser figure out which is which.
There is also an experimental feature, activated with Lark(..., lexer='contextual')
which helps resolve some terminal collisions, but I don't think it will help in your case.
from lark.
I was looking at my parser again today, and decided to start of with a simple one and start extending that. There is one issue with LALR that I can't figure out how to solve. Consider
file: line? (newline line?)* newline
What if my file does not end with a newline? Changing to
file: line? (newline line?)* newline?
will cause a collision. Do you have a suggestion for this case when file
may start and/or end with zero or more newlines.
from lark.
Oh nevermind, just as I posted it I saw the issue:
file: line? (newline line?)*
will solve it.
from lark.
Related Issues (20)
- Please remove the duplicate PYPI record HOT 1
- Transformer raises AttributeError when a tree is only a token HOT 1
- Import lark grammar written in one python project into another HOT 2
- Need help figuring out why some characters are captured in __ANON_ HOT 2
- Need help with terminals not showing up as expected
- Transforming tree after standalone parser results in different AST HOT 4
- Making a comment by using regular expression HOT 5
- earley very, very slow HOT 24
- Cant read `meta` from Tree or Token? HOT 5
- How to define lark grammar for best parsing performance HOT 8
- Unable to parse Arabic text HOT 3
- Incorrect start_pos / end_pos in the tree HOT 8
- Add `outlines` in the list of projects using Lark HOT 2
- Lark.open_from_package() does not support namespace packages HOT 2
- Stand-alone program cannot be run HOT 4
- Issue of installing lark in Python HOT 1
- Pipe in terminal regex not working as expected HOT 1
- Transformer Not Applying Expected Transformations in Lark Parser HOT 3
- Deprecation Warning HOT 6
- accepts() vs choices() in InteractiveParser HOT 10
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from lark.