GithubHelp home page GithubHelp logo

Comments (22)

FRidh avatar FRidh commented on July 18, 2024 1

I indeed forgot to denote it as a raw string. The earley parser is slightly faster now. Hopefully I find more time to work on the lalr solution. Thanks for your help!

from lark.

erezsh avatar erezsh commented on July 18, 2024

Hi, this is simply one of the restrictions of the LALR algorithm. It's notoriously hard to explain, but essentially, the parser needs to always be able to determine which rule it's parsing, by looking at the next token. When there is more than one path for the same token, you get a collision.

Anyway, it was easy to fix. I changed two things:

  1. MACRO is now macro. The lexer requires unique terminals to work properly. You can use ?macro: STRING if you don't want a macro branch in your tree.
  2. I changed attrs to attrs: attr (";" attr)* ";"?. It seems minor, but it's a style that is more friendly to LALR.

Here is the fixed grammar:

action          : STRING ACTION_OPERATOR (ESCAPED_STRING | STRING)
attr            : (action | macro | conditional)
attrs           : attr (";"  attr)* ";"?  // Colon is only used as separator, and thus optional for final attr
operator        : OPERATOR
conditional     : STRING "?" STRING ":"

expr            : macro OPERATOR attrs
line            : expr COMMENT?
file            : line+

comment         : COMMENT

macro           : STRING
COMMENT         : /#.*$/m

STRING          : /[a-zA-Z0-9_.-]+/

OPERATOR        : ":" | "+:"
ACTION_OPERATOR : "===" | "==" | "+=" | "-=" | "="

%import common.WS
%import common.NEWLINE
%import common.ESCAPED_STRING
%ignore WS
%ignore COMMENT
%ignore NEWLINE

from lark.

erezsh avatar erezsh commented on July 18, 2024

P.S. you'll never match comment if you %ignore it.

from lark.

FRidh avatar FRidh commented on July 18, 2024

Thanks for the suggestions. Interesting that 2) would be preferable. I noticed the same style is used in the JSON example. I have yet to decide what to do with the comments, whether I keep them in the tree or not. Its actually quite useful to get them out and keep them so I may remove the %ignore.

Now I get, using this text

foo +: some="text";
foo +: bar1; bar2; other = "multiple words";

the following error:

KeyError                                  Traceback (most recent call last)
/nix/store/n7cb6ca5m0ddsk85kyxwcbs3whcdjqv2-python3-3.6.3-env/lib/python3.6/site-packages/lark/parsers/lalr_parser.py in get_action(key)
     41             try:
---> 42                 return states[state][key]
     43             except KeyError:

KeyError: 'OPERATOR'

During handling of the above exception, another exception occurred:

UnexpectedToken                           Traceback (most recent call last)
<ipython-input-158-602f1e4c7647> in <module>()
     32 """
     33 parser = Lark(grammar, start='file', parser='lalr')
---> 34 parsed = parser.parse(text)

/nix/store/n7cb6ca5m0ddsk85kyxwcbs3whcdjqv2-python3-3.6.3-env/lib/python3.6/site-packages/lark/lark.py in parse(self, text)
    186 
    187     def parse(self, text):
--> 188         return self.parser.parse(text)
    189 
    190         # if self.profiler:

/nix/store/n7cb6ca5m0ddsk85kyxwcbs3whcdjqv2-python3-3.6.3-env/lib/python3.6/site-packages/lark/parser_frontends.py in parse(self, text)
     29     def parse(self, text):
     30         tokens = self.lex(text)
---> 31         return self.parser.parse(tokens)
     32 
     33 

/nix/store/n7cb6ca5m0ddsk85kyxwcbs3whcdjqv2-python3-3.6.3-env/lib/python3.6/site-packages/lark/parsers/lalr_parser.py in parse(self, seq, set_state)
     69             i += 1
     70             while True:
---> 71                 action, arg = get_action(token.type)
     72 
     73                 if action == ACTION_SHIFT:

/nix/store/n7cb6ca5m0ddsk85kyxwcbs3whcdjqv2-python3-3.6.3-env/lib/python3.6/site-packages/lark/parsers/lalr_parser.py in get_action(key)
     44                 expected = states[state].keys()
     45 
---> 46                 raise UnexpectedToken(token, expected, seq, i)
     47 
     48         def reduce(rule, size, end=False):

UnexpectedToken: Unexpected token Token(OPERATOR, '+:') at line 3, column 4.
Expected: dict_keys(['STRING', '__SEMICOLON', '$end'])
Context: <no context>

Could that be because macro exists at both the lhs and rhs of +:?

from lark.

FRidh avatar FRidh commented on July 18, 2024

Interesting, if I remove the semi-colon from the first line,

foo +: some="text"
foo +: bar1; bar2; other = "multiple words";

the issue does not occur. The grammar for attrs is however fine.

from lark.

erezsh avatar erezsh commented on July 18, 2024

It seems, from a shallow examination, that your grammar is not deterministic for a lookahead of one. In particular, it's not clear when a line ends end a new line begins. Not every grammar is LALR-compatible, and even for languages that are LALR-compatible, it takes some thought and intention to write a LALR-compatible grammar for them.

from lark.

erezsh avatar erezsh commented on July 18, 2024

Perhaps it will give you a better idea to consider that most languages have a line-terminating character. For example, BASIC ends in a newline, while C and Java end with a semicolon. These languages also have scope characters, such as {...} or begin ... end.

from lark.

FRidh avatar FRidh commented on July 18, 2024

In this case, each expr is written on only one line. Can NEWLINE be used for this? I suppose I should not ignore it then.

Currently, the semicolon is used only as separator, although with some minor changes to the files written in this language, the semicolon could always be required, thus giving

attrs : (attr ";")+

from lark.

erezsh avatar erezsh commented on July 18, 2024

Yes, using the NEWLINE as an "anchor" terminal will help a lot.
If you do so, the optional semicolon shouldn't be a problem (it isn't a problem in Python).

from lark.

erezsh avatar erezsh commented on July 18, 2024

As a side note, it will also make Earley parse it faster, since a lack of determinism is a big performance drain.

from lark.

FRidh avatar FRidh commented on July 18, 2024

Talking about Python, I keep getting

UnexpectedInput: No token defined for: '/' in '/#[^\n' at line 16 col 22

when using from python2.g

COMMENT: /#[^\n]*/

instead of

COMMENT: /#.*$/m

I am looking at Python because newlines and comments are the same.


Closing this issue because the main issue has been solved.

from lark.

erezsh avatar erezsh commented on July 18, 2024

Well, the Python2.g grammar is tested and working. Perhaps you copied it wrong?

Do you get the exception when calling "parse", or when instanciating Lark?

from lark.

FRidh avatar FRidh commented on July 18, 2024

When instanciating Lark.

from lark.

erezsh avatar erezsh commented on July 18, 2024

Then you just made a syntax error in the grammar.

If you paste the line that causes the error (and a few lines around it) I can tell you what.

from lark.

FRidh avatar FRidh commented on July 18, 2024
    action          : VARIABLE ACTION_OPERATOR (ESCAPED_STRING | STRING)
    attr            : (action | parent | conditional)
    attrs           : attr (";" attr)* ";"?  // Colon is only used as separator, and thus optional for final attr
    //attrs           : (attr + ";")+
    conditional     : STRING "?" STRING ":"
    
    expr            : macro operator attrs
    line            : expr
    file            : (newline | line)*

    parent          : macro
    ?newline        : NEWLINE
    ?macro          : MACRO
    
    COMMENT         : /#[^\n]*/ 
    
    VARIABLE        : /[a-zA-Z0-9_.]+/
    MACRO           : /[a-zA-Z0-9_.]+/
    STRING          : /[a-zA-Z0-9_.-]+/
    
    ?operator        : OPERATOR
    OPERATOR        : "+:" | ":"
    
    //?action_operator: ACTION_OPERATOR
    ACTION_OPERATOR : "===" | "==" | "+=" | "-=" | "="
    
    WS              : /[ \t\f]/+
        
    %import common.NEWLINE
    %import common.ESCAPED_STRING
    %ignore WS
    %ignore COMMENT
    //%ignore NEWLINE

from lark.

erezsh avatar erezsh commented on July 18, 2024

This grammar works just fine. If you're inputting it as a string, make sure it's a raw string r""" ... """, so that \n doesn't become a literal newline character.

from lark.

FRidh avatar FRidh commented on July 18, 2024

Instead of parsing a file, I could also feed individual lines to Lark. What do you think that would do with performance?

from lark.

erezsh avatar erezsh commented on July 18, 2024

Don't you have any structure in your file that might happen between lines? Like blocks / scopes. Because if you don't, then it should be fine.

from lark.

FRidh avatar FRidh commented on July 18, 2024

No, no such structure exists.

I've been trying further with the LALR parser, now one line at a time:

grammar = r"""
    action          : VARIABLE ACTION_OPERATOR (ESCAPED_STRING | STRING)
    attr            : (action | parent | conditional)
    attrs           : attr (";" attr)* ";"?  // Colon is only used as separator, and thus optional for final attr
    //attrs           : (attr + ";")+
    conditional     : STRING "?" STRING ":"
    
    expr            : macro operator attrs
    line            : expr | newline
    file            : line*

    parent          : macro
    ?newline        : NEWLINE
    ?macro          : MACRO
    
    COMMENT         : /#[^\n]*/
    
    VARIABLE        : /[a-zA-Z0-9_.-]+/
    MACRO           : /[a-zA-Z0-9_.]+/
    STRING          : /[a-zA-Z0-9_.-]+/
    
    ?operator        : OPERATOR
    OPERATOR        : "+:" | ":"
    
    //?action_operator: ACTION_OPERATOR
    ACTION_OPERATOR : "===" | "==" | "+=" | "-=" | "="
    
    WS              : /[ \t\f]/+
    
    
    %import common.NEWLINE
    %import common.ESCAPED_STRING
    %ignore WS
    %ignore COMMENT
    //%ignore NEWLINE
"""

text = """
foo +: bar==="Some text";
""".split('\n')
parser = Lark(grammar, start='line', parser='lalr')

n = 1
print(text[n])
parsed = parser.parse(text[n])

results in

foo +: bar==="Some text";

---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
/nix/store/k7nh2kh8zf35wqbkvwvrk7p7ak9vmry3-python3-3.6.3-env/lib/python3.6/site-packages/lark/parsers/lalr_parser.py in get_action(key)
     41             try:
---> 42                 return states[state][key]
     43             except KeyError:

KeyError: 'STRING'

During handling of the above exception, another exception occurred:

UnexpectedToken                           Traceback (most recent call last)
<ipython-input-82-16b7125b0b42> in <module>()
     43 n = 1
     44 print(text[n])
---> 45 parsed = parser.parse(text[n])

/nix/store/k7nh2kh8zf35wqbkvwvrk7p7ak9vmry3-python3-3.6.3-env/lib/python3.6/site-packages/lark/lark.py in parse(self, text)
    186 
    187     def parse(self, text):
--> 188         return self.parser.parse(text)
    189 
    190         # if self.profiler:

/nix/store/k7nh2kh8zf35wqbkvwvrk7p7ak9vmry3-python3-3.6.3-env/lib/python3.6/site-packages/lark/parser_frontends.py in parse(self, text)
     29     def parse(self, text):
     30         tokens = self.lex(text)
---> 31         return self.parser.parse(tokens)
     32 
     33 

/nix/store/k7nh2kh8zf35wqbkvwvrk7p7ak9vmry3-python3-3.6.3-env/lib/python3.6/site-packages/lark/parsers/lalr_parser.py in parse(self, seq, set_state)
     69             i += 1
     70             while True:
---> 71                 action, arg = get_action(token.type)
     72 
     73                 if action == ACTION_SHIFT:

/nix/store/k7nh2kh8zf35wqbkvwvrk7p7ak9vmry3-python3-3.6.3-env/lib/python3.6/site-packages/lark/parsers/lalr_parser.py in get_action(key)
     44                 expected = states[state].keys()
     45 
---> 46                 raise UnexpectedToken(token, expected, seq, i)
     47 
     48         def reduce(rule, size, end=False):

UnexpectedToken: Unexpected token Token(STRING, 'foo') at line 1, column 0.
Expected: dict_keys(['macro', 'MACRO', 'expr', 'NEWLINE', 'newline'])
Context: <no context>

Why would it identify foo as a STRING? It indeed fulfills the regex, but since it is a left-right parser, should it not try MACRO first? Or does it go for STRING first because it is nested deeper?

from lark.

erezsh avatar erezsh commented on July 18, 2024

Tokenization is a separate stage from parsing. It doesn't always know which terminal to test first, but it returns the first match. Because STRING, VARIABLE and MACRO are the same regex, it will always match one over the others. You should use a single regex for it (a single terminal), and use rules to let the parser figure out which is which.

There is also an experimental feature, activated with Lark(..., lexer='contextual') which helps resolve some terminal collisions, but I don't think it will help in your case.

from lark.

FRidh avatar FRidh commented on July 18, 2024

I was looking at my parser again today, and decided to start of with a simple one and start extending that. There is one issue with LALR that I can't figure out how to solve. Consider

file: line? (newline line?)* newline

What if my file does not end with a newline? Changing to

file: line? (newline line?)* newline?

will cause a collision. Do you have a suggestion for this case when file may start and/or end with zero or more newlines.

from lark.

FRidh avatar FRidh commented on July 18, 2024

Oh nevermind, just as I posted it I saw the issue:

file: line? (newline line?)*

will solve it.

from lark.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.