igordejanovic / parglare Goto Github PK

View Code? Open in Web Editor NEW

135.0 135.0 32.0 13.75 MB

A pure Python LR/GLR parser - http://www.igordejanovic.net/parglare/

License: MIT License

Makefile 0.52% Python 99.30% Shell 0.17%

deterministic glr-parser glr-parsing lalr lr-parser non-deterministic parsing priority python scannerless-lr

parglare's People

Contributors

Stargazers

Watchers

parglare's Issues

Option to call semantic actions even though the build_tree is set to True

Currently (I'm on commit 204e239) actions are not called when build_tree is set to True, since the they directly affect the process of AST creation. But, sometimes it would be useful to call the action, even-though the build_tree is set to True.

The use-case where I find this to be useful is following:

There is an ambiguity in the grammar that can be resolved by using actions + dynamic filtering combo. In this case actions are used to collect certain information during parsing, and dynamic filtering to decide whether the parser should call REDUCE or not.
Parser needs to return an AST which will be used to generate the code from it.

GLR drops valid parses on lexical ambiguity

There are situation in case of lexical ambiguity in which GLR parser wrongly detect automata looping and discard valid heads leading to less solutions that there are. The problem gets worse as the solution, although from the set of right solutions, is non-determininstic due to unordered collections used.

This code demonstrates the problem:

    grammar = """
    model: element+;
    element: title
           | table_with_note
           | table_with_title;
    table_with_title: table_title table_with_note;
    table_with_note: table note*;
    title: /title/;   // <-- This is lexically ambiguous with the next.
    table_title: /title/;
    table: "table";
    note: "note";
    """

    # this input should yield 4 parse trees.
    input = "title table title table"

    g = Grammar.from_string(grammar)
    parser = GLRParser(g)
    results = parser.parse(input)

    # We should have 4 solutions for the input.
    assert len(results) == 4

Parglare LR parser silently ignores input

parglare version: HEAD-ish
Python version: 3.5.2
Operating System: linux ubuntu

Description

I tried to parse a simple type definition using the following specification

grammar = r"""
spec : definition | spec definition ;
definition : typedefinition ;
typedefinition : typedefheader typelines ;

typedefheader : empties "define" typetypes  eol ;
typetypes : "type" | "types" ;

typelines : typeline | typelines typeline ;
typeline : empties NAMETK eol
         | empties NAMETK "is" "a" NAMETK eol ;

empties : EMPTY | empties eol ;
eol : /\n/;

NAMETK :  /[A-Za-z][A-Za-z0-9]*/ ;

LAYOUT : EMPTY | Layout ;
Layout : /[ \t\r]+/ ;

KEYWORD: /[a-z]+/ ;
"""

text = """
define type
    real
    float is a real
    double is a real
"""

from parglare import Grammar, Parser
g = Grammar.from_string(grammar)
parser = Parser(g)
result = parser.parse(text)
print(result)

The language is line-based, so there is an eol token to explicitly check on newline. Also, LAYOUT does not handle newline, instead empties skips any empty line if necessary.

I expected to see the entire input parsed, but I only get the header line and the first real type definition line:

[[[[], '\n'], 'define', 'type', '\n'], [[], 'real', '\n']]

In particular, the 2nd and 3rd type definitions are missing, and there is no parse error reported.

Swapping type definition lines makes no difference, you always only get the first definition.

I also tried adding actions to the typeline rule, and only one call is made. I would really like if the parser processed all lines :)

Generate a parser as python code

Parglare is damn slow. Generating parser is also damn slow. It may be useful both from speed and debug perspective to serialize the generated parser into a python file.

Expression syntax sugar

I have been translating some grammar in some syntax to parglare. That grammar had expression syntax. A user creates an expression template and gives it a name. Then he creates a special block where he just enumerates the operators tokens, their signature (unary, binary, where args are situated) and their priority and mentions the name of the template and the name of some name inside of that template. I think it may be a nice piece of syntax sugar for parglare.

LAYOUT must allow empty match

parglare version: bc33add (Jan 27, 2018)
Python version: 2.5.3
Operating System: linux, ubuntu

Description

The following specification breaks on missing whitespace at the start of the text:

from parglare import Grammar
from parglare import Parser

gram = """\
words : word | words word ;
word : /[a-z]+/ ;

LAYOUT : WS | comment ;
comment : /#.*/ ;
WS : /[ \t]+/ ;
"""

text = "abc def"

grammar = Grammar.from_string(gram)
parser = Parser(grammar)
result = parser.parse(text)
print(result)

produces

Traceback (most recent call last):
  File "n.py", line 17, in <module>
    result = parser.parse(text)
  File "~/compiler3/parglare/parser.py", line 208, in parse
    position)
  File "~/compiler3/parglare/parser.py", line 475, in _skipws
    input_str, position, context=context)
  File "~/compiler3/parglare/parser.py", line 282, in parse
    nomatch_error(actions.keys()))
parglare.exceptions.ParseError: Error at position 1,0 => "*abc def". Expected: WS or comment

If you take out the LAYOUT and comment rules, it works. It looks like there is a forced LAYOUT at the start of the file. There may also be one at the end of the file, but that is not testable currently, I think.

I know forcing space between tokens is not normal in programming languages, as people tend to be afraid of using the spacebar and write code like a+1-4=b. However, the language I am parsing is a constrained natural language with sentences like power-supply must provide power. Optional white space doesn't make much sense there, and (I think) complicates parsing due to non-existing white-space ambiguities that must be resolved (words should never be broken into two pieces).

Depending on how you see this problem, several options to fix it exists (for as far as I can see):

Note in the discussion of LAYOUT that it must allow to be empty.
Don't require a LAYOUT at the start of the file, ie make it optional (the text might start with white-space, but not always).

Likely other options exist as well.

Having separate `context.extra` value per GLR parser thread

See this discussion

At the moment, treat context.extra as a global value which is by default a dict. This issue will deal with providing a mechanism to have an isolated/separate value of context.extra per each GLR parsing thread.

Syntactic inheritance

Rules that have on RHS only a single rule reference or alternative choice of single rule references should be treated as syntactic inheritance -- e.g. LHS is considered a generalization of RHS rules.

Serialization/load of parser tables and parse trees/ASTs

Should be a fast binary format. Usage of e.g. Apache Thrift
Interoperability between tools

Use cases:

Tables could be constructed faster using e.g. go-parglare (still in inception) and used in Python. Faster development time for very large grammars.
Parse trees/ASTs could be transformed and semantic actions can be performed using different tools written using different languages.

No valid parse on lexical ambiguity

parglare version: 0.5
Python version: 2.7.13
Operating System: macOS 10.13.4

Description

I'm migrating some basic NLP to parglare to test it, and bumped into this with some obscure regexes. It's probably not a happy case to have a space there, but I wanted to report it anyway. Here's a minimal version:

grammar = r"""
S: FOO | BAR;

FOO: /a/ "foo";

BAR: /a / "bar";
"""

g = Grammar.from_string(grammar)
GLRParser(g).parse('a foo')

This raises a ParseError: Error at position 1,2 => "a *foo". Expected: bar

I'm using the GLR parser since I need all possible parses (not that it would be useful with that example). Is there any way to get it to parse, without registering custom_lexical_disambiguation? (in case those regex recognizers are converted to literal strings, I don't even get the chance to fix it in the registered disambiguator)

Support for line/column/indentation based languages

e.g. FORTRAN as column-based or Python as indentation-based.

In-grammar common action definition

parglare provides common actions but they must be given to the parser using constructor parameter.

This feature would provide a syntax to specify common actions in the grammar directly.

@collect
some_objects: some_objects some_object | some_object;

User could still override grammar provided action with some other action using constructor parameter.

Action for creating Python object for rules with assignments/named matches

This action should be automatically used if there is assignment (see #2 ) in the rule. It will create Python object with attributes set to values collected by named matches/assignments.

Imply optional match for boolean assignment

Boolean assignment ?= should imply that RHS is matched optionally.

Thus:

Rule: some_attr?=SomeRule?;

could be written a little bit cleaner:

Rule: some_attr?=SomeRule;

Move action definition to prod/rule meta-data

Current syntax for defining actions is:

@action_name
Rulename: ...;

If action definition would be moved to {} block together with disambiguation and other meta-data (issue #57) a fine-grained control would be possible. As {} block is defined by production, each production could have a different action. This would make current list-based specification of actions deprecated.

After #17 is implemented additional flexibility is achieved. Action could be given per rule as it is now, and overriden per production.

Syntax might be similar to what it's now but the @... would be given inside {} block.

Rulename: .... {@action_name};

Or for each production defined in one rule (after #17 is implemented):

Rulename {@action_name}: ...;

specify common action

Feature request.

The @ syntax for specifying common actions in the grammar is great. It would be really nice if I could specify my own common actions for use with the @ syntax.

parser = Parser(g, actions=actions, common_action=common_actions)

If a key conflicts with a default common action (like collect), I'd like to override the default. (common actions should also be specified in documentation).

Lexical disambiguation by lexeme ordering

In the context of lexical ambiguity it would be nice to have a way to define which lexeme is preferred over which.

For example:

terminals:
a: {>b, >c};
b: ;
c: ;

In case of ambiguity between a and either b or c, a will be preferred.

Ordering graph should be pre-calculated and cycles reported. Disambiguation will be done dynamically.

Include rules

In #7 an action will be provided for creating Python objects with attributes set using named matches #2.
This feature would provide a possibility to reference a rule that uses named matches but the resulting object attributes will be created on the referee rule.

For example see this textX issue

This feature can be implemented after #7

LR parser should throw exception if table can't be constructed without conflicts.

Related to #30

Parsing should not continue in case of conflicts as conflict resolution strategies are used for table construction.

Add context argument to grammar recognizers

It is currently possible to extend the LR parser by keeping track of some extra state in global variables that are examined by custom recognizers and modified by custom actions. This has proven useful in parsing indentation, but probably has many other applications as well. It would be nice to have the parser keep track of this extra state and pass it along to both the recognizers and the actions.

Phase I: As mentioned in #5, there is a Context object that can be passed as an argument to the the parse() method. This is currently passed along to actions, but not to recognizers. This context object could be used to store any extra state variables that are required. If this context object were passed to recognizers as well, then the external global state could be eliminated.

Phase II: If possible, it would be very useful to make this work for GLR as well. It seems like the context object, or a particular attribute of it, could be duplicated when GLR parser forks, keeping a separate copy for each fork. Perhaps copy.deepcopy() or some sort of custom clone() method on the extra state object could be used to perform this duplication.

Question: If this proposal seems acceptable, should the extra context argument be added to the beginning or end of the recognizer argument list? The actions include it at the beginning, but it might cause less trouble for existing code to add it at the end.

I have started experimenting with this, and have implemented some of the more straightforward changes for LR parsing here:

https://github.com/codecraftingtools/parglare/tree/recognizer-context

Please take a look at it and let me know what you think. I think this covers most of what is required for phase I.

Another question: In implementing indentation parsing, I needed to allow recognizers to recognize the empty string. This required modifying this line in _token_recognition()

             last_prior = symbol.prior
             tok = symbol.recognizer(input_str, position, context)
-            if tok:
+            if tok is not None:
                 tokens.append(Token(symbol, tok))

to differentiate between an empty string ("") and no match (None). This change is included as a separate commit in the branch mentioned above. It seems like this small change makes things work, but it may break some other things I am not aware of, so I wanted to point it out.

Any comments would be appreciated. Also, please let me know if you would like me to do anything differently to make the collaboration workflow easier. Thanks again for all the work that has gone into this project.

Syntactic suggar for regex-like operators

parglare uses a pure BNF meta-language for grammar specification.
This leads to somewhat verbose grammars. There are a lot of places where a zero-or-more (*) or one-or-more (+) regex constructs could be used. parglare should expand these usages to additional productions with a name that is derived from the referenced rule and the regex operation used. The action that is implicitly bounded could be determined also -- e.g. @collect for * and +.

How far out is Table Caching?

In the documentation there is mention that it will be possible to not generate the parser on every startup... how far out is that feature?

Even though I had read that in the documentation before I started, I had naively assumed it would still be somehow possible to pickle the state of the parser and reload it on a future run. In fact, it is possible to serialize the parser using the dill module but it doesn't work quite right after I reload it.

Better shift-reduce disambiguation

Currently there are shift/right, reduce/left disambiguators you can use per production to resolve conflicts during LALR table calculation.

Sometimes it is useful to better specify in which context these disambiguators should be used.

If shift is given, then when the parser sees the production it will choose to shift instead of reduce for any lookahead token. Better control could be achieved if shift could be given one or more lookahead tokens for which it should be used.

Something like:

MyProduction: some terms and non-terms {shift(term1, term2), reduce};

In this case, resolution would be reduce for all tokens except term1 and term2.

In grammar control of `prefer_shift` and `prefer_shift_over_empty` strategies

prefer_shift and prefer_shift_over_empty strategies can be defined globally during parser instantiation. They are overridden by explicit per-production rule for associativity (left/right).

GLR parser can investigate both shift and reduce action. There are situation where we want some of the prefer_* strategies to be applied globally but to disable it for some of the productions.

Consider this example:

    grammar = """
    Program: "begin" statements=Statement* ProgramEnd EOF;
    ProgramEnd: "end" | DOT;
    Statement: "end" "transaction" | "command";
    DOT: ".";
    """
    g = Grammar.from_string(grammar, ignore_case=True)
    parser = GLRParser(g, build_tree=True, prefer_shifts=True)

    parser.parse("""
    begin
        command
        end transaction
        command
        end transaction
        command
    end
    """)

If we blindly use prefer_shifts in GLR than statements in Program rule will not be reduced when the final end keyword is encountered but will be shifted in anticipation of the end transaction statement. Actually, we need 2 tokens of lookahead to decide if the end token is program end or the beginning of end transaction statement. Thus, here we should let GLR investigate both shift (to check if it's end transation) and reduce (to check if it's end of program).

The idea is to have nops and nopse disambiguation rules that will disable global settings of prefer_shifts and prefer_shifts_over_empty on a production level.

Named assignments break disambiguation

parglare version: HEAD
Python version: 3.5.2
Operating System: linux ubuntu

Description

Having fun trying to shoehorn a line-based language into the parser, which is somewhat asking for trouble of course. The best solution so far is to make LAYOUT handle only spaces and tabs, and have a dedicated eol non-terminal that handles line endings. That seems to mostly work except the grammar needs to handle truly empty lines explicitly (not added here for simplicity).

Anyway, while experimenting I ran into a weird disambiguation problem. See

# --------------------
gram_text = r"""
typedefheader : "define" "type"  eol ;

eol : /\n/
| filecomment
| doccomment
;

filecomment : /#.*\n/ ;
doccomment : /#<.*\n/ {20} ;

LAYOUT: EMPTY | /[ \t]+/ ;
KEYWORD: /[-a-z]+/ ;
"""
# --------------------
text = "define type #< something\n"

g = Grammar.from_string(gram_text)
p = Parser(g)
print(p.parse(text))
print("############################")
print()

# --------------------
gram_text2 = r"""
typedefheader : "define" "type"  eol ;

eol : /\n/
| filecomment
| doccomment
;

filecomment : /#.*\n/ ;
doccomment : docu=/#<.*\n/ {20} ;    // Added "docu="

LAYOUT: EMPTY | /[ \t]+/ ;
KEYWORD: /[-a-z]+/ ;
"""
# --------------------

g2 = Grammar.from_string(gram_text2)
p2 = Parser(g2)
print(p2.parse(text))

In the first grammar, the eol defines a \n possibly prefixed by two different forms of comment, # filecomment and #< documentation comment. Since these two forms of comment are obviously ambiguous, the second form has a nice {20} attached to it to give it priority. So far so good.

Now I wanted to know if it actually worked, so I added docu= to the doccomment rule (printing the result would give me a named instance rather than plain text). However, the parser then crashed on the ambiguity:

['define', 'type', '#< something\n']
############################

Traceback (most recent call last):
  File "/home/hat/projects/psi/psi5/compiler3/parglare/parser.py", line 213, in parse
    ntok = next_token(cur_state, input_str, position)
  File "/home/hat/projects/psi/psi5/compiler3/parglare/parser.py", line 537, in _next_token
    ntok = self._lexical_disambiguation(tokens)
  File "/home/hat/projects/psi/psi5/compiler3/parglare/parser.py", line 672, in _lexical_disambiguation
    raise DisambiguationError(tokens)
parglare.exceptions.DisambiguationError: [<filecomment(#< something
)>, <#<.*\n(#< something
)>]

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "named_assignment.py", line 43, in <module>
    print(p2.parse(text))
  File "/home/hat/projects/psi/psi5/compiler3/parglare/parser.py", line 217, in parse
    disambiguation_error(e.tokens))
parglare.exceptions.ParseError: Error at position 1,12 => "fine type *#< somethi". Can't disambiguate between: <filecomment(#< something
)> or <#<.*\n(#< something
)>

Note the second line with hashes in the output, meaning that the first parse was fine.

Disambiguation for GLR with dynamic priorities

Priorities used currently are static in nature, they are used to resolve conflicts during LR table calculation.

It would be nice to have a way to specify dynamic priority that will be used to choose the right tree from the parse forest after GLR finishes parsing. This priority wouldn't be used for LR tables but for post-porcessing of parse forest and eliminating the trees which have productions/teminals of lower priority chosen.

feature request: validate that there is an action for every grammar rule

In updating the code for the new common actions, I realized that I was parsing (A) paragraphs but not handling them. It would be great if I could get:

a list of all rules that have no corresponding action
an audit of all the rules and the actions that are handling them.

Maybe this could be part of pglr command?

Parentheses grouping

Using parentheses in the body of the rule and applying regex operators on them.

This will require creating new rules behind the scenes for each parentheses usage as the LR table building process is based on BNF.

Naming of new rules? Maybe the name of the containing rule + some sufix (e.g. the index of the parentheses occurrence).

Example:

A: (B C)* D;

translates to the following productions:

A_p1 = B
A_p1 = C
A_p1_0 = A_p1_1
A_p1_0 = EMPTY
A_p1_1 = A_p1_1 A_p1
A_p1_1 = A_p1
A = A_p1_0 D

Add flag to disable lexical disambiguation

As discussed in #40, it would be useful to have a flag to disable lexical disambiguation (and maybe the scanning optimization for building the actions table here), for example to obtain all posible parses in NLP tasks with the GLR parser.

Add support for arbitrary rule metadata.

This could be used in recognizers, actions, disambituation/error recovery callbacks, error reporting etc. to extend the semantics of parglare grammar language.

Ideas for the syntax (label is meta-data attached to my_rule grammar rule):

<label: "My Rule">
my_rule: ... ;

or just extending currently build-in meta-data in {}:

my_rule: ... {label:"My Rule"};

Once defined it could be accessed on grammar symbol as an attribute for example.

I like more the later syntax approach as it wouldn't introduce additional syntax noise and with #17 it would enable defining meta-data per productions also without change in syntax.

Per- grammar rule disambiguation filter/rule

Currently, static disambiguation filters/rules are defined per production using {} syntax.

This issue will deal with defining the same at the grammar rule level. Production level filters should be "stronger" and override those defined at rule level.

Support for grammar modularization

import statement for grammars.

Similar to textX.

Conflict between string match and rule with the same name

If there is a string match in the grammar (i.e. terminal rule) and another rule with the same name, those two will collide leading to grammar errors:

Example:

Terminals:
    "Terminals" ":" terminal_list=Terminal* ";"
;

This leads to

Error in the grammar file.
First set empty for grammar symbol "Terminals". An infinite recursion on the grammar symbol.

Definition of grammar and actions for loop behavior

parglare version: 0.5
Python version: 3.6.3
Operating System: Mac OSX High Sierra

Description

I am trying to implement loop behavior, specifically a while loop using custom actions with my grammar. When I run the parser, it executes the statements before evaluating the loop conditions and then continues runs the loop with no loop body.

What I Did

I defined the grammar as below:
PROGRAM: STATEMENT_LIST;
@pass_nochange
STATEMENT_LIST: STATEMENT STATEMENT_LIST?;
STATEMENT: CONCEPT_STATEMENT
| EXPRESSION_STATEMENT
| ACCEPT_STATEMENT
| ANSWER_STATEMENT
| WHILE_STATEMENT
| IF_STATEMENT
| IMPORT_STATEMENT;
EXPRESSION_STATEMENT: EXPRESSION_STATEMENT AND_OP EXPRESSION_STATEMENT {right, 1}
| EXPRESSION_STATEMENT OR_OP EXPRESSION_STATEMENT {right, 1}
| EXPRESSION_STATEMENT GE_OP EXPRESSION_STATEMENT {right, 2}
| EXPRESSION_STATEMENT LE_OP EXPRESSION_STATEMENT {right, 2}
| EXPRESSION_STATEMENT GT_OP EXPRESSION_STATEMENT {right, 2}
| EXPRESSION_STATEMENT LT_OP EXPRESSION_STATEMENT {right, 2}
| EXPRESSION_STATEMENT EQ_OP EXPRESSION_STATEMENT {right, 2}
| EXPRESSION_STATEMENT NE_OP EXPRESSION_STATEMENT {right, 3}
| NOT_OP EXPRESSION_STATEMENT
| L_PAREN_PN EXPRESSION_STATEMENT R_PAREN_PN
| TAUTOLOGY
| CONTRADICTION
| IDENTIFIER
| NUMBER;
WHILE_STATEMENT: WHILE_KW EXPRESSION_STATEMENT STATEMENT_LIST;

And the actions below:
actions_dict["PROGRAM"] = lambda _, nodes: nodes[0]
actions_dict["STATEMENT"] = lambda _, nodes: nodes[0]
actions_dict["EXPRESSION_STATEMENT"] = [lambda _, nodes: nodes[0] and nodes[2],
lambda _, nodes: nodes[0] or nodes[2],
lambda _, nodes: nodes[0] >= nodes[2],
lambda _, nodes: nodes[0] <= nodes[2],
lambda _, nodes: nodes[0] > nodes[2],
lambda _, nodes: nodes[0] < nodes[2],
lambda _, nodes: nodes[0] == nodes[2],
lambda _, nodes: nodes[0] != nodes[2],
lambda _, nodes: not nodes[1],
lambda _, nodes: nodes[1],
lambda _, nodes: bool(nodes[0]),
lambda _, nodes: bool(nodes[0]),
lambda _, nodes: nodes[0],
lambda _, nodes: float(nodes[0])]
def while_func(context, nodes):
print(nodes)
while(nodes[1]):
nodes[0]
return None
actions_dict["WHILE_STATEMENT"] = while_func
actions_dict["ACCEPT_STATEMENT"] = lambda _, nodes: print("the answer is 42")

This parses the statement "WHILE 1==1 accept.request" correctly but when executing, It will print the answer once and then begin to execute the while loop with no output. I have tried executing the statement with and without the build tree flag but it gives the same behavior. I am looking for any advice on how to go about this.

Unicode handling

parglare version: parglare (0.4.1)
Python version: Python 2.7.14 (default, Jan 5 2018, 10:41:29)
[GCC 7.2.1 20171224] on linux2
Operating System: ARCH Linux

Description

Non-ascii rules are not supported under python 2.7
In python 3 they work fine.
Simplest example:

# coding: utf-8
from parglare import Grammar
from parglare import Parser

grammar = Grammar.from_file("names.pg")
parser = Parser(grammar)
inp = 'МИША МЫЛ РАМУ'
print(inp)
result = parser.parse(inp)
print(result)

grammar:

LINE: FIO|SYMBOL;
FIO: /'МИША'|'САША'/;
SYMBOL: /\w+/;

What I Did

result in python 2.7

python ./names.py 
МИША МЫЛ РАМУ
Traceback (most recent call last):
  File "./names.py", line 9, in <module>
    result = parser.parse(inp)
  File "/usr/lib/python2.7/site-packages/parglare/parser.py", line 206, in parse
    position)
  File "/usr/lib/python2.7/site-packages/parglare/parser.py", line 480, in _skipws
    while position < in_len and input_str[position] in self.ws:
UnicodeDecodeError: 'ascii' codec can't decode byte 0xd0 in position 0: ordinal not in range(128)

Terminal rules should be defined in a separate block in grammars

Current syntax which makes terminal and non-terminal rules similar and doesn't enforce separation of rules leads to confusion and unexpected changes in semantics. See #27

Specifying terminals in a separate block in the grammar should lead to a cleaner language semantics.

Obscure behavior when text is longer then grammar can consume

parglare version: 0.5.dev0
Python version: 3.6.3
Operating System: ArchLinux 4.13.11

Description

I tried to split text into paragraphs, keeping whitespaces as significant terms.
As first step of deriving grammar, I tried to parse two separate lines instead of paragraphs.
But when using knowingly wrong grammar, which can parse only first line -- I expected error.
Instead of error, parglare simply parsed first line and dropped the rest of text.
Is it intentional behavior?

What I Did

g = Grammar.from_string('T: NL* L+ NL* | NL*; L: /.+/; NL: "\n";', re_flags=re.MULTILINE)
print(Parser(g, ws=False).parse('\nL1\nL2\n\n'))
# [['\n'], ['L1'], ['\n']]

I expected parse error, e.g. "expected STOP but received L2".

Named matches

parglare reduction actions get their subresults by ordinal positions.

Named matches would provide getting subresults by names:

my_rule: first=first_match_rule second=second_match_rule;
first_match_rule: ...;
second_match_rule: ...;

Now in your action for my_rule you will get first and second as parameters.

This would make it easy to provide a new common action that will return a Python object with supplied parameters as object attributes.

@obj
my_rule: first=first_match_rule second=second_match_rule;

feature request: colorize * in error report

It would be hugely helpful if the * indicating where the parser failed were bright red. I hacked something together for arpeggio with regexp and http://click.pocoo.org/5/utils/#ansi-colors, but it would be great if this were built in to parglare.

Strange get_collector() behavior

parglare version: 0.6.1
Python version: Python 2.7.6
Operating System: Ubuntu 14.04

When running this program with Python 2:

from parglare import get_collector

action = get_collector()

def f(context, node):
    return node

action('f_action')(f)

An error occurs:

Traceback (most recent call last):
  File "test_unicode.py", line 9, in <module>
    action('f_action')(f)
  File "parglare/parglare/common.py", line 153, in __call__
    return decorator(name_or_f)
  File "parglare/parglare/common.py", line 140, in decorator
    name = f.__name__
AttributeError: 'str' object has no attribute '__name__'

The error goes away when add this line to the top of the file:

from __future__ import unicode_literals

The error does not occur when I run the original program with Python 3.4.0.

Generate UML-like diagram from the grammar

The diagram would depict the structure of the language.

Wrong associativity behavior

parglare version: 0.4.1
Python version: Python 2.7.14 (default, Jan 5 2018, 10:41:29)
[GCC 7.2.1 20171224] on linux2
Operating System: Arch Linux

Description

simple calculator test
Grammar:
STMT : STMT ADDOP STMT {left, 1}
| STMT MULOP STMT {left, 2}
| "(" STMT ")" | NUMBER;
ADDOP : "+" | "-";
MULOP : "*"|"/";
NUMBER: /\d+(.\d+)?/;
input:
1 + 2 / (3 - 1 + 5)

What I Did

from parglare import Grammar
from parglare import Parser

grammar = Grammar.from_file("calc_pg.pg")
parser = Parser(grammar)
print('1 + 2 / (3 - 1 + 5)')
result = parser.parse('1 + 2 / (3 - 1 + 5)')

print(result)
1 + 2 / (3 - 1 + 5)
['1', u'+', ['2', u'/', [u'(', ['3', u'-', ['1', u'+', '5']], u')']]]

WTF? Why's 1 + 5 is done before 3 - 1?
I've tried {left, 1} and {1, left} - no better
Classic grammar works better
STMT : TERM | STMT ADDOP TERM ;
TERM : FACTOR | FACTOR MULOP FACTOR ;
FACTOR : "(" STMT ")" | NUMBER;
ADDOP : "+" | "-";
MULOP : "*"|"/";
NUMBER: /\d+(.\d+)?/;

program output:
1 + 2 / (3 - 1 + 5)
['1', u'+', ['2', u'/', [u'(', [['3', u'-', '1'], u'+', '5'], u')']]]

Better control of greedy behaviour for `*` and `+`

There should be a way to control greedy behaviour and nops on repetitions.

Readme example needs update

parglare version: 0.6.1
Python version: 3.6.5
Operating System: Archlinux

Description

This is a cool project! But I was miffed to find out that the readme example does not work. It looks like on May 22 terminal symbols were given their own grammar block... that is probably what broke the example.

What I Did

I copied the code from the readme into a python script and tried to run it. I got this result:

ParseError: 9:8:";\nnumber: */\d+(\.\d+" => Expected: Name or StrTerm

Adding a line with the word terminals above the number:... line fixed the example.

Add support for side-effect actions

It seems that custom actions are being used for two separate purposes: 1) constructing a parse result (e.g. AST tree nodes), and 2) manipulating external state for use by recognizers or dynamic filters. This is illustrated by the new option added in #45 and the discussion in #5 about handling significant indentation. I think it would be more straightforward to split up the custom actions as currently implemented into two separate, but very similar, groups.

The purpose of the existing actions would be to generate a parse result, and the purpose of the side-effects actions would be to manipulate some extra parsing state for use by recognizers or dynamic filters. The side-effect actions would always be called during the parse, as they are really part of the parsing process, but the result generation actions would be optional and could be called after-the-fact. The return value of the side-effect actions would be ignored.

I assume there would also be side-effect decorators and the option to use a _side_effects.py file as with actions and recognizers.

Question: Should the side-effects list be passed in via the grammar constuctor or the parser constructor? My first thought is that side-effects are more a part of the grammar and should go there, but passing them in through the parser would be fine as well.

Question: If this were implemented, would it eliminate the need for the extra option introduced in #45? Would the actions used by @alensuljkanovic fall cleanly into the side-effect category?

I have starting working on this, and think the implementation is very straight-forward, but I haven't published my work to a branch yet.

Any comments on this idea would be appreciated. Answers to these questions will also help give me direction, as they all will affect the implementation. Thanks!

Handler for tokens recognition/disambiguation

parglare currently calls recognizers one by one and in case of multiple match do lexical disambiguation strategy on its own based on the priorities and the length of the match.

In same cases it would be better to register a handler that will receive all possible tokens expected at the current place and let the handler decide which of the tokens is the next in the input.

Motivation: there are use-cases where we want to do fuzzy matching of tokens. This approach gives us a chance to decide which token is the best match at the given location in case there are no exact matches. We even get a chance to dynamically decide which token is the right one in case of multiple matches.

Multiple definition for Rule/Class error not consistent

parglare version: bc33add (Jan 27, 2018)
Python version: 3.5.2
Operating System: linux, ubuntu

Description

parglare sometimes accepts multiple definitions of the same non-terminal, and sometimes it doesn't.
Example case:

from parglare import Grammar
from parglare import Parser

text = "a"

gram1 = """\
A : "a" ;
A : "b" ;
"""

grammar = Grammar.from_string(gram1) # line 11
parser = Parser(grammar)
result = parser.parse(text)
print(result)

gram2 = """\
A : t="a" ;
A : t="b" ;
"""

grammar = Grammar.from_string(gram2) # line 21
parser = Parser(grammar)
result = parser.parse(text)
print(result)

Produces

a
Traceback (most recent call last):
  File "m.py", line 21, in <module>
    grammar = Grammar.from_string(gram2)
  File "~/compiler3/parglare/grammar.py", line 615, in from_string
    .parse(grammar_str, context=context),
  File "~/compiler3/parglare/parser.py", line 381, in parse
    context)
  File "~/compiler3/parglare/parser.py", line 619, in _call_reduce_action
    result = sem_action(context, subresults)
  File "~/compiler3/parglare/grammar.py", line 961, in act_production_rule
    .format(name))
parglare.exceptions.GrammarError: Multiple definition for Rule/Class "A"

Note that it crashes at line 21, but not line 11.

EDIT:

The only difference between both grammars are the t= additions in grammar2.

Merging both alternatives with a | avoids the crash.

Better error reporting

Error message should be based on a full parser state at the position in the input where parsing failed and not only on the current LR state.
In GLR error should be reported based on a set of heads (i.e. their full state) that failed furthest in the input stream.

This boils down to filtering out all invalid LR items in the current state based on the previous states on the parser stack(s).

Import grammar from Python package

Currently grammar can be imported from file system by giving relative path from the importing grammar.
In some use-cases one would want to import grammar deployed with some Python package.

In that case additional import variant could be used in the form:

import some.python.module.grammar as a;

grammar should be a python variable of type string, on the module some.python.module, containing a path to the grammar file, like:

grammar = os.path.join(os.path.dirname(__file__), 'grammar.pg')

Keywords-like recognizers

If special terminal rule KEYWORD is defined (it must be a regex match) than all string recognizers whose value match the KEYWORD will match only if surrounded by white-space.

This will prevent matching a keyword as a part of some other element.

igordejanovic / parglare Goto Github PK

parglare's People

Contributors

Stargazers

Watchers

Forkers

parglare's Issues

Description

Description

Description

Description

Description

What I Did

Description

What I Did

Description

What I Did

Description

What I Did

Description

What I Did

Description

Recommend Projects

Recommend Topics

Recommend Org

Jobs