igordejanovic / parglare Goto Github PK
View Code? Open in Web Editor NEWA pure Python LR/GLR parser - http://www.igordejanovic.net/parglare/
License: MIT License
A pure Python LR/GLR parser - http://www.igordejanovic.net/parglare/
License: MIT License
Currently (I'm on commit 204e239) actions are not called when build_tree
is set to True
, since the they directly affect the process of AST creation. But, sometimes it would be useful to call the action, even-though the build_tree
is set to True
.
The use-case where I find this to be useful is following:
There are situation in case of lexical ambiguity in which GLR parser wrongly detect automata looping and discard valid heads leading to less solutions that there are. The problem gets worse as the solution, although from the set of right solutions, is non-determininstic due to unordered collections used.
This code demonstrates the problem:
grammar = """
model: element+;
element: title
| table_with_note
| table_with_title;
table_with_title: table_title table_with_note;
table_with_note: table note*;
title: /title/; // <-- This is lexically ambiguous with the next.
table_title: /title/;
table: "table";
note: "note";
"""
# this input should yield 4 parse trees.
input = "title table title table"
g = Grammar.from_string(grammar)
parser = GLRParser(g)
results = parser.parse(input)
# We should have 4 solutions for the input.
assert len(results) == 4
I tried to parse a simple type definition using the following specification
grammar = r"""
spec : definition | spec definition ;
definition : typedefinition ;
typedefinition : typedefheader typelines ;
typedefheader : empties "define" typetypes eol ;
typetypes : "type" | "types" ;
typelines : typeline | typelines typeline ;
typeline : empties NAMETK eol
| empties NAMETK "is" "a" NAMETK eol ;
empties : EMPTY | empties eol ;
eol : /\n/;
NAMETK : /[A-Za-z][A-Za-z0-9]*/ ;
LAYOUT : EMPTY | Layout ;
Layout : /[ \t\r]+/ ;
KEYWORD: /[a-z]+/ ;
"""
text = """
define type
real
float is a real
double is a real
"""
from parglare import Grammar, Parser
g = Grammar.from_string(grammar)
parser = Parser(g)
result = parser.parse(text)
print(result)
The language is line-based, so there is an eol
token to explicitly check on newline. Also, LAYOUT does not handle newline, instead empties
skips any empty line if necessary.
I expected to see the entire input parsed, but I only get the header line and the first real
type definition line:
[[[[], '\n'], 'define', 'type', '\n'], [[], 'real', '\n']]
In particular, the 2nd and 3rd type definitions are missing, and there is no parse error reported.
Swapping type definition lines makes no difference, you always only get the first definition.
I also tried adding actions to the typeline
rule, and only one call is made. I would really like if the parser processed all lines :)
Parglare is damn slow. Generating parser is also damn slow. It may be useful both from speed and debug perspective to serialize the generated parser into a python file.
I have been translating some grammar in some syntax to parglare. That grammar had expression syntax. A user creates an expression template and gives it a name. Then he creates a special block where he just enumerates the operators tokens, their signature (unary, binary, where args are situated) and their priority and mentions the name of the template and the name of some name inside of that template. I think it may be a nice piece of syntax sugar for parglare.
The following specification breaks on missing whitespace at the start of the text:
from parglare import Grammar
from parglare import Parser
gram = """\
words : word | words word ;
word : /[a-z]+/ ;
LAYOUT : WS | comment ;
comment : /#.*/ ;
WS : /[ \t]+/ ;
"""
text = "abc def"
grammar = Grammar.from_string(gram)
parser = Parser(grammar)
result = parser.parse(text)
print(result)
produces
Traceback (most recent call last):
File "n.py", line 17, in <module>
result = parser.parse(text)
File "~/compiler3/parglare/parser.py", line 208, in parse
position)
File "~/compiler3/parglare/parser.py", line 475, in _skipws
input_str, position, context=context)
File "~/compiler3/parglare/parser.py", line 282, in parse
nomatch_error(actions.keys()))
parglare.exceptions.ParseError: Error at position 1,0 => "*abc def". Expected: WS or comment
If you take out the LAYOUT
and comment
rules, it works. It looks like there is a forced LAYOUT
at the start of the file. There may also be one at the end of the file, but that is not testable currently, I think.
I know forcing space between tokens is not normal in programming languages, as people tend to be afraid of using the spacebar and write code like a+1-4=b
. However, the language I am parsing is a constrained natural language with sentences like power-supply must provide power
. Optional white space doesn't make much sense there, and (I think) complicates parsing due to non-existing white-space ambiguities that must be resolved (words should never be broken into two pieces).
Depending on how you see this problem, several options to fix it exists (for as far as I can see):
LAYOUT
that it must allow to be empty.LAYOUT
at the start of the file, ie make it optional (the text might start with white-space, but not always).Likely other options exist as well.
See this discussion
At the moment, treat context.extra
as a global value which is by default a dict. This issue will deal with providing a mechanism to have an isolated/separate value of context.extra
per each GLR parsing thread.
Rules that have on RHS only a single rule reference or alternative choice of single rule references should be treated as syntactic inheritance -- e.g. LHS is considered a generalization of RHS rules.
Use cases:
I'm migrating some basic NLP to parglare to test it, and bumped into this with some obscure regexes. It's probably not a happy case to have a space there, but I wanted to report it anyway. Here's a minimal version:
grammar = r"""
S: FOO | BAR;
FOO: /a/ "foo";
BAR: /a / "bar";
"""
g = Grammar.from_string(grammar)
GLRParser(g).parse('a foo')
This raises a ParseError: Error at position 1,2 => "a *foo". Expected: bar
I'm using the GLR parser since I need all possible parses (not that it would be useful with that example). Is there any way to get it to parse, without registering custom_lexical_disambiguation? (in case those regex recognizers are converted to literal strings, I don't even get the chance to fix it in the registered disambiguator)
e.g. FORTRAN as column-based or Python as indentation-based.
parglare provides common actions but they must be given to the parser using constructor parameter.
This feature would provide a syntax to specify common actions in the grammar directly.
@collect
some_objects: some_objects some_object | some_object;
User could still override grammar provided action with some other action using constructor parameter.
This action should be automatically used if there is assignment (see #2 ) in the rule. It will create Python object with attributes set to values collected by named matches/assignments.
Boolean assignment ?=
should imply that RHS is matched optionally.
Thus:
Rule: some_attr?=SomeRule?;
could be written a little bit cleaner:
Rule: some_attr?=SomeRule;
Current syntax for defining actions is:
@action_name
Rulename: ...;
If action definition would be moved to {}
block together with disambiguation and other meta-data (issue #57) a fine-grained control would be possible. As {}
block is defined by production, each production could have a different action. This would make current list-based specification of actions deprecated.
After #17 is implemented additional flexibility is achieved. Action could be given per rule as it is now, and overriden per production.
Syntax might be similar to what it's now but the @...
would be given inside {}
block.
Rulename: .... {@action_name};
Or for each production defined in one rule (after #17 is implemented):
Rulename {@action_name}: ...;
Feature request.
The @
syntax for specifying common actions in the grammar is great. It would be really nice if I could specify my own common actions for use with the @
syntax.
parser = Parser(g, actions=actions, common_action=common_actions)
If a key conflicts with a default common action (like collect), I'd like to override the default. (common actions should also be specified in documentation).
In the context of lexical ambiguity it would be nice to have a way to define which lexeme is preferred over which.
For example:
terminals:
a: {>b, >c};
b: ;
c: ;
In case of ambiguity between a
and either b
or c
, a
will be preferred.
Ordering graph should be pre-calculated and cycles reported. Disambiguation will be done dynamically.
In #7 an action will be provided for creating Python objects with attributes set using named matches #2.
This feature would provide a possibility to reference a rule that uses named matches but the resulting object attributes will be created on the referee rule.
For example see this textX issue
This feature can be implemented after #7
Related to #30
Parsing should not continue in case of conflicts as conflict resolution strategies are used for table construction.
It is currently possible to extend the LR parser by keeping track of some extra state in global variables that are examined by custom recognizers and modified by custom actions. This has proven useful in parsing indentation, but probably has many other applications as well. It would be nice to have the parser keep track of this extra state and pass it along to both the recognizers and the actions.
Phase I: As mentioned in #5, there is a Context object that can be passed as an argument to the the parse() method. This is currently passed along to actions, but not to recognizers. This context object could be used to store any extra state variables that are required. If this context object were passed to recognizers as well, then the external global state could be eliminated.
Phase II: If possible, it would be very useful to make this work for GLR as well. It seems like the context object, or a particular attribute of it, could be duplicated when GLR parser forks, keeping a separate copy for each fork. Perhaps copy.deepcopy() or some sort of custom clone() method on the extra state object could be used to perform this duplication.
Question: If this proposal seems acceptable, should the extra context argument be added to the beginning or end of the recognizer argument list? The actions include it at the beginning, but it might cause less trouble for existing code to add it at the end.
I have started experimenting with this, and have implemented some of the more straightforward changes for LR parsing here:
https://github.com/codecraftingtools/parglare/tree/recognizer-context
Please take a look at it and let me know what you think. I think this covers most of what is required for phase I.
Another question: In implementing indentation parsing, I needed to allow recognizers to recognize the empty string. This required modifying this line in _token_recognition()
last_prior = symbol.prior
tok = symbol.recognizer(input_str, position, context)
- if tok:
+ if tok is not None:
tokens.append(Token(symbol, tok))
to differentiate between an empty string ("") and no match (None). This change is included as a separate commit in the branch mentioned above. It seems like this small change makes things work, but it may break some other things I am not aware of, so I wanted to point it out.
Any comments would be appreciated. Also, please let me know if you would like me to do anything differently to make the collaboration workflow easier. Thanks again for all the work that has gone into this project.
parglare uses a pure BNF meta-language for grammar specification.
This leads to somewhat verbose grammars. There are a lot of places where a zero-or-more (*
) or one-or-more (+
) regex constructs could be used. parglare should expand these usages to additional productions with a name that is derived from the referenced rule and the regex operation used. The action that is implicitly bounded could be determined also -- e.g. @collect
for *
and +
.
In the documentation there is mention that it will be possible to not generate the parser on every startup... how far out is that feature?
Even though I had read that in the documentation before I started, I had naively assumed it would still be somehow possible to pickle the state of the parser and reload it on a future run. In fact, it is possible to serialize the parser using the dill
module but it doesn't work quite right after I reload it.
Currently there are shift/right
, reduce/left
disambiguators you can use per production to resolve conflicts during LALR table calculation.
Sometimes it is useful to better specify in which context these disambiguators should be used.
If shift
is given, then when the parser sees the production it will choose to shift instead of reduce for any lookahead token. Better control could be achieved if shift could be given one or more lookahead tokens for which it should be used.
Something like:
MyProduction: some terms and non-terms {shift(term1, term2), reduce};
In this case, resolution would be reduce
for all tokens except term1
and term2
.
prefer_shift
and prefer_shift_over_empty
strategies can be defined globally during parser instantiation. They are overridden by explicit per-production rule for associativity (left
/right
).
GLR parser can investigate both shift and reduce action. There are situation where we want some of the prefer_*
strategies to be applied globally but to disable it for some of the productions.
Consider this example:
grammar = """
Program: "begin" statements=Statement* ProgramEnd EOF;
ProgramEnd: "end" | DOT;
Statement: "end" "transaction" | "command";
DOT: ".";
"""
g = Grammar.from_string(grammar, ignore_case=True)
parser = GLRParser(g, build_tree=True, prefer_shifts=True)
parser.parse("""
begin
command
end transaction
command
end transaction
command
end
""")
If we blindly use prefer_shifts
in GLR than statements
in Program
rule will not be reduced when the final end
keyword is encountered but will be shifted in anticipation of the end transaction
statement. Actually, we need 2 tokens of lookahead to decide if the end
token is program end or the beginning of end transaction
statement. Thus, here we should let GLR investigate both shift (to check if it's end transation
) and reduce (to check if it's end of program).
The idea is to have nops
and nopse
disambiguation rules that will disable global settings of prefer_shifts
and prefer_shifts_over_empty
on a production level.
Having fun trying to shoehorn a line-based language into the parser, which is somewhat asking for trouble of course. The best solution so far is to make LAYOUT
handle only spaces and tabs, and have a dedicated eol
non-terminal that handles line endings. That seems to mostly work except the grammar needs to handle truly empty lines explicitly (not added here for simplicity).
Anyway, while experimenting I ran into a weird disambiguation problem. See
# --------------------
gram_text = r"""
typedefheader : "define" "type" eol ;
eol : /\n/
| filecomment
| doccomment
;
filecomment : /#.*\n/ ;
doccomment : /#<.*\n/ {20} ;
LAYOUT: EMPTY | /[ \t]+/ ;
KEYWORD: /[-a-z]+/ ;
"""
# --------------------
text = "define type #< something\n"
g = Grammar.from_string(gram_text)
p = Parser(g)
print(p.parse(text))
print("############################")
print()
# --------------------
gram_text2 = r"""
typedefheader : "define" "type" eol ;
eol : /\n/
| filecomment
| doccomment
;
filecomment : /#.*\n/ ;
doccomment : docu=/#<.*\n/ {20} ; // Added "docu="
LAYOUT: EMPTY | /[ \t]+/ ;
KEYWORD: /[-a-z]+/ ;
"""
# --------------------
g2 = Grammar.from_string(gram_text2)
p2 = Parser(g2)
print(p2.parse(text))
In the first grammar, the eol
defines a \n
possibly prefixed by two different forms of comment, # filecomment
and #< documentation comment
. Since these two forms of comment are obviously ambiguous, the second form has a nice {20}
attached to it to give it priority. So far so good.
Now I wanted to know if it actually worked, so I added docu=
to the doccomment
rule (printing the result would give me a named instance rather than plain text). However, the parser then crashed on the ambiguity:
['define', 'type', '#< something\n']
############################
Traceback (most recent call last):
File "/home/hat/projects/psi/psi5/compiler3/parglare/parser.py", line 213, in parse
ntok = next_token(cur_state, input_str, position)
File "/home/hat/projects/psi/psi5/compiler3/parglare/parser.py", line 537, in _next_token
ntok = self._lexical_disambiguation(tokens)
File "/home/hat/projects/psi/psi5/compiler3/parglare/parser.py", line 672, in _lexical_disambiguation
raise DisambiguationError(tokens)
parglare.exceptions.DisambiguationError: [<filecomment(#< something
)>, <#<.*\n(#< something
)>]
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "named_assignment.py", line 43, in <module>
print(p2.parse(text))
File "/home/hat/projects/psi/psi5/compiler3/parglare/parser.py", line 217, in parse
disambiguation_error(e.tokens))
parglare.exceptions.ParseError: Error at position 1,12 => "fine type *#< somethi". Can't disambiguate between: <filecomment(#< something
)> or <#<.*\n(#< something
)>
Note the second line with hashes in the output, meaning that the first parse was fine.
Priorities used currently are static in nature, they are used to resolve conflicts during LR table calculation.
It would be nice to have a way to specify dynamic priority that will be used to choose the right tree from the parse forest after GLR finishes parsing. This priority wouldn't be used for LR tables but for post-porcessing of parse forest and eliminating the trees which have productions/teminals of lower priority chosen.
In updating the code for the new common actions, I realized that I was parsing (A)
paragraphs but not handling them. It would be great if I could get:
Maybe this could be part of pglr command?
Using parentheses in the body of the rule and applying regex operators on them.
This will require creating new rules behind the scenes for each parentheses usage as the LR table building process is based on BNF.
Naming of new rules? Maybe the name of the containing rule + some sufix (e.g. the index of the parentheses occurrence).
Example:
A: (B C)* D;
translates to the following productions:
A_p1 = B
A_p1 = C
A_p1_0 = A_p1_1
A_p1_0 = EMPTY
A_p1_1 = A_p1_1 A_p1
A_p1_1 = A_p1
A = A_p1_0 D
This could be used in recognizers, actions, disambituation/error recovery callbacks, error reporting etc. to extend the semantics of parglare grammar language.
Ideas for the syntax (label
is meta-data attached to my_rule
grammar rule):
<label: "My Rule">
my_rule: ... ;
or just extending currently build-in meta-data in {}
:
my_rule: ... {label:"My Rule"};
Once defined it could be accessed on grammar symbol as an attribute for example.
I like more the later syntax approach as it wouldn't introduce additional syntax noise and with #17 it would enable defining meta-data per productions also without change in syntax.
Currently, static disambiguation filters/rules are defined per production using {}
syntax.
This issue will deal with defining the same at the grammar rule level. Production level filters should be "stronger" and override those defined at rule level.
import
statement for grammars.
Similar to textX.
If there is a string match in the grammar (i.e. terminal rule) and another rule with the same name, those two will collide leading to grammar errors:
Example:
Terminals:
"Terminals" ":" terminal_list=Terminal* ";"
;
This leads to
Error in the grammar file.
First set empty for grammar symbol "Terminals". An infinite recursion on the grammar symbol.
I am trying to implement loop behavior, specifically a while loop using custom actions with my grammar. When I run the parser, it executes the statements before evaluating the loop conditions and then continues runs the loop with no loop body.
I defined the grammar as below:
PROGRAM: STATEMENT_LIST;
@pass_nochange
STATEMENT_LIST: STATEMENT STATEMENT_LIST?;
STATEMENT: CONCEPT_STATEMENT
| EXPRESSION_STATEMENT
| ACCEPT_STATEMENT
| ANSWER_STATEMENT
| WHILE_STATEMENT
| IF_STATEMENT
| IMPORT_STATEMENT;
EXPRESSION_STATEMENT: EXPRESSION_STATEMENT AND_OP EXPRESSION_STATEMENT {right, 1}
| EXPRESSION_STATEMENT OR_OP EXPRESSION_STATEMENT {right, 1}
| EXPRESSION_STATEMENT GE_OP EXPRESSION_STATEMENT {right, 2}
| EXPRESSION_STATEMENT LE_OP EXPRESSION_STATEMENT {right, 2}
| EXPRESSION_STATEMENT GT_OP EXPRESSION_STATEMENT {right, 2}
| EXPRESSION_STATEMENT LT_OP EXPRESSION_STATEMENT {right, 2}
| EXPRESSION_STATEMENT EQ_OP EXPRESSION_STATEMENT {right, 2}
| EXPRESSION_STATEMENT NE_OP EXPRESSION_STATEMENT {right, 3}
| NOT_OP EXPRESSION_STATEMENT
| L_PAREN_PN EXPRESSION_STATEMENT R_PAREN_PN
| TAUTOLOGY
| CONTRADICTION
| IDENTIFIER
| NUMBER;
WHILE_STATEMENT: WHILE_KW EXPRESSION_STATEMENT STATEMENT_LIST;
And the actions below:
actions_dict["PROGRAM"] = lambda _, nodes: nodes[0]
actions_dict["STATEMENT"] = lambda _, nodes: nodes[0]
actions_dict["EXPRESSION_STATEMENT"] = [lambda _, nodes: nodes[0] and nodes[2],
lambda _, nodes: nodes[0] or nodes[2],
lambda _, nodes: nodes[0] >= nodes[2],
lambda _, nodes: nodes[0] <= nodes[2],
lambda _, nodes: nodes[0] > nodes[2],
lambda _, nodes: nodes[0] < nodes[2],
lambda _, nodes: nodes[0] == nodes[2],
lambda _, nodes: nodes[0] != nodes[2],
lambda _, nodes: not nodes[1],
lambda _, nodes: nodes[1],
lambda _, nodes: bool(nodes[0]),
lambda _, nodes: bool(nodes[0]),
lambda _, nodes: nodes[0],
lambda _, nodes: float(nodes[0])]
def while_func(context, nodes):
print(nodes)
while(nodes[1]):
nodes[0]
return None
actions_dict["WHILE_STATEMENT"] = while_func
actions_dict["ACCEPT_STATEMENT"] = lambda _, nodes: print("the answer is 42")
This parses the statement "WHILE 1==1 accept.request" correctly but when executing, It will print the answer once and then begin to execute the while loop with no output. I have tried executing the statement with and without the build tree flag but it gives the same behavior. I am looking for any advice on how to go about this.
Non-ascii rules are not supported under python 2.7
In python 3 they work fine.
Simplest example:
# coding: utf-8
from parglare import Grammar
from parglare import Parser
grammar = Grammar.from_file("names.pg")
parser = Parser(grammar)
inp = 'МИША МЫЛ РАМУ'
print(inp)
result = parser.parse(inp)
print(result)
grammar:
LINE: FIO|SYMBOL;
FIO: /'МИША'|'САША'/;
SYMBOL: /\w+/;
result in python 2.7
python ./names.py
МИША МЫЛ РАМУ
Traceback (most recent call last):
File "./names.py", line 9, in <module>
result = parser.parse(inp)
File "/usr/lib/python2.7/site-packages/parglare/parser.py", line 206, in parse
position)
File "/usr/lib/python2.7/site-packages/parglare/parser.py", line 480, in _skipws
while position < in_len and input_str[position] in self.ws:
UnicodeDecodeError: 'ascii' codec can't decode byte 0xd0 in position 0: ordinal not in range(128)
Current syntax which makes terminal and non-terminal rules similar and doesn't enforce separation of rules leads to confusion and unexpected changes in semantics. See #27
Specifying terminals in a separate block in the grammar should lead to a cleaner language semantics.
I tried to split text into paragraphs, keeping whitespaces as significant terms.
As first step of deriving grammar, I tried to parse two separate lines instead of paragraphs.
But when using knowingly wrong grammar, which can parse only first line -- I expected error.
Instead of error, parglare
simply parsed first line and dropped the rest of text.
Is it intentional behavior?
g = Grammar.from_string('T: NL* L+ NL* | NL*; L: /.+/; NL: "\n";', re_flags=re.MULTILINE)
print(Parser(g, ws=False).parse('\nL1\nL2\n\n'))
# [['\n'], ['L1'], ['\n']]
I expected parse error, e.g. "expected STOP but received L2".
parglare reduction actions get their subresults by ordinal positions.
Named matches would provide getting subresults by names:
my_rule: first=first_match_rule second=second_match_rule;
first_match_rule: ...;
second_match_rule: ...;
Now in your action for my_rule
you will get first
and second
as parameters.
This would make it easy to provide a new common action that will return a Python object with supplied parameters as object attributes.
@obj
my_rule: first=first_match_rule second=second_match_rule;
It would be hugely helpful if the * indicating where the parser failed were bright red. I hacked something together for arpeggio with regexp and http://click.pocoo.org/5/utils/#ansi-colors, but it would be great if this were built in to parglare.
parglare version: 0.6.1
Python version: Python 2.7.6
Operating System: Ubuntu 14.04
When running this program with Python 2:
from parglare import get_collector
action = get_collector()
def f(context, node):
return node
action('f_action')(f)
An error occurs:
Traceback (most recent call last):
File "test_unicode.py", line 9, in <module>
action('f_action')(f)
File "parglare/parglare/common.py", line 153, in __call__
return decorator(name_or_f)
File "parglare/parglare/common.py", line 140, in decorator
name = f.__name__
AttributeError: 'str' object has no attribute '__name__'
The error goes away when add this line to the top of the file:
from __future__ import unicode_literals
The error does not occur when I run the original program with Python 3.4.0.
The diagram would depict the structure of the language.
Similar to textX meta-model visualization.
parglare version: 0.4.1
Python version: Python 2.7.14 (default, Jan 5 2018, 10:41:29)
[GCC 7.2.1 20171224] on linux2
Operating System: Arch Linux
simple calculator test
Grammar:
STMT : STMT ADDOP STMT {left, 1}
| STMT MULOP STMT {left, 2}
| "(" STMT ")" | NUMBER;
ADDOP : "+" | "-";
MULOP : "*"|"/";
NUMBER: /\d+(.\d+)?/;
input:
1 + 2 / (3 - 1 + 5)
from parglare import Grammar
from parglare import Parser
grammar = Grammar.from_file("calc_pg.pg")
parser = Parser(grammar)
print('1 + 2 / (3 - 1 + 5)')
result = parser.parse('1 + 2 / (3 - 1 + 5)')
print(result)
1 + 2 / (3 - 1 + 5)
['1', u'+', ['2', u'/', [u'(', ['3', u'-', ['1', u'+', '5']], u')']]]
WTF? Why's 1 + 5 is done before 3 - 1?
I've tried {left, 1} and {1, left} - no better
Classic grammar works better
STMT : TERM | STMT ADDOP TERM ;
TERM : FACTOR | FACTOR MULOP FACTOR ;
FACTOR : "(" STMT ")" | NUMBER;
ADDOP : "+" | "-";
MULOP : "*"|"/";
NUMBER: /\d+(.\d+)?/;
program output:
1 + 2 / (3 - 1 + 5)
['1', u'+', ['2', u'/', [u'(', [['3', u'-', '1'], u'+', '5'], u')']]]
There should be a way to control greedy behaviour and nops
on repetitions.
This is a cool project! But I was miffed to find out that the readme example does not work. It looks like on May 22 terminal symbols were given their own grammar block... that is probably what broke the example.
I copied the code from the readme into a python script and tried to run it. I got this result:
ParseError: 9:8:";\nnumber: */\d+(\.\d+" => Expected: Name or StrTerm
Adding a line with the word terminals
above the number:...
line fixed the example.
It seems that custom actions are being used for two separate purposes: 1) constructing a parse result (e.g. AST tree nodes), and 2) manipulating external state for use by recognizers or dynamic filters. This is illustrated by the new option added in #45 and the discussion in #5 about handling significant indentation. I think it would be more straightforward to split up the custom actions as currently implemented into two separate, but very similar, groups.
The purpose of the existing actions would be to generate a parse result, and the purpose of the side-effects actions would be to manipulate some extra parsing state for use by recognizers or dynamic filters. The side-effect actions would always be called during the parse, as they are really part of the parsing process, but the result generation actions would be optional and could be called after-the-fact. The return value of the side-effect actions would be ignored.
I assume there would also be side-effect decorators and the option to use a _side_effects.py file as with actions and recognizers.
Question: Should the side-effects list be passed in via the grammar constuctor or the parser constructor? My first thought is that side-effects are more a part of the grammar and should go there, but passing them in through the parser would be fine as well.
Question: If this were implemented, would it eliminate the need for the extra option introduced in #45? Would the actions used by @alensuljkanovic fall cleanly into the side-effect category?
I have starting working on this, and think the implementation is very straight-forward, but I haven't published my work to a branch yet.
Any comments on this idea would be appreciated. Answers to these questions will also help give me direction, as they all will affect the implementation. Thanks!
parglare currently calls recognizers one by one and in case of multiple match do lexical disambiguation strategy on its own based on the priorities and the length of the match.
In same cases it would be better to register a handler that will receive all possible tokens expected at the current place and let the handler decide which of the tokens is the next in the input.
Motivation: there are use-cases where we want to do fuzzy matching of tokens. This approach gives us a chance to decide which token is the best match at the given location in case there are no exact matches. We even get a chance to dynamically decide which token is the right one in case of multiple matches.
parglare
sometimes accepts multiple definitions of the same non-terminal, and sometimes it doesn't.
Example case:
from parglare import Grammar
from parglare import Parser
text = "a"
gram1 = """\
A : "a" ;
A : "b" ;
"""
grammar = Grammar.from_string(gram1) # line 11
parser = Parser(grammar)
result = parser.parse(text)
print(result)
gram2 = """\
A : t="a" ;
A : t="b" ;
"""
grammar = Grammar.from_string(gram2) # line 21
parser = Parser(grammar)
result = parser.parse(text)
print(result)
Produces
a
Traceback (most recent call last):
File "m.py", line 21, in <module>
grammar = Grammar.from_string(gram2)
File "~/compiler3/parglare/grammar.py", line 615, in from_string
.parse(grammar_str, context=context),
File "~/compiler3/parglare/parser.py", line 381, in parse
context)
File "~/compiler3/parglare/parser.py", line 619, in _call_reduce_action
result = sem_action(context, subresults)
File "~/compiler3/parglare/grammar.py", line 961, in act_production_rule
.format(name))
parglare.exceptions.GrammarError: Multiple definition for Rule/Class "A"
Note that it crashes at line 21, but not line 11.
EDIT:
The only difference between both grammars are the t=
additions in grammar2
.
Merging both alternatives with a |
avoids the crash.
Error message should be based on a full parser state at the position in the input where parsing failed and not only on the current LR state.
In GLR error should be reported based on a set of heads (i.e. their full state) that failed furthest in the input stream.
This boils down to filtering out all invalid LR items in the current state based on the previous states on the parser stack(s).
Currently grammar can be imported from file system by giving relative path from the importing grammar.
In some use-cases one would want to import grammar deployed with some Python package.
In that case additional import variant could be used in the form:
import some.python.module.grammar as a;
grammar
should be a python variable of type string, on the module some.python.module
, containing a path to the grammar file, like:
grammar = os.path.join(os.path.dirname(__file__), 'grammar.pg')
If special terminal rule KEYWORD
is defined (it must be a regex match) than all string recognizers whose value match the KEYWORD
will match only if surrounded by white-space.
This will prevent matching a keyword as a part of some other element.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.