GithubHelp home page GithubHelp logo

parsimonious's Introduction

Parsimonious

Parsimonious aims to be the fastest arbitrary-lookahead parser written in pure Python—and the most usable. It's based on parsing expression grammars (PEGs), which means you feed it a simplified sort of EBNF notation. Parsimonious was designed to undergird a MediaWiki parser that wouldn't take 5 seconds or a GB of RAM to do one page, but it's applicable to all sorts of languages.

Code:https://github.com/erikrose/parsimonious/
Issues:https://github.com/erikrose/parsimonious/issues
License:MIT License (MIT)
Package:https://pypi.org/project/parsimonious/

Goals

  • Speed
  • Frugal RAM use
  • Minimalistic, understandable, idiomatic Python code
  • Readable grammars
  • Extensible grammars
  • Complete test coverage
  • Separation of concerns. Some Python parsing kits mix recognition with instructions about how to turn the resulting tree into some kind of other representation. This is limiting when you want to do several different things with a tree: for example, render wiki markup to HTML or to text.
  • Good error reporting. I want the parser to work with me as I develop a grammar.

Install

To install Parsimonious, run:

$ pip install parsimonious

Example Usage

Here's how to build a simple grammar:

>>> from parsimonious.grammar import Grammar
>>> grammar = Grammar(
...     """
...     bold_text  = bold_open text bold_close
...     text       = ~"[A-Z 0-9]*"i
...     bold_open  = "(("
...     bold_close = "))"
...     """)

You can have forward references and even right recursion; it's all taken care of by the grammar compiler. The first rule is taken to be the default start symbol, but you can override that.

Next, let's parse something and get an abstract syntax tree:

>>> print(grammar.parse('((bold stuff))'))
<Node called "bold_text" matching "((bold stuff))">
    <Node called "bold_open" matching "((">
    <RegexNode called "text" matching "bold stuff">
    <Node called "bold_close" matching "))">

You'd typically then use a nodes.NodeVisitor subclass (see below) to walk the tree and do something useful with it.

Another example would be to implement a parser for .ini-files. Consider the following:

grammar = Grammar(
    r"""
    expr        = (entry / emptyline)*
    entry       = section pair*

    section     = lpar word rpar ws
    pair        = key equal value ws?

    key         = word+
    value       = (word / quoted)+
    word        = ~r"[-\w]+"
    quoted      = ~'"[^\"]+"'
    equal       = ws? "=" ws?
    lpar        = "["
    rpar        = "]"
    ws          = ~"\s*"
    emptyline   = ws+
    """
)

We could now implement a subclass of NodeVisitor like so:

class IniVisitor(NodeVisitor):
    def visit_expr(self, node, visited_children):
        """ Returns the overall output. """
        output = {}
        for child in visited_children:
            output.update(child[0])
        return output

    def visit_entry(self, node, visited_children):
        """ Makes a dict of the section (as key) and the key/value pairs. """
        key, values = visited_children
        return {key: dict(values)}

    def visit_section(self, node, visited_children):
        """ Gets the section name. """
        _, section, *_ = visited_children
        return section.text

    def visit_pair(self, node, visited_children):
        """ Gets each key/value pair, returns a tuple. """
        key, _, value, *_ = node.children
        return key.text, value.text

    def generic_visit(self, node, visited_children):
        """ The generic visit method. """
        return visited_children or node

And call it like that:

from parsimonious.grammar import Grammar
from parsimonious.nodes import NodeVisitor

data = """[section]
somekey = somevalue
someotherkey=someothervalue

[anothersection]
key123 = "what the heck?"
key456="yet another one here"

"""

tree = grammar.parse(data)

iv = IniVisitor()
output = iv.visit(tree)
print(output)

This would yield

{'section': {'somekey': 'somevalue', 'someotherkey': 'someothervalue'}, 'anothersection': {'key123': '"what the heck?"', 'key456': '"yet another one here"'}}

Status

  • Everything that exists works. Test coverage is good.
  • I don't plan on making any backward-incompatible changes to the rule syntax in the future, so you can write grammars with confidence.
  • It may be slow and use a lot of RAM; I haven't measured either yet. However, I have yet to begin optimizing in earnest.
  • Error reporting is now in place. repr methods of expressions, grammars, and nodes are clear and helpful as well. The Grammar ones are even round-trippable!
  • The grammar extensibility story is underdeveloped at the moment. You should be able to extend a grammar by simply concatenating more rules onto the existing ones; later rules of the same name should override previous ones. However, this is untested and may not be the final story.
  • Sphinx docs are coming, but the docstrings are quite useful now.
  • Note that there may be API changes until we get to 1.0, so be sure to pin to the version you're using.

Coming Soon

  • Optimizations to make Parsimonious worthy of its name
  • Tighter RAM use
  • Better-thought-out grammar extensibility story
  • Amazing grammar debugging

A Little About PEG Parsers

PEG parsers don't draw a distinction between lexing and parsing; everything is done at once. As a result, there is no lookahead limit, as there is with, for instance, Yacc. And, due to both of these properties, PEG grammars are easier to write: they're basically just a more practical dialect of EBNF. With caching, they take O(grammar size * text length) memory (though I plan to do better), but they run in O(text length) time.

More Technically

PEGs can describe a superset of LL(k) languages, any deterministic LR(k) language, and many others—including some that aren't context-free (http://www.brynosaurus.com/pub/lang/peg.pdf). They can also deal with what would be ambiguous languages if described in canonical EBNF. They do this by trading the | alternation operator for the / operator, which works the same except that it makes priority explicit: a / b / c first tries matching a. If that fails, it tries b, and, failing that, moves on to c. Thus, ambiguity is resolved by always yielding the first successful recognition.

Writing Grammars

Grammars are defined by a series of rules. The syntax should be familiar to anyone who uses regexes or reads programming language manuals. An example will serve best:

my_grammar = Grammar(r"""
    styled_text = bold_text / italic_text
    bold_text   = "((" text "))"
    italic_text = "''" text "''"
    text        = ~"[A-Z 0-9]*"i
    """)

You can wrap a rule across multiple lines if you like; the syntax is very forgiving.

If you want to save your grammar into a separate file, you should name it using .ppeg extension.

Syntax Reference

"some literal" Used to quote literals. Backslash escaping and Python conventions for "raw" and Unicode strings help support fiddly characters.
b"some literal" A bytes literal. Using bytes literals and regular expressions allows your grammar to parse binary files. Note that all literals and regular expressions must be of the same type within a grammar. In grammars that process bytestrings, you should make the grammar string an r"""string""" so that byte literals like \xff work correctly.
[space] Sequences are made out of space- or tab-delimited things. a b c matches spots where those 3 terms appear in that order.
a / b / c Alternatives. The first to succeed of a / b / c wins.
thing? An optional expression. This is greedy, always consuming thing if it exists.
&thing A lookahead assertion. Ensures thing matches at the current position but does not consume it.
!thing A negative lookahead assertion. Matches if thing isn't found here. Doesn't consume any text.
things* Zero or more things. This is greedy, always consuming as many repetitions as it can.
things+ One or more things. This is greedy, always consuming as many repetitions as it can.
~r"regex"ilmsuxa Regexes have ~ in front and are quoted like literals. Any flags (asilmx) follow the end quotes as single chars. Regexes are good for representing character classes ([a-z0-9]) and optimizing for speed. The downside is that they won't be able to take advantage of our fancy debugging, once we get that working. Ultimately, I'd like to deprecate explicit regexes and instead have Parsimonious dynamically build them out of simpler primitives. Parsimonious uses the regex library instead of the built-in re module.
~br"regex" A bytes regex; required if your grammar parses bytestrings.
(things) Parentheses are used for grouping, like in every other language.
thing{n} Exactly n repetitions of thing.
thing{n,m} Between n and m repititions (inclusive.)
thing{,m} At most m repetitions of thing.
thing{n,} At least n repetitions of thing.

Optimizing Grammars

Don't Repeat Expressions

If you need a ~"[a-z0-9]"i at two points in your grammar, don't type it twice. Make it a rule of its own, and reference it from wherever you need it. You'll get the most out of the caching this way, since cache lookups are by expression object identity (for speed).

Even if you have an expression that's very simple, not repeating it will save RAM, as there can, at worst, be a cached int for every char in the text you're parsing. In the future, we may identify repeated subexpressions automatically and factor them up while building the grammar.

How much should you shove into one regex, versus how much should you break them up to not repeat yourself? That's a fine balance and worthy of benchmarking. More stuff jammed into a regex will execute faster, because it doesn't have to run any Python between pieces, but a broken-up one will give better cache performance if the individual pieces are re-used elsewhere. If the pieces of a regex aren't used anywhere else, by all means keep the whole thing together.

Quantifiers

Bring your ? and * quantifiers up to the highest level you can. Otherwise, lower-level patterns could succeed but be empty and put a bunch of useless nodes in your tree that didn't really match anything.

Processing Parse Trees

A parse tree has a node for each expression matched, even if it matched a zero-length string, like "thing"? might.

The NodeVisitor class provides an inversion-of-control framework for walking a tree and returning a new construct (tree, string, or whatever) based on it. For now, have a look at its docstrings for more detail. There's also a good example in grammar.RuleVisitor. Notice how we take advantage of nodes' iterability by using tuple unpacks in the formal parameter lists:

def visit_or_term(self, or_term, (slash, _, term)):
    ...

For reference, here is the production the above unpacks:

or_term = "/" _ term

When something goes wrong in your visitor, you get a nice error like this:

[normal traceback here...]
VisitationException: 'Node' object has no attribute 'foo'

Parse tree:
<Node called "rules" matching "number = ~"[0-9]+"">  <-- *** We were here. ***
    <Node matching "number = ~"[0-9]+"">
        <Node called "rule" matching "number = ~"[0-9]+"">
            <Node matching "">
            <Node called "label" matching "number">
            <Node matching " ">
                <Node called "_" matching " ">
            <Node matching "=">
            <Node matching " ">
                <Node called "_" matching " ">
            <Node called "rhs" matching "~"[0-9]+"">
                <Node called "term" matching "~"[0-9]+"">
                    <Node called "atom" matching "~"[0-9]+"">
                        <Node called "regex" matching "~"[0-9]+"">
                            <Node matching "~">
                            <Node called "literal" matching ""[0-9]+"">
                            <Node matching "">
            <Node matching "">
            <Node called "eol" matching "
            ">
    <Node matching "">

The parse tree is tacked onto the exception, and the node whose visitor method raised the error is pointed out.

Why No Streaming Tree Processing?

Some have asked why we don't process the tree as we go, SAX-style. There are two main reasons:

  1. It wouldn't work. With a PEG parser, no parsing decision is final until the whole text is parsed. If we had to change a decision, we'd have to backtrack and redo the SAX-style interpretation as well, which would involve reconstituting part of the AST and quite possibly scuttling whatever you were doing with the streaming output. (Note that some bursty SAX-style processing may be possible in the future if we use cuts.)
  2. It interferes with the ability to derive multiple representations from the AST: for example, turning wiki markup into first HTML and then text.

Future Directions

Rule Syntax Changes

  • Maybe support left-recursive rules like PyMeta, if anybody cares.
  • Ultimately, I'd like to get rid of explicit regexes and break them into more atomic things like character classes. Then we can dynamically compile bits of the grammar into regexes as necessary to boost speed.

Optimizations

  • Make RAM use almost constant by automatically inserting "cuts", as described in http://ialab.cs.tsukuba.ac.jp/~mizusima/publications/paste513-mizushima.pdf. This would also improve error reporting, as we wouldn't backtrack out of everything informative before finally failing.
  • Find all the distinct subexpressions, and unify duplicates for a better cache hit ratio.
  • Think about having the user (optionally) provide some representative input along with a grammar. We can then profile against it, see which expressions are worth caching, and annotate the grammar. Perhaps there will even be positions at which a given expression is more worth caching. Or we could keep a count of how many times each cache entry has been used and evict the most useless ones as RAM use grows.
  • We could possibly compile the grammar into VM instructions, like in "A parsing machine for PEGs" by Medeiros.
  • If the recursion gets too deep in practice, use trampolining to dodge it.

Niceties

Version History

(Next release)
  • ...
0.10.0
  • Fix infinite recursion in __eq__ in some cases. (FelisNivalis)
  • Improve error message in left-recursive rules. (lucaswiman)
  • Add support for range {min,max} repetition expressions (righthandabacus)
  • Fix bug in * and + for token grammars (lucaswiman)
  • Add support for grammars on bytestrings (lucaswiman)
  • Fix LazyReference resolution bug #134 (righthandabacus)
  • ~15% speedup on benchmarks with a faster node cache (ethframe)

Warning

This release makes backward-incompatible changes:

  • Fix precedence of string literal modifiers u/r/b. This will break grammars with no spaces between a reference and a string literal. (lucaswiman)
0.9.0
  • Add support for Python 3.7, 3.8, 3.9, 3.10 (righthandabacus, Lonnen)
  • Drop support for Python 2.x, 3.3, 3.4 (righthandabacus, Lonnen)
  • Remove six and go all in on Python 3 idioms (Lonnen)
  • Replace re with regex for improved handling of unicode characters in regexes (Oderjunkie)
  • Dropped nose for unittest (swayson)
  • Grammar.__repr__() now correctly escapes backslashes (ingolemo)
  • Custom rules can now be class methods in addition to functions (James Addison)
  • Make the ascii flag available in the regex syntax (Roman Inflianskas)
0.8.1
  • Switch to a function-style print in the benchmark tests so we work cleanly as a dependency on Python 3. (Edward Betts)
0.8.0
  • Make Grammar iteration ordered, making the __repr__ more like the original input. (Lucas Wiman)
  • Improve text representation and error messages for anonymous subexpressions. (Lucas Wiman)
  • Expose BadGrammar and VisitationError as top-level imports.
  • No longer crash when you try to compare a Node to an instance of a different class. (Esben Sonne)
  • Pin six at 1.9.0 to ensure we have python_2_unicode_compatible. (Sam Raker)
  • Drop Python 2.6 support.
0.7.0
  • Add experimental token-based parsing, via TokenGrammar class, for those operating on pre-lexed streams of tokens. This can, for example, help parse indentation-sensitive languages that use the "off-side rule", like Python. (Erik Rose)
  • Common codebase for Python 2 and 3: no more 2to3 translation step (Mattias Urlichs, Lucas Wiman)
  • Drop Python 3.1 and 3.2 support.
  • Fix a bug in Grammar.__repr__ which fails to work on Python 3 since the string_escape codec is gone in Python 3. (Lucas Wiman)
  • Don't lose parentheses when printing representations of expressions. (Michael Kelly)
  • Make Grammar an immutable mapping (until we add automatic recompilation). (Michael Kelly)
0.6.2
  • Make grammar compilation 100x faster. Thanks to dmoisset for the initial patch.
0.6.1
  • Fix bug which made the default rule of a grammar invalid when it contained a forward reference.
0.6

Warning

This release makes backward-incompatible changes:

  • The default_rule arg to Grammar's constructor has been replaced with a method, some_grammar.default('rule_name'), which returns a new grammar just like the old except with its default rule changed. This is to free up the constructor kwargs for custom rules.
  • UndefinedLabel is no longer a subclass of VisitationError. This matters only in the unlikely case that you were catching VisitationError exceptions and expecting to thus also catch UndefinedLabel.
  • Add support for "custom rules" in Grammars. These provide a hook for simple custom parsing hooks spelled as Python lambdas. For heavy-duty needs, you can put in Compound Expressions with LazyReferences as subexpressions, and the Grammar will hook them up for optimal efficiency--no calling __getitem__ on Grammar at parse time.
  • Allow grammars without a default rule (in cases where there are no string rules), which leads to also allowing empty grammars. Perhaps someone building up grammars dynamically will find that useful.
  • Add @rule decorator, allowing grammars to be constructed out of notations on NodeVisitor methods. This saves looking back and forth between the visitor and the grammar when there is only one visitor per grammar.
  • Add parse() and match() convenience methods to NodeVisitor. This makes the common case of parsing a string and applying exactly one visitor to the AST shorter and simpler.
  • Improve exception message when you forget to declare a visitor method.
  • Add unwrapped_exceptions attribute to NodeVisitor, letting you name certain exceptions which propagate out of visitors without being wrapped by VisitationError exceptions.
  • Expose much more of the library in __init__, making your imports shorter.
  • Drastically simplify reference resolution machinery. (Vladimir Keleshev)
0.5

Warning

This release makes some backward-incompatible changes. See below.

  • Add alpha-quality error reporting. Now, rather than returning None, parse() and match() raise ParseError if they don't succeed. This makes more sense, since you'd rarely attempt to parse something and not care if it succeeds. It was too easy before to forget to check for a None result. ParseError gives you a human-readable unicode representation as well as some attributes that let you construct your own custom presentation.
  • Grammar construction now raises ParseError rather than BadGrammar if it can't parse your rules.
  • parse() now takes an optional pos argument, like match().
  • Make the _str__() method of UndefinedLabel return the right type.
  • Support splitting rules across multiple lines, interleaving comments, putting multiple rules on one line (but don't do that) and all sorts of other horrific behavior.
  • Tolerate whitespace after opening parens.
  • Add support for single-quoted literals.
0.4
  • Support Python 3.
  • Fix import * for parsimonious.expressions.
  • Rewrite grammar compiler so right-recursive rules can be compiled and parsing no longer fails in some cases with forward rule references.
0.3
  • Support comments, the ! ("not") operator, and parentheses in grammar definition syntax.
  • Change the & operator to a prefix operator to conform to the original PEG syntax. The version in Parsing Techniques was infix, and that's what I used as a reference. However, the unary version is more convenient, as it lets you spell AB & A as simply A &B.
  • Take the print statements out of the benchmark tests.
  • Give Node an evaluate-able __repr__.
0.2
  • Support matching of prefixes and other not-to-the-end slices of strings by making match() public and able to initialize a new cache. Add match() callthrough method to Grammar.
  • Report a BadGrammar exception (rather than crashing) when there are mistakes in a grammar definition.
  • Simplify grammar compilation internals: get rid of superfluous visitor methods and factor up repetitive ones. Simplify rule grammar as well.
  • Add NodeVisitor.lift_child convenience method.
  • Rename VisitationException to VisitationError for consistency with the standard Python exception hierarchy.
  • Rework repr and str values for grammars and expressions. Now they both look like rule syntax. Grammars are even round-trippable! This fixes a unicode encoding error when printing nodes that had parsed unicode text.
  • Add tox for testing. Stop advertising Python 2.5 support, which never worked (and won't unless somebody cares a lot, since it makes Python 3 support harder).
  • Settle (hopefully) on the term "rule" to mean "the string representation of a production". Get rid of the vague, mysterious "DSL".
0.1
  • A rough but useable preview release

Thanks to Wiki Loves Monuments Panama for showing their support with a generous gift.

parsimonious's People

Contributors

andreacrotti avatar boxconnect avatar cknv avatar edwardbetts avatar erikrose avatar ethframe avatar felisnivalis avatar ingolemo avatar jayaddison avatar jwhitlock avatar kkirsche avatar kolanich avatar lonnen avatar lucaswiman avatar mdamien avatar moreati avatar nbrunoloff avatar oderjunkie avatar pavel-kirienko avatar righthandabacus avatar rominf avatar smurfix avatar swayson avatar timgates42 avatar zhuzilin avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

parsimonious's Issues

Finalize grammar composition story

Reconsider dict-like grammars:

However, pros:

  • You can .update() them to compose. OTOH, won't we have to redo a bunch of compilation and optimization on update? That seems like it should be named something more explicit, like compose(). [Ed: update() is not a very good parallel for the type of thing extends will do; extends maintains Grammar boundaries (for delegation), while update() just paves over keys, one at a time.]
  • You can enumerate the rules without getting repr and such back.
  • You can use Python keywords as rule names.
  • Rule names don't collide with other Grammar method names.

If we were to store expressions on methods of a grammar, we could change or alias Expression.parse to Expression.__call__.

Strange behavior

I'm getting this strange behavior using the latest "master":

>>> g = """
...  digits = digit+ 
...  int = digits 
...  digit = ~"[0-9]" 
...  number = int 
...  main = number 
... """
>>> Grammar(g)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "parsimonious/grammar.py", line 109, in __repr__
    return "Grammar('%s')" % str(self).encode('string_escape')
  File "parsimonious/utils.py", line 9, in __str__
    return self.__unicode__().encode('utf-8')
  File "parsimonious/grammar.py", line 105, in __unicode__
    return '\n'.join(expr.as_rule() for expr in exprs)
  File "parsimonious/grammar.py", line 105, in <genexpr>
    return '\n'.join(expr.as_rule() for expr in exprs)
AttributeError: 'LazyReference' object has no attribute 'as_rule'
>>> Grammar(g)['main']
u'int'

How do I attach callbacks to terminal nodes?

Continuing from the previous example - I'd like to be able to define some kind of callback which will be called any time the parser encounters a terminal.

So continuing on from my previous example, supposing the parser had successfully parsed the expression:

"go north, south"

And supposing that matches a particular rule for 'movement instruction' - I'd like a particular callback to be invoked any time we have one of those.

Can it be done?

Generate a parse tree

Each Expression can collect its bit of the tree and tag it with the LHS of the rule that birthed it (and maybe line number, column number, and whatever else we can think of). If it's a subexpression and has no LHS, do something quieter, I guess.

Make one-visitor-to-one-grammar case simpler

Many situations have only a single visitor coupled to one grammar. It's great that Parsimonious makes it easy to decouple those (rendering plain text and HTML out of wiki text, for instance), but it would also be nice not to have to have something like this in everybody's client code:

def create_handler(grammar, visitor):
    visitor_inst = visitor()
    return lambda x: visitor_inst.visit(grammar.parse(x))

Maybe we should be able to register a default visitor with a grammar. Actually, that's upside down. It's very unlikely we could recycle a visitor for use with more than one grammar, so let's make visitors aware of their grammars. NodeVisitor should have parse() and match() methods which run a string through the registered grammar (stored on an attr or something), visit it, and return the result:

@classmethod
def parse(cls, *args, **kwargs):
    return cls().visit(cls.grammar.parse(*args, **kwargs))

So then we could say just...

visitor.parse(x)

Preserve parentheses in repr for Grammar

Using latest PyPI version:

Python 2.7.3 (default, Dec 17 2012, 20:20:42) 
[GCC 4.2.1 Compatible Apple Clang 4.1 ((tags/Apple/clang-421.11.66))] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> from parsimonious.grammar import Grammar
>>> Grammar('foo = "bar" ("baz" "eggs")* "spam"')
Grammar('foo = "bar" "baz" "eggs"* "spam"')

Easy way to handle consecutive whitespaces?

It is common in programming languages to treat consecutive whitespace characters as a single delimiter, but it seems that there is no easy way to do it in parsimonious. I am new to PEG, so maybe I am wrong.

Currently I'm writing a grammar like this

type = (space simple_type space "<" space type_list space ">" space) / ( space simple_type space )
simple_type = ......
type_list = ......
......
space = ~"[ \t]*"

It is so tedious to insert a space everywhere, and easily forgettable because I hit the Space key on my keyboard and then forget to insert a space.

Generalizing and augmenting PEGs

I wish to parse (1) binary data using PEGs, or (2) token streams (like #40). Also, I wish to parse (3) more grammars than PEGs allow by somehow augmenting the grammar with some Python code and maybe some state, because I don't want to throw away my grammar and rewrite the parser manually as soon as I want to add a feature that is not parseable by PEGs, like, say, whitespace-sensitivity.

What do you think?

Support \n etc. more easily

It's awkward to express LFs, CRs, etc. in grammars, because Python tends to replace them with actual newlines, which are no-ops. It works in the grammar DSL's grammar because they're wrapped in regexes, but that shouldn't be required. Ford's original PEG grammar supports \n\r\t'"{}\ and some numerics. We should probably go that way.

Lots of examples + online doc

It looks good library to me but I'm unable to use it properly and can comment how good it is.

Almost no examples (except only 2) and only ReadMe as doc.

To make it popular I feel it needs examples and full documentation.

I really want to use and I'm not python pro but can understand many things easily and examples surely help in learning and using library.

If I miss anything please let me know.

Thanks,
Yash
KineticWing IDE

Specifying operator precedence and associativity declaratively

In yacc, it is very convenient to use %left and %right to specify operator associativity, as well as precedence. Handling associativity and precedence manually is error-prone.

I have no API in mind, but, maybe, the extensibility story can somehow allow this.

Simple test failed

I can't find ang tutorial and example, so I found there is :

    g = Grammar('''
                polite_greeting = greeting ", my good sir"
                greeting        = Hi / Hello
                ''')
    g.parse('Hello, my good sir')

    g['greeting'].parse('Hi')

But I got:

KeyError: 'greeting'

And I check the source code, it seems peg didn't be processed at all in Grammer init(), so how should I do next? Is it a bug?

Unnamed nodes should make up useful names for themselves

When you print a node, you get something like this:

<Node called "char_range" matching "c-a">
    <Node called "class_char" matching "c">
        <RegexNode matching "c">
    <Node matching "-">
    <Node called "class_char" matching "a">
        <RegexNode matching "a">

Named nodes get names, but unnamed ones (like the one matching "-") don't. Especially when debugging—for instance, ending up in visit_generic and not knowing what node triggered that dispatch—it would be helpful if nodes would identify themselves by their rule expression: <Node "class_item*" matching "abc">. Then we take the "called" out of the named nodes, and everything is frighteningly consistent.

Implement Grammar object

Implement parsing of a PEG DSL to make grammars easy to define. In particular, recursion and forward references are a pain in pure Python, and the DSL can be leaps and bounds more concise. (At least I always had trouble reading PyParsing's definitions, and Pijnu's DSL is quite legible.)

This can conceivably wait awhile—at least until I can't bear to write out the expression webs by hand anymore.

Make Grammar less of a dict

The write-oriented dict methods don't make sense on Grammars. We shouldn't invite calls to them, at least not until we have some pretty comprehensive recompilation machinery hooked up to them.

  • update()
  • __setitem__()
  • pop() and popitem()
  • setdefault()
  • __delitem__()

Arbitrary deviations in syntax from original PEG paper

Maybe there is a good reason for that, but why parsimonious deviates in syntax from original PEG paper?

I can understand addition of regular expression syntax (for practicality), but why not a proper superset of PEG ASCII syntax?

Also, PEG grammar allows for multi-line rules thanks to:

Primary <- Identifier !LEFTARROW

but parsimonious does not (apparently)

And, BTW, thanks for the great library, it is by far my favorite PEG implementation.

Allow for rules that don't create separate nodes

It's often useful to define some basic rules that can be composited to form larger expressions:

digit = ~r"[0-9]"
number = digit+

However, there is not always any reason why each separate digit needs its own node, even if it's useful as a "building block" in the grammar. As such, it would be nice to be able to mark a rule as "non-node creating". The library simpleparse supports this by creating 'unreported productions' that would translate into something like this for parsimonious:

<digit> = ~r"[0-9]"
number = digit+

Then, any number node could simply be the number, and we don't get a bunch of (unneeded) nodes for each component – because the angle brackets signify that this rule should not be "reported" separately.

I suppose this could either be implemented while parsing the grammar – replacing any reference to such a rule with simply its contents (at a preprocessing stage), or as part of the parsing of the text...

peg unused

in grammar.py:_rules_from_peg the "peg" argument is actually unused in
the function, even if it's actually passed from the constructor.

Is it maybe not useful anymore?

Write real documentation, add Sphinx

I never really wrote meaningful documentation; the readme started out as notes to myself and just metastasized. Run Sphinx over the project, write some real docs, and publish to readthedocs.

setting rule precedence

I've translated the entire modelica EBNF grammar to PEG and used parsimonious to create a pure python modelica compiler.

https://github.com/jgoppert/modpeg/blob/master/modpeg/parser.py

Please check it out and let me know if you see any obvious mistakes as this is my first adventure with PEG.

My issue is that some of the statements are matching a generic regex before my keywords. Is there a way to set precedence on rules?

original EBNF grammar

    element_list = ((element semicolon)/(annotation semicolon))*

modified PEG grammar where end/equation/algorithm are keywords that I have to ensure don't match the generic name of an element in the list

   element_list = (!(end/equation/algorithm)
        ((element semicolon)/(annotation semicolon)))*

Tree transforms

It would be better if parsimonious had something like Supress in Pyparsing, this way we would be able to check for a keyword for existence, but don't represent it as a node in the tree.

Reduce visitor pain

It's a pain to write visitors. I can never remember what the formal params are. I want to write a code generator for them, but that's an antipattern. How can we make those go away or shrink?

Remove whitespace : Neglect similar to pyparsing

test = visibility ws* function_keyword ws* word ws* arguments* ws*

This rule matches

sample2 = """public function __construct( )"""

Now problem I don't want to mention ws* and it should be done by itself. Like pyparsing

How to enable that...

I tried

test = [visibility function_keyword word arguments*]

And

test = (visibility function_keyword word arguments*)

With no luck.

I don't know its a bug or I'm doing wrong or its a feature request.

I only want neglect all types whitespaces automatically. How to achieve this.

Thanks
Yash

/ operator only allows single terms

Based on experience with other PEG pargens, I'd think this should work:

>>> from parsimonious import Grammar
>>> Grammar("a = b c / d")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/eevee/.local/lib/python3.4/site-packages/parsimonious/grammar.py", line 63, in __init__
    exprs, first = self._expressions_from_rules(rules)
  File "/home/eevee/.local/lib/python3.4/site-packages/parsimonious/grammar.py", line 78, in _expressions_from_rules
    tree = rule_grammar.parse(rules)
  File "/home/eevee/.local/lib/python3.4/site-packages/parsimonious/grammar.py", line 83, in parse
    return self.default_rule.parse(text, pos=pos)
  File "/home/eevee/.local/lib/python3.4/site-packages/parsimonious/expressions.py", line 42, in parse
    raise IncompleteParseError(text, node.end, self)
parsimonious.exceptions.IncompleteParseError: Rule 'rules' matched in its entirety, but it didn't consume all the text. The non-matching portion of the text begins with '/ d' (line 1, column 9).

But / is defined with:

or_term = "/" _ term
ored = term or_term+

Wrapping everything in parentheses works, but is a little inconvenient when e.g. parsing a language with operators and trying to consume the whitespace after all of them :)

Make @rule support subclassing properly

The @rule decorator isn't very reliable in subclassed visitors. The skipped test at

def test_rule_decorator_subclassing():
"""Make sure we can subclass and override visitor methods without blowing
away the rules attached to them."""
class OverridingFormatter(CoupledFormatter):
def visit_text(self, node, visited_children):
"""Return the text capitalized."""
return node.text.upper()
@rule('"not used"')
def visit_useless(self, node, visited_children):
"""Get in the way. Tempt the metaclass to pave over the
superclass's grammar with a new one."""
raise SkipTest("I haven't got around to making this work yet.")
eq_(OverridingFormatter().parse('((hi))'), '<b>HI</b>')
demonstrates what's left to be done.

import *

In grammar.py there is a * operator..
It's a bit annoying specially using tools like PyLint, that can't really check if there are symbols not bound to anything.

If there are many symbols even something like

import parsimonious.expressions as e
might be better imho.

print unicode node may complain error

The code like this:

#! /usr/bin/env python
#coding=utf-8
from parsimonious.grammar import Grammar, peg_grammar
from parsimonious.expressions import *

class TextGrammar(Grammar):
    def _rules_from_peg(self, peg=None):
        string = Regex(r'\S+', unicode=True)

        peg_rules = {}
        for k, v in ((x, y) for (x, y) in locals().iteritems() if isinstance(y, Expression)):
            v.name = k
            peg_rules[k] = v
        return peg_rules, string

if __name__ == '__main__':
    peg = TextGrammar('')
    print peg['string'].parse(u'中文')

And when I ran it on console, I got:

Traceback (most recent call last):
  File "t2.py", line 18, in <module>
    print peg['string'].parse(u'涓枃')
UnicodeEncodeError: 'ascii' codec can't encode characters in position 9-10: ordinal not in range(128)

So you can see there is an exception. And if I change one line like this:

print unicode(peg['string'].parse(u'中文'))

I got the right result:

<string "中文">

And I think the code of Node is not deal with unicode very well, when it needs string result, so I change the Node code like this:

def __unicode__(self):
    s = self.prettily()
    if isinstance(s, str):
        return unicode(s, 'utf8')
    else:
        return s

def __str__(self):
    s = self.prettily()
    if isinstance(s, unicode):
        return s.encode('utf8')
    else:
        return s

__repr__ = __str__

And for __str__ and __repr__ it'll return utf8 encoded string.

But in here, I just assume that the all string should be encoded in utf8, maybe there should be an suitable parameter to be passed, for example when doing parse:

peg.parse(text, encoding)

* rules should always return a list to visitors

When feeding Grammar an empty string, ugly things happen. This is an example of a pain point that makes it harder to write visitors than it should be. I had to work around them at

# isinstance() is a temporary hack around the fact that * rules don't
# always get transformed into lists by NodeVisitor. We should fix that;
# it's surprising and requires writing lame branches like this.
return rule_map, rules[0] if isinstance(rules, list) and rules else None
.

Here's a pdb session as I tried to track it down:

 170             method = getattr(self, 'visit_' + node.expr_name, self.generic_visit)                             
 171             if method.__name__ == 'visit_rules':                                                              
 172                 import pdb;pdb.set_trace()                                                                    
 173                                                                                                               
 174             # Call that method, and show where in the tree it failed if it blows                              
 175             # up.                                                                                             
 176  ->         try:                                                                                              
 177                 return method(node, [self.visit(n) for n in node])                                            
 178             except VisitationError:                                                                           
 179                 # Don't catch and re-wrap already-wrapped exceptions.                                         
 180                 raise                                                                                         
 181             except Exception as e:                                                                            
 182                 # Catch any exception, and tack on a parse tree so it's easier to                             
 183                 # see where it went wrong.                                                                    
 184                 exc_class, exc, tb = sys.exc_info()                                                           
 185                 raise VisitationError, (exc, exc_class, node), tb                                             
(Pdb++) node
s = ''
Node('rules', s, 0, 0, children=[Node('_', s, 0, 0), Node('', s, 0, 0)])

Where does Node('', s, 0, 0) come from? It's the childless rule*. But that should have been transmuted to a list by the list comp in NodeVisitor. Why wasn't it?

Is it because the node doesn't get named 'rule' when it comes back empty? And thus visit_rule doesn't get called? I think + and * should always turn into lists. + probably does, since it always matches at least one instance when it succeeds.

Unicode all the things

I'm assuming that this shouldn't happen so I'mma filin' an issue

Traceback (most recent call last):
  File "main.py", line 24, in <module>
    stylesheet = do_parse(inp.read())
  File "/cygdrive/e/opt/crass/parse.py", line 411, in do_parse
    tree = css.parse(raw)
  File "/home/basta/.virtualenvs/crass/lib/python2.7/site-packages/parsimonious/grammar.py", line 83, in parse
    return self.default_rule.parse(text, pos=pos)
  File "/home/basta/.virtualenvs/crass/lib/python2.7/site-packages/parsimonious/expressions.py", line 40, in parse
    node = self.match(text, pos=pos)
  File "/home/basta/.virtualenvs/crass/lib/python2.7/site-packages/parsimonious/expressions.py", line 55, in match
    node = self._match(text, pos, {}, error)
  File "/home/basta/.virtualenvs/crass/lib/python2.7/site-packages/parsimonious/expressions.py", line 94, in _match
    error)
  File "/home/basta/.virtualenvs/crass/lib/python2.7/site-packages/parsimonious/expressions.py", line 230, in _uncached_match
    node = m._match(text, new_pos, cache, error)
  File "/home/basta/.virtualenvs/crass/lib/python2.7/site-packages/parsimonious/expressions.py", line 94, in _match
    error)
  File "/home/basta/.virtualenvs/crass/lib/python2.7/site-packages/parsimonious/expressions.py", line 230, in _uncached_match
    node = m._match(text, new_pos, cache, error)
  File "/home/basta/.virtualenvs/crass/lib/python2.7/site-packages/parsimonious/expressions.py", line 94, in _match
    error)
  File "/home/basta/.virtualenvs/crass/lib/python2.7/site-packages/parsimonious/expressions.py", line 307, in _uncached_match
    node = self.members[0]._match(text, pos, cache, error)
  File "/home/basta/.virtualenvs/crass/lib/python2.7/site-packages/parsimonious/expressions.py", line 94, in _match
    error)
  File "/home/basta/.virtualenvs/crass/lib/python2.7/site-packages/parsimonious/expressions.py", line 230, in _uncached_match
    node = m._match(text, new_pos, cache, error)
  File "/home/basta/.virtualenvs/crass/lib/python2.7/site-packages/parsimonious/expressions.py", line 94, in _match
    error)
  File "/home/basta/.virtualenvs/crass/lib/python2.7/site-packages/parsimonious/expressions.py", line 252, in _uncached_match
    node = m._match(text, pos, cache, error)
  File "/home/basta/.virtualenvs/crass/lib/python2.7/site-packages/parsimonious/expressions.py", line 94, in _match
    error)
  File "/home/basta/.virtualenvs/crass/lib/python2.7/site-packages/parsimonious/expressions.py", line 230, in _uncached_match
    node = m._match(text, new_pos, cache, error)
  File "/home/basta/.virtualenvs/crass/lib/python2.7/site-packages/parsimonious/expressions.py", line 94, in _match
    error)
  File "/home/basta/.virtualenvs/crass/lib/python2.7/site-packages/parsimonious/expressions.py", line 322, in _uncached_match
    node = self.members[0]._match(text, new_pos, cache, error)
  File "/home/basta/.virtualenvs/crass/lib/python2.7/site-packages/parsimonious/expressions.py", line 94, in _match
    error)
  File "/home/basta/.virtualenvs/crass/lib/python2.7/site-packages/parsimonious/expressions.py", line 252, in _uncached_match
    node = m._match(text, pos, cache, error)
  File "/home/basta/.virtualenvs/crass/lib/python2.7/site-packages/parsimonious/expressions.py", line 94, in _match
    error)
  File "/home/basta/.virtualenvs/crass/lib/python2.7/site-packages/parsimonious/expressions.py", line 252, in _uncached_match
    node = m._match(text, pos, cache, error)
  File "/home/basta/.virtualenvs/crass/lib/python2.7/site-packages/parsimonious/expressions.py", line 98, in _match
    print self, "doesn't match at", repr(text[pos:pos + 20])
  File "/home/basta/.virtualenvs/crass/lib/python2.7/site-packages/parsimonious/utils.py", line 16, in __str__
    return self.__unicode__().encode('utf-8')
  File "/home/basta/.virtualenvs/crass/lib/python2.7/site-packages/parsimonious/expressions.py", line 119, in __unicode__
    self.as_rule(),
  File "/home/basta/.virtualenvs/crass/lib/python2.7/site-packages/parsimonious/expressions.py", line 128, in as_rule
    return ((u'%s = %s' % (self.name, self._as_rhs())) if self.name else
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc2 in position 3: ordinal not in range(128)

LazyReference resolution is busted

If a LazyReference points to a LazyReference, the former gets replaced with just another lazy reference, which isn't a big help. As a result, trying to parse with the grammar fails.

Lose the tuple unpacking in visitors

It doesn't exist in Python 3, because Python 3 hates beauty. ;-)

As @reclosedev says…

There is no tuple parameter unpacking in Python 3 pep-3113. Solution is to use variable-length arguments by unpacking it in visit(). Also it saves from typing extra parentheses in visitors.

Example not working with latest parsimonious

https://github.com/erikrose/parsimonious/blob/adamfeuer-example_programs/parsimonious/examples/string_expression_language.py

I pulled latest source and install it. Tried one of examples which fails with error message

root_node = string_expression_grammar.parse(string_expressions)
File "build/bdist.linux-x86_64/egg/parsimonious/grammar.py", line 112, in parse
File "build/bdist.linux-x86_64/egg/parsimonious/expressions.py", line 109, in parse
parsimonious.exceptions.IncompleteParseError: Rule 'program' matched in its entirety, but it didn't    consume all the text. The non-matching portion of the text begins with '{
# test program' (line 2, column 1).

Input used:
{
# test program
a = "xyz"
b = "abc"
c = "def"
c = "333" # overwrites def
d = c + a + b
}
in seperate file.

I learned few things from this example but now its not working so again I'm stuck. Please resolve this.

Thanks
Yash

import *

In grammar.py there is a * operator..
It's a bit annoying specially using tools like PyLint, that can't really check if there are symbols not bound to anything.

If there are many symbols even something like

import parsimonious.expressions as e
might be better imho.

Support right recursion

A grammar like…

digits = digit digits?
digit = ~r"[0-9]"

…has no way of being built. We need to construct Sequence(Regex(…)) and then, afterward, append it as its own last member.

Also, getting reprs of such grammars recurses infinitely. Check for twice-visited nodes or something.

JSON example. Some questions and thoughts

I was interested in parsimonius because of readable grammars and separation of concerns, but didn't find any examples (except rule_syntax), so I've tried to write a simple JSON parser with demo and benchmark. https://gist.github.com/reclosedev/5222560

I'm not sure that I've used correct way to express grammar. For example, comma separated values and members. This grammar allows comma after the last member/value (JSON doesn't). How should it be written?

Can we mark some term or rule as excluded from tree? Example: whitespace, braces, commas. It would allow to reuse and to simplify some visit_* methods.

Suggestion: NodeVisitor.lift_child could be more useful, if it accepted rules with more than one child, e.g.:

values = value ws? ","? ws?
def lift_child(self, node, visited_children):
    """Lift the sole child of ``node`` up to replace the node."""
    return visited_children[0]

Or it can be separate method.

I think it would be great to have more real grammar examples with benchmarks in parsimonius.

Allow multi-line rules

= is used for only one thing in the rule-defining grammar, so we can do multi-line rules without requiring surrounding parens. Just be sure to require a newline at the end of each rule lest things get hard for humans to parse.

Parse error rule hierarchy

It would be nice if parse error would tell me the rule hierarchy of the current broken rule.

example:

error in rule assignment at line 1 col 1
class_definition->class_body->assignment

Umlaut Problem

Following Example:

#!/usr/bin/env python
# -*- coding: utf-8 -*- 

from parsimonious.grammar import Grammar
g = Grammar(
  """
  styled_text = bold_text / italic_text
  bold_text   = "((" text "))"
  italic_text = "''" text "''"
  text        = ~r"[\w\s]*"
  """
)
print g.parse(u'((böld))')

Traceback:

Traceback (most recent call last):                                                                                                                                                                                               
  File "parsi.py", line 12, in <module>                                                                                                                                                                                          
    print g.parse(u'((böld))')                                                                                                                                                                                                   
  File "/home/mariusz/.local/lib/python2.7/site-packages/parsimonious/grammar.py", line 83, in parse                                                                                                                             
    return self.default_rule.parse(text, pos=pos)                                                                                                                                                                                
  File "/home/mariusz/.local/lib/python2.7/site-packages/parsimonious/expressions.py", line 40, in parse                                                                                                                         
    node = self.match(text, pos=pos)                                                                                                                                                                                             
  File "/home/mariusz/.local/lib/python2.7/site-packages/parsimonious/expressions.py", line 57, in match                                                                                                                         
    raise error                                                                                                                                                                                                                  
parsimonious.exceptions.ParseError: Rule <Literal "))" at 0x21525040> didn't match at 'öld))' (line 1, column 4).                                                                                                                

Decide whether Parsimonious is for Unicode, bytestrings, or both

First, we should probably stop supporting the re.L flag; it's unreliable and worse than re.U, as http://docs.python.org/3/library/re.html observes.

In order to simplify things and make the API work uniformly across Python 2 and 3, I propose we adopt the convention from Python 3's re lib: grammars defined in Unicode can match only Unicode strings, and those defined by bytestrings can match only bytestrings. We drop support for the re.U flag, letting it be determined at Grammar construction time by what sort of string is passed in. Support re.A if you want, but I'd be content to make people spell out what they mean by \s, \w, and \d explicitly. (What about `\b'?)

To support the naive use of grammars, we can try to promote bytestrings to Unicode if an attempt is made to parse them with a Unicode grammar. But people defining grammars should know better.

Remember to address ParseError.line() and column(), which assume '\n' will be a bytestring in 2 and a Unicode in 3 atm.

AttributeError: 'LazyReference' object has no attribute 'parse'

I'm not having any luck doing any actual parsing. I've used a PEG parser before (PEG.js) but I'm not understanding how to use this one. If, for example, you simply modify that styled text example with styled_text = text, you'll generate an error like so. Why is that not legal? Should it be?

$ python convert.py
Traceback (most recent call last):
  File "convert.py", line 17, in <module>
    print my_grammar.parse("TEST")
  File "build\bdist.win32\egg\parsimonious\grammar.py", line 112, in parse
AttributeError: 'LazyReference' object has no attribute 'parse'

Whitespace before first item in parenthesis breaks Grammar

>>> from parsimonious.grammar import Grammar

>>> Grammar('foo = ("baz" "bar")+').parse('bazbar')
s = 'bazbar'
Node(u'foo', s, 0, 6, children=[Node('', s, 0, 6, children=[Node('', s, 0, 3), Node('', s, 3, 6)])])

>>> Grammar('foo = ( "baz" "bar")+').parse('bazbar')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Library/Python/2.7/site-packages/parsimonious/grammar.py", line 63, in __init__
    exprs, first = self._expressions_from_rules(rules)
  File "/Library/Python/2.7/site-packages/parsimonious/grammar.py", line 80, in _expressions_from_rules
    raise BadGrammar('There is an error in your grammar definition. '
parsimonious.exceptions.BadGrammar: There is an error in your grammar definition. Sorry for the vague error reporting at the moment.
>>> 

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.