dabeaz / sly Goto Github PK

Sly Lex Yacc

License: Other

Python 98.86% HTML 0.43% Euphoria 0.38% Makefile 0.32%

sly's Introduction

SLY (Sly Lex-Yacc)

SLY is a 100% Python implementation of the lex and yacc tools commonly used to write parsers and compilers. Parsing is based on the same LALR(1) algorithm used by many yacc tools. Here are a few notable features:

SLY provides very extensive error reporting and diagnostic information to assist in parser construction. The original implementation was developed for instructional purposes. As a result, the system tries to identify the most common types of errors made by novice users.
SLY provides full support for empty productions, error recovery, precedence specifiers, and moderately ambiguous grammars.
SLY uses various Python metaprogramming features to specify lexers and parsers. There are no generated files or extra steps involved. You simply write Python code and run it.
SLY can be used to build parsers for "real" programming languages. Although it is not ultra-fast due to its Python implementation, SLY can be used to parse grammars consisting of several hundred rules (as might be found for a language like C).

SLY originates from the PLY project. However, it's been modernized a bit. In fact, don't expect any code previously written for PLY to work. That said, most of the things that were possible in PLY are also possible in SLY.

SLY is a modern library for performing lexing and parsing. It implements the LALR(1) parsing algorithm, commonly used for parsing and compiling various programming languages.

Important Notice : October 11, 2022

The SLY project is no longer making package-installable releases. It's fully functional, but if choose to use it, you should vendor the code into your application. SLY has zero-dependencies. Although I am semi-retiring the project, I will respond to bug reports and still may decide to make future changes to it depending on my mood. I'd like to thank everyone who has contributed to it over the years. --Dave

Requirements

SLY requires the use of Python 3.6 or greater. Older versions of Python are not supported.

An Example

SLY is probably best illustrated by an example. Here's what it looks like to write a parser that can evaluate simple arithmetic expressions and store variables:

# -----------------------------------------------------------------------------
# calc.py
# -----------------------------------------------------------------------------

from sly import Lexer, Parser

class CalcLexer(Lexer):
    tokens = { NAME, NUMBER, PLUS, TIMES, MINUS, DIVIDE, ASSIGN, LPAREN, RPAREN }
    ignore = ' \t'

    # Tokens
    NAME = r'[a-zA-Z_][a-zA-Z0-9_]*'
    NUMBER = r'\d+'

    # Special symbols
    PLUS = r'\+'
    MINUS = r'-'
    TIMES = r'\*'
    DIVIDE = r'/'
    ASSIGN = r'='
    LPAREN = r'\('
    RPAREN = r'\)'

    # Ignored pattern
    ignore_newline = r'\n+'

    # Extra action for newlines
    def ignore_newline(self, t):
        self.lineno += t.value.count('\n')

    def error(self, t):
        print("Illegal character '%s'" % t.value[0])
        self.index += 1

class CalcParser(Parser):
    tokens = CalcLexer.tokens

    precedence = (
        ('left', PLUS, MINUS),
        ('left', TIMES, DIVIDE),
        ('right', UMINUS),
        )

    def __init__(self):
        self.names = { }

    @_('NAME ASSIGN expr')
    def statement(self, p):
        self.names[p.NAME] = p.expr

    @_('expr')
    def statement(self, p):
        print(p.expr)

    @_('expr PLUS expr')
    def expr(self, p):
        return p.expr0 + p.expr1

    @_('expr MINUS expr')
    def expr(self, p):
        return p.expr0 - p.expr1

    @_('expr TIMES expr')
    def expr(self, p):
        return p.expr0 * p.expr1

    @_('expr DIVIDE expr')
    def expr(self, p):
        return p.expr0 / p.expr1

    @_('MINUS expr %prec UMINUS')
    def expr(self, p):
        return -p.expr

    @_('LPAREN expr RPAREN')
    def expr(self, p):
        return p.expr

    @_('NUMBER')
    def expr(self, p):
        return int(p.NUMBER)

    @_('NAME')
    def expr(self, p):
        try:
            return self.names[p.NAME]
        except LookupError:
            print(f'Undefined name {p.NAME!r}')
            return 0

if __name__ == '__main__':
    lexer = CalcLexer()
    parser = CalcParser()
    while True:
        try:
            text = input('calc > ')
        except EOFError:
            break
        if text:
            parser.parse(lexer.tokenize(text))

Documentation

Further documentation can be found at https://sly.readthedocs.io/en/latest.

Talks

Reinventing the Parser Generator, talk by David Beazley at PyCon 2018, Cleveland.

Resources

For a detailed overview of parsing theory, consult the excellent book "Compilers : Principles, Techniques, and Tools" by Aho, Sethi, and Ullman. The topics found in "Lex & Yacc" by Levine, Mason, and Brown may also be useful.

The GitHub page for SLY can be found at:

https://github.com/dabeaz/sly

Please direct bug reports and pull requests to the GitHub page. To contact me directly, send email to [email protected] or contact me on Twitter (@dabeaz).

-- Dave

P.S.

You should come take a course!

sly's People

Contributors

Stargazers

Watchers

Forkers

tusonggao tacaswell cdeil ehsan-keshavarzian gwworld ericvsmith dellelce chmarr dduong42 kkiningh morrisma dstroe2000 sfingram danshorstein a402539 costrouc akuli thesage21 einkindandy yassineosip b-rade willingc vslang jianantian iwoherka xmonader dvp2015 bronislavy pombredanne dongchengliang interestingprojectlist alberth lordmauve abhaikollara khurramfaraaz laranea brchiu beyonddream-productions longjohncoder jeanmonet dlee6210 kajirahul rokujyouhitoma angrycaptain19 jrib awesomesoup manux81 justincase9 bugengine spyoungtech tatchakorn activehigh mavdotjs taroyuyu egeronik senyai shalevy1 ersrc maxeem wqh17101 minekpo1 stef500 adrgadelha cybroidtech python-repository-hub wwarriner arztklein gastonfeng spoiledrottencat icodein blakecthompson capuanob mayhemheroes henny-hen jnahmias joemichalak daniel-garabato peterthies dophist iviarcio aquilesics vinimeihayalo ads40 fueler jyxxhyx ea-rus rogerlarsson open-domain neocliff ismailozenc cwy9493 jurandy-almeida tobias-silva dannwise beyondhenry whysoserious20001128 robertartigas bhusalb hudson-ai sachaa-thanasius

sly's Issues

Vanished _lrtable

Dave,

I don't know if I miss something but I'm facing a very strange problem. I created a lexer (which works as expected) and a parser (called Myparser). I managed to write some rules and tried to parse some syntax, but banged on :

File "/usr/local/lib/python3.6/dist-packages/sly-0.4-py3.6.egg/sly/yacc.py", line 1849, in parse
AttributeError: 'Parser' object has no attribute '_lrtable'

I did some checking in the source code and everything works as expected, except when I try to instantiate my parser ! The '_lrtable' is properly set during the 'class' instantiation (I check in ParserMeta.new), but once my object Myparser is instantiated, at least one class variables (Parser) dosen't seem to be inherited from the Parser class and is then missing. ParserMeta.new returns the class containing the '_lrtable' attribute, but it seems to be wipped out on the final objet. All other attributes seem to be there, but no '_lrtable'.

I known that my grammar is currently unstable and unfinished and I wonder if it could play a role...

What else could I investigate ? Anything I could try ?

Support for the third-party regex module?

Currently sly uses a hard-coded import re. I need the third-party regex library for things like \p{Ll} to match Unicode lowercase letters. This means that I need to import sly like this:

import re
import sys

import regex

# ugly hack
try:
    sys.modules['re'] = regex
    import sly
finally:
    sys.modules['re'] = re

There are some problems with this approach:

If sly imports any modules that haven't been imported earlier, they'll get the regex module instead of the re module too. That might cause some issues even though regex is supposed to be "a drop-in replacement" for re.
The implementation of sly might change later so that this breaks. For example, if sly imports re after import sly has returned, it gets the real re module instead of the regex module.

It would be nice if sly had some kind of way to specify which regex module to use with a Lexer subclass.

YaccProduction index property causes AttributeError if index is 0.

The index property of class YaccProduction is defined as

    @property
    def index(self):
        for tok in self._slice:
            if isinstance(tok, YaccSymbol):
                continue
            index = getattr(tok, 'index', None)
            if index:
                return index
        raise AttributeError('No index attribute found')

An index value of 0 is valid if the token is at the very start of the source text, but the if index: test for it is false so an AttributeError is raised.

Changing the test to if index is not None: fixes the problem).

Undeclared identifiers in tokens assignment?

Please pardon my ignorance, but I'm having a very difficult time understanding what happens when I assign the tokens attribute of a Lexer instance.

I see in the documentation

Token names should be specified using all-caps as shown.

and some testing confirms that any identifier seems to be legal so long as its name is capitalized, but I don't understand what happens in Python when these identifiers are provided; nor do I recognize anything in lex.py that enables this behavior.

What is so special about tokens that allows it to accept previously undeclared names?

Benchmarking a JSON parser

I've written a JSON parser in SLY for the purpose of benchmarking against other parsing libraries (including another LALR parser, Lark, and some PEG-like parsers: pyparsing, parsimonious, pe). The repo is here: https://github.com/goodmami/python-parsing-benchmarks

The README has the results for a run on CPython and on PyPy (for both, the mean time for 5 rounds of parsing a ~5MB JSON file). The code for SLY is here: https://github.com/goodmami/python-parsing-benchmarks/blob/master/bench/sly/json.py

I'm the author of one other library (pe), and I want the benchmarks to be fair, so my question is:

Can a more experienced user of SLY please take a look at the code (it's < 100 LOC) to see if I'm missing any obvious optimizations?

Regarding the code, a couple of things to note:

The implementation is wrapped in a function just so I can benchmark parser instantiation
The STRING regex is complicated because this is a strict implementation of the JSON spec, which whitelists valid characters and escapes

Thanks for any help!

Support for inclusive lexing states

Is there support in SLY for the equivalent of PLY's inclusive lexing states? It seems the default behaviour in SLY is equivalent to PLY's 'exclusive'. I find no mention of "exclusive" or "inclusive" in the source code.

EDIT: perhaps the idea is that, no, there is no inclusive lexing, but the lexer you define for the new state can inherit from the base one to get its tokens?

Release the latest on PyPI

Cause it's awesome!

Allow optional tokens in rules

I've spent a bit of time working with a project that uses SLY, and one thing I think could be a huge benefit would be to allow optionals (0 or 1 occurances) in rules.

For example, I've found myself having to define something like the following quite often:

@_('TOKEN')
@_('')
def maybe_token(self, p):
    pass

@_('... maybe_token ...')
def other_rule(self, p):
    ...

Where I don't really care about TOKEN but I do need to account for it because it is contextually useful (such as in a language like Python where whitespace is important... sometimes).

It would be much simpler and easier to read to be able to do the following:

@_('... [TOKEN] ...')
def other_rule(self, p):
    ...

Where [TOKEN] defines that it is optional to the production rule, expanding into a rule with and rule without.

Feature idea: better error messages for yacc

Currently the only error information I can report to the user of my compiler from a sly.Parser is "invalid syntax at this line and this column in this file". That's not as good as it could be IMO. Before using sly I had a hand-written parser, and its error messages were way better than "syntax error", as in "trailing comma not allowed in argument list" or "you can't do a == b == c, you need to do a == b and b == c instead".

Even though parser generators can't give very good error messages compared to hand-written parsers, some kind of error messages other than calling everything "syntax error" would be nice. For example, I would guess that it's easy to implement some kind of error message handling for a == b == c when there is precedence = (..., ('nonassoc', EQ), ...) in the sly.Parser subclass.

Cannot use '[]' with EBNF?

I tried to write this formula for a variable or a reference of element in one array:

@_("'$' IDENTIFIER [ '[' consts ']' ]")
def vars(self, p):
    print(len(p))

As well known, strings in '[]' is optional for this rule, but I try to get length of it, the length always is 3: if I Input $i as a variable not ref, the p[2] will be (None, None, None) as a tuple.

How to solve it? I think the length would be 5 if my input looks like $x[0] and length would be 2 if my input looks like $i

Getting the derivation of a parse

is there a way to get a derivation of the string? (don't know if I'm using "derivation" in the correct sense)
something which says that the rules were applied in this order: rule 0 , 1, 5, 2, 3, 4, ...

Here's what I've been able to come up with so far:

    def to_index(self, p):
        prods = self._grammar.Productions
        o = prods[1]  # the first one is dummy
        idx = [
            (o.number, o.name)
            for o in self._grammar.Productions
            if o.namemap == p._namemap
        ]
        print(p, idx.number)

    @_("SELECT cols FROM tables SEMI")
    def query(self, p):
        self.to_index(p)
        return [p.cols, p.tables]

Is there a better way to do this, short of editing sly code?

Error: the calc_ebnf cannot be executed.

The example in calc_ebnf cannot run with python 3.8.3, following is the error reports:

sly.yacc.YaccError: Unable to build grammar.
C:\ProgramData\Miniconda3\lib\site-packages\sly\yacc.py:1690: Symbol 'PLUS|MINUS' used, but not defined as a token or a rule
C:\ProgramData\Miniconda3\lib\site-packages\sly\yacc.py:1690: Symbol 'TIMES|DIVIDE' used, but not defined as a token or a rule

How to make ebnf work with sly?

Pickling/saving a generated grammar

Is there any way to take a given lexer/parser and pickle the results once the class is instantiated, and then reuse it later (to save us some startup time)? I tried to pickle a created Parser class with rules but the resulting file didn't seem to have any of the grammar data in it.

if else - both statements always executing

I'm trying to parse a language called pinescript which has if-then-else statements. I've extended the calc example to support many of its features. I'm currently stuck on this one. It appears that the else block is always executing no matter the result of the expr. Here is the output from the terminal when I parse the offending statement:

WARNING: 34 shift/reduce conflicts
WARNING: 11 reduce/reduce conflicts
calc > if true x = 10 else x = -1
DEBUG: ID ASSIGN expr x 10
DEBUG: ID ASSIGN expr x -1
DEBUG: {'x': -1} True 10 -1
10
calc > x
-1

As you can see I've added a cexpr (compound expression) along with expr to the grammar.

    @_('cexpr')
    def statement(self, p):
        return p.cexpr

    @_('ID ASSIGN expr')
    def statement(self, p):
        print('DEBUG:', 'ID ASSIGN expr', p.ID, p.expr)
        self.names[p.ID] = p.expr
        return p.expr
...
    @_('IF expr statement ELSE statement')
    def cexpr(self, p):
        print('DEBUG:', names, p.expr, p.statement0, p.statement1)
        return p.statement0 if p.expr else p.statement1

    @_('IF expr expr ELSE expr')
    def cexpr(self, p):
        # print(p, p.expr0, p.expr1, p.expr2)
        return p.expr1 if p.expr0 else p.expr2

Any idea how I can make only the correct branch statement evaluate? Please let me know if any additional info is needed to understand whats happening.

Handling end of file in the lexer

PLY has t_eof token support so you can detect things like open brackets. Is it possible to do this in SLY? If not, the code from PLY seems (at least at a glance) to be quite straightforward - is this something you'd accept a PR for?

ignore case sensitivity

I'm trying to have my language case insensitive.

I'm doing this :

class FortyLexer(Lexer):
    """Main Lexer Class for FortyFor"""
    Lexer.reflags = Lexer.regex_module.IGNORECASE

    tokens = {
        ID, NUMBER, PLUS, MINUS, TIMES, DIVIDE, ASSIGN, LPAR, RPAR,
        IF, ELSE
    }

    ignore = ' \t'
    ignore_comment = r'\!.*'
    ignore_newline = r'\n+'

    ID      = r'[a-zA-Z_][a-zA-Z0-9_]*'
    ID[r'if'] = IF
    ID[r'else'] = ELSE
...

"else" is recognized as an ELSE token, but "ELSE" is still recognized as an "ID".
Am i doing it wrong ?

i also tried with " Lexer.reflags = Lexer.regex_module.RegexFlag.IGNORECASE"

distribute a wheel

i was trying to use sly through pyodide in the browser, but since it is a source distribution from pypi it doesn't work. would it be possible to add bdist_wheel to the pypi distribution?

Change lexer state according to parser state on the fly

If we want to change lexer states according to parser states, e.g. according to certain token combination patterns, it is not possible to do so because parse method of Parser class uses tokens that be produced from Lexer's tokenize method.

To be exact, I would like to implement similar behavior in asn2wrs.py that changes lexer state after WITH SYNTAX or in specific states, e.g. p_ObjectSet, p_ParameterList... .

Is it any alternative way to do it ?

Thanks.

calclex.py example produces NameError

The calclex.py example from the "Writing a Lexer" section of https://github.com/dabeaz/sly/blob/master/docs/sly.rst raises a NameError when run. Console output to demonstrate:

$ (master) python3 --version
Python 3.6.5

$ (master) cat calclex.py 
# calclex.py

from sly import Lexer

class CalcLexer(Lexer):
    # Set of token names.   This is always required
    tokens = { ID, NUMBER, PLUS, MINUS, TIMES,
               DIVIDE, ASSIGN, LPAREN, RPAREN }

    # String containing ignored characters between tokens
    ignore = ' \t'

    # Regular expression rules for tokens
    ID      = r'[a-zA-Z_][a-zA-Z0-9_]*'
    NUMBER  = r'\d+'
    PLUS    = r'\+'
    MINUS   = r'-'
    TIMES   = r'\*'
    DIVIDE  = r'/'
    ASSIGN  = r'='
    LPAREN  = r'\('
    RPAREN  = r'\)'


if __name__ == '__main__':
    data = 'x = 3 + 42 * (s - t)'
    lexer = CalcLexer()
    for tok in lexer.tokenize(data):
        print('type=%r, value=%r' % (tok.type, tok.value))

$ (master) python3 calclex.py 
Traceback (most recent call last):
  File "calclex.py", line 5, in <module>
    class CalcLexer(Lexer):
  File "calclex.py", line 7, in CalcLexer
    tokens = { ID, NUMBER, PLUS, MINUS, TIMES,
NameError: name 'ID' is not defined

I am running sly v0.3. I take it I must be doing something wrong, but I do not see what.

graphviz file with automaton

Dear all,
I would like to know if there is a way to generate a dot file for the LALR(1) automaton in sly.
Something like it happens with yacc.

Is Sly a Python3.6+ Ply?

I have 2 questions:
1.Is Sly a Python3.6+ Ply?
2.WIll code write in Ply works in Sly?

AttributeError: 'function' object has no attribute 'rules' in the sly Parser

I recently got this error on Ubuntu 18.04:

Traceback (most recent call last):
  File "chain_interpreter.py", line 1, in <module>
    import chain_lexer, chain_parser
  File "/home/jonathan/Documents/programming/python/Programming Languages/chAIn/chain/src/chain_parser.py", line 6, in <module>
    class ChainParser(Parser):
  File "/home/jonathan/.local/lib/python3.6/site-packages/sly/yacc.py", line 1586, in __new__
    cls._build(list(attributes.items()))
  File "/home/jonathan/.local/lib/python3.6/site-packages/sly/yacc.py", line 1785, in _build
    if not cls.__build_grammar(rules):
  File "/home/jonathan/.local/lib/python3.6/site-packages/sly/yacc.py", line 1677, in __build_grammar
    parsed_rule = _collect_grammar_rules(func)
  File "/home/jonathan/.local/lib/python3.6/site-packages/sly/yacc.py", line 1545, in _collect_grammar_rules
    for rule, lineno in zip(func.rules, range(lineno+len(func.rules)-1, 0, -1)):

AttributeError: 'function' object has no attribute 'rules'

This is intersting because a few mothns ago when I was working on this project, I did not have this problem. I have not changed any of the source code since then. I am on python 3. The start of my chain_parser.py file looks something like this:

from sly import Parser
from chain_lexer import ChainLexer
import pprint

class ChainParser(Parser):  # error occurs on this line?
    tokens = ChainLexer.tokens
    debugfile = 'parser.out'

    @_("program statement")
    def program(self, p):
        return p.program + (p.statement, )

I have looked online but as far as I can tell, no one has gotten this issue before...

All help would be appreciated.

Why is my string being split into individual characters?

So I have this regex expression STRING = r'[^\[\][^\n"]+' to get capture text to save into a variable, but I also have another part of my program which uses literals of ( and ) to do math, such as 5 add ( 5 minus 2 ) = 8. However my issue is that when I run my test and I actually type in the same thing into the interpreter, it causes an error and I'm not sure why? It should show that instead of ( 5 add 5 ) is a STRING, it will be made up of a ( NUMBER ADD NUMBER ), but individual tokens.

WARNING: 5 reduce/reduce conflicts
Token(type='DQUOTE', value='"', lineno=2, index=5)
Token(type='STRING', value='hello', lineno=2, index=6)
Token(type='DQUOTE', value='"', lineno=2, index=11)
Token(type='STRING', value='( 5 add 5 )', lineno=3, index=17)
engpy > (5 add 5)
sly: Syntax error at line 1, token=STRING
engpy > ( 5 add 5 )
sly: Syntax error at line 1, token=STRING

test = '''
    "hello"
    ( 5 add 5 )'''

for t in lexer.tokenize(test):
    print(t)

Lacking gratitude on contributions

The thread from these tweets led me to raise this issue - there's not enough gratitude in OSS.

I've been learning about parsing over a year or so. From your open source projects (such as PLY and SLY), talks and insights, I grown to learn about a number of parsing concepts. Some resources that introduced me to this field include:

Not to mention I learned a great amount of Python over the years from your talks on many, many other topics.

There is not enough gratitude in open source, nor many tools to express it. Nevertheless, I thank you.

Parsing binary SCPI data

Hi,
this is not an issue, more a question:
I would like to parse binary SCPI data of the form #{LoL}{L}{data}.
LoL is the length of the length field L.
L is the length of the binary data.
e.g. #210abcdefghij would be a valid binary data with LoL=2, L=10 and abc.. as data.
The problem here is that the length of the data has to be determined by parsing the header
and somehow feeding the length information to something like ".{L}"

Something like "#([1-9])([0-9]{\1})+(.{\2})" would be nice, where I could use the result of a group
inside of the current regex.
Embedded actions looked promising, but they happen in the parser.

small doc err

Small error/fix in the documentation under "Token Remapping" the following text/example:

# Base ID rule
    ID = r'[a-zA-Z_][a-zA-Z0-9_]*'

    # Special cases
    ID['if'] = IF
    ID['else'] = ELSE
    ID['while'] = WHILE

is a bit misleading as this gives no (intuitive) way to trigger a yacc rule that includes 'if', 'else' or 'while'. Assuming that the assignment is made with IF, ELSE and WHILE with keywords/tokens specified earlier. If however, the example specifies uppercase for the dict keys:

    # Special cases
    ID['IF'] = IF
    ID['ELSE'] = ELSE
    ID['WHILE'] = WHILE

a parser rule e.g.;
@_('IF NUMBER "=" NUMBER')
no longer fails.

Cheaper/simpler way to get column information out-of-the-box

In SLY, the recommended way to get column information for a token is to use its stored offset, find the offset of the last newline, and use that to compute the column.

While coding a scanner, I realized there is a simpler/cheaper way to get column information for each token.

If you instead store the offset of the start of a line at the same time you update the line number information, and copy that line start offset into each token together with the line number, the token has sufficient information to compute its offset by itself.

Abstracting this to "a position-information" object makes this even simpler. Instead of copying a line number, copy an object named position or something similar. Create a new Position object with line number and line start offset each time a newline is found, and done.

How to extend a Lexer

This may be related in lots of ways to #12 : is there a simple way to extend a Lexer?

I have this AbstractLexer class that has many rules that should be shared by child classes, and it would be very inelegant to have to copy these rules in all the "child" lexers.

Conditional lexing

Dear all,

I am having problems trying to do conditional lexing:
Here is a sample code:

import sly

class CalcLexer(sly.Lexer):
    tokens = { NUMBER, PLUS, MINUS, TIMES, DIVIDE, LBRACE }
    ignore = ' \t\n'

    NUMBER = r'\d+'
    PLUS = r'\+'
    TIMES = r'\*'
    MINUS = r'-'
    DIVIDE = r'/'
    LBRACE = r'\{'

    def LBRACE(self, t):
        raise sly.LexerStateChange(BlockLexer, t)

class BlockLexer(sly.Lexer):
    tokens = { RBRACE, NAME, VALUE }
    ignore = ' \t\n'

    NAME = r'[a-zA-Z_][a-zA-Z0-9_]+'
    VALUE = r'\d+'
    RBRACE = r'\}'

    def RBRACE(self, t):
        raise sly.LexerStateChange(CalcLexer, t)


if __name__ == '__main__':
    lexer = CalcLexer()
    for tok in lexer.tokenize('3 + 4 { foo bar 1234 } * 6'):
        print(tok)

The expected output should be:

Token(type='NUMBER', value='3', lineno=1, index=0)
Token(type='PLUS', value='+', lineno=1, index=2)
Token(type='NUMBER', value='4', lineno=1, index=4)

However, the result is:
Token(type='NUMBER', value='3', lineno=1, index=0)
Token(type='PLUS', value='+', lineno=1, index=2)
Token(type='NUMBER', value='4', lineno=1, index=4)
Traceback (most recent call last):
File "prueba.py", line 31, in
for tok in lexer.tokenize('3 + 4 { foo bar 1234 } * 6'):
File "/home/user/Git_Repositories/sly/sly/lex.py", line 400, in tokenize
tok = _token_funcs[tok.type](self, tok)
File "prueba.py", line 15, in LBRACE
raise sly.LexerStateChange(BlockLexer, t)
sly.lex.LexerStateChange: (<class 'main.BlockLexer'>, Token(type='LBRACE', value='{', lineno=1, index=6))`

How do I link Parsing Rules?

In ply, We would use a doc string like so
'''program : program statement | statement'''
then later something like
'''statement : INTEGER command NEWLINE'''
To link parsing rules to a token for use later.

How would you do something similar in sly?
All the examples I have seen are something like
@_('program statement', 'statement')

So, how do I link the rules ('program statement', 'statement') to a token (program)?
Or rather define those rules as that token?

Is the function name the token that defines those rules in the decorator?
So I'd then use the function name in the decorators?

Like so ?
@_('program statement', 'statement')
def program(p):
@_('INTEGER command NEWLINE')
def statement(p):
(I might not be using correct terminology)

How do I determine the nestedness level

I have a working lexer and semi-working parser. My data contains name=value pairs , and value can be an object in itself. I am trying to determine how deep in the tree I am when my method for the nested object is being called. I tried to add indentLevel as the member of Parser, but that fails. I understand why - it is because the nested elements are reduced before the object itself, but I do not see a proper way to do it. Please advice. See code snippet below.

  tokens = {ELEM_NAME, ELEM_VALUE, 
             LBRACKET, RBRACKET}

.......

   @_('LBRACKET elements RBRACKET')
   def aggregate(self, p):
       return p.elements

   @_('element elements')
   def elements(self, p):
       return p.element + "\n" + p.elements

   @_('element')
   def elements(self, p):
       return p.element

   @_('ELEM_NAME LBRACKET elements RBRACKET')
   def element(self, p):
       self.d_identLevel += 1
       retStr = p.ELEM_NAME + "\n" + p.elements
       self.d_identLevel -= 1
       return retStr


   @_('ELEM_NAME ELEM_VALUE')
   def element(self, p):
       return self.ident() + p.ELEM_NAME + "=" +p.ELEM_VALUE

Support getattr

In a Parser, given a function with multiple rules, where a statement is present in one rule but not the other, it would be useful to be able to use getattr on the p arg for that attribute, but currently a KeyError bubbles up.

    @_('NAME "(" ")"',
       'NAME "(" opt ")"')
    def rule(self, p):
        opt = getattr(p, 'opt', 'value')
        ...

When the first rule matches, without the value for opt, calling getattr results in a KeyError

  File ".../python3.6/site-packages/sly/yacc.py", line 1900, in parse
    value = p.func(self, pslice)
  File "...", line ..., in function
    args = getattr(p, 'arguments_list', [])
  File ".../python3.6/site-packages/sly/yacc.py", line 147, in __getattr__
    return self._slice[self._namemap[name]].value

Problem where import Lexer and/or parser

Hi, how are you?. I have a problem when import lexer, parser or both. both in py 2.7 and 3.5.

pyton3 import-test.py:
Traceback (most recent call last): File "import-test.py", line 1, in <module> from sly import Lexer, Parser File "/home/nbalmaceda/.local/lib/python3.5/site-packages/sly/__init__.py", line 2, in <module> from .lex import * File "/home/nbalmaceda/.local/lib/python3.5/site-packages/sly/lex.py", line 78 return f'Token(type={self.type!r}, value={self.value!r}, lineno={self.lineno}, index={self.index})' ^ SyntaxError: invalid syntax
pyton import-test.py:
Traceback (most recent call last): File "import-test.py", line 1, in <module> from sly import Lexer, Parser File "/home/nbalmaceda/.local/lib/python2.7/site-packages/sly/__init__.py", line 6 __all__ = [ *lex.__all__, *yacc.__all__ ] ^ SyntaxError: invalid syntax

debug level

Dear all,
Is there a way to see in a sly parser, the shift reduce operations for an input?
The equivalent command in yacc is yydebug=1 and it shows you the stack, the automaton and a
step by step execution of the parser.

Silencing sly warnings?

Is there a way to disable the sly warnings? Here is an exmaple of one:

WARNING: Token 'EQEQ' defined, but not used

I have tried the ply way to diable (using NullLogger) but that doesn't work. Disabling the warnings isn't really documented and I can't find anything online.

Any help would be appreciated.

Python 3.6 minimum required

Some folks are still using Debian 9 which defaults to only Python 3.5.x.

This repo requires a minimum of Python 3.6 due to f-multiline feature (print(f'oneline')).

Perhaps a little note stating this minimum requirement somewhere in the Wiki?

How to extend a Parser?

If I have

class MyLexer(Lexer):
     ...

class MyParser(Parser):
    tokens = MyLexer.tokens

    ... # some expressions

How can I extend MyParser to write another parser that builds on the first's expressions?

This examples does not appear to work.

class MySecondParser(MyParser, Parser):
    tokens = MyParser.tokens

    ... # some expressions

I get the following error
Symbol <expression from MyParser> used, but not defined as a token or a rule

Feature request: automatic support for optional symbols and repetitions

Grammars often include rules for optional symbols, or for symbols repreated 0 or more, or 1 or more times. For example I have just written a PLY grammar with these rules:

def p_request_0n(t):
    """ request_0n : request_0n request
                   | empty
    """

def p_ixLocation_1n(t):
    """ ixLocation_1n : ixLocation_1n ixLocation
                      | ixLocation
    """

It would be nice if SLY could do this automatically, perhaps with:

? for optional
* for 0 or more
+ for 1 or more

Alternately, SLY could use _01, _0n and _1n suffixes to symbols, and if rules for them aren't supplied then sensible defaults are used (alternately the user could write their own rule instead of the default, e.g. to insert a comma between each occurance).

Update PyPI Entry

version 0.4 was uploaded over a month ago but is still not the version available from PyPI.

teach PyCharm to work with sly

As David stated in his PyCon2018 talk sly breaks PyCharm by heavily using modern features of python in an unique way.

PyCharm states in is documentation

PyCharm's code analysis is flexibly configurable. You can enable/disable each code inspection and change its severity, create profiles with custom sets of inspections, apply inspections differently in different scopes, suppress inspections in specific pieces of code, and more.

see https://www.jetbrains.com/help/pycharm/code-inspection.html

would be interesting if we could teach pycharm not to break by a modified code inspection

I think sly is really promising and gets parsing in a much better way than most other tools. Thanks David

Parse tree generation gets short shrift in documentation

Thank you for a very cool tool. The annotation approach is elegant.

In the documentation, the primary example for PLY is the calculator. This can be a little misleading for noobs (such as a colleague of mine) because the calculator's parser rules actually conflate parsing proper (the grammatical analysis of the input stream) and the evaluation of the parsed stream. This is very clever, and it can easily be argued (as you implicitly do) that if one wants a parse tree as output, then the evaluation returns the parse tree, not values.

I'd suggest breaking the documentation for the parser up into two separate sections, one about parse tree generation, and one using your calculator example, so that noobs do not immediately think that the output of parsing is the final evaluation of the input.

If I have time, I will submit a PR for this.

Lexer states

Any examples on using lexer states?

Forcing a fixed (but smaller) regex prior to m.match()/m.group() at a certain state?

Maybe I bit more than I can chew, but I'm trying to develop a BIND9 configuration parser using Sly (and formerly Ply).

Basic problem is the Sly/Ply's auto-typing of multiple ID (identifiers) and whether I should generalize all my variable fields into just one ID type or not given the constrain put forth by sly (or ply) design.

BIND9 configuration is a weird comportment of C-style/Python-style comment, include statement, alias dictionary, multiple-LBRACE/RBRACE nesting, and ignoring newlines centered by using SEMICOLON as a statement terminator. I got all that working except for one: ID type discriminator (via multi-token regex).

include named-options.conf;
server example.com;

My first attempt to further subdivide/specialize that generic ID token was to break it up into multiple ID-type tokens and define SERVER ALIASNAME and INCLUDE FILESPEC using:

    t_SERVER_NAME = r'[A-Za-z0-9_\-\.]*'
    t_FILESPEC = r'([/\\:\._\-0-9A-Za-z]+)(?=[ \t]*;)'

I ran into that classic problem where a certain state is identifying the "ID" as a wrong token type.

After much reading of Google Group, StackOverflow, and GitHub forum/issues, I've concluded that any attempt to discriminate identifier (variable, aliasname, full domain name) is futile due to inability for regex to properly identify these identifiers.

Then I thought, why not at initialization time that I would forcibly pre-assign its smaller regex for that certain state (heck, for most states).

At any rate, I see three choices ahead of me:

Is there a way to pre-select a lone (but smaller) token regex after entering into a next state instead of using the more generalized multi-token ID type identification regex?
Or is verification of its variable naming convention (using just t_ID) best done inside the state-specific parser function (ie., p_clause_server and p_clause_include) and not at token-level (ie., t_FILESPEC and t_SERVER_NAME)?
Or did I overlook another tip?

If I can nail this, NGINX configuration file format will soon follow and I can post the result in its entirety here in GitHub for other security researchers to use.

Getting line information from other terminals.

In the scanner, all tokens get position information. In the parser however you can only get line information from a single token, namely the first one. This makes it impossible to report position for error related to other tokens.

As a trival example, consider a rule like FUNCTION code END. The END can be just as wrong as FUNCTION, and it doesn't need to be on the same line, especially if code is not short.

In the current implementation, I would have to change value of every token in the scanner to include position information, which is silly.

Feature request: rename Lexer and Parser decorators

Is there any particular reason for naming the decorators defined in the Lexer and Parser classes '_'? It's not very meaningful and I find them ugly in use since @_(' ..... ') essentially starts with four punctuation marks.
I messed around with the code and changed them to @token and @rule respectively. Suitably modified versions of the calculator examples seemed to work OK.

Add alternate function name for rule definitions in parsers

Because the @_(str) rule decorator is "special" for this library, my static analysis plugins raise the alarm bells on every invocation because it can't find the name _, which makes it difficult to see when there actually is a problem.

It would be nice to have an "alias" (exported from sly) called something like @rule(str) that does the same thing it does now, but makes static analysis tools not raise a bunch of errors.

Panic Mode Recovery at End of File

Background

Ideally I want to be able to parse out some specially formatted C++ comments and
the function which they are documenting. (Think a bespoke form of Doxygen).

After some reading it sounded a lot like using a Lexer/Parser had already solved
the hard part of this.

Possible problem is I'm trying to be lazy and ignore all the surrounding C++ code.
So, outside of my golden comment blocks (and later the function being documented)
there's a sea of syntax errors.

I was hoping I could easily pull out the interesting parts and ignore everything
else. I'm starting to think this might be outside intended operating conditions
of such a parser though...

Sly

I've been testing out Sly which I've proved will easily do what I want when there is
no unexpected text.

However, I can't quite seem to get the rather extreme error handling to do what I'd like.
Currently the problem appears to be when the unexpected text is between a valid
statement and the EOF.

Looking at the state debugfile, it looks like I need to get either a
COMMENT_OPEN or an $end to reduce what should be a complete expression on
the stack. However, I'm entering error() handling before hitting the end of the
file and I wonder if I need to be signaling this somehow?

I've got some simplified test code below.

Test Code

#! /usr/bin/env python3

from sly import Parser
from sly import Lexer
from pprint import pprint


class CommentLexer(Lexer):
    tokens = {COMMENT_OPEN, COMMENT_CLOSE, WORD, SEMI}

    COMMENT_OPEN = r"/\* COMMENT:"
    COMMENT_CLOSE = r"\*/"
    WORD = r"[^; \*\t\n\r\f\v]+"
    SEMI = r";"

    ignore_astrix = r"\*"
    ignore_newline = r"\n"
    ignore_space = r" "

    def ignore_newline(self, t):
        self.lineno += t.value.count("\n")

    def error(self, t):
        print("Line %d: Bad character %r" % (self.lineno, t.value[0]))
        self.index += 1


class CommentParser(Parser):
    tokens = CommentLexer.tokens
    debugfile = "comment_parser.out"

    def __init__(self):
        self.comments = []

    @_("comment_doc comment_doc")
    def comment_doc(self, p):
        pass

    @_("COMMENT_OPEN string COMMENT_CLOSE")
    def comment_doc(self, p):
        print("#########")
        print(f"Got: {p.string}")
        print("#########")
        self.comments.append(p.string)
        return p.string

    @_("string string")
    def string(self, p):
        return p[0] + " " + p[1]

    @_("WORD")
    def string(self, p):
        return p.WORD

    def error(self, p):
        pprint(p)

        if not p:
            print("Hit the end of the file!")
            return

        print(f"Syntax error at type: {p.type} value: {p.value} line: {p.lineno}")
        while True:
            tok = next(self.tokens, None)

            if tok == None:
                print("Error Tok: Hit None")
                return tok

            if tok.type == "COMMENT_OPEN":
                print("Error Tok: Found new comment")
                return tok

            print(f"Ignoring: {tok.type}")


def test_one_comment_recovery_after():
    lexer = CommentLexer()

    test_data = """
    /* COMMENT: This is the
       only comment string I'd
       like to parse out
    */

    /* I don't care about this one. */

    """

    parser = CommentParser()
    parser.parse(lexer.tokenize(test_data))
    assert len(parser.comments) == 1


def test_one_comment_recovery_before():
    lexer = CommentLexer()

    test_data = """
    /* I don't care about this one. */

    /* COMMENT: This is the
       only comment string I'd
       like to parse out
    */

    """

    parser = CommentParser()
    parser.parse(lexer.tokenize(test_data))
    assert len(parser.comments) == 1

Aborting a parse

Thanks for SLY - very useful. Can you help? I have a small parser for reading a 'C' source code file and performing a translation on lines starting with "#rom". This is a non-standard statement available in Custom Computer Service's PIC compiler.
The modification I'm doing now requires the translation to be target processor specific so the first #rom statement must be the device type, some have 14 bit wide ROM words and others 16 bit. There is no point continuing if that statement is missing.
I have an SLY parser working but my solution to aborting the parse if an ordinary #rom statement appears before a #rom device statement seems awkward and heavy handed. Definitely not neat.
The 'deviceStmntErr error' production simply raises an exception.

@_( 'deviceStmnt statements' )
def root( self, p ):
	print("hv root")
	return True

@_( 'deviceStmntErr error' )
def root( self, p ):
	print("No device stmnt")
	raise CCSParserExcept("Device type statement missing")


@_( 'deviceStmnt' )
def deviceStmnt( self, p ):
	print("hv deviceStmnt:deviceStmnt")
	return

@_( 'CCODE deviceStmnt' )
def deviceStmnt( self, p ):
	print("hv deviceStmnt:CCODE deviceStmnt")
	return

@_( '' )	#empty production
def deviceStmntErr( self, p ):
	print("hv deviceStmntErr")
	return

@_( 'HASHROM DEVICE STRING' )
def deviceStmnt( self, p ):
	print("hv deviceStmnt")
	etc
HASHROM, DEVICE & STRING are terminals from the lexer. The following is a valid device statement,
'#rom device "PIC16F73"'.

What I'm wondering is is there a parser function in SLY that can tell the parser to receive an end of file for the next token rather than call the lexer for the next token? I could then return an abort code or message instead of raising the exception.
Any comments or feedback would be appreciated.

Using a class property to automatically populate tokens set

For token-heavy grammars, instead of manually defining a set of token names, you can populate that set using a class property.

class classproperty:
    def __init__(self, fget):
        self.fget = fget

    def __get__(self, owner_self, owner_cls):
        return self.fget(owner_cls)


class MyLexer(sly.Lexer)
    @classproperty
    def tokens(cls):
        return {x for x in cls.__dict__ if x.isupper()}

In the example above, the tokens class property returns a set of token names from the MyLexer object's uppercased field names.

This eliminates the need to manually add each token name as a string to a set as in:

class MyLexer(sly.Lexer)
    tokens = {
        'NE',
        'DIGITS'
    }

    NE = '!='

    @_(r'\b([0-9.]+)[f]{0,1}\b')
    def DIGITS(self, t):
        pass

It would be nice if the user could instruct sly.lexer._build to automatically collect token names from the lexer object though.

item from token

Please forgive if this is very dumb question, I am a beginner with yacc parsers.

I cannot understand why the minimalist example below throws a syntax error. The behavior I would
expect from the grammar, letters -> LETTER, does not seem to work in this case.

test_string = 'At'

class TestLexer(Lexer):
    tokens = {LETTER}
    LETTER = '[a-zA-Z]'

class TestParser(Parser):    
    tokens = TestLexer.tokens

    @_('LETTER')
    def letters(self, p):
        return [p[0]]
    
t_lexer = TestLexer()
t_parser = TestParser()
t_parser.parse(t_lexer.tokenize(test_string))