GithubHelp home page GithubHelp logo

pylr1's People

Contributors

horazont avatar sebastianriese avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar

pylr1's Issues

Lexical Tie-Ins

The parser generator should support lexical tie-ins (implemented via lexer-states).

Unicode support

The parser generator should accept unicode input and regex features making the handling of unicode easy.

Lexers are larger than necessary

Currently for each start condition a lextable is written. This is not optimal, as in many parsers, there will be start conditions, that cannot be distinguished (for example most parsers will never explicitly use the start of line or start of file conditions, in those cases they can simply use the same lextable as the $INITIAL start condition, additionally some code in the body of the lexing function could be dropped then).

Also some identical actions are written to the lexer multiple times. This as well should be avoided.

The Great Syntax Regularization

The current syntax has some irregularites (e.g. some block directives use colons others don't). We aim to make the syntax regular to improve usability.

Overhead due to stand-alone parsers/lexers

Some of the source code of the generated files could be split to a runtime library. This should remain optional/it has to be easy to ship the runtime with any other package using generated parsers.

An extension module runtime library would also allow speeding up the parser transparently, where available (perhaps we would have to get the lex/parse-tables out of the python data structures for more efficient access).

We should also think about cleaner design of the Lexer and Parser classes and extensibility.

Alternative Byte Sources for the Lexer

Currently the generated lexers can only open files by name and the files must be mmap-able.

Therefore it would be nice to be capable of lexing from other sources, such as strings, sockets, pipes, stdin etc. (On 32bit machines there may also be a problem when mmap-ing very large files due to limited address space).

While lexing from strings is trivially added (just the constructor of the Lexer class has to be adapted), lexing from other sources is an interesting task, especially if the lookahead should be limited (compare the flex interactive mode).

Additionally a stack of input files would be nice to enable inclusion of source files.

Create a library with common classes

The Position class and some other support classes do not need to be re-created for each parser.

Putting them in a shared library allows for use with type hints in AST modules without running into trouble with circular imports.

Use sane internal format for pyblobs

The current handling of the python actions is quite broken. The self hosted grammarfile parser shows some promise of improving on this (as it understands quite a bit of Python's lexeme structure). At least, instead of splitting at newline and inserting indenation afterwards, lists with logical line strings should be provided. Directly serializing semi-parsed python in the writer would be even better (see the PyBlob class of the self hosted grammar file parser for some ideas, this would also allow cleaner stack var replacement).

More error resistant parse stack var references

The $[0-9]+ stack references in the parser are exceedingly error prone. Added support for
$SYMBOLNAME[0-9]+ and $[name] plus symbol[name]-syntax in the production would be nice to provide more stability. Also the required .sem is not elegant. Using @[0-9] for position marks and $[0-9]+ for the semantic value as in bison/yacc should be considered. Problem: the @-sign may actually appear in valid python code, therefore correct replacement gets more difficult.

Split up grammar spec to multiple files

For complex grammars it may be desirable to separate the lexer and parser to multiple files.
The problem: When using stokens they are fully interdependent. So a syntax like:

# header section
%lexer("path/to/lexer.pyLR")
%parser("path/to/parser.pyLR")
%footer
# footer section

or

%include("path/to/include.pyLR")

might be the best solution. The %include version is also interesting for parsers and lexers sharing structure (sample usecase: adding support for actions in other languages to the self hosting pyLRp parser: most of it will be similar, or even the same, but the lang-blobs will have a different lexic/grammar).

Position information is broken for ``%empty``

For the following parser definition (syntax):

%lexer

\n+ %restart
A   A
B   B
X   X

%parser

document:
    foo:
        $$.sem = [$1.sem]
    document X foo:
        $$.sem = $1.sem
        $$.sem.append($3.sem)

foo:
    %empty:
        $$.sem = $$.pos
    A:
        $$.sem = $$.pos
    B:
        $$.sem = $$.sem

Together with a script bar.py:

#!/usr/bin/env python3
import pprint
import sys

from foo import Parser, Lexer

l = Lexer(open(sys.argv[1], "rb"), filename=sys.argv[1])
p = Parser(l)

result = p.Parse()

for item in result:
    print(item)

And the following input (text):

AX

We get the following:

$ python3 -m pyLRp -lL3Td -o foo.py syntax && python3 bar.py text
0 # (0, 4) A "A"
0 4 # (1, 3) X "X"
0 6 # (1, 0) X "X"
0 1 # (0, 2) X "X"
0 1 2 # (1, 2) $EOF ""
0 1 2 3 # (1, 1) $EOF ""
0 1 # (1, 5) $EOF ""
text Line 1:0 - 1:1
 Line 0:0 - 1:2

As you can see, even the file name is missing from the position information for the %empty production. I understand that it might be difficult to get a coherent range of characters, but the file name should be available correctly.

If possible, col0 == col1 and line0 == line1 would be nice, too, but I don’t know if it makes sense.

Non-random ordering of tables

Some of the generated data is randomly ordered or inherently random because it is generated from non-ordered items or being dependent on hashes which are ids (read: the adresses of the objects in CPython) and therefore inherently random. For the sake of debugging, regression testing and version control interaction some stability would be nice.

While some randomness is easily removed, for other objects this goal is more difficult to achive. A list of some of the random/randomly ordered items follows:

  • lexaction methiods in the generated Lexer (easily fixed by writing them in their numeric order.
  • lextable states. This one is much more difficult because several set operation based algorithms are run in series.

The parsetable is much less random, as the LR-graph generation is quite predictable and generates the according numbers. Also the symbol numbers are nonrandom (being assigned incrementally by first appearance). The start condition numbers in the lexer are also predictable.

The pywriter is a mess

The pywriter is in many places inelegant, error prone and difficult to understand.

The implementation of generation options is completely ad-hoc and hard to follow, this should be replaced by an elegant, extensible and clear general mechanism.

Preferably on assembles a Composite, each of which components represents an option. Then the correct code is written automagically.

Tool support for whitespace structured languages

While LALR(1) parsers with regex based lexers are very comfortable tools for building parsers for expressions and C-like languages, formulating an indentation-based grammar like Python's is a difficult to do. Therefore support from the lexer for this would be nice. Instead of specifying the INDENT and DEDENT tokens by regexen, they are computed directly by the lexer, when a certain switch is given.

Line continuations (\ at EOL in python) and free form segments (anything in parens, braces or brackets in python), which do not rely on whitespace for structure) have to be easily definable.

An elegant general mechanism would be preferred to a single purpose switch.

Trailing context

Trailing context support (re/context, re$ in flex) should be added to the lexer. While I don't care too much for /context but I consider $ to be exceedingly important. While at it, the flex <>-feature might be implemented as well (trailing context, which only matches EOF).

Conflict resolution at runtime

Languages like Haskell support the definition of the fixity of operators at runtime.

A parser generator could help the implementation of such features by deferring conflict resolution to the runtime. Instead of hard-coding the resolution to the parse-table a conflict resolution entry could be set, which uses parse-time available information to determine how the conflict shall be resolved.

The way this shall be exposed is not obvious. But a fixity like "%runtime" could be used and methods could be added to the parser, to control fixity resolution.

Custom methods and initialization in the lexer and parser classes

It would be nice if it were possible to define custom methods and custom initialization code in the lexer and parser classes. It is often useful to factor out common code from productions currently one can define functions in the %footer section, but often it would be better/cleaner if they were methods of the Parser class. Additionally often the interesting result of a parse is the assembly of a structure, which is done by method calls on the resulting object and not the semantic value of the root node. Also the %function lexaction better were a %method lexaction.

Current Workaround: Just use the methods in the actions, use derived classes of Lexer and Parser which have the init code and supply the methods.

Another Possible workaround: use stupid and complete %AST generation and derive a visitor accordingly. This may be desirable for building up legacy structures anyway.

Error recovery in the parser

Add support for error recovery in the parser. See the Dragon Book for ideas. Good error messages for parsing errors would also be very desirable.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.