GithubHelp home page GithubHelp logo

arithy / packcc Goto Github PK

View Code? Open in Web Editor NEW
327.0 20.0 26.0 518 KB

A parser generator for C

License: Other

C 66.68% Makefile 2.48% Shell 12.68% Kotlin 14.96% Python 3.21%
peg packrat-parsing parser parser-generator compiler-compiler left-recursion

packcc's Introduction

PackCC

Overview

PackCC is a parser generator for C. Its main features are as follows:

  • Generates your parser in C from a grammar described in a PEG,
  • Gives your parser great efficiency by packrat parsing,
  • Supports direct and indirect left-recursive grammar rules.

The grammar of your parser can be described in a PEG (Parsing Expression Grammar). The PEG is a top-down parsing language, and is similar to the regular-expression grammar. Compared with a bottom-up parsing language, like Yacc's one, the PEG is much more intuitive and cannot be ambiguous. The PEG does not require tokenization to be a separate step, and tokenization rules can be written in the same way as any other grammar rules.

Your generated parser can parse inputs very efficiently by packrat parsing. The packrat parsing is the recursive descent parsing algorithm that is accelerated using memoization. By using packrat parsing, any input can be parsed in linear time. Without it, however, the resulting parser could exhibit exponential time performance in the worst case due to the unlimited look-ahead capability.

Unlike common packrat parsers, PackCC can support direct and indirect left-recursive grammar rules. This powerful feature enables you to describe your language grammar in a much simpler way. (The algorithm is based on the paper "Packrat Parsers Can Support Left Recursion" authored by A. Warth, J. R. Douglass, and T. Millstein.)

Some additional features are as follows:

  • Thread-safe and reentrant,
  • Supports UTF-8 multibyte characters (version 1.4.0 or later),
  • Generates more ease-of-understanding parser source codes,
  • Consists of just a single compact source file,
  • Under MIT license. (not under a certain contagious license!)

The generated code is beautified and as ease-of-understanding as possible. Actually, it uses lots of goto statements, but the control flows are much more traceable than goto spaghetti storms generated by Yacc or other parser generators. This feature is irrelevant to common users, but helpful for PackCC developers to debug it.

PackCC itself is under MIT license, but you can distribute your generated code under any license you like.

Installation

You can obtain the executable packcc by compiling src/packcc.c using your favorite C compiler. For convenience, the build environments using GCC, Clang, and Microsoft Visual Studio are prepared under build directory.

Using GCC

Other than MinGW

packcc will be built in both directories build/gcc/debug/bin and build/gcc/release/bin using gcc by executing the following commands:

cd build/gcc
make
make check  # bats-core and uncrustify are required (see tests/README.md)

packcc in the directory build/gcc/release/bin is suitable for practical use.

MinGW

packcc will be built in both directories build/mingw-gcc/debug/bin and build/mingw-gcc/release/bin using gcc by executing the following commands:

cd build/mingw-gcc
make
make check  # bats-core and uncrustify are required (see tests/README.md)

packcc in the directory build/mingw-gcc/release/bin is suitable for practical use.

Using Clang

Other than MinGW

packcc will be built in both directories build/clang/debug/bin and build/clang/release/bin using clang by executing the following commands:

cd build/clang
make
make check  # bats-core and uncrustify are required (see tests/README.md)

packcc in the directory build/clang/release/bin is suitable for practical use.

MinGW

packcc will be built in both directories build/mingw-clang/debug/bin and build/mingw-clang/release/bin using clang by executing the following commands:

cd build/mingw-clang
make
make check  # bats-core and uncrustify are required (see tests/README.md)

packcc in the directory build/mingw-clang/release/bin is suitable for practical use.

Using Microsoft Visual Studio

You have to install Microsoft Visual Studio 2019 in advance. After that, you can build packcc.exe by the following instructions:

  • Open the solution file build\msvc\msvc.sln,
  • Select a preferred solution configuration (Debug or Release) and a preferred solution platform (x64 or x86),
  • Invoke the Build Solution menu item.

packcc.exe will appear in build\msvc\XXX\YYY directory. Here, XXX is x64 or x86, and YYY is Debug or Release. packcc.exe in the directory build\msvc\XXX\Release is suitable for practical use.

Usage

Command

You must prepare a PEG source file in advance. For details of the PEG source syntax, see the section "Syntax". Here, let the file name example.peg for example.

packcc example.peg

By running this, the parser source example.h and example.c are generated.

If no PEG file name is specified, the PEG source is read from the standard input, and -.h and -.c will be generated.

The base name of the parser source files can be changed by -o option.

packcc -o parser example.peg

By running this, the parser source parser.h and parser.c are generated. This option can be specified only once.

A directory to search for import files can be added by -I option (version 2.0.0 or later). This option can be specified as many times as needed. The firstly specified directory will be searched first, the secondly specified directory will be searched next, and so on.

packcc -I foo -I bar/baz example.peg

By running this, the directory foo is searched first, and the directory bar/baz is searched next. The directories specified by this option have higher priority than those specified in the environment variable PCC_IMPORT_PATH and the default directories. For more details of import, see the explanation of %import written in the section "Syntax".

If you want to disable UTF-8 support, specify the command line option -a or --ascii (version 1.4.0 or later).

If you want to insert #line directives in the generated source and header files, specify the command line option -l or --lines (version 1.7.0 or later). It is helpful to trace compilation errors of the generated source and header files back to the codes written in the PEG source file.

If you want to confirm the version of the packcc command, execute the below.

packcc -v

Syntax

A grammar consists of a set of named rules. A rule definition can be split into multiple lines.

rulename <- pattern

The rulename is the name of the rule to define. The pattern is a text pattern that contains one or more of the following elements.

rulename

The element stands for the entire pattern in the rule with the name given by rulename.

variable:rulename

The element stands for the entire pattern in the rule with the name given by rulename. The variable is an identifier associated with the semantic value returned from the rule by assigning to $$ in its action. The identifier can be referred to in subsequent actions as a variable. The example is shown below.

term <- l:term _ '+' _ r:factor { $$ = l + r; }

A variable identifier must consist of alphabets (both uppercase and lowercase letters), digits, and underscores. The first letter must be an alphabet. The reserved keywords in C cannot be used.

sequence1 / sequence2 / ... / sequenceN

Each sequence is tried in turn until one of them matches, at which time matching for the overall pattern succeeds. If no sequence matches then the matching for the overall pattern fails. The operator slash (/) has the least priority. The example is shown below.

'foo' rule1 / 'bar'+ [0-9]? / rule2

This pattern tries matching of the first sequence ('foo' rule1). If it succeeds, then the overall pattern matching succeeds and ends without evaluating the subsequent sequences. Otherwise, it tries matching of the next sequence ('bar'+ [0-9]?). If it succeeds, then the overall pattern matching succeeds and ends without evaluating the subsequent sequence. Finally, it tries matching of the last sequence (rule2). If it succeeds, then the overall pattern matching succeeds. Otherwise, the overall pattern matching fails.

'string'

A character or string enclosed in single quotes is matched literally. The ANSI C escape sequences are recognized within the characters. The UNICODE escape sequences (ex. \u20AC) are also recognized including surrogate pairs, if the command line option -a is not specified (version 1.4.0 or later). The example is shown below.

'foo bar'

"string"

A character or string enclosed in double quotes is matched literally. The ANSI C escape sequences are recognized within the characters. The UNICODE escape sequences (ex. \u20AC) are also recognized including surrogate pairs, if the command line option -a is not specified (version 1.4.0 or later). The example is shown below.

"foo bar"

[character class]

A set of characters enclosed in square brackets matches any single character from the set. The ANSI C escape sequences are recognized within the characters. The UNICODE escape sequences (ex. \u20AC) are also recognized including surrogate pairs, if the command line option -a is not specified (version 1.4.0 or later). If the set begins with an up-arrow (^), the set is negated (the element matches any character not in the set). Any pair of characters separated with a dash (-) represents the range of characters from the first to the second, inclusive. The examples are shown below.

[abc]
[^abc]
[a-zA-Z0-9_]

.

A dot (.) matches any single character. Note that the only time this fails is at the end of input, where there is no character to match.

element ?

The element is optional. If present on the input, it is consumed and the match succeeds. If not present on the input, no text is consumed and the match succeeds anyway.

element *

The element is optional and repeatable. If present on the input, one or more occurrences of the element are consumed and the match succeeds. If no occurrence of the element is present on the input, the match succeeds anyway.

element +

The element is repeatable. If present on the input, one or more occurrences of the element are consumed and the match succeeds. If no occurrence of the element is present on the input, the match fails.

& element

The predicate succeeds only if the element can be matched. The input text scanned while matching element is not consumed from the input and remains available for subsequent matching.

! element

The predicate succeeds only if the element cannot be matched. The input text scanned while matching element is not consumed from the input and remains available for subsequent matching. A popular idiom is the following, which matches the end of input, after the last character of the input has already been consumed.

!.

( pattern )

Parentheses are used for grouping (modifying the precedence of the pattern).

< pattern >

Angle brackets are used for grouping (modifying the precedence of the pattern) and text capturing. The captured text is numbered in evaluation order, and can be referred to later using $1, $2, etc.

$n

A dollar ($) followed by a positive integer represents a text previously captured. The positive integer corresponds to the order of capturing. A $1 represents the first captured text. The examples are shown below.

< [0-9]+ > 'foo' $1

This matches 0foo0, 123foo123, etc.

'[' < '='* > '[' ( !( ']' $1 ']' ) . )* ( ']' $1 ']' )

This matches [[...]], [=[...]=], [==[...]==], etc.

{ c source code }

Curly braces surround an action. The action is arbitrary C source code to be executed at the end of matching. Any braces within the action must be properly nested. Note that braces in directive lines and in comments (/*...*/ and //...) are appropriately ignored. One or more actions can be inserted in any places between elements in the pattern. Actions are not executed where matching fails.

[0-9]+ 'foo' { puts("OK"); } 'bar' / [0-9]+ 'foo' 'baz'

In this example, if the input is 012foobar, the action { puts("OK"); } is to be executed, but if the input is 012foobaz, the action is not to be executed. All matched actions are guaranteed to be executed only once.

In the action, the C source code can use the predefined variables below.

  • $$ : The output variable, to which the result of the rule is stored. The data type is the one specified by %value. The default data type is int.
  • auxil : The user-defined data that has been given via the API function pcc_create(). The data type is the one specified by %auxil. The default data type is void *.
  • variable : The result of another rule that has already been evaluated. If the rule has not been evaluated, it is ensured that the value is zero-cleared (version 1.7.1 or later). The data type is the one specified by %value. The default data type is int.
  • $n : The string of the captured text. The n is the positive integer that corresponds to the order of capturing. The variable $1 holds the string of the first captured text.
  • $ns : The start position in the input of the captured text, inclusive. The n is the positive integer that corresponds to the order of capturing. The variable $1s holds the start position of the first captured text.
  • $ne : The end position in the input of the captured text, exclusive. The n is the positive integer that corresponds to the order of capturing. The variable $1e holds the end position of the first captured text.
  • $0 : The string of the text between the start position in the input at which the rule pattern begins to match and the current position in the input at which the element immediately before the action ends to match.
  • $0s : The start position in the input at which the rule pattern begins to match.
  • $0e : The current position in the input at which the element immediately before the action ends to match.

An example is shown below.

term <- l:term _ '+' _ r:factor { $$ = l + r; }
factor <- < [0-9]+ >            { $$ = atoi($1); }
_ <- [ \t]*

Note that the string data held by $n and $0 are discarded immediately after evaluation of the action. If the string data are needed after the action, they must be copied in $$ or auxil. If they are required to be copied in $$, it is recommended to define a structure as the type of output data using %value, and to copy the necessary string data in its member variable. Similarly, if they are required to be copied in auxil, it is recommended to define a structure as the type of user-defined data using %auxil, and to copy the necessary string data in its member variable.

The position values are 0-based; that is, the first position is 0. The data type is size_t (before version 1.4.0, it was int).

element ~ { c source code }

Curly braces following tilde (~) surround an error action. The error action is arbitrary C source code to be executed at the end of matching only if the preceding element matching fails. Any braces within the error action must be properly nested. Note that braces in directive lines and in comments (/*...*/ and //...) are appropriately ignored. One or more error actions can be inserted in any places after elements in the pattern. The operator tilde (~) binds less tightly than any other operator except alternation (/) and sequencing. The error action is intended to make error handling and recovery code easier to write. In the error action, all predefined variables described above are available as well. The examples are shown below.

rule1 <- e1 e2 e3 ~{ error("e[12] ok; e3 has failed"); }
rule2 <- (e1 e2 e3) ~{ error("one of e[123] has failed"); }

%header { c source code }

The specified C source code is copied verbatim to the C header file before the generated parser API function declarations. Any braces in the C source code must be properly nested. Note that braces in directive lines and in comments (/*...*/ and //...) are appropriately ignored. When %header is used multiple times, the respective C source codes are copied in order of their appearance.

%source { c source code }

The specified C source code is copied verbatim to the C source file before the generated parser implementation code. Any braces in the C source code must be properly nested. Note that braces in directive lines and in comments (/*...*/ and //...) are appropriately ignored. When %source is used multiple times, the respective C source codes are copied in order of their appearance.

%common { c source code }

The specified C source code is copied verbatim to both of the C header file and the C source file before the generated parser API function declarations and the implementation code respectively. This has the same effect as %header { c source code } %source { c source code }. Any braces in the C source code must be properly nested. Note that braces in directive lines and in comments (/*...*/ and //...) are appropriately ignored.

%earlyheader { c source code }

%earlysource { c source code }

%earlycommon { c source code }

Same as %header, %source and %common, respectively. The only difference is that these directives place the code at the very beginning of the generated file, before any code or includes generated by PackCC. This can be useful for example when it is necessary to modify behavior of standard libraries via a macro definition.

%value "output data type"

The type of output data, which is output as $$ in each action and can be retrieved from the parser API function pcc_parse(), is changed to the specified one from the default int. This can be used only once and cannot be used in imported files.

%auxil "user-defined data type"

The type of user-defined data, which is passed to the parser API function pcc_create(), is changed to the specified one from the default void *. This can be used only once and cannot be used in imported files.

%prefix "prefix"

The prefix of the parser API functions is changed to the specified one from the default pcc. This can be used only once and cannot be used in imported files.

%import "import file name"

The content of the specified import file is expanded at the text location of %import (version 2.0.0 or later). This can be used multiple times anywhere and can be used also in imported files. The import file name can be a relative path to the current directory or an absolute path. If it is a relative path, the directories listed below are searched for the import file in the listed order.

  1. the directory where the file that imports the import file is located
  2. the directories specified with -I options
    • They are prioritized in order of their appearance in the command line.
  3. the directories specified by the environment variable PCC_IMPORT_PATH
    • They are prioritized in order of their appearance in the value of this variable.
    • The character used as a delimiter between directory names is the colon ':' if PackCC is built for a Unix-like platform such as Linux, macOS, and MinGW. The character is the semicolon ';' if PackCC is built as a native Windows executable. (This is exactly the same manner as the environment variable PATH.)
  4. the per-user default directory
    • This is the subdirectory .packcc/import in the home directory if PackCC is built for a Unix-like platform, and in the user profile directory, "C:\Users\username" for example, if PackCC is built as a native Windows executable.
  5. the system-wide default directory
    • This is the directory /usr/share/packcc/import if PackCC is built for a Unix-like platform, and is the subdirectory packcc/import in the common application data directory, "C:\ProgramData" for example.

Note that the file imported once is silently ignored when it is attempted to be imported again.

#comment

A comment can be inserted between # and the end of the line.

%%

A double percent %% terminates the section for rule definitions of the grammar. All text following %% is copied verbatim to the C source file after the generated parser implementation code.

(The specification is determined by referring to peg/leg developed by Ian Piumarta.)

Import Files

The following import files are currently bundled.

For details, see here.

Macros

Some macros are prepared to customize the parser. The macro definition should be in %source section in the PEG source.

%source {
#define PCC_GETCHAR(auxil) get_character((auxil)->input)
#define PCC_BUFFERSIZE 1024
}

The following macros are available.

PCC_GETCHAR(auxil)

The function macro to get a character from the input. The user-defined data passed to the API function pcc_create() can be retrieved from the argument auxil. It can be ignored if no user-defined data. This macro must return a character code as an int type, or -1 if the input ends.

The default is defined as below.

#define PCC_GETCHAR(auxil) getchar()

PCC_ERROR(auxil)

The function macro to handle a syntax error. The user-defined data passed to the API function pcc_create() can be retrieved from the argument auxil. It can be ignored if no user-defined data. This macro need not return a value. It may abort the process (by using exit() for example) when a fatal error occurs, and can also return normally to deal with warnings.

The default is defined as below.

#define PCC_ERROR(auxil) pcc_error()
static void pcc_error(void) {
    fprintf(stderr, "Syntax error\n");
    exit(1);
}

PCC_MALLOC(auxil,size)

The function macro to allocate a memory block. The user-defined data passed to the API function pcc_create() can be retrieved from the argument auxil. It can be ignored if no user-defined data. The argument size is the number of bytes to allocate. This macro must return a pointer to the allocated memory block, or NULL if no sufficient memory is available.

The default is defined as below.

#define PCC_MALLOC(auxil, size) pcc_malloc_e(size)
static void *pcc_malloc_e(size_t size) {
    void *p = malloc(size);
    if (p == NULL) {
        fprintf(stderr, "Out of memory\n");
        exit(1);
    }
    return p;
}

PCC_REALLOC(auxil,ptr,size)

The function macro to reallocate the existing memory block. The user-defined data passed to the API function pcc_create() can be retrieved from the argument auxil. It can be ignored if no user-defined data. The argument ptr is the pointer to the previously allocated memory block. The argument size is the new number of bytes to reallocate. This macro must return a pointer to the reallocated memory block, or NULL if no sufficient memory is available. The contents of the memory block should be left unchanged in any case even if the reallocation fails.

The default is defined as below.

#define PCC_REALLOC(auxil, ptr, size) pcc_realloc_e(ptr, size)
static void *pcc_realloc_e(void *ptr, size_t size) {
    void *p = realloc(ptr, size);
    if (p == NULL) {
        fprintf(stderr, "Out of memory\n");
        exit(1);
    }
    return p;
}

PCC_FREE(auxil,ptr)

The function macro to free the existing memory block. The user-defined data passed to the API function pcc_create() can be retrieved from the argument auxil. It can be ignored if no user-defined data. The argument ptr is the pointer to the previously allocated memory block. This macro need not return a value.

The default is defined as below.

#define PCC_FREE(auxil, ptr) free(ptr)

PCC_DEBUG(auxil,event,rule,level,pos,buffer,length)

The function macro for debugging (version 1.5.0 or later). Sometimes, especially for complex parsers, it is useful to see how exactly the parser processes the input. This macro is called on important events and allows to log or display the current state of the parser. The argument rule is a string that contains the name of the currently evaluated rule. The non-negative integer level specifies how deep in the rule hierarchy the parser currently is. The argument pos holds the position from the start of the current context in bytes. In case of event == PCC_DBG_MATCH, the argument buffer holds the matched input and length is its size. For other events, buffer and length indicate a part of the currently loaded input, which is used to evaluate the current rule.

Caution: Since version 1.6.0, the first argument auxil is added to this macro. The user-defined data passed to the API function pcc_create() can be retrieved from this argument.

There are currently three supported events:

  • PCC_DBG_EVALUATE (= 0) - called when the parser starts to evaluate rule
  • PCC_DBG_MATCH (= 1) - called when rule is matched, at which point buffer holds entire matched string
  • PCC_DBG_NOMATCH (= 2) - called when the parser determines that the input does not match currently evaluated rule

A very simple implementation could look like this:

static const char *dbg_str[] = { "Evaluating rule", "Matched rule", "Abandoning rule" };
#define PCC_DEBUG(auxil, event, rule, level, pos, buffer, length) \
    fprintf(stderr, "%*s%s %s @%zu [%.*s]\n", (int)((level) * 2), "", dbg_str[event], rule, pos, (int)(length), buffer)

The default is to do nothing:

#define PCC_DEBUG(auxil, event, rule, level, pos, buffer, length) ((void)0)

PCC_BUFFERSIZE

The initial size (the number of characters) of the text buffer. The text buffer is expanded as needed. The default is 256.

PCC_ARRAYSIZE

The initial size (the number of elements) of the internal arrays other than the text buffer. The arrays are expanded as needed. The default is 2.

API

The parser API has only 3 simple functions below.

pcc_context_t *pcc_create(void *auxil);

Creates a parser context. This context needs to be passed to the functions below. The auxil can be used to pass user-defined data to be bound to the context. NULL can be specified if no user-defined data.

int pcc_parse(pcc_context_t *ctx, int *ret);

Parses an input text (from standard input by default) and returns the result in ret. The ret can be NULL if no output data is needed. This function returns 0 if no text is left to be parsed, or a nonzero value otherwise.

void pcc_destroy(pcc_context_t *ctx);

Destroys the parser context. All resources allocated in the parser context are released.

The type of output data ret can be changed. If you want change it to char *, specify %value "char *" in the PEG source. The default is int.

The type of user-defined data auxil can be changed. If you want change it to long, specify %auxil "long" in the PEG source. The default is void *.

The prefix pcc can be changed. If you want change it to foo, specify %prefix "foo" in the PEG source. The default is pcc.

After the above settings, the API functions change like below.

foo_context_t *foo_create(long auxil);
int foo_parse(foo_context_t *ctx, char **ret);
void foo_destroy(foo_context_t *ctx);

The typical usage of the API functions is shown below.

int ret;
pcc_context_t *ctx = pcc_create(NULL);
while (pcc_parse(ctx, &ret));
pcc_destroy(ctx);

Examples

Desktop Calculator

A simple example which provides interactive four arithmetic operations of integers is shown here. Note that left-recursive grammar rules are defined in this example.

%prefix "calc"

%source {
#include <stdio.h>
#include <stdlib.h>
}

statement <- _ e:expression _ EOL { printf("answer=%d\n", e); }
           / ( !EOL . )* EOL      { printf("error\n"); }

expression <- e:term { $$ = e; }

term <- l:term _ '+' _ r:factor { $$ = l + r; }
      / l:term _ '-' _ r:factor { $$ = l - r; }
      / e:factor                { $$ = e; }

factor <- l:factor _ '*' _ r:unary { $$ = l * r; }
        / l:factor _ '/' _ r:unary { $$ = l / r; }
        / e:unary                  { $$ = e; }

unary <- '+' _ e:unary { $$ = +e; }
       / '-' _ e:unary { $$ = -e; }
       / e:primary     { $$ = e; }

primary <- < [0-9]+ >               { $$ = atoi($1); }
         / '(' _ e:expression _ ')' { $$ = e; }

_      <- [ \t]*
EOL    <- '\n' / '\r\n' / '\r' / ';'

%%
int main() {
    calc_context_t *ctx = calc_create(NULL);
    while (calc_parse(ctx, NULL));
    calc_destroy(ctx);
    return 0;
}

An execution example is as follows.

$ ./calc↵
1+2*(3+4*(5+6))↵
answer=95
5*6*7*8/(1*2*3*4)↵
answer=70

Simple AST builder

An example which builds an AST (abstract syntax tree) and dumps it is shown here. This example accepts the same inputs as Desktop Calculator shown above.

%prefix "calc"

%value "pcc_ast_node_t *"    # <-- must be set

%auxil "pcc_ast_manager_t *" # <-- must be set

%header {
#define PCC_AST_NODE_CUSTOM_DATA_DEFINED /* <-- enables node custom data */

typedef struct text_data_tag { /* <-- node custom data type */
    char *text;
} pcc_ast_node_custom_data_t;
}

%source {
#include <stdio.h>
#include <string.h>
}

statement <- _ e:expression _ EOL { $$ = e; }
           / ( !EOL . )* EOL      { $$ = NULL; }

expression <- e:term { $$ = e; }

term <- l:term _ '+' _ r:factor { $$ = pcc_ast_node__create_2(l, r); $$->custom.text = strdup("+"); }
      / l:term _ '-' _ r:factor { $$ = pcc_ast_node__create_2(l, r); $$->custom.text = strdup("-"); }
      / e:factor                { $$ = e; }

factor <- l:factor _ '*' _ r:unary { $$ = pcc_ast_node__create_2(l, r); $$->custom.text = strdup("*"); }
        / l:factor _ '/' _ r:unary { $$ = pcc_ast_node__create_2(l, r); $$->custom.text = strdup("/"); }
        / e:unary                  { $$ = e; }

unary <- '+' _ e:unary { $$ = pcc_ast_node__create_1(e); $$->custom.text = strdup("+"); }
       / '-' _ e:unary { $$ = pcc_ast_node__create_1(e); $$->custom.text = strdup("-"); }
       / e:primary     { $$ = e; }

primary <- < [0-9]+ >               { $$ = pcc_ast_node__create_0(); $$->custom.text = strdup($1); }
         / '(' _ e:expression _ ')' { $$ = e; }

_      <- [ \t]*
EOL    <- '\n' / '\r\n' / '\r' / ';'

%import "code/pcc_ast.peg"   # <-- provides AST build functions

%%
void pcc_ast_node_custom_data__initialize(pcc_ast_node_custom_data_t *obj) { /* <-- must be implemented when enabling node custom data */
    obj->text = NULL;
}

void pcc_ast_node_custom_data__finalize(pcc_ast_node_custom_data_t *obj) {   /* <-- must be implemented when enabling node custom data */
    free(obj->text);
}

static void dump_ast(const pcc_ast_node_t *obj, int depth) {
    if (obj) {
        switch (obj->type) {
        case PCC_AST_NODE_TYPE_NULLARY:
            printf("%*s%s: \"%s\"\n", 2 * depth, "", "nullary", obj->custom.text);
            break;
        case PCC_AST_NODE_TYPE_UNARY:
            printf("%*s%s: \"%s\"\n", 2 * depth, "", "unary", obj->custom.text);
            dump_ast(obj->data.unary.node, depth + 1);
            break;
        case PCC_AST_NODE_TYPE_BINARY:
            printf("%*s%s: \"%s\"\n", 2 * depth, "", "binary", obj->custom.text);
            dump_ast(obj->data.binary.node[0], depth + 1);
            dump_ast(obj->data.binary.node[1], depth + 1);
            break;
        case PCC_AST_NODE_TYPE_TERNARY:
            printf("%*s%s: \"%s\"\n", 2 * depth, "", "ternary", obj->custom.text);
            dump_ast(obj->data.ternary.node[0], depth + 1);
            dump_ast(obj->data.ternary.node[1], depth + 1);
            dump_ast(obj->data.ternary.node[2], depth + 1);
            break;
        case PCC_AST_NODE_TYPE_VARIADIC:
            printf("%*s%s: \"%s\"\n", 2 * depth, "", "variadic", obj->custom.text);
            {
                size_t i;
                for (i = 0; i < obj->data.variadic.len; i++) {
                    dump_ast(obj->data.variadic.node[i], depth + 1);
                }
            }
            break;
        default:
            printf("%*s%s: \"%s\"\n", 2 * depth, "", "(unknown)", obj->custom.text);
            break;
        }
    }
    else {
        printf("%*s(null)\n", 2 * depth, "");
    }
}

int main(int argc, char **argv) {
    pcc_ast_manager_t mgr;
    pcc_ast_manager__initialize(&mgr);
    {
        calc_context_t *ctx = calc_create(&mgr);
        pcc_ast_node_t *ast = NULL;
        while (calc_parse(ctx, &ast)) {
            dump_ast(ast, 0);
            pcc_ast_node__destroy(ast);
        }
        calc_destroy(ctx);
    }
    pcc_ast_manager__finalize(&mgr);
    return 0;
}

The key point is the line %import "code/pcc_ast.peg". The import file code/pcc_ast.peg makes it easier to build ASTs. For more details, see here.

An execution example is as follows.

$ ./ast-calc↵
1+2*(3+4*(5+6))↵
binary: "+"
  nullary: "1"
  binary: "*"
    nullary: "2"
    binary: "+"
      nullary: "3"
      binary: "*"
        nullary: "4"
        binary: "+"
          nullary: "5"
          nullary: "6"
5*6*7*8/(1*2*3*4)↵
binary: "/"
  binary: "*"
    binary: "*"
      binary: "*"
        nullary: "5"
        nullary: "6"
      nullary: "7"
    nullary: "8"
  binary: "*"
    binary: "*"
      binary: "*"
        nullary: "1"
        nullary: "2"
      nullary: "3"
    nullary: "4"

AST Builder for Tiny-C

You can find the more practical example in the directory examples/ast-tinyc. It builds an AST from an input source file written in Tiny-C and dumps the AST.

packcc's People

Contributors

arithy avatar dolik-rce avatar masatake avatar mingodad avatar wataash avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

packcc's Issues

Memory usage compared to gcc compiler

Looking for a ready to use C peg grammar I found this one https://github.com/pointlander/peg/blob/master/grammars/c/c.peg and needed to make some changes (see attached) to build and parse the generated C file using packcc then I did a comparison of time and memory usage against gcc compiling it and got the result shown bellow, the generated parser compiled with -O2 uses 12x more memory than gcc compiling without optimization and 6.6x more when compiling with -O2.

packcc -l -o c99-mouse c99-mouse.peg

gcc -E c99-mouse.c > c99-mouse.pp.c

/usr/bin/time gcc -g -o c99-mouse c99-mouse.c
0.55user 0.02system 0:00.58elapsed 100%CPU (0avgtext+0avgdata 60484maxresident)k
0inputs+568outputs (0major+16781minor)pagefaults 0swaps

/usr/bin/time gcc -g -O2 -o c99-mouse c99-mouse.c
2.79user 0.07system 0:02.86elapsed 99%CPU (0avgtext+0avgdata 110656maxresident)k
0inputs+1304outputs (0major+36607minor)pagefaults 0swaps

/usr/bin/time ./c99-mouse c99-mouse.pp.c
1.16user 0.21system 0:01.38elapsed 99%CPU (0avgtext+0avgdata 740176maxresident)k
0inputs+0outputs (0major+184749minor)pagefaults 0swaps

c99-mouse.peg.zip

Characters higher than \x7f are being changed to \xff during parser generation

Hi, I had a rule in my grammar that referred to characters higher than 127 ('\x7f'), and it changed all the character escape codes in my rule to \xff.
This is the rule I started out with:

id_uc <-
	[\xc0-\xdf][\x80-\xbf] /
	[\xe0-\xef][\x80-\xbf][\x80-\xbf] /
	[\xf0-\xff][\x80-\xbf][\x80-\xbf][\x80-\xbf]

As I kept simplifying it to track down the problem, the generator kept changing the escape codes to \xff, and leaving lower ones (e.g. \x41) as they are.

I found the cause of the problem too:
On line 487, in the function escape_character, you have a bit that says (unsigned)ch instead of (unsigned char)ch.
Here is the code block that's in:

        if (ch >= '\x20' && ch < '\x7f')
            snprintf(*buf, 5, "%c", ch);
        else
            snprintf(*buf, 5, "\\x%02x", (unsigned)ch);

In your code, it seems you are using the unsigned cast to make sure the higher char values are being fed properly to snprintf, but the C type unsigned is equal to unsigned int, so snprintf is writing strings like \xffffffc0 into the string buffer.
Changing the cast to (unsigned char) makes write them do it correctly (e.g. \xc0).

So can you change that line to this for me?:

            snprintf(*buf, 5, "\\x%02x", (unsigned char)ch);

(Actually it would be for other users, because I've already fixed it in my downloaded copy.)

And thanks for the great tool.

How to do good syntax error handling?

Hi,
I can use special rules to catch common errors and point out which row they occur on. I keep track of rows and store it in auxil:

_ <- (WS / Comments)*
__ <- (WS / Comments)+
WS <- [ \t\r\n] {
    if ($0[0] == '\n') {
        auxil->row++;
    }
}
Comments <- SingleLineComment / BlockComment
SingleLineComment <- "//" (!EOL .)* EOL?
EOL <- ("\r\n" / "\n" / "\r") { auxil->row++; }
BlockComment <- "/*" (BlockCommentContent / EOL)* "*/"
BlockCommentContent <- (!("*/" / EOL) .)

I can then use a special rule to catch a common error, e.g.

Block <- e:Expr { $$ = CN(BLOCK, 1, e); } ( _ CommaSeparator _ e:Expr { AC($$, e); })*
CommaSeparator <- ("," / ";") {
    if (strcmp($0, ";") == 0) {
        fprintf(stderr, "%d: Use ',' to separate expressions in blocks", auxil->row);
    }
}

But with unexpected syntax errors everything breaks down and I cannot point out which row the error occured on.

As a workaround I added the following:

    static int ROW = 1;

    static int satie_getchar(satie_auxil_t* _auxil) {
        int c = getchar();
        if (c == '\n') {
            ROW++;
        }
        return c;
    }

    static void satie_error(satie_auxil_t* auxil) {
        panic("Syntax error near line %d", ROW);
    }

It works and I have re-invented awk-like error handling. :-) It's crude though.

Ideally I would like to point out syntax errors very precisely with both row and column info.

I haven't been able to figure out how to do that? Any hints?

Cheers
/Joakim

packcc-1.3.0-linux-x64.tar.gz isn't gzipped

packcc-1.3.0-linux-x64.tar.gz should be named packcc-1.3.0-linux-x64.tar as it is a tarball, not a gzipped tarball.

It's only a minor issue, as the GNU tar command typically copes fine. (I only noticed as I explicitly passed -z which isn't really necessary.)

cannot match backslash

matching backslash doesn't seem to be possible

file <- '\\' EOL
EOL <-  ("\r\n"  / "\n" / "\r" )

input

\

EBCDIC (and UNICODE)

For probably irrational reasons, I have got a rough port of packcc running on a VM/370 mainframe (emulated). All good :-)

I need to change some of the code to support EBCDIC - e.g. [A-Z] is weird in EBCDIC. I am assuming that you would not be too interested in "polluting" your code with EBCDIC - however I will also try and do it in way that may support UNICODE (but perhaps this is already done). Again just interested in your thoughts.

Thanks

Adrian

Simple grammar goes into an infinite loop instead of erroring

I have a simple grammar like so:

root <- foo*
foo <- "foo"

If I give the parser something like 0, fo, or bar it goes into an infinite loop instead of exiting. I'm using the example main code like this:

int main() {
    pcc_context_t *ctx = pcc_create(NULL);
    while (pcc_parse(ctx, NULL));
    pcc_destroy(ctx);
    return 0;
}

[Bug] Conflict caused by FALSE, TRUE macros and bool_tag enum using same name

I am unabled to build packcc on AIX 7.2 using gcc version 8.3.0. The details are described here

3044

This issue was closed/rejected because I thought this is a bug in universal-ctags, which is not the case. This is the complete output of the make run

mkdir -p debug/bin/ && gcc -std=gnu89 -Wall -Wextra -Wno-unused-parameter -Wno-overlength-strings -pedantic -O0 -g2 -o debug/bin/packcc ../../src/packcc.c
In file included from /opt/freeware/lib/gcc/powerpc-ibm-aix7.2.0.0/8.3.0/include-fixed/stdio.h:503,
                 from ../../src/packcc.c:40:
../../src/packcc.c:94:5: error: expected identifier before numeric constant
     FALSE = 0,
     ^~~~~
../../src/packcc.c: In function 'unescape_string':
../../src/packcc.c:621:31: warning: comparison is always false due to limited range of data type [-Wtype-limits]
                         if (d < 0) break;
                               ^
../../src/packcc.c:647:31: warning: comparison is always false due to limited range of data type [-Wtype-limits]
                         if (d < 0) break;
                               ^
../../src/packcc.c:672:39: warning: comparison is always false due to limited range of data type [-Wtype-limits]
                                 if (d < 0) break;
                                       ^
../../src/packcc.c: In function 'populate_bits':
../../src/packcc.c:914:12: warning: right shift count >= width of type [-Wshift-count-overflow]
     x |= x >> 32;
            ^~
make: *** [Makefile:34: debug/bin/packcc] Error 1

Memory-exhaustion, and infinite loops, on certain grammars

Certain grammars result in the generation of a parser which may enter an infinite loop. I've noticed that in version 1.2.2 (but not in 1.2.1), a warning is shown: packcc: Warning: Infinite loop detected in generated code.

Also, certain grammars result in the generation of a parser which may quickly exhaust memory and then terminate. I saw no warning this time.

Can these be fixed? Needless to say these possibilities are offputting.

Here's a minimal example to recreate both:

Usage: echo -n -e "aaaa" | ./kaboom

%prefix "kaboom"

%header
{
  static void my_pcc_error(void);
  #define PCC_ERROR(auxil) my_pcc_error()
}

# top <- ( "a" ([ \t]*) * ) ## Infinite loop

top <- ( "a" ws * ) ## Out-of-memory error

ws  <- [ \t]*

%%


// #include <stdlib.h>
#include <stdio.h>

static void my_pcc_error(void) {
    fputs("Syntax error.\n", stderr);
}

int main(int argc, char *argv[]) {

    kaboom_context_t *ctx = kaboom_create(NULL);

    puts("Time to call kaboom_parse...");

    const int textRemains = kaboom_parse(ctx, NULL);

    /* We never get this far */

    puts(textRemains ? "Text remains" : "No text remains");

    kaboom_destroy(ctx);
    return 0;
}

edit: enabled syntax highlighting

%value and memory management

I'm trying to parse some input into a struct but I don't understand how to use %value to get a pointer in and out of the parser. A very stripped-down example of what I'd like to do:

//thing.h
enum STATE {
    FOO,
    BAR,
    BAZ
};
typedef struct {
    enum STATE state;
} cmds_t;
//the grammar
%header {
    #include "thing.h"
}

%value "cmds_t*"

COMMAND <- FOOBARBAZ
FOOBARBAZ <-
    "foo" { $$->state = FOO; } /
    "bar" { $$->state = BAR; } /
    "baz" { $$->state = BAZ; }
//main.c fragment
cmds_t* commands;
pcc_context_t* ctx = pcc_create(NULL);
int ret = pcc_parse(ctx, &commands);
//use *commands
//free(*commands) maybe?
pcc_destroy(ctx);

but obviously no memory is allocated so the FOOBARBAZ actions are null pointer derefs. Even if I pre-allocate memory or insert a dummy rule before FOOBARBAZ that allocates some memory that pointer only lives for the lifetime of the action and then leaks. I don't see a way to propagate a single pointer throughout rules without using methods that make all the subsequent rules significantly more complex. The simplicity of the actions is a huge positive for me.

I see in the TinyC example that a second data structure is passed around in auxil that looks like it holds the AST. Is this the intended approach for all applications? I certainly could use %auxil cmds_t* and then have all my actions use auxil->... but I feel like I'm missing something simple that would allow the above.

Parser seems to be successful but doesn't returns 0.

Hello I'm testing this nice project with a minimal example.

%prefix "w"

test <-
    word   {puts("OK");}

word <-
	[_a-zA-Z]*

%%

int main()
{
        w_context_t  * ctx = w_create(NULL);
        printf("parse res: %d\n", w_parse(ctx, NULL));
        w_destroy(ctx);
       return 0;
}

When this parser reads a word and prints "OK" I guess the "test" rule is successful. But the "print" prints 1. There is no "Syntax error" default message.
Is a kind person can explain what I'm doing wrong if it is not an issue ?

parser reads more data than necessary

If you have a grammar like:

foo <- "foobar\n" / "foo\n"

and input the string "foo\n..........", then packcc will read 8 bytes for this rule when it really only needed to read 4.

As an example of where this is a problem, consider an interactive parser where the user enters data line by line, if the user types "foo\n", the parser will request two or more lines of input from the user when only one was actually needed.

Need additional conditions to not shift by 32 on additional 32 bit platforms

From this ifndef only Windows 32bit will not include the x |= x >> 32 code.

https://github.com/arithy/packcc/blob/master/src/packcc.c#L913

Other 32 bit platforms will include this code. On FreeBSD 32 bit clang 10, and 11 (and it appears all other) result in a crashing program.

For example after building packcc, universal-ctags attempts to run packcc with sample tag file. Output from FreeBSD ports on a 32 bit system building universal-ctags port using arinty/packcc:

http://beefy15.nyi.freebsd.org/data/130releng-i386-default/46fc7df8540c/logs/universal-ctags-p5.9.20210411.0.log

In this case populate_bits() returns -1 with optimized compiling.

I believe this is undefined behavior for x >> 32 on 32 bit systems where x is 4 bytes resulting in a compiler warning:

./misc/packcc/src/packcc.c:914:12: warning: shift count >= width of type [-Wshift-count-overflow]
x |= x >> 32; 
           ^  ~~  
1 warning generated.

assuming x is > 0

Optimized gcc appears to do the right thing resulting in 0 however unoptimized gcc will result in x's original value.

Optimized clang will result in max value of the given type or -1 on 32bit and unoptimized clang like gcc will result in the original value.

I'm not totally sure if this is a FreeBSD only problem. It's possible you can get lucky with gcc compiling optimized or unoptimized and it will run without issue?

I don't know if there's a single ifndef or macro that could be used to handle all platforms? Would it make sense to check the sizeof x during runtime and put the x |= x >> 32 in an if block?

For the short term for FreeBSD at I can patch it out on 32bit systems.

Changes made in Universal Ctags project

I should make pull requests. However, I cannot find time for doing it now.
So, allow me just listing some of them here:

If possible, could you cherry-pick some of them?
If you know the same change is already introduced, let me know. I will remove the associated item from the list.

stop parser on error

i'm trying to implement errors like this:

expression
     <- <term> _ { printf("expression: >%s< \n", $1); }
     / <(!EOL .)*> .* { printf("line %d: error: expected expression: %s\n", ((State*)auxil)->line, $0); }

but of course ocassionally the .* doesn't consume the entire input to the end, so you get multiple errors for the same line.
is there a way to just stop the parser immediatly?

AST Mode

Continues the discussion from #51.

The main use case for a parser-generator is to build an Abstract Syntax Tree, or AST. Therefore, it would be nice if there were a cleaner way to do so built into packcc. Right now the user must, for each grammar rule which should be part of the structure, write boilerplate code to allocate and return an AST node using the return values from the other grammar rules that make it up. There is also not an easy way to associate extra information with these nodes, which would be useful for type checking and semantic analysis. These concerns could be taken care of by the parser-generator.

It's still unclear exactly how this should be done.

Peg only mode?

I was benchmarking leg vs packcc for https://github.com/andrewchambers/minias and saw that leg is both 10x faster and uses 10x less ram for my examples - I think in many cases the overhead of packrat parsing might not be worth it.

I was wondering if there is any chance for a peg only mode or peg only port, as packcc has many other advantages over peg/leg.

Pre type checking possible?

I tried to build a simple PEG grammar that should accept these:

1+1;
a*1;
1*a;
true&&true;
a&&true;

But fail on these:

1+true;
2&&1;
1&&true;

Below is my latest approach. It is almost there but fails on, for example, a*1;
I'm starting to believe that it's impossible and that I just should let my compiler do type analysis in a later state. It would be nice to catch these errors early on though.

Cheers
/J

%prefix "test"
Program <- Statement+
Statement <- Assignment / Expression ';'
Assignment <- Variable '=' Expression

# Expressions
Expression <- LogicalExpr / ArithmeticExpr

# Logical expressions
LogicalExpr <- OrExpr
OrExpr <- AndExpr ('||' AndExpr)*
AndExpr <- NotExpr ('&&' NotExpr)*
NotExpr <- '!' LogicalPrimary / LogicalPrimary
LogicalPrimary <- BooleanLiteral / NonArithmetic / Variable / FunctionCall
NonArithmetic <- (Variable / FunctionCall) !NumberLiteral

# Arithmetic expressions
ArithmeticExpr <- AdditiveExpr
AdditiveExpr <- MultiplicativeExpr (('+' / '-') MultiplicativeExpr)*
MultiplicativeExpr <- UnaryExpr (('*' / '/') UnaryExpr)*
UnaryExpr <- ('+' / '-')? ArithmeticPrimary
ArithmeticPrimary <- NumberLiteral / NonLogical / Variable/ FunctionCall
NonLogical <- (Variable / FunctionCall) !BooleanLiteral

# Handling of literals and variables
NumberLiteral <- [0-9]+ ('.' [0-9]+)?
BooleanLiteral <- 'true' / 'false'
Variable <- [a-zA-Z_][a-zA-Z0-9_]*
FunctionCall <- Variable '(' (Expression (',' Expression)*)? ')'

%%
int main() {
test_context_t *context = test_create(NULL);
test_parse(context, NULL);
test_destroy(context);
return 0;
}

Make generated code easier to read

Hello @arithy.

The README file states:

The generated code is beautified and as ease-of-understanding as possible

However, with the latest changes, each rule that matches character classes has about 40 lines dealing with unicode. I believe the code for converting bytes to unicode codepoints could (and should) be easily separated into a function. That would make the generated code significantly shorter and much easier to read.

What do you think? I can send PR if you're interested.

Uninitialized variables

Hello @arithy,

I have tested my application that uses PackCC generated parser with valgrind and I have noticed, that it reports conditional jumps depending on uninitialised values. Here is a simplified grammar to reproduce:

%value "int"

%source {
#include <stdio.h>
}

integer <- u:unary? d:digit {
    if (u) {
        printf("RESULT: %d\n", u * d);
    } else {
        printf("RESULT: %d\n", d);
    }
}

unary <- "-" { $$ = -1; } / "+" { $$ = 1; }
digit <- [0-9]+ { $$ = atoi($0); }

%%
int main() {
    pcc_context_t *ctx = pcc_create(NULL);
    pcc_parse(ctx, NULL);
    pcc_destroy(ctx);
    return 0;
}

If you compile this and run echo 42 | valgrind ./example, it reports (ignoring the uninteresting parts for brevity):

Conditional jump or move depends on uninitialised value(s)
   at 0x10BD4F: pcc_action_integer_0 (example.c:1045)
   by 0x10BCF2: pcc_do_action (example.c:1026)
   by 0x10BD13: pcc_do_action (example.c:1029)
   by 0x10C5D3: pcc_parse (example.c:1252)
   by 0x10C667: main (example.c:1266)

So far I'm using an ugly workaround, checking whether the optional part matched something:

integer <- <u:unary?> d:digit {
    if ($1s != $1e) {
        ...

But it's not very nice. Would it be possible to make sure that the variable is initialized to 0 (or another appropriate value, e.g. NULL if it is a pointer)? Or alternatively, would it be possible to add some syntax to check whether the variable is actually present in the rule? I mean something like if ($u) ..., that would return true if the optional variable is present.

By the way: Another place where similar problem pops up is in alternations:

ruleA <- (b:ruleB / c:ruleC) EOF {
    // do something with b or c, depending on which one was matched
} 

This can be usually worked around by moving the alternation into separate rule, but if we could simply do if (b) ... (or if ($b)), then it would make the grammar easier to read.

Parsing a "switch { case n: m }"

I tried to write a grammar to parse the following toy example:

switch 1 {
  case 2:
    3,
    42
  case 4:
    5
}

I tried with the following but it is not the way to do it:

SwitchStmt <- "switch" _ Expr _ "{" _ CaseStmt+ _ "}"
CaseStmt <- "case" _ Expr _ ":" _ ExprList
ExprList <- Expr (_ "," _ Expr)*
Expr <- [0-9]+
_ <- WS*
WS <- " " / "\t" / "\n" / "\r"

Any hints?

Cheers
/J

Segfaulting parser

Grammar:

%source {
static const char *dbg_str[] = { "Evaluating rule", "Matched rule", "Abandoning rule" };
#define PCC_DEBUG(event, rule, level, pos, buffer, length) \
    fprintf(stderr, "%*s%s %s @%d [%.*s]\n", level * 2, "", dbg_str[event], rule, pos, length, buffer)
}

file <- (a / _)+
a <- "A;"
_ <- [ \t\n]*

Input:

A;
A

Expected output:

Syntax error should be reported.

Debugger session:

(gdb) run < tmp.d/input.txt
Starting program: /home/h/prog/packcc/tests/tmp.d/parser < tmp.d/input.txt
Evaluating rule file @0 []
  Evaluating rule a @0 []
  Matched rule a @0 [A;]
  Evaluating rule a @2 []
  Abandoning rule a @2 []
  Evaluating rule _ @2 [
A]
  Matched rule _ @2 [
]
  Evaluating rule a @3 [A]
  Abandoning rule a @3 []
  Evaluating rule _ @3 [A
]
  Matched rule _ @3 []
Matched rule file @0 [A;
]
Evaluating rule file @0 [A
]
  Evaluating rule a @3 [A
]
  Abandoning rule a @3 []
  Evaluating rule _ @3 [A
]
  Matched rule _ @3 []
Matched rule file @0 [A

]

Program received signal SIGSEGV, Segmentation fault.
0x00007ffff7f46801 in __memmove_avx_unaligned_erms () from /usr/lib/libc.so.6
(gdb) bt
#0  0x00007ffff7f46801 in __memmove_avx_unaligned_erms () from /usr/lib/libc.so.6
#1  0x00005555555571c0 in pcc_commit_buffer (ctx=0x55555555c2a0) at tmp.d/parser.c:820
#2  0x0000555555558581 in pcc_parse (ctx=0x55555555c2a0, ret=0x7fffffffe8ac) at tmp.d/parser.c:1128
#3  0x0000555555558623 in main (argc=1, argv=0x7fffffffe9b8) at main.c:17
(gdb) f 1
#1  0x00005555555571c0 in pcc_commit_buffer (ctx=0x55555555c2a0) at tmp.d/parser.c:820
820	    memmove(ctx->buffer.buf, ctx->buffer.buf + ctx->pos, ctx->buffer.len - ctx->pos);
(gdb) p *ctx
$1 = {pos = 3, level = 0, buffer = {buf = 0x55555555c310 "A\n\nA\n", max = 256, len = 2}, lrtable = {buf = 0x55555555c420, max = 256, len = 4}, lrstack = {buf = 0x55555555cc30, max = 2, 
    len = 0}, auxil = 0x0}

Somehow, ctx->pos > ctx->buffer.len, which leads to segfault, because size_t is unsigned, so this overflows and it effectively tries to copy 18446744073709551615 bytes of memory, which is waaaay outside of the allocated memory.

C predicates

Would it be possible to allow C predicates to participate in matching?

For example:

foonumber <- <[0-9]+> ?? { atoi($1) > 50 }

Lexical state support

JFlex has nice support for controlling lexical state. I assume that Flex does as well. In JFlex you call yybegin(int state) to start a new state, and then any rules that are wrapped by the state will get invoked:

%%
{
{myrule} { only gets recognized if state==MYSTATE }
}

Go here and search for "lexical state": https://jflex.de/manual.html

This is really useful. Does PEG or Packcc have a similar concept?

Strange bad generated parser

When trying to build a parser for lpeg-re grammar to use with packcc, packcc accepts the grammar without any error and generate the parser but when trying to compile the parser there is the error shown bellow, it seems to have trouble with the S rule in the grammar.

gcc -o lpeg-re lpeg-re.c
lpeg-re.c: In function ‘pcc_evaluate_rule_suffix’:
lpeg-re.c:1286:25: error: expected expression before ‘)’ token
 1286 |                         )) goto L0005;
      |                         ^

The grammar:

%prefix "lpeg_re"

pattern         <- exp !.
exp             <- S (grammar / alternative)

alternative     <- seq ('/' S seq)*
seq             <- prefix*
prefix          <- '&' S prefix / '!' S prefix / suffix
suffix          <- primary S (([+*?]
                            / '^' [+-]? num
                            / '->' S (string / '{}' / name / num)
                            / '=>' S name) S)*

primary         <- '(' exp ')' / (string / keyword) / class / defined
                 / '{:' (name ':')? exp ':}'
                 / '=' name
                 / '@' exp
                 / '{}'
                 / '{~' exp '~}'
                 / '{|' exp '|}'   # missing table capture
                 / '{' exp '}'
		 / '~?' # Expected match
		 / '~>' S ( 'foldleft' / 'foldright' / 'rfoldleft' / 'rfoldright' )
		 / '$' (string / name / num) # Arbitrary capture
                 / '.'
                 / name S !(asttag / arrow )
                 / '<' name '>'          ## old-style non terminals
		 / '^' name
		 / '%{' S name S '}'

grammar         <- definition+
definition      <- name S (asttag S)? arrow exp

class           <- '[' '^'? item (!']' item)* ']'
item            <- defined / range / .
range           <- . '-' [^\]]

S               <- ([ \t\f\r\n]  /  '--' [^\r\n]*)*  # spaces and comments
name            <- [A-Za-z_]([A-Za-z0-9_] / '-' !'>' )*
arrow           <- (  '<--' / '<==' / '<-|'  / '<-' )
num             <- [0-9]+
string          <- '"' [^"]* '"' / "'" [^']* "'"
defined         <- '%' name
keyword     <-  '`' [^`]+ '`'
asttag         <- ':' S name

%%
int main() {
    lpeg_re_context_t *ctx = lpeg_re_create(NULL);
    while (lpeg_re_parse(ctx, NULL));
    lpeg_re_destroy(ctx);
    return 0;
}

Error Recovery

Hello!

Sorry for opening two "issues"- they are not really issues but relate to different topics!

What is your thinking concerning re-syncing after a parse/recognition error. I am thinking of a strategy of maybe trying to insert "the" missing token (rule, I guess), and then if that doesn't help immediately trying to find a sync point where the inputs meets a rule.

Happy to look at this - but what are your thoughts / advice / thinking on how relevant this might be for you and its feasibility?

Thanks

Adrian

Parser very slow with repeated parse calls

bench.peg:

%prefix "asm"

%earlyheader {
typedef struct {int x; int y; int z;} Parsev;
}

%value "Parsev"

# Uncomment for 10X speed.
# file <- line+

line <- s:stmt eol
      / eol
      / .

stmt <- d:directive 
      / i:instr
      / l:label

directive <- ".glob" "o"? "l" ws i:ident
           / ".data"
           / ".text"
           / ".balign" ws n:number 
           / ".byte" ws n:number

label <- i:ident ':'

instr <- "nop"
       / "leave"
       / "ret"
       / i:jmp
       / i:add

jmp <- "jmp" ws i:ident

add <- "add" 'q'? ws s:m ws? ',' ws? d:r64
     / "add" 'q'? ws s:imm ws? ',' ws? d:r64
     / "add" 'q'? ws s:r64 ws? ',' ws? d:m
     / "add" 'q'? ws s:r64 ws? ',' ws? d:r64
     / "addq" ws s:imm ws? ',' ws? d:m

m <- '(' ws? r:r64 ws? ')'
   / <'-'?[0-9]+> ws? '(' ws? r:r64 ws? ')'
   / i:ident  ws? '(' ws? r:r64 ws? ')'

r64 <- "%rax"
     / "%rcx"
     / "%rdx"
     / "%rbx"
     / "%rsp"
     / "%rbp"
     / "%rsi"
     / "%rdi"
     / "%r8" 
     / "%r9" 
     / "%r10"
     / "%r11"
     / "%r12"
     / "%r13"
     / "%r14"
     / "%r15"

imm <- '$' i:ident
     / '$' <'-'?[0-9]+>

ident <- <[_a-zA-Z][_a-zA-Z0-9]*>

number <- <'-'?[0-9]+>

ws <- [ \t]+

eol <- ws? ("\n" / (! .))

%%
int main() {
    asm_context_t *ctx = asm_create(NULL);
    while (asm_parse(ctx, NULL));
    asm_destroy(ctx);
    return 0;
}

bench.txt:

for i in `seq 100000`;  do  echo "addq %rax, (%rax)"  >> bench.txt ; done

First run, one parse call per line:

$ ./packcc ./bench.peg && clang -O2 -g bench.c -o bench

$ time ./bench  < bench.txt
real    1m22.178s
user    1m21.425s
sys     0m0.428s

Now uncomment the file <- line+ part of the benchmark:

$ ./packcc ./bench.peg && clang -O2 -g bench.c -o bench
$ time ./bench  < bench.txt

real    0m1.221s
user    0m1.026s
sys     0m0.192s

If I benchmark the first case, I find the majority of work is memmove of the internal packcc arrays, my intuition is this extra work does not seem correct.

Null pointer error on broken grammar

Hello @arithy,

I have noticed very small bug that might crash PackCC, but only on some very broken grammars.

If there is only one rule and it contains syntax error, then null pointer is dereferenced at packcc.c:3148.

Grammar to test:

main <- ( "A"

If the grammar contains two or more rules, than it fails correctly (returns non-zero code, but doesn't crash).

I am aware that this is pretty far fetched corner case (I have only found it due to another bug in my application 🙂) and it's up to you if you want to fix it or not. I just wanted to let you know.

`$n` should be independent across alternatives

The sequence of $n is successive across the alternatives.

start <- <rule_y> <rule_x> { printf("%s: rule_y(%s) rule_x(%s)\n", $0, $1, $2); }
       / <rule_x>          { printf("%s: rule_x(%s)\n", $0, $3); }  # <- not $1, but $3
       / <rule_y>          { printf("%s: rule_y(%s)\n", $0, $4); }  # <- not $1, but $4
rule_x <- 'xxx'
rule_y <- 'yyy'

This is unnatural, we expect the numbers are independent:

start <- <rule_y> <rule_x> { printf("%s: rule_y(%s) rule_x(%s)\n", $0, $1, $2); }
       / <rule_x>          { printf("%s: rule_x(%s)\n", $0, $1); }
       / <rule_y>          { printf("%s: rule_y(%s)\n", $0, $1); }
rule_x <- 'xxx'
rule_y <- 'yyy'

Actions that runs before the end of parsing.

I want to count the lines and columns in order to display better error messages. But when a syntax error occurs the line/col count doesn't happens because actions doesn't run.
The older "peg/leg" tool have expression predicate that runs at parse time, I don't found a similar solution in packcc.

I can create many rules that try to handle all the possible errors, so the parsing is always successful. It's tedious and it's so sad to not use the errors actions " ~{} ".

My question is probably very noob because counting line/col is pretty common, so If you know the regular packcc solution fell free to answer .

Questions not answered in the README

Questions

These are things that I've been wondering. Other people probably have the same questions, so they should probably be covered in the README.

  1. What is the storage duration of $$? There's actually a StackOverflow question about this. https://stackoverflow.com/questions/66145396/how-to-use-the-value-returned-by-a-packcc-parser

  2. How can I read from a file instead of stdin?

  3. Suppose I want to generate an AST. Am I supposed to generate it with $$ inside the rules manually? Is there some way to make this easier? Or is generating an AST beyond the scope of this parser generator?

  4. Can I define an action and an error action on the same rule?

  5. What's the deal with whitespace? It seems to be ignored. What if my language were whitespace-sensitive? Does it suck the whitespace out of string literals?

  6. My language has C style single and multi-line comments in it. How can I ignore those like whitespace seems to be ignored?

Thanks, and you've got a very impressive project.

Passing auxil to PCC_DEBUG

To control what should be printed or not in PCC_DEBUG definition, I would like to pass auxil to the definition of PCC_DEBUG.

#define PCC_DEBUG(event, rule, level, pos, buffer, length) baseDebug(event, rule, level, pos, buffer, length)
static void baseDebug(int event, const char *rule, size_t level, size_t pos, const char *buffer, size_t len)
{
	if (strcmp(rule, "Identifier") != 0)
		return;

PCC_DEBUG can print too many things. In the example, the rule is examined to limit the output only about "Identifier".
In the example, I have to do hardcode ("Identifier").

What I would like to do:

PCC_DEBUG(auxil, event, rule, level, pos, buffer, length) baseDebug(auxil, event, rule, level, pos, buffer, length)

static void baseDebug(struct parserCtx *auxil, int event, const char *rule, size_t level, size_t pos, const char *buffer, size_t len)
{
       if (!isMember(parserCtx->debug_rules_dict, rule))
		return;

PCC_DEBUG is already explained in the README file. So I wonder whether extending it is acceptable or not.

Adding example of how to use generated parser

Hi, sorry to ask, but is it possible to add an example of how to produce output from applying parser to a file? I tried to look at the test scripts, but it's a bit confusing. Adding something like build.sh with examples with few steps would be enough. Thank you so much for this project!

value passthrough

is it possible to have something like this

expression <- x:(a / b / c / d) { $$ = x; }

instead of this

expression <- a { $$ = a; } / b { $$ = b; } / c { $$ = c; } / d { $$ = d; }

?

raising error from action?

It is unclear to me if it is possible to raise an error from the action. In your example calculator it would make sense for example for division by zero or (int_min / -1).

thank you for this awesome project!

Lookahead woes and more

I have been trying to write a small PEG to parse the following in a file named simple.sa:

42,
a = 42,
a = b + c * d,
a.b,
a.b.c,
a.b(1, 2),
42.b,
4711.b(),
a[666],
a[777].foo

It works except for the last a[777].foo.

At times I think there must be a bug in packcc's lookahead pattern support and then I realize it must be me misunderstanding something central. I wish there was more documentation on how to successfully use positive and negative lookahead. I'm stumped. Can anyone have mercy on me and point out what I need to do to the PEG below to work as expected?

Cheers
/J

# File: simple.peg
# Test: packcc simple.peg && gcc -o simple simple.c && ./simple < simple.sa

%prefix "satie"

%earlysource {
    static const char *dbg_str[] = { "Evaluating rule", "Matched rule", "Abandoning rule" };
    #define PCC_DEBUG(auxil, event, rule, level, pos, buffer, length) \
        fprintf(stderr, "%*s%s %s @%zu [%.*s]\n", (int)((level) * 2), "", dbg_str[event], rule, pos, (int)(length), buffer)
}

Program            <- _ TopLevelExpr (_ "," _ TopLevelExpr)* EOF
TopLevelExpr       <- Binding / Expr

Binding            <- MatchPattern _ "=" _ Expr
MatchPattern       <- Literal / FieldAccess / Symbol

Expr               <- Add
Add                <- Multiplicate (_ "+" _  Multiplicate)*
Multiplicate       <- Indexing (_ "*" _  Indexing)*
Indexing           <- Symbol _ "[" _ Expr _ "]"  / FunctionCall
FunctionCall       <- Symbol _ "(" _ ExprSequence? _ ")" / FieldAccess
FieldAccess        <- HasField (_ "." _ HasField)* / Primary
HasField           <- Symbol !(_ ("(" / "[")) / Literal / Indexing / FunctionCall
Primary            <- Literal / Symbol
ExprSequence       <- Expr (_ "," _ Expr)*
Literal            <- NumberLiteral
NumberLiteral      <- [0-9]+
Symbol             <- [a-zA-Z_][a-zA-Z_0-9_]*

_                  <- WS*
WS                 <- [ \t\r\n]
EOF                <- _ !.

%%
int main() {
    satie_context_t *context = satie_create(NULL);
    satie_parse(context, NULL);
    satie_destroy(context);
    return 0;
}

Consider add support for UCD(Unicode Character Database) rule pattern

I think it would be good to add unicode support either by using builtin rules (eg. UPPERCASE_LETTER | LOWERCASE_LETTER) like pest or using unicode property regex pattern (eg. \p{Lu} | \p{Ll}).

I guess an external library like pcre will be added or an embeddable code header file will be needed.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.