GithubHelp home page GithubHelp logo

vnmakarov / yaep Goto Github PK

View Code? Open in Web Editor NEW
130.0 130.0 13.0 3.03 MB

Yet Another Earley Parser

License: Other

Shell 0.07% C 2.18% C++ 1.15% Yacc 0.37% Makefile 0.18% CMake 0.29% M4 0.01% Lex 0.03% SWIG 95.70%
ambiguous-grammars ast earley-parser error-recovery grammar library minimal-cost-ast

yaep's People

Contributors

sanssecours avatar thecount avatar vnmakarov avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

yaep's Issues

Implementation of error recovery?

Hi,

This project looks very interesting to me as I'm trying to build earley parsers with different strategies. @vnmakarov I was wondering if it's possible for you to briefly mention how the error handling algorithm works? Or a few pointers to where related code relies will also be useful!

Great project!

Thanks and Best!

Make libyaep reentrant

YAEP currently manipulates a lot of static variables during its operation. This makes it impossible to use multiple parsers asynchronously.

One simple preliminary fix would be to declare all these variables thread_local. While this does not make libyaep reentrant, it would at least become possible to use multiple parsers concurrently, as long as each parser has a thread of its own and stays on that thread. This preliminary fix would, strictly speaking, break the API, though, because it would no longer be possible to use the same parser from different threads.

The better fix would be to make libyaep fully reentrant by eliminating all static variables. Unless extra parameters are required for this for exposed functions, this could be implemented without breaking the API. Parsers could then be used asynchronously and, as long the caller adheres to a few simple rules, would also be thread-safe.

mm

What is the best way to make sure my application memory management doesn't interfere with the speed (or function) of yaep's memory management?

I'm not sure the best way to work with your memory manager (OS_CREATE etc)...
yaep_parse has parse_alloc and parse_free functions (i read from yaep.txt that these functions are only used for the translation (*root) memory.

i note your test code (yaep.tst.in) defines

static void *
test_parse_alloc (int size)
{
  void *result;

  OS_TOP_EXPAND (mem_os, size);
  result = OS_TOP_BEGIN (mem_os);
  OS_TOP_FINISH (mem_os);
  return result;
}

and you pass yaep_parse and NULL for the yaep_parse functions

it seems you've optimized memory management and i don't feel as though i should use malloc/free in my own code in case i change your memory management characteristics.

but in the end, i'm going to link libyaep.a into code that uses malloc/realloc/free anyway, so...

what is the best way to make sure my application memory management doesn't interfere with the speed (or function) of yaep's memory management?

Joop Leo's optimization?

It is stated that YAEP doesn't use Joop Leo's optimization, and that right recursion is fast anyways. And because of the lookahead I do believe this to be true for some grammars, but not all LR(k) grammars. For example, the following LR(2) grammar:

S -> A 'a' 'b'
A -> 'a' A
A ->

Generates increasingly large set cores when parsing "aaa...aaab", leading to (at least) quadratic behavior.

Perhaps an optimization similar to Joop Leo's could be made, making YAEP linear time for all LR(k) grammars?

Licensing confusion

There is some confusion about which license(s) apply to YAEP.

  • The README states YAEP is licensed under LGPL version 2
  • The copyright file states YAEP is dual licensed under GPL and LGPL, and then references /usr/share/common-licenses/(GPL|LGPL), which, in my case, are symlinks to the newest versions (3.0) of the respective licenses.
  • The headers of some of the source files (e. g., yaep.c, second comment block) specifically forbid the use of the LGPL.

@vnmakarov could you please clarify:

  • Which license do you want to apply to YAEP?
    • GPL?
    • LGPL?
    • GPL/LGPL dual license?
    • Something else?
  • Which version(s) of the license(s) do you want to apply?
    • GPL version 2?
    • GPL version 3?
    • LGPL version 2.0 ("library" license)?
    • LGPL version 2.1 ("lesser" license)?
    • LGPL version 3.0?
    • Something else?
  • What about future versions of licenses?
    • Do you want to allow later versions (the "version X, or (at your option) any later version" clause)?
    • Do you want to allow only the specified versions?
    • Something else?

I can make a pull request fixing those headers/copyright files for you once you indicate what you want.

Install header files in separate subdirectory

Currently, make install installs three header files, allocate.h, cocom-config.h, and yaep.h, directly under /usr/local/include. Especially allocate.h and cocom-config.h might clash with headers of the same name from other packages, so I propose to create a dedicated subdirectory /usr/local/include/yaep/ and put all the headers there.

Some compiler warnings about possible uninitialized variables

Adding -Wall -Wextra with gcc give a few warnings and some of then I fixed as shown bellow:

diff --git a/src/yaep.c b/src/yaep.c
index 3719017..b68cce8 100644
--- a/src/yaep.c
+++ b/src/yaep.c
@@ -5263,8 +5263,8 @@ static vlo_t *tnodes_vlo;
 static struct yaep_tree_node *
 prune_to_minimal (struct yaep_tree_node *node, int *cost)
 {
-  struct yaep_tree_node *child, *alt, *next_alt, *result;
-  int i, min_cost;
+  struct yaep_tree_node *child, *alt, *next_alt, *result = NULL;
+  int i, min_cost = 0;
 
   assert (node != NULL);
   switch (node->type)
@@ -5431,7 +5431,7 @@ make_parse (int *ambiguous_p)
   struct yaep_tree_node *parent_anode, *anode, root_anode;
   int parent_disp;
   int saved_one_parse_p;
-  struct yaep_tree_node **term_node_array;
+  struct yaep_tree_node **term_node_array = NULL;
 #ifndef __cplusplus
   vlo_t stack, orig_states;
 #else
@@ -6420,7 +6420,7 @@ read_rule (const char ***rhs, const char **anode, int *anode_cost,
 
 /* The following variable is the current number of next input
    token. */
-static int ntok;
+static size_t ntok;
 
 /* The following function imported by Earley's algorithm (see comments
    in the interface file). */

Info on the fastest Earley impl

I made an Earley engine that is much faster in benchmarks than YAEP: gearley. Here are the results for 88248 tokens (10K lines of C).

gearley: bench_parse_c ... bench:  53,125,029 ns/iter (+/- 3,652,318)
gearley with CompactBocage: bench_parse_c ... bench:  87,122,025 ns/iter (+/- 2,744,845)
YAEP: parse time 0.38

Memory use

gearley: all:12669488b forest:12664888b
gearley with CompactBocage: all:2.217MB forest:2.212MB
YAEP: memory=12.724MB

So, the Earley sets take up only 4600 bytes. That's one byte every 19 tokens. Gearley with a compact forest takes only 17.4% of the memory YAEP does. However, a compact forest slows down the benchmark by 34 milliseconds. With normal forest, the parse takes 14% time compared to a parse with YAEP, and the same amount of memory.

Compact forest takes 25 bytes per token. Gearley with the default forest takes 602 ns per token.

I invite you to make your own benchmarks.

Exploring ASTs and productions

Hey there, I am really struggling to print out the tree that the parser does produce. The root node the parse method provides, does not afford any children node connections rather some info and the code of the terminal character. Could you please explain me how can I get access to all the derivation trees of any ambiguously parsed string?

Build and testing system needs an overhaul

While preparing #16, I noticed that the build/testing system is a bit onerous to work with. Specifically, the various *.in files appear to be some intermediate products (from autoconf?), probably the result of YAEP being part of a different software suite before it was split off as a standalone package. A particular headache were the bulky *.tst.in test files.

So before I continue working on #12, my next step would be to overhaul the building and testing process. I'm thinking CMake/CTest: it's compatible to more environments than autoconf, supports separate out-of-tree builds for C and C++ easily, and is generally less pain to work with. I'll also split the *.tst.in files into smaller files and put them in a separate directory.

Memory leak in yaep_free_grammar()

The following very simple program:

#include<yaep.h>

int main( void ) {
        struct grammar * g1 = yaep_create_grammar();
        yaep_free_grammar( g1 );
        struct grammar * g2 = yaep_create_grammar();
        yaep_free_grammar( g2 );
}

leaks memory. It's not quite clear to me what happens here. Apparently, yaep_free_grammar() deallocates some substructures of struct grammar, but not the memory for struct grammar itself (the first two lines of main() are sufficient to expose this problem). But in addition, yaep_create_grammar() and yaep_free_grammar() seem to handle a global grammar object (which kind of defeats the purpose of having multiple grammars) and thus some memory becomes lost permanently.

dynamic rule creation

Hi there. Great work! (too many c++only runtime parser generators out there)

Is there any chance to easily create rules at runtime? I mean, earley_parse_grammar(g, description) accepts string as grammar description for now, but I would like to fill some BNF-shaped (recursive) struct and use it in hypothetic earley_load_grammar(g, myBNF). I could create a patch myself, but I'm not sure my parser-generator background is strong enough to just start without any hints. Thanks!

Trying to output multiple parse trees with equal costs doesn't work

The following mode has problems (ignoring the glaring obvious issues with memory cleanup):

yaep_set_one_parse_flag(g,0); yaep_set_cost_flag(g,1);

The attached code yields:

Translation:
0: ALTERNATIVE: node=1, next=2
1: ABSTRACT: a2 ( 3 4 )
3: TERMINAL: code=97, repr='a'
4: TERMINAL: code=98, repr='b'
2:

while it should yield something like

Translation:
0: ALTERNATIVE: node=1, next=2
1: ABSTRACT: a1 ( 3 4 )
3: TERMINAL: code=97, repr='a'
4: TERMINAL: code=98, repr='b'
2: ALTERNATIVE: node=5, next=nil
5: ABSTRACT: a2 ( 3 4 )

ambig2_test.zip

fail fast when ambiguity is detected?

Hi,

I am considering using yaep to add limited user-customizable syntax to a domain-specific language. In #24 (comment), you express concern that a non-trivial expertise is required to avoid unintended ambguities in grammars. If I understand correctly, this is because Earley parsing is much slower on ambiguous grammars. In that case, would it be feasible to have yaep abort parsing once ambiguity is detected, ideally with an informative error message?

Thanks,
Andres

make install doesn't install allocate.h

after make install, i try

~$ echo '#include<yaep.h>' >test.c
~$ gcc test.c
In file included from test.c:1:0:
/usr/local/include/yaep.h:47:22: fatal error: allocate.h: No such file or directory
compilation terminated.

Parallel Build Fails

Issue Description

While creating a Homebrew formula for YAEP, I noticed that building the software in parallel fails sometimes.

Steps to Reproduce

mkdir build
cd build
# I use Ninja (https://ninja-build.org) below, since it seems to
# fail consistently (unlike e.g. `make` using the switch `-j9`, 
# which only fails sometimes).
cmake -GNinja .. 
ninja

Expected Result

The build system creates the library without any problems.

Actual Result

Ninja prints the following error message:

[12/211] Building C object src/CMakeFiles/yaep_test.dir/yaep.c.o
FAILED: src/CMakeFiles/yaep_test.dir/yaep.c.o
/Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/bin/cc  -Isrc -DYAEP_TEST -DYAEP_DEBUG -std=gnu90 -MD -MT src/CMakeFiles/yaep_test.dir/yaep.c.o -MF src/CMakeFiles/yaep_test.dir/yaep.c.o.d -o src/CMakeFiles/yaep_test.dir/yaep.c.o   -c ../src/yaep.c
../src/yaep.c:3497:10: fatal error: 'sgramm.c' file not found
#include "sgramm.c"
         ^~~~~~~~~~
1 error generated.
[16/211] Building CXX object src/CMakeFiles/yaep++_test.dir/yaep.cpp.o
FAILED: src/CMakeFiles/yaep++_test.dir/yaep.cpp.o
/Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/bin/c++   -Isrc -DYAEP_TEST -DYAEP_DEBUG -std=gnu++11 -MD -MT src/CMakeFiles/yaep++_test.dir/yaep.cpp.o -MF src/CMakeFiles/yaep++_test.dir/yaep.cpp.o.d -o src/CMakeFiles/yaep++_test.dir/yaep.cpp.o -c ../src/yaep.cpp
In file included from ../src/yaep.cpp:61:
In file included from ../src/yaep.c:3497:
build/src/sgramm.c:398:2: error: unterminated conditional directive
#if (! defined yyoverflow \
 ^
../src/yaep.cpp:99:10: error: use of undeclared identifier 'yaep_parse_grammar'
  return yaep_parse_grammar (this->grammar, strict_p, description);
         ^
2 errors generated.
[18/211] [BISON][bison_ansic] Building parser with bison 3.2.2
ansic.y: warning: 1 shift/reduce conflict [-Wconflicts-sr]
ninja: build stopped: subcommand failed.

.

Additional Information

  • OS: macOS 10.14.1
  • CMake: cmake version 3.13.1

Add First Release

It would be really nice if you could add a git tag, such as v1.0, 1.0.0 or something similar, for the current version of YAEP. This would make it easier to create a binary package based on the repository code. For example, Homebrew requires a stable, tagged version for a formula (a description on how to build a Homebrew package, in a Ruby based Domain Specific Language).

Empty root node returned

When I parse a string with YAEP, I'm getting back an YAEP_NIL root node. I've enabled the debug logging, and it appears that it is parsing correctly, but is not returning the root of this tree. Here's the debug output:

Parsing start...
  Set core = 0
      1 $S : .error $eof, error, 0
      2 $S : .S $eof, DT NN, 0
    -----------
      3 S : .NP0SBJ ADVP0TMP VP PUNCTUATION, DT NN, 0
      4 S : .NP0SBJ VP PUNCTUATION, DT NN, 0
      5 NP0SBJ : .DT NNS, DT, 0
      6 NP0SBJ : .NN, NN, 0

Reading 0=NN(108), Current set=0
New set=1
  Set core = 1
      7 NP0SBJ : NN., RB VBP VBD, 1
      9 S : NP0SBJ .VP PUNCTUATION, VBP VBD, 1
    -----------
     10 VP : .VBP PP0CLR, VBP, 0
     11 VP : .VBD NP0PRD COMMA ADVP0CLR, VBD, 0

Reading 1=VBP(101), Current set=1
New set=2
  Set core = 2
     12 VP : VBP .PP0CLR, IN, 1
    -----------
     13 PP0CLR : .IN NP, IN, 0

Reading 2=IN(105), Current set=2
New set=3
  Set core = 3
     14 PP0CLR : IN .NP, CD NNS NNP JJ, 1
    -----------
     15 NP : .NP0QP PP0TMP, CD JJ, 0
     16 NP : .NP PP, CD NNS NNP JJ, 0
     17 NP : .NNP, NNP, 0
     18 NP : .NNS, NNS, 0
     19 NP : .NP0QP PP, CD JJ, 0
     20 NP0QP : .CD CD, CD, 0
     21 NP0QP : .JJ NNS, JJ, 0

Reading 3=NNS(102), Current set=3
New set=4
  Set core = 4
     22 NP : NNS., PUNCTUATION IN, 1
     23 PP0CLR : IN NP., PUNCTUATION, 2
     25 VP : VBP PP0CLR., PUNCTUATION, 3
     26 S : NP0SBJ VP .PUNCTUATION, PUNCTUATION, 4
    -----------

Reading 4=PUNCTUATION(104), Current set=4
New set=5
  Set core = 5
     27 S : NP0SBJ VP PUNCTUATION., $eof, 5
     28 $S : S .$eof, $eof, 5
    -----------

Reading 5=$eof(-1), Current set=5
New set=6
  Set core = 6
     29 $S : S $eof.,, 6
    -----------
Translation:
      0: EMPTY

Here's the grammar. I'm passing in string leifh.

"\n"
"TERM DT = 97 COMMA = 98 RB = 99 CD = 100 VBP = 101 NNS = 102 NNP = 103 PUNCTUATION = 104 IN = 105 VBD = 106 JJ = 107 NN = 108;\n"
"S : NP0SBJ VP PUNCTUATION\n"
"  | NP0SBJ ADVP0TMP VP PUNCTUATION\n"
"  ;\n"
"NP : NP0QP PP\n"
"  | NNS\n"
"  | NNP\n"
"  | NP PP\n"
"  | NP0QP PP0TMP\n"
"  ;\n"
"ADVP0CLR : RB PP\n"
"  ;\n"
"VP : VBD NP0PRD COMMA ADVP0CLR\n"
"  | VBP PP0CLR\n"
"  ;\n"
"PP0CLR : IN NP\n"
"  ;\n"
"PP : IN NP\n"
"  ;\n"
"ADVP0TMP : RB\n"
"  ;\n"
"PP0TMP : IN NP\n"
"  ;\n"
"NP0SBJ : NN\n"
"  | DT NNS\n"
"  ;\n"
"NP0QP : JJ NNS\n"
"  | CD CD\n"
"  ;\n"
"NP0PRD : CD CD NNS\n"
"  ;\n"

Everything looks OK to me during the parse. Am I doing something wrong? Thanks!

Installing on Linux Fails After Compilation with Clang

Steps to Reproduce the Problem

  1. Clone the Repository

    git clone git clone https://github.com/vnmakarov/yaep.git
  2. Compile and Install YAEP using Clang

    ./configure
    cd src
    make
    make test
    sudo make install

Expected Result

The last command should work without problems.

Actual Result

The last command fails printing the following output:

bison -y ./sgramm.y
mv y.tab.c sgramm.c
clang -c -g  -Ofast -DHAVE_CONFIG_H -I.. -I. -I. -DNDEBUG -fPIC -o C/yaep.o ../src/../src/yaep.c
make: clang: Command not found
make: *** [C/yaep.o] Error 127

System Information

OS: Ubuntu Trusty Tahr
Compiler: Clang 5.0

Additional Information

HOW CAN I RUN IT?

Hi,
I follow these steps to install your library:

/configure --srcdir= --prefix=
cd src
make
make test (optional)
make install

But, How can I run it in terminal? (I use OSX)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.