GithubHelp home page GithubHelp logo

iparse's Introduction

IParse

IParse is an interpreting parser, meaning that it reads a grammar and interpret this to parse another file. It also uses a parsing driven scanner approach, where the parser calls the scanner to see if a certain type of scanner symbol is found on the input. A number of scanners are provided, including a raw scanner, which gives access to the raw input. The grammar allows definition of character ranges and white space terminals, thus allowing a scanner to be specified in the grammar.

Several parsing algorithms are provided and can be selected from the command line. The default parsing algorithm is a back-tracking parser, which uses memorization, resulting in a good overall performance.

IParse has a proven track record in many application (including a commercial application), but it should be noted that some parts are still under construction, such as the LL1HeapColourParser. The ParParser, an experimental parallel parser, has poor performance.

Also RcTransl, a tool for language translation between Windows resource files, is still under development.

http://www.iwriteiam.nl/MM.html

Compiling

Compiling with g++ (version 7.5) in software folder:

g++ -fno-operator-names all_IParse.cpp -o IParse

Compiling with clang (version 6.0) in software folder:

clang++ -fno-operator-names all_IParse.cpp -o IParse

On Windows use Visual C++ 2008 Express Edition with IParse.sln file.

Testing

On Linux in root folder:

software/IParse software/c.gr others/scan.pc -p scan_pc_output
diff scan_pc_output others/scan_pc_output
software/IParse software/c.gr others/scan.pc -unparse unparse_scan.pc
diff unparse_scan.pc others/unparse_scan.pc

The diff should not find any differences.

MarkDownC

MarkDownC is a tool for performing literate programming with MarkDown files like the ones that are supported by GitHub. The idea is that you give this program a list of MarkDown files with fragments of C code and that the program figures out how to put these fragments together in a single C file, such that the file can be compiled. (I wrote about the conception of this idea in the blog article Literate programming with Markdown.) For examples on how to use the program, see:

Issue the following command to build the MarkDownC processor:

cd software
g++ -fno-operator-names all_IParse.cpp -o IParse
./IParse c_md.gr -o MarkDownCGrammar.cpp
g++ -g -fno-operator-names -Wall MarkDownC.cpp -o MarkDownC

Or:

cd software
clang++ -fno-operator-names all_IParse.cpp -o IParse
./IParse c_md.gr -o MarkDownCGrammar.cpp
clang++ -g -fno-operator-names -Wall MarkDownC.cpp -o MarkDownC

iparse's People

Contributors

fransfaase avatar mingodad avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar

iparse's Issues

Back-tracking heap parser not working

The command: IParse c.gr -BTHeap scan.pc does not parse the whole file, as expected. It looks like the back-tracking heap parser is not working correctly.

Segfault when trying to parse js.gr

Reading http://www.iwriteiam.nl/MM.html I found the js.gr and decided to try it and got this output:

valgrind software/IParse software/js.gr lexer.js 
==7911== Memcheck, a memory error detector
==7911== Copyright (C) 2002-2017, and GNU GPL'd, by Julian Seward et al.
==7911== Using Valgrind-3.16.1 and LibVEX; rerun with -h for copyright info
==7911== Command: software/IParse software/js.gr lexer.js
==7911== 
Iparse, Version: 1.7 of February 17, 2021.
Processing: software/js.gr
Processing: lexer.js
==7911== Invalid read of size 8
==7911==    at 0x10C873: AbstractParseTreeBase::string() const (AbstractParseTree.cpp:576)
==7911==    by 0x1184DC: Grammar::make_rule(AbstractParseTree::iterator, GrammarOrRule*) (ParserGrammar.cpp:307)
==7911==    by 0x11A222: Grammar::loadGrammar(AbstractParseTree const&) (ParserGrammar.cpp:602)
==7911==    by 0x13D9B2: main (IParse.cpp:715)
==7911==  Address 0x8 is not stack'd, malloc'd or (recently) free'd
==7911== 
==7911== 
==7911== Process terminating with default action of signal 11 (SIGSEGV)
==7911==  Access not within mapped region at address 0x8
==7911==    at 0x10C873: AbstractParseTreeBase::string() const (AbstractParseTree.cpp:576)
==7911==    by 0x1184DC: Grammar::make_rule(AbstractParseTree::iterator, GrammarOrRule*) (ParserGrammar.cpp:307)
==7911==    by 0x11A222: Grammar::loadGrammar(AbstractParseTree const&) (ParserGrammar.cpp:602)
==7911==    by 0x13D9B2: main (IParse.cpp:715)
==7911==  If you believe this happened as a result of a stack
==7911==  overflow in your program's main thread (unlikely but
==7911==  possible), you can try to increase the size of the
==7911==  main thread stack using the --main-stacksize= flag.
==7911==  The main thread stack size used in this run was 8388608.
==7911== 
==7911== HEAP SUMMARY:
==7911==     in use at exit: 95,978 bytes in 2,701 blocks
==7911==   total heap usage: 3,946 allocs, 1,245 frees, 273,948 bytes allocated
==7911== 
==7911== LEAK SUMMARY:
==7911==    definitely lost: 0 bytes in 0 blocks
==7911==    indirectly lost: 0 bytes in 0 blocks
==7911==      possibly lost: 0 bytes in 0 blocks
==7911==    still reachable: 95,978 bytes in 2,701 blocks
==7911==         suppressed: 0 bytes in 0 blocks
==7911== Rerun with --leak-check=full to see details of leaked memory
==7911== 
==7911== For lists of detected and suppressed errors, rerun with: -s
==7911== ERROR SUMMARY: 1 errors from 1 contexts (suppressed: 0 from 0)
/home/mingo/bin/myvalgrind: line 2:  7911 Segmentation fault      (core dumped) valgrind $*

lexer.js

/*
From https://eli.thegreenplace.net/2013/07/16/hand-written-lexer-in-javascript-compared-to-the-regex-based-ones
*/

'use strict';

var Lexer = exports.Lexer = function() {
  this.pos = 0;
  this.buf = null;
  this.buflen = 0;

  // Operator table, mapping operator -> token name
  this.optable = {
    '+':  'PLUS',
    '-':  'MINUS',
    '*':  'MULTIPLY',
    '.':  'PERIOD',
    '\\': 'BACKSLASH',
    ':':  'COLON',
    '%':  'PERCENT',
    '|':  'PIPE',
    '!':  'EXCLAMATION',
    '?':  'QUESTION',
    '#':  'POUND',
    '&':  'AMPERSAND',
    ';':  'SEMI',
    ',':  'COMMA',
    '(':  'L_PAREN',
    ')':  'R_PAREN',
    '<':  'L_ANG',
    '>':  'R_ANG',
    '{':  'L_BRACE',
    '}':  'R_BRACE',
    '[':  'L_BRACKET',
    ']':  'R_BRACKET',
    '=':  'EQUALS'
  };
}

// Initialize the Lexer's buffer. This resets the lexer's internal
// state and subsequent tokens will be returned starting with the
// beginning of the new buffer.
Lexer.prototype.input = function(buf) {
  this.pos = 0;
  this.buf = buf;
  this.buflen = buf.length;
}

// Get the next token from the current buffer. A token is an object with
// the following properties:
// - name: name of the pattern that this token matched (taken from rules).
// - value: actual string value of the token.
// - pos: offset in the current buffer where the token starts.
//
// If there are no more tokens in the buffer, returns null. In case of
// an error throws Error.
Lexer.prototype.token = function() {
  this._skipnontokens();
  if (this.pos >= this.buflen) {
    return null;
  }

  // The char at this.pos is part of a real token. Figure out which.
  var c = this.buf.charAt(this.pos);

  // '/' is treated specially, because it starts a comment if followed by
  // another '/'. If not followed by another '/', it's the DIVIDE
  // operator.
  if (c === '/') {
    var next_c = this.buf.charAt(this.pos + 1);
    if (next_c === '/') {
      return this._process_comment();
    } else {
      return {name: 'DIVIDE', value: '/', pos: this.pos++};
    }
  } else {
    // Look it up in the table of operators
    var op = this.optable[c];
    if (op !== undefined) {
      return {name: op, value: c, pos: this.pos++};
    } else {
      // Not an operator - so it's the beginning of another token.
      if (Lexer._isalpha(c)) {
        return this._process_identifier();
      } else if (Lexer._isdigit(c)) {
        return this._process_number();
      } else if (c === '"') {
        return this._process_quote();
      } else {
        throw Error('Token error at ' + this.pos);
      }
    }
  }
}

Lexer._isnewline = function(c) {
  return c === '\r' || c === '\n';
}

Lexer._isdigit = function(c) {
  return c >= '0' && c <= '9';
}

Lexer._isalpha = function(c) {
  return (c >= 'a' && c <= 'z') ||
         (c >= 'A' && c <= 'Z') ||
         c === '_' || c === '$';
}

Lexer._isalphanum = function(c) {
  return (c >= 'a' && c <= 'z') ||
         (c >= 'A' && c <= 'Z') ||
         (c >= '0' && c <= '9') ||
         c === '_' || c === '$';
}

Lexer.prototype._process_number = function() {
  var endpos = this.pos + 1;
  while (endpos < this.buflen &&
         Lexer._isdigit(this.buf.charAt(endpos))) {
    endpos++;
  }

  var tok = {
    name: 'NUMBER',
    value: this.buf.substring(this.pos, endpos),
    pos: this.pos
  };
  this.pos = endpos;
  return tok;
}

Lexer.prototype._process_comment = function() {
  var endpos = this.pos + 2;
  // Skip until the end of the line
  var c = this.buf.charAt(this.pos + 2);
  while (endpos < this.buflen &&
         !Lexer._isnewline(this.buf.charAt(endpos))) {
    endpos++;
  }

  var tok = {
    name: 'COMMENT',
    value: this.buf.substring(this.pos, endpos),
    pos: this.pos
  };
  this.pos = endpos + 1;
  return tok;
}

Lexer.prototype._process_identifier = function() {
  var endpos = this.pos + 1;
  while (endpos < this.buflen &&
         Lexer._isalphanum(this.buf.charAt(endpos))) {
    endpos++;
  }

  var tok = {
    name: 'IDENTIFIER',
    value: this.buf.substring(this.pos, endpos),
    pos: this.pos
  };
  this.pos = endpos;
  return tok;
}

Lexer.prototype._process_quote = function() {
  // this.pos points at the opening quote. Find the ending quote.
  var end_index = this.buf.indexOf('"', this.pos + 1);

  if (end_index === -1) {
    throw Error('Unterminated quote at ' + this.pos);
  } else {
    var tok = {
      name: 'QUOTE',
      value: this.buf.substring(this.pos, end_index + 1),
      pos: this.pos
    };
    this.pos = end_index + 1;
    return tok;
  }
}

Lexer.prototype._skipnontokens = function() {
  while (this.pos < this.buflen) {
    var c = this.buf.charAt(this.pos);
    if (c == ' ' || c == '\t' || c == '\r' || c == '\n') {
      this.pos++;
    } else {
      break;
    }
  }
}

Comparing 'this == 0' doesn't make sense

Hello !
Trying to compile it on linux and enabling warnings there is several places (see bellow) where you are comparing non null this with 0 (NULL):

../AbstractParseTree.cpp: In member function ‘void tree_t::print(FILE*, bool)’:
../AbstractParseTree.cpp:151:5: warning: nonnull argument ‘this’ compared to NULL [-Wnonnull-compare]
  151 |     if (this == 0)
      |     ^~
../AbstractParseTree.cpp: In member function ‘void tree_t::release()’:
../AbstractParseTree.cpp:123:5: warning: nonnull argument ‘this’ compared to NULL [-Wnonnull-compare]
  123 |     if (this == 0)
      |     ^~
../AbstractParseTree.cpp: In member function ‘tree_t* tree_t::clone()’:
../AbstractParseTree.cpp:248:2: warning: nonnull argument ‘this’ compared to NULL [-Wnonnull-compare]
  248 |  if (this == 0)
      |  ^~
../AbstractParseTree.cpp: In member function ‘bool tree_cursor_t::make_private_copy()’:
../AbstractParseTree.cpp:445:2: warning: nonnull argument ‘this’ compared to NULL [-Wnonnull-compare]
  445 |  if (this == 0)
      |  ^~

Memory leaks detected by valgrind

Testing this project with valgrind we get this memory leaks:

valgrind --leak-check=full software/IParse  software/c.gr others/scan.pc -p scan_pc_output
==9186== Memcheck, a memory error detector
==9186== Copyright (C) 2002-2017, and GNU GPL'd, by Julian Seward et al.
==9186== Using Valgrind-3.16.1 and LibVEX; rerun with -h for copyright info
==9186== Command: software/IParse software/c.gr others/scan.pc -p scan_pc_output
==9186== 
Iparse, Version: 1.7 of February 17, 2021.
Processing: software/c.gr
==9186== Mismatched free() / delete / delete []
==9186==    at 0x4C336CE: operator delete(void*, unsigned long) (vg_replace_malloc.c:593)
==9186==    by 0x11F41C: BTParser::free_solutions() (in /home/mingo/dev/c/A_grammars/IParse-0/software/IParse)
==9186==    by 0x11F6CA: BTParser::parse(TextFileBuffer const&, Ident, AbstractParseTree&) (in /home/mingo/dev/c/A_grammars/IParse-0/software/IParse)
==9186==    by 0x13CF7E: main (in /home/mingo/dev/c/A_grammars/IParse-0/software/IParse)
==9186==  Address 0x5be5980 is 0 bytes inside a block of size 55,424 alloc'd
==9186==    at 0x4C32BCF: operator new[](unsigned long) (vg_replace_malloc.c:431)
==9186==    by 0x11F30D: BTParser::init_solutions() (in /home/mingo/dev/c/A_grammars/IParse-0/software/IParse)
==9186==    by 0x11F615: BTParser::parse(TextFileBuffer const&, Ident, AbstractParseTree&) (in /home/mingo/dev/c/A_grammars/IParse-0/software/IParse)
==9186==    by 0x13CF7E: main (in /home/mingo/dev/c/A_grammars/IParse-0/software/IParse)
==9186== 
==9186== Mismatched free() / delete / delete []
==9186==    at 0x4C336CE: operator delete(void*, unsigned long) (vg_replace_malloc.c:593)
==9186==    by 0x13E1EF: TextFileBuffer::release() (in /home/mingo/dev/c/A_grammars/IParse-0/software/IParse)
==9186==    by 0x13CFC5: main (in /home/mingo/dev/c/A_grammars/IParse-0/software/IParse)
==9186==  Address 0x5bdb830 is 0 bytes inside a block of size 6,928 alloc'd
==9186==    at 0x4C32BCF: operator new[](unsigned long) (vg_replace_malloc.c:431)
==9186==    by 0x12F3E8: PlainFileReader::read(_IO_FILE*, TextFileBuffer&) (in /home/mingo/dev/c/A_grammars/IParse-0/software/IParse)
==9186==    by 0x13CB77: main (in /home/mingo/dev/c/A_grammars/IParse-0/software/IParse)
==9186== 
Processing: others/scan.pc
Processing: -p
tree:

--------------
==9186== 
==9186== HEAP SUMMARY:
==9186==     in use at exit: 643,065 bytes in 19,135 blocks
==9186==   total heap usage: 44,380 allocs, 25,245 frees, 2,382,641 bytes allocated
==9186== 
==9186== 4,760 bytes in 119 blocks are definitely lost in loss record 1,517 of 1,545
==9186==    at 0x4C31E83: malloc (vg_replace_malloc.c:307)
==9186==    by 0x10B2E1: tree_t::operator new(unsigned long) (in /home/mingo/dev/c/A_grammars/IParse-0/software/IParse)
==9186==    by 0x10C8AE: AbstractParseTree::createIdent(Ident) (in /home/mingo/dev/c/A_grammars/IParse-0/software/IParse)
==9186==    by 0x13D9B6: AbstractParseTree::operator=(Ident) (in /home/mingo/dev/c/A_grammars/IParse-0/software/IParse)
==9186==    by 0x10F1FE: BasicScanner::accept_ident(TextFileBuffer&, AbstractParseTree&) (in /home/mingo/dev/c/A_grammars/IParse-0/software/IParse)
==9186==    by 0x10EC2D: BasicScanner::acceptTerminal(TextFileBuffer&, Ident, AbstractParseTree&) (in /home/mingo/dev/c/A_grammars/IParse-0/software/IParse)
==9186==    by 0x11BCA1: BTParser::parse_term(GrammarTerminal*, AbstractParseTree&) (in /home/mingo/dev/c/A_grammars/IParse-0/software/IParse)
==9186==    by 0x11D58D: BTParser::parse_rule(GrammarRule*, ParsedValue*, Ident, AbstractParseTree&) (in /home/mingo/dev/c/A_grammars/IParse-0/software/IParse)
==9186==    by 0x11C619: BTParser::parse_nt(GrammarNonTerminal*, AbstractParseTree&) (in /home/mingo/dev/c/A_grammars/IParse-0/software/IParse)
==9186==    by 0x11D61C: BTParser::parse_rule(GrammarRule*, ParsedValue*, Ident, AbstractParseTree&) (in /home/mingo/dev/c/A_grammars/IParse-0/software/IParse)
==9186==    by 0x11CBAA: BTParser::parse_or(GrammarOrRule*, ParsedValue*, AbstractParseTree&) (in /home/mingo/dev/c/A_grammars/IParse-0/software/IParse)
==9186==    by 0x11EDC1: BTParser::parse_seq(GrammarRule*, char const*, AbstractParseTree, ParsedValue*, Ident, AbstractParseTree&) (in /home/mingo/dev/c/A_grammars/IParse-0/software/IParse)
==9186== 
==9186== 66,470 (13,120 direct, 53,350 indirect) bytes in 2 blocks are definitely lost in loss record 1,545 of 1,545
==9186==    at 0x4C324E2: operator new(unsigned long) (vg_replace_malloc.c:342)
==9186==    by 0x13CBBB: main (in /home/mingo/dev/c/A_grammars/IParse-0/software/IParse)
==9186== 
==9186== LEAK SUMMARY:
==9186==    definitely lost: 17,880 bytes in 121 blocks
==9186==    indirectly lost: 53,350 bytes in 1,099 blocks
==9186==      possibly lost: 0 bytes in 0 blocks
==9186==    still reachable: 571,835 bytes in 17,915 blocks
==9186==         suppressed: 0 bytes in 0 blocks
==9186== Reachable blocks (those to which a pointer was found) are not shown.
==9186== To see them, rerun with: --leak-check=full --show-leak-kinds=all
==9186== 
==9186== For lists of detected and suppressed errors, rerun with: -s
==9186== ERROR SUMMARY: 6 errors from 4 contexts (suppressed: 0 from 0)

And here are some easy fixes (but still need to look for other leaks):

  • software/TextFileReader.h:
@@ -7,11 +7,11 @@ class TextFileBuffer : public TextFilePos
 {
 public:
 	TextFileBuffer();
 
 	void assign(const char* str, unsigned long len, bool utf8encoded = false);
-	void release() { delete (char*)_buffer; _buffer = 0; }
+	void release() { delete [] (char*)_buffer; _buffer = 0; }
 	unsigned long length() { return _len; }
 
 	TextFileBuffer& operator=(const TextFileBuffer& lhs)
 	{
 		assign(lhs._buffer, lhs._len, lhs._utf8encoded);
  • software/BTParser.cpp:
@@ -583,11 +583,11 @@ void BTParser::free_solutions()
 		{	ParseSolution* next_sol = sol->next;
 			delete sol;
 			sol = next_sol;
 		}
   	}
-	delete _solutions;
+	delete [] _solutions;
 	_solutions = 0;
 }
 
 ParseSolution* BTParser::find_solution(unsigned long filepos, Ident nt)
 {
  • software/IParse.cpp:
@@ -756,10 +756,12 @@ int main(int argc, char *argv[])
 				{
 					parser->printExpected(stdout, filename, textBuffer);
 					return 0;
 				}
 				textBuffer.release();
+                                delete parser;
+                                delete scanner;
 
                	tree.attach(new_tree);
 
                 fclose(fin);
             }

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.