phorward / libphorward Goto Github PK

C/C++ library for dynamic data structures, regular expressions, lexical analysis & more...

License: MIT License

Shell 7.85% C 89.20% Makefile 1.49% M4 0.51% Objective-C 0.30% C++ 0.66%

linked-list hash-table dynamic-arrays lexical-analysis regular-expressions regular-expression-engine documentation-tool prototype-generator regular-expression c

libphorward's Introduction

libphorward

The libphorward is a generic C/C++ library and toolbox, serving platform-independent utilities for different purposes.

Data structures

parray - Dynamically managed arrays & stacks
pccl - Character-classes
plex - Lexical analysis
plist - Linked lists, hash-tables, queues & stacks
pregex - Regular expressions

Generic helpers

DEBUG-facilities - Logging, tracing and run-time analysis
pgetopt - Command-line options interpreter
pstr*, pwcs* - Enhanced string operations

Command-line tools

pdoc - C source code documentation tool
pinclude - Generate big files from various smaller ones
plex - Lexical analyzer generator and interpreter
pproto - C function prototype generator
pregex - Regular expressions match/find/split/replace
ptest - C program test facilities

Documentation

A recently updated, full documentation can be found here, and is also locally available after installation.

Building

Building phorward is simple as every GNU-style open source program. Extract the downloaded release tarball or clone the source repository into a directory of your choice.

Then, run

./configure
make
make install

And you're ready to go!

Alternative development build

Alternatively, there is also a simpler method for setting up a local build system for development and testing purposes.

To do so, type

make -f Makefile.gnu make_install
make

This locally compiles the library or parts of it, and is ideal for development purposes.

Stand-alone copy

The entire library including its tools can be made available in one target directory by using the script ./standalone.sh.

This makes stand-alone integration of the entire library into other projects possible without a previous installation or porting, and easier maintainable packages.

The generated stand-alone package contains a Makefile and can directly be built.

Credits

libphorward is developed and maintained by Jan Max Meyer, Phorward Software Technologies.

Contributions by Heavenfighter and AGS.

License

You may use, modify and distribute this software under the terms and conditions of the MIT license. The full license terms can be obtained from the file LICENSE.

libphorward's People

Contributors

Stargazers

Watchers

Forkers

freebasic-programmer heavenfighter jolo passingbreeze crackercat killvxk camelcasecam

libphorward's Issues

Checking when parser is done on an infinite stream

First or all, thank you so much for open sourcing this wonderful library. I have a custom requirement. I have an infinite stream of tokens and the pattern I want to capture is embedded in the stream. It can start anywhere and end anywhere. I am using pp_parctx_next to update the parser context when the next token in the stream is available. However, I am clueless as how to detect PPPAR_STATE_DONE state after every pp_parctx_next call. Even when the valid statement tokens (generated by the grammar) have been passed the context state is still PPPAR_STATE_NEXT and is changed to PPPAR_STATE_DONE only when pp_parctx_next(ctx, pp_sym_get( grm, 0 ) , NULL) is called. But then I can not add any more tokens. Is there a way to detect PPPAR_STATE_DONE state or way to roll back the parser context ?

Superfluous states in dfa

Using the following patterns (scanner defined as a plex ptr) I get a dfa that consists of more states than expected (note: the prefix $ keeps me from having to use \ as the $ keeps the FreeBASIC parser from interpreting escape sequences inside the string)

plex_define(scanner,@$"as[ \t]*",1,  PREGEX_COMP_NOANCHORS | PREGEX_COMP_NOREF)
plex_define(scanner,@$"dim[ \t]*",2,  PREGEX_COMP_NOANCHORS | PREGEX_COMP_NOREF)
plex_define(scanner,@$"if[ \t]*",3,  PREGEX_COMP_NOANCHORS | PREGEX_COMP_NOREF)
plex_define(scanner,@$"then[ \t]*",4,  PREGEX_COMP_NOANCHORS | PREGEX_COMP_NOREF)
plex_define(scanner,@$"[a-zA-Z][a-zA-Z0-9_]*[ \t]*",5,  PREGEX_COMP_NOANCHORS | PREGEX_COMP_NOREF)
plex_define(scanner,"_[a-zA-Z0-9_]+[ \t]*",6,  PREGEX_COMP_NOANCHORS | PREGEX_COMP_NOREF)

The scanner resulting from compiling the patterns should accept either
any of the words as dim if then (followed by an arbitrary amount of whitespace)
or
an identifier (followed by an arbitrary amount of whitespace).

After calling plex_prepare the resulting dfa looks like this (printed in a 'human readable format')

0 (row size = 35, ID = 0, FLAGS = 0, REF = 0, DEF = 26,"_",1,"a",2,"d",3,"i",4,"t",5,"A","Z",6,"b","c",6,"e","h",6,"j","s",6,"u","z",6)
1 (row size = 17, ID = 0, FLAGS = 0, REF = 0, DEF = 26,"0","9",7,"A","Z",7,"_",7,"a","z",7)
2 (row size = 29, ID = 5, FLAGS = 0, REF = 0, DEF = 26,"s",8,"TAB",9,"SPACE",9,"0","9",10,"A","Z",10,"_",10,"a","r",10,"t","z",10)
3 (row size = 29, ID = 5, FLAGS = 0, REF = 0, DEF = 26,"i",11,"TAB",9,"SPACE",9,"0","9",10,"A","Z",10,"_",10,"a","h",10,"j","z",10)
4 (row size = 29, ID = 5, FLAGS = 0, REF = 0, DEF = 26,"f",12,"TAB",9,"SPACE",9,"0","9",10,"A","Z",10,"_",10,"a","e",10,"g","z",10)
5 (row size = 29, ID = 5, FLAGS = 0, REF = 0, DEF = 26,"h",13,"TAB",9,"SPACE",9,"0","9",10,"A","Z",10,"_",10,"a","g",10,"i","z",10)
6 (row size = 23, ID = 5, FLAGS = 0, REF = 0, DEF = 26,"TAB",9,"SPACE",9,"0","9",10,"A","Z",10,"_",10,"a","z",10)
7 (row size = 23, ID = 6, FLAGS = 0, REF = 0, DEF = 26,"TAB",14,"SPACE",14,"0","9",7,"A","Z",7,"_",7,"a","z",7)
8 (row size = 23, ID = 1, FLAGS = 0, REF = 0, DEF = 26,"TAB",15,"SPACE",15,"0","9",10,"A","Z",10,"_",10,"a","z",10)
9 (row size = 11, ID = 5, FLAGS = 0, REF = 0, DEF = 26,"TAB",9,"SPACE",9)
10 (row size = 23, ID = 5, FLAGS = 0, REF = 0, DEF = 26,"TAB",9,"SPACE",9,"0","9",10,"A","Z",10,"_",10,"a","z",10)
11 (row size = 29, ID = 5, FLAGS = 0, REF = 0, DEF = 26,"m",16,"TAB",9,"SPACE",9,"0","9",10,"A","Z",10,"_",10,"a","l",10,"n","z",10)
12 (row size = 23, ID = 3, FLAGS = 0, REF = 0, DEF = 26,"TAB",17,"SPACE",17,"0","9",10,"A","Z",10,"_",10,"a","z",10)
13 (row size = 29, ID = 5, FLAGS = 0, REF = 0, DEF = 26,"e",18,"TAB",9,"SPACE",9,"0","9",10,"A","Z",10,"_",10,"a","d",10,"f","z",10)
14 (row size = 11, ID = 6, FLAGS = 0, REF = 0, DEF = 26,"TAB",14,"SPACE",14)
15 (row size = 11, ID = 1, FLAGS = 0, REF = 0, DEF = 26,"TAB",19,"SPACE",19)
16 (row size = 23, ID = 2, FLAGS = 0, REF = 0, DEF = 26,"TAB",20,"SPACE",20,"0","9",10,"A","Z",10,"_",10,"a","z",10)
17 (row size = 11, ID = 3, FLAGS = 0, REF = 0, DEF = 26,"TAB",21,"SPACE",21)
18 (row size = 29, ID = 5, FLAGS = 0, REF = 0, DEF = 26,"n",22,"TAB",9,"SPACE",9,"0","9",10,"A","Z",10,"_",10,"a","m",10,"o","z",10)
19 (row size = 11, ID = 1, FLAGS = 0, REF = 0, DEF = 26,"TAB",15,"SPACE",15)
20 (row size = 11, ID = 2, FLAGS = 0, REF = 0, DEF = 26,"TAB",23,"SPACE",23)
21 (row size = 11, ID = 3, FLAGS = 0, REF = 0, DEF = 26,"TAB",17,"SPACE",17)
22 (row size = 23, ID = 4, FLAGS = 0, REF = 0, DEF = 26,"TAB",24,"SPACE",24,"0","9",10,"A","Z",10,"_",10,"a","z",10)
23 (row size = 11, ID = 2, FLAGS = 0, REF = 0, DEF = 26,"TAB",20,"SPACE",20)
24 (row size = 11, ID = 4, FLAGS = 0, REF = 0, DEF = 26,"TAB",25,"SPACE",25)
25 (row size = 11, ID = 4, FLAGS = 0, REF = 0, DEF = 26,"TAB",24,"SPACE",24)

Format of the above: first number on line is the state number, the rest is the content of the row for a given state (the transitions start after DEF = 26). There are no default transitions and flags/references are both 0 (partly due to the use of PREGEX_COMP_NOANCHORS | PREGEX_COMP_NOREF). The word "TAB" denotes the character \t and the word "SPACE" denotes the single space character.

The part of the dfa that seems 'wrong' is restricted to the states where the scanner has found one of the words as dim if then. After scanning one of those words the next character decides how matching will continue. If the next character in the input is [ \t] then the word is found and the scanner continues consuming the trailing [ \t].
If the next character in the input is [a-zA-Z0-9_] then the word was only a prefix of an identifier and the scanner continues scanning for an identifier.

States 15, 17, 19, 20, 21, 23, 24 and 25 are the ones that make me go 'hmmm...'.

15 -> [ \t] -> 19  [id = 1]
19 -> [ \t] -> 15  [id = 1]  
17 -> [ \t] -> 21  [id = 3]
21 -> [ \t] -> 17  [id = 3]
20 -> [ \t] -> 23  [id = 2]
23 -> [ \t] -> 20  [id = 2]  
24 -> [ \t] -> 25  [id = 4]
25 -> [ \t] -> 24  [id = 4]

The matching paths for as dim if and then look like this (last entry on the line is (ID) where ID is the ID for the word that was recognized)

as:  0 a 2 s 8             [ \t] 15 (1)
dim: 0 d 3 i 11 m 16       [ \t] 20 (2)
if:  0 4 f 12              [ \t] 17 (3)
then: 0 t 5 h 13 e 18 n 22 [ \t] 24 (4)

The reason why I found the dfa somewhat strange is the fact that state 9 and state 14 behave as expected. Those states are entered after the scanner has matched an identifier and found a whitespace character.

In state 9 the only possible transition is on [ \t]. Since the id does not change src and dst of the only transition in state 9 is state 9.
In state 14 the only possible transition is on [ \t]. Since the id does not change the src and dst of the only transition in state 14 is state 14.

Unless I am getting something wrong here the states 15 17 20 and 24 should behave like states 9 and 14. The src and the dst of the transition in those states should be the same as the number of the state.
So you'd get

15 -> [ \t] -> 15 (1)
17 -> [ \t] -> 17 (3)
20 -> [ \t] -> 20 (2)
24 -> [ \t] -> 24 (4)

Given the above transitions the states 19 21 23 and 25 seem superfluous.
The routine that utilizes the dfa (plex_lex) does work as expected. So no problems there.
But the dfa produced by plex_prepare is bigger than expected (it looks 'suspicious').

States 9 and 14 are the ones that make me think that there might be an issue with the dfa creation algorithm.

plist_subsort() always crashes on Termux

Inside a Termux system on an Android device, the function plist_subsort() always crashes when running without valgrind, inspecting invalid and unitizialized pointers

The function should also generally refactored.

Translating DFA (plex) to graphviz format (.dot)

I've finished the C version of the routine that, given a plex - created DFA, creates a file in graphviz format representing the state machine. Code can be found at the end of this comment.

I already posted the code in another issue but that issue got closed (apparently I closed the issue but I have no clue how I closed that issue). If you saw my previous post then you can forget about this post as it is a repeat of my previous post.

I finished the translation of the C code to FreeBASIC (UniCC). I should get the FreeBASIC template code working in the upcoming week.

#include "phorward.h"

#define SIZE (lex->trans[i][0])
#define FROM (lex->trans[i][j])
#define TO   (lex->trans[i][j+1])
#define DST  (lex->trans[i][j+2])
#define MATCH_FLAGS (lex->trans[i][2])
#define REF_FLAGS (lex->trans[i][3])
#define FINAL_STATE (lex->trans[i][1] > 0)
#define ID (lex->trans[i][1])
#define DEFAULT_TRANSITION (lex->trans[i][4] != lex->trans_cnt)

void write_edge(FILE* transitions, wchar_t from, wchar_t to);
int write_dfa(plex *lex, char* filename);

void write_edge(FILE* transitions, wchar_t from, wchar_t to)
{  
  if (iswprint(from)) 
    fprintf(transitions,"&#x%x;",from);
  else
    fprintf(transitions,"0x%x",from);
  if (to != from)
  {
    if (iswprint(to))
      fprintf(transitions," - &#x%x;",to);
    else
      fprintf(transitions," - 0x%x",to);
  };
};

int write_dfa(plex *lex, char* filename)
{
  int i,j;
  FILE *transitions;
  int dst;
  
  if (strlen(filename) == 0) 
    return -1;
  transitions = fopen(filename,"w+b");
  if (transitions == 0)
    return -1;
    
  /* write start of graph */
  fprintf(transitions,"digraph {\n");
  fprintf(transitions,"  rankdir=LR;\n");
  fprintf(transitions,"  node [shape = circle];\n");

  for (i = 0;i < lex->trans_cnt;i++)
  {
    /* size = lex->trans[i][0]; */
    fprintf(transitions,"  n%d [",i);
    /* change shape and label of node if state is a final state */
    if (FINAL_STATE)
      fprintf(transitions,"shape=doublecircle,");
    fprintf(transitions,"label = \" n%d\\nmatch_flags = %d\\nref_flags = %d\\n",i,MATCH_FLAGS,REF_FLAGS);
    if (FINAL_STATE)
      fprintf(transitions,"id = %d\\n", ID);
    fprintf(transitions,"\"];\n");
    /* default transition */
    if (DEFAULT_TRANSITION) 
      fprintf(transitions,"  n%d -> n%d [style=bold];\n",i,lex->trans[i][4]);
    /* a state with size < 5 is a final state: it has no outgoing transitions */
    if (SIZE > 5)
    {
      j = 5;
      while (1)
      {
        fprintf(transitions,"  n%d -> n%d [label = <",i,DST);
        write_edge(transitions,FROM,TO);
        dst = lex->trans[i][j+2];
        j += 3;
        while (1) 
        {
          /*  no more transitions to write */
          if (j >= SIZE)
          {
            fprintf(transitions,">];\n");
            break;
          };
          if (lex->trans[i][j+2] == dst)
          {
            fprintf(transitions,"<br/>");
            write_edge(transitions,FROM,TO);            
            j += 3;
            continue;
          };
          /* no more transitions to write */
          fprintf(transitions,">];\n");
          break;
        };
        if (j >= SIZE)
          break;
      };
    };
  };
  fprintf(transitions,"}\n");
  fclose(transitions);
  return 0;
};

int main (int argc, char** argv)
{
  int i;
  plex *s;
  pregex_ptn *err_; 
  int err;

  enum
  {
    ID1 = 1,
    ID2 = 2,
    ID3 = 3,
    ID4 = 4,
    ID5 = 5,
    ID6 = 6,
    ID7 = 7
  };
  
  struct 
  {
    char* pat;
    int id;
    int flags;
  } patterns[] =  
  {
    {"wstring",ID1,PREGEX_COMP_STATIC},
    {"zstring",ID2,PREGEX_COMP_STATIC},
    {"string",ID3,PREGEX_COMP_STATIC},
    {"wstr",ID4,PREGEX_COMP_STATIC},
    {"str",ID5,PREGEX_COMP_STATIC},
    {"bling",ID6,PREGEX_COMP_STATIC},
    {"wing",ID7,PREGEX_COMP_STATIC},
    {"keywords.dot",0,-1},
    {"_[a-zA-Z0-9_]+",ID1,0},
    {"[a-zA-Z][a-zA-Z0-9_]*",ID1,0},        
    {"identifier.dot",0,-1},
    {"abc.*def.*ghi",ID1,0},
    {"large.dot",0,0},
    {"",-1,-1}
  };
  
  i = 0;
  while (1)
  {
    s =  plex_create(0);
    if (s == 0) 
    {
      fprintf(stderr,"Could not create lexical analyzer, exiting");
      return -1;
    };
    while (patterns[i].id != 0)  
    {
      err_ = plex_define(s,patterns[i].pat,patterns[i].id,patterns[i].flags);
      if (err_ == 0)
      {
        fprintf(stderr,"Problem while creating pattern, exiting");
        plex_free(s);
        return -1;
      };
      i++;
    };
    err = plex_prepare(s);  
    if (err == 0)
    {
      fprintf(stderr,"Could not generate dfa");
      plex_free(s);
      return -1;
    };
    err = write_dfa(s,patterns[i].pat);
    if (err)
    {
      fprintf(stderr,"Could not create file %s",patterns[i].pat);
      plex_free(s);
      return -1;
    };      
    plex_free(s);
    i++;
    if (patterns[i].id == -1 && patterns[i].flags == -1)
      return 0;    
  };
};

Problems with json parser

I have written a simple json parser to test the pbnf parser. And it works just fine. Sometimes.
If I run the parser numerous times (running from command-line) then it will either work (and produce
a parse tree) or it will crash. The message I get when it crashes (the message is FreeBASIC specific):

Aborting due to runtime error 12 ("segmentation violation" signal) in json.bas::PARSE_JSON()

If I run the program within the context of an editor (Programmer's Notepad) I get different results.

I can get the same message as above or I get multiple lines that read

base/list.c, 515: Function called with wrong or incomplete parameters, fix your call!

The message ends with the line

base/list.c, 516: !CORE!

Aborting due to runtime error 12 ("segmentation violation" signal) in json.bas::PARSE_JSON()

And sometimes the parser works (the ast gets created) and I get the message

base/list.c, 43: Function called with wrong or incomplete parameters, fix your call!
base/list.c, 43: Function called with wrong or incomplete parameters, fix your call!

If I get the above message I do not get a segmentation violation.

The parser sometimes works and sometimes crashes. And the messages I am getting seem to be related
to the list implementation. I'd say the problem is pointer related?

The problem could also be with the version of mingw I am using (using mingw 3.4.5 with options -gdwarf-2 and -g3).
Or with the grammar I have written. I am using FB 1.0.5 (compilation flags: -g -v -gen gas -w pedantic -exx -maxerr inf -R). This is the BASIC code I wrote to implement the json parser

  #include "file.bi"
  #include "crt/stdio.bi"
  
  #inclib "phorward4"
  
  enum runtime_errors
    no_error = 0
    illegal_function_call= 1
    file_not_found_signal
    file_io_error
    out_of_memory
    illegal_resume
    out_of_bounds_array_access
    null_pointer_access
    no_privileges
    interrupted_signal
    illegal_instruction_signal
    floating_point_error_signal
    segmentation_violation_signal
    termination_request_signal
    abnormal_termination_signal
    quit_request_signal
    return_without_gosub
    end_of_file 
  end enum

  extern "c"
    type ppgram as any
    type plex as any
    type pppar as any
    type ppast as any
    declare function pp_gram_from_pbnf( byval g as ppgram ptr, byval src as zstring ptr) as byte
    declare function pp_gram_create() as ppgram ptr
    declare function pp_par_create( byval g as ppgram ptr) as pppar ptr
    declare function pp_par_parse( byval root as ppast ptr, byval par as pppar ptr, byval start as zstring ptr) as byte
    declare function pp_par_autolex(byval par as pppar ptr) as integer
    declare sub pp_ast_dump( byval stream as FILE ptr, byval ast as ppast ptr)
    declare sub pp_ast_dump_short( byval stream as FILE ptr, byval ast as ppast ptr)
    declare function pp_ast_free(byval node as ppast ptr) as ppast ptr
    declare function pp_gram_free( byval g as ppgram ptr) as ppgram ptr
    declare function pp_par_free( byval p as pppar ptr) as pppar ptr
  end extern


  function parse_json(byval json_text as zstring ptr) as integer
    dim lexer as string = _
      $"%skip /[\s]+/;"_
      $"NUMBER : /((\+|\-)?[\d]+)((\.[\d]+([Ee](\+|\-)?[0-9]+)?)|([Ee](\+|\-)[\d]+))?/ = nnumber;"_
      $"NULL : /null/=nil;"_
      $"TRUE : /true/=true ;"_
      $"FALSE : /false/=false ;"_
      $"SCONST : /""((\x5c(\x5c|a|""|b|f|n|r|t|u[a-fA-F0-9][a-fA-F0-9][a-fA-F0-9][a-fA-F0-9]|v))|[^\x5c""])*""/=sconst;"_      
      $"CCONST : /'((\x5c(\x5c|a|'|b|f|n|r|t|v))|[^\x5c'])*'/=cconst;"_      
      $"ID    : /[a-zA-Z_][A-Za-z0-9_]*/ =id;"_
      $"COLON : "":"";"_
      $"COMMA : "","";"_
      $"LC    : ""{"";"_
      $"RC    : ""}"";"_  
      $"LB    : ""["";"_
      $"RB    : ""]"";"

      
    dim parser as string = _
     $"json$ : value;"_
     $"value : object =nobject| array =narray| ID | NUMBER | NULL | TRUE | FALSE | CCONST | SCONST;"_
     $"object : LC RC | LC pair_list RC ;"_
     $"pair_list : pair | pair_list COMMA pair;"_
     $"key : SCONST | CCONST | ID ;"_
     $"pair : key COLON value = pair;"_
     $"array : LB RB | LB value_list RB ;"_
     $"value_list : value | value_list COMMA value;"


    dim grm as ppgram ptr
    grm = pp_gram_create()
    dim s as string 
    s = lexer & parser
    var result = pp_gram_from_pbnf( grm,strptr(s))
    if (result = 0) then 
      print "Creating grammar failed"
      error illegal_function_call
    end if
    dim root as ppast ptr
    dim par as pppar ptr
    par = pp_par_create(grm)
    if (par = 0) then
      print "Creating parser failed"
      error illegal_function_call
    end if
    var result2 = pp_par_autolex(par)
    if result2 = 0 then
      error illegal_instruction_signal
    end if
    var v1 = timer()
    var result3 = pp_par_parse( @root, par, json_text)
    var v2 = timer()    
    if (result3 = 0) then
      error illegal_instruction_signal
    end if
    print using "Timing : ####.#####";v2 - v2
    dim fresult as FILE ptr
    fresult = fopen("ast_tree","w+b")
    if (fresult = 0) then
      print "Could not write ast"
      error file_io_error
    end if
    pp_ast_dump_short(fresult,root)
    fclose(fresult)
    pp_ast_free(root)
    pp_par_free(par)
    pp_gram_free(grm)
  return 0
end function

sub get_json(byref filename as string, byval flag as ubyte)
  if flag then    
    if (fileexists(filename) = 0) then
      print "Could not find file ";filename
      error file_not_found_signal
    end if
    dim fh as integer = freefile()
    if (fh = 0) then
      print "Could not create file descriptor"
      error file_io_error
    end if
    open filename for binary access read as #fh
    var err_ = err()
    if (err_ <> 0) then
      print "Could not open file ";filename
      error file_io_error
    end if  
    dim json as ubyte ptr = callocate(lof(fh)+1,sizeof(ubyte))
    if (json = 0) then
      error out_of_memory
    end if
    get #fh,,*json,lof(fh)
    err_ = err()
    if (err_ <> 0) then
      error err_
    end if
    json[lof(fh)] = asc(!"\0")
    close #fh
    parse_json(cast(zstring ptr,json))
    deallocate(json)
  else
    parse_json(strptr(filename))
  end if
   
end sub

get_json("generated.json",1)

Writing the parser was fun. If anything can be done to keep the compiled parser from crashing then that would be nice. Version of libphorward used: 0.22.2 (using json data generated by https://www.json-generator.com)

Compiler warnings (mingw)

When compiling the code (version 0.20.0) using msys/mingw(32 bit) on windows 7 (64 bit pro) I get the following warnings:

string/convert.c: In function `pdbl_to_wcs':
string/convert.c:164: warning: passing arg 2 of `swprintf' makes pointer from integer without a cast

string/string.c: In function `pwcscatchar':
string/string.c:884: warning: passing arg 2 of `swprintf' makes pointer from integer without a cast

string/utf8.c:33: warning: large integer implicitly truncated to unsigned type
string/utf8.c:34: warning: large integer implicitly truncated to unsigned type
string/utf8.c:34: warning: large integer implicitly truncated to unsigned type
string/utf8.c:35: warning: large integer implicitly truncated to unsigned type
string/utf8.c: In function `u8_toutf8':
string/utf8.c:235: warning: comparison is always true due to limited range of data type
string/utf8.c:242: warning: comparison is always true due to limited range of data type
string/utf8.c: In function `u8_wc_toutf8':
string/utf8.c:268: warning: comparison is always true due to limited range of data type
string/utf8.c:274: warning: comparison is always true due to limited range of data type
string/utf8.c: In function `u8_escape_wchar':
string/utf8.c:468: warning: comparison is always false due to limited range of data type
string/utf8.c:470: warning: comparison is always true due to limited range of data type

libtool: warning: undefined symbols not allowed in i686-pc-mingw32 shared libraries; building static only

The swprintf problem can be resolved by adding the following snippet to both string.c and convert.c

#if _WIN32
#define swprintf _snwprintf
#endif

The utf8.c related warnings have to do with the size of wchar_t on windows (unsigned 16 bit integer). On Linux wchar_t is defined as a 32 bit integer which means that utf8.c will most likely compile without a problem on Linux.

The author you got the utf8 - code from has this to say about his utf8 library:

I now use and recommend utf8proc instead of this library.

libphorward documentation

I forked the libphorward project. Then I made (a lot) of changes to the documentation files (.t2t). And... that turned out to be a waste of time. Documentation is created using the comments as found in the C source code. And those will be rewritten every time you recreate the t2t files.

I figured it would be better to correct the typos and then issue a pull request. Less work for you (otherwise I'd post an issue with all the typos and you'd have to correct all the files yourself). It seemed like a good idea to correct the typos and then issue a pull request. Of course I should have corrected the typos in the right place (eg not in the .t2t files).

Am I correct in thinking that in order to correct typos in the documentation of libphorward I have to change the comments as found in the C source code? And when I am satisfied with the changes I made I can issue a pull request. Leaving it up to you to decide whether or not to accept the pull request.

I looked at the way parsing (lalr1) is performed in libphorward. And I have something of a feature request.

Would it be possible to have an external scanner call the parser ('push parsing')? The interface to the libphorward generated parser would consist of a single function that should get called by the scanner for every token that the parser should parse.

Having an external scanner call the parser could make it a lot easier to parse input from anything besides a static string (eg stream-ish style parsing). Also the parser could be part of a 'larger' parser that would decide what subparser would get fed a certain token. It would be great to use x subparsers to parse some 'exotic' language (for example C++ :)).

Parsing one token using push parsing could look something like this.

token = get_token(scanner)
parse(parse_state,token)
if (parse_state->error_) then
  ''error message
end if

I am assuming the parser needs a place (parse_state) to keep track of parsing state. The numbering scheme for tokens could be straightforward: first token to appear in the grammar would get assigned a possible user defined number (or 1 or something else). Subsequent tokens appearing in the grammar would get numbered upward (or downward) from that first , possible user defined number.

parse_state could be used to store other, user-defined information (pppar could be extended to contain some user defined data in the shape of an any ptr or a plist or...?).

It would also be nice if the user could specify a routine that gets executed at given points during the parse. With some restrictions (not every position in the grammar might be appropriate).

Specifying where the routine should get called could be done by inserting special symbols in the grammar.

Whenever the parser reaches a point during the parse that corresponds with the position of the symbol the user defined routine gets executed.

An example.

struct_type: typedef struct s_identifier '{' struct_declaration_list'}' init_suffix;
init_suffix : init_declarator_list | ';';
s_identifier : identifier ;#

The user defined routine would need some type of info (context) so it can figure out what to do. A unique name (or number) should be added to every #

struct_type: typedef struct s_identifier '{' struct_declaration_list'}' init_suffix;
init_suffix : init_declarator_list | ';';
s_identifier : identifier ;#s_identifier

It's up to the user to come up with the names or the numbers after the #. The parser will simply call the user defined routine passing the name or number as an argument.

That's it for now. It's a shame I misunderstood the way documentation for libphorward gets generated. Otherwise I could have contributed something to libphorward.

Typos in files

I found typos in a bunch of files. (Most if not all) of the typos have to do with the use of the word trough (=>through).

https://github.com/phorward/phorward/blob/fc7c7bcffd1d2737627c93043aae77e317447daf/src/base/array.h#L30
trough => through

The whole construct can be rewritten (iterating through and iterating are equivalent).

Macro that expands into a for-loop iterating a parray-object

Same thing in
https://github.com/phorward/phorward/blob/fc7c7bcffd1d2737627c93043aae77e317447daf/src/base/list.h#L67

https://github.com/phorward/phorward/blob/91eaf8e1f44bcc013d3f4c57a20ca736542cfbe8/src/regex/misc.c#L55
trough => through

https://github.com/phorward/phorward/blob/91eaf8e1f44bcc013d3f4c57a20ca736542cfbe8/src/parse/gram.c#L370
trough => through

https://github.com/phorward/phorward/blob/91eaf8e1f44bcc013d3f4c57a20ca736542cfbe8/src/parse/gram.c#L523
trough => through

https://github.com/phorward/phorward/blob/91eaf8e1f44bcc013d3f4c57a20ca736542cfbe8/src/any/any.c#L115
trough => through

https://github.com/phorward/phorward/blob/712f9b21d8b7bd733783b44254c5198e1ecb4dc4/src/regex/nfa.c#L263
https://github.com/phorward/phorward/blob/712f9b21d8b7bd733783b44254c5198e1ecb4dc4/src/regex/nfa.c#L325

trough => through

https://github.com/phorward/phorward/blob/e709b9e6901257d04bd9a212e8a0175499781512/doc/phorward.t2t#L191

trough => throughout and with a bit of a rewrite you get

These functions are used throughout libphorward's internal object mapping functions.

https://github.com/phorward/phorward/blob/712f9b21d8b7bd733783b44254c5198e1ecb4dc4/src/regex/ptn.c#L510

Again trough => through. After a rewrite you get

/* Iterate ccl... */

Iterate implies 'accessing every element in'. The above could be read as 'access every element in ccl' which is what you want to convey.

https://github.com/phorward/phorward/blob/e709b9e6901257d04bd9a212e8a0175499781512/doc/ref.t2t#L121
https://github.com/phorward/phorward/blob/e709b9e6901257d04bd9a212e8a0175499781512/doc/ref.t2t#L156

In both of the above cases you could use

Macro that expands into a for-loop iterating a

https://github.com/phorward/phorward/blob/e709b9e6901257d04bd9a212e8a0175499781512/doc/phorward.html#L1005

rewrite (same as before)

These functions are used throughout libphorward's internal object mapping functions.

https://github.com/phorward/phorward/blob/e709b9e6901257d04bd9a212e8a0175499781512/doc/phorward.html#L2684

rewrite (same as before)

Macro that expands into a for-loop iterating a parray-object

I still have to test your bug - fix (plex_tokenize). I looked at the fix and thought: looks good to me and forgot to download/test. I'd be surprised if there would be a problem with the fix. I will download and test the fix regardless. I tried plex_tokenize in the first place because I wanted to see whether calling plex_tokenize once would result in faster lexing than calling plex_lex for each and every token.

Infinite loop in plex_tokenize

The function plex_tokenize contains an infinite loop. The infinite loop looks like this

while( start )
	{
		if( ( start = plex_next( lex, start, &id, &end ) ) && matches )
		{
			if( ! *matches )
				*matches = parray_create( sizeof( prange ), 0 );

			r = (prange*)parray_malloc( *matches );
			r->id = id;
			r->begin = start;
			r->end = end;
		}

		start = end;
		count++;
	}

At end-of-input plex_next will return 0x0 and start becomes 0x0. But the following statement

               start = end;

assigns a value to start that's never 0x0. Which means start never becomes 0 and the loop never finishes.

When there is no more input *end is 0 but end itself contains the address of a character (which happens to be \0). *end is 0 but end is not.

I found the error while testing a scanner for the C programming language. Using the function plex_lex I was able to tokenize a file from the phorward library (pregex.c) without a problem. But when I tried to do the same using plex_tokenize I ran into the issue described above.

It's great fun playing around with plex and the. Thanks for creating the Phorward Toolkit.

Aside: I found a typo in the latest unicc documentation (it's in the index):

2.6. Implementing a comiler..................................................................................................................26

comiler => compiler

I found some more typos in the unicc manual. I'll read the entire manual and then post all the typos I find at https://github.com/phorward/unicc

plist hashing function with lesser collisions

Issue created from a comment in #10 by @FreeBASIC-programmer:

I tried an alternative for the current hash function as used by libphorward. When hashing a list of round about 255 keywords the number of collisions produced by the current hash function is 71. Using the proposed hashing scheme the number of collisions is 47 (size of the hashtable: 769).

The new hash function is about as trivial as the old one (assuming key is a zero terminated string)

  while (*key) 
 {
    hashval += (hashval << 1) + *key;
    key++;
 };

The above function gives better results (less collisions) than the current hash function. Whether a decrease in collisions influences runtime performance in a 'big' way remains to be seen. But less collisions is a good thing.

plex_lex calling plex_prepare without using value returned by plex_prepare

I am experimenting with an alternative internal format for the dfa and found two lines of code in plex_lex that looked suspicious.
Before calling plex_lex a user is supposed to call plex_prepare. Failing to do so does will lead to the the execution of the following lines of code (line 251, plex_lex):

if( !lex->trans_cnt )
  plex_prepare( lex );

After the call to plex_prepare scanning commences as-if plex_prepare succeeded in creating a dfa.
But what if it fails? There will be a problem when plex_lex tries to use lex->trans if trans was not created (which is the case when plex_prepare returns FALSE).

I found exactly the same lines of code as the ones shown above in plex_next. So if there is a problem with plex_lex then there is a problem with plex_next as well.

I hope I have found a bug. If not then I humbly apologize for misinterpreting your code.

User-defined actions during parsing

Originally from issue #6, by @FreeBASIC-programmer

It would also be nice if the user could specify a routine that gets executed at given points during the parse. With some restrictions (not every position in the grammar might be appropriate).

Specifying where the routine should get called could be done by inserting special symbols in the grammar.

Whenever the parser reaches a point during the parse that corresponds with the position of the symbol the user defined routine gets executed.

An example.

struct_type: typedef struct s_identifier '{' struct_declaration_list'}' init_suffix;
init_suffix : init_declarator_list | ';';
s_identifier : identifier ;#

The user defined routine would need some type of info (context) so it can figure out what to do. A unique name (or number) should be added to every #

struct_type: typedef struct s_identifier '{' struct_declaration_list'}' init_suffix;
init_suffix : init_declarator_list | ';';
s_identifier : identifier ;#s_identifier

It's up to the user to come up with the names or the numbers after the #. The parser will simply call the user defined routine passing the name or number as an argument.

Push parsing

Originally from issue #6, by @FreeBASIC-programmer

I looked at the way parsing (lalr1) is performed in libphorward. And I have something of a feature request.

Parsing one token using push parsing could look something like this.

token = get_token(scanner)
parse(parse_state,token)
if (parse_state->error_) then
  ''error message
end if

parse_state could be used to store other, user-defined information (pppar could be extended to contain some user defined data in the shape of an any ptr or a plist or...?).

Unburden libphorward

Several personal decisions had been taken to unburden libphorward and make it, what it was primarily meant to be: A useful, lightweight C library with tools for (mostly) general purpose software development.

More specialization on parsing and probably lexing will be continuously done and continued in unicc as a universal parser generator and framework.

The following list displays the current "feature list" of libphorward. It shall be reduced to the items displayed in bold size. Anything else shall be moved to unicc as a place to remain or moved into new projects, which may have a future or not (e.g. the virtual machine feature).

Parser development tools
- ppgram for grammar definition
- pppar provides a modular LALR(1) parser generator
- ppast is a representation of a browsable abstract syntax tree (AST)
Lexer development tools
- regular expressions and pattern definition interface
- plex provides a lexical analyzer
- pregex for definition and execution of regular expression
- pccl for unicode-enabled character classes
- tools for regex and lexer deployment
- string functions for regular expression match, split and replace
Runtime evaluation tools
- construction of dynamic intermediate languages and interpreters
- pany is a data object for handling different data-types in one object
- pvm for defining stack-based virtual machine instruction sets
Dynamic data structures
- plist for linked-lists with build-in hash table support,
- parray for arrays and stacks.
Extended string management functions
- concat, extend, tokenize and short-hand allocation of strings and wide-character strings
- consistent byte- and wide-character (unicode) function support
- unicode support for UTF-8 in byte-character functions
Universal system-specific functions for platform-independent C software development
- Unix-style command-line parser
- Mapping files to strings
Debug und trace facilities
Consequent object-oriented build-up of all function interfaces (e.g. plist, parray, pregex, pparse, ...)

Currently UniCC v1.5 is released and only depends on the tools which are not bolded above from libphorward. So, logically, its the most convenient way to continue the parser-releated developments in UniCCv2 and reduce libphorward's feature list to a more general usage.

Then it will be time for v1.0, finally.

Example (C scanner) and a question

I wanted to add something to a previous issue but it was already closed.
In that issue I mentioned an example I created (a C scanner) and you asked whether I could post the code.

I did some more testing using the C scanner and it failed on several inputs. Apart from it failing on certain inputs the scanner is written in FreeBASIC which makes it less usable as an example that could be included in libphorward. I am willing and able to translate the example scanner to C when I get it to scan most C code (I have to do some serious testing before I can say that I created anything close to a scanner that can tokenize C code).

The question is related to what to do when I find typos in some part of libphorward (either the manual or the source code). To me a typo is not really an 'issue', not something I'd want to create an issue for.
But minimization of the number of typos in manuals and source code seems to me to be a good thing.

Is there some way I can notify you of typos without having to create an issue? I'd prefer to create issues only for 'real' bugs and not for typos.

Increasing scanner performance

I wanted to speed up a plex generated scanner and I found that using the option PREGEX_RUN_UCHAR made a huge difference. If you do not use PREGEX_RUN_UCHAR then plex will perform the following actions for every byte in the input

  ch = u8_char( ptr );
  ptr += u8_seqlen( ptr );

After using the option PREGEX_RUN_UCHAR I figured it would be possible to take the plex generated dfa and turn it into a hard-coded scanner. Maybe that would lead to an even bigger increase in performance.
So I wrote a program that, given a dfa
--> writes one label (sometimes two labels) for every state in the dfa;
--> writes case statements to represent transitions;
--> writes goto statements to perform transitions.

BASIC syntax allows for representing case statements in a fairly straightforward manner.
An example.
Suppose the valid range of transitions is a to z. The BASIC code looks like this (foo denotes the label of the state, bar is the destination label):

foo:
  idx += 1
  select case as const input_string[idx]
  case asc("a") to asc("z"): goto bar

In case there is a transition from state x to state x and state x is a final state then two
two labels get created. One that is used as jump target for internal transitions and one that is used as jump target for external transitions. The difference between the two is that at the first label (the one used as target for external transitions) the ID is set.
At the second label the id does not get set as there is no need to set the id again and again due to internal transitions. An example of a state that represents matching the second part of an identifier (first character of id already read, valid continuation of id is [a-zA-Z0-9_]*)

foo:
  id = IDENTIFIER
bar:
  idx += 1
  match = idx + 1
  select case as const input_string[idx]
  case asc("a") to asc("z"),_
          asc("A") to asc("Z"),_
          asc("0") to asc("9"),_
          asc("_"): goto bar
  case else: goto exit_label
  end select

Case else is the default transition. Whenever a mismatch occurs a transition is made to exit_label
(located near the end of the scanner routine).

There are some caveats when turning the dfa into code. In plex flags can be set at two levels:
--> the level of a single token (when defining a single token) or
--> the level of the scanner.

In a hard-coded parser the flags can only be set at the level of a single token. And the scanner works
as if the user set the options PREGEX_RUN_UCHAR, PREGEX_RUN_NOANCHORS and PREGEX_RUN_NOREF.
Using those options it is still possible to process utf8 encoded text by defining the patterns using multiple bytes per code-point (hex escape sequences).

Long story short. You can turn a plex generated dfa into code (with some restrictions). But is it worth it?
If the hard-coded scanner generates tokens at a rate that is lower than plex_lex then creation of the tool that turns a dfa into a hard-coded scanner would have been a waste of time.

Luckily for me the hard-coded scanner generated tokens five times faster than plex_lex.
That increase in performance did not come as a surprise to me.
A re2c generated scanner consists of a combination of switch statements and goto statements as well.
And a re2c generated scanner generates tokens at the speed of a handwritten scanner.

As hard-coding the scanner worked so well an obvious next step would be to hard-code the parser as well. In a separate issue I have posted a problem with a parser generated by libphorward.
If the parser related issue gets resolved (assuming there is a problem and that it is fixable) it should be very possible to hard-code the parser as well. By combining the hard-coded scanner and the hard-coded parser I should be able to get a lalr parser that parses input at a speed that puts the Flex/Bison combination to shame.

The way I see it hard-coding the scanner/parser is only worthwhile if performance becomes an issue. If pp_par_parse is able to parse input in an amount of time that is short (enough) then there is no need to generate code.

But if performance becomes an issue it's nice to know you can squeeze some more performance out of libphorward by simply hard-coding the scanner/parser.

phorward / libphorward Goto Github PK

libphorward's Introduction

libphorward

Data structures

Generic helpers

Command-line tools

Documentation

Building

Alternative development build

Stand-alone copy

Credits

License

libphorward's People

Contributors

Stargazers

Watchers

Forkers

libphorward's Issues

Issue created from a comment in #10 by @FreeBASIC-programmer:

Originally from issue #6, by @FreeBASIC-programmer

Originally from issue #6, by @FreeBASIC-programmer

Recommend Projects

Recommend Topics

Recommend Org

Jobs