blynn / nex Goto Github PK

View Code? Open in Web Editor NEW

415.0 13.0 50.0 158 KB

Lexer for Go

Home Page: http://cs.stanford.edu/~blynn/nex/

License: GNU General Public License v3.0

Shell 0.46% Go 95.80% Yacc 3.18% Makefile 0.55%

nex's Introduction

Nex

Nex is a lexer similar to Lex/Flex that:

generates Go code instead of C code
integrates with Go’s yacc instead of YACC/Bison
supports UTF-8
supports nested structural regular expressions.

See Structural Regular Expressions by Rob Pike. I wrote this code to get acquainted with Go and also to explore some of the ideas in the paper. Also, I’ve always been meaning to implement algorithms I learned from a compilers course I took many years ago. Back then, we never coded them; merely understanding the theory was enough to pass the exam.

Go has a less general scanner package, but it is especially suited for tokenizing Go code.

Installation

$ export GOPATH=/tmp/go
$ go get github.com/blynn/nex

Example

One simple example in the Flex manual is a scanner that counts characters and lines. The program is similar in Nex:

/\n/{ nLines++; nChars++ }
/./{ nChars++ }
//
package main
import ("fmt";"os")
func main() {
  var nLines, nChars int
  NN_FUN(NewLexer(os.Stdin))
  fmt.Printf("%d %d\n", nLines, nChars)
}

The syntax resembles Awk more than Flex: each regex must be delimited. An empty regex terminates the rules section and signifies the presence of user code, which is printed on standard output with NN_FUN replaced by the generated scanner.

Name the above example lc.nex. Then compile and run it by typing:

$ nex -r -s lc.nex

The program runs on standard input and output. For example:

$ nex -r -s lc.nex < /usr/share/dict/words
99171 938587

To generate Go code for a scanner without compiling and running it, type:

$ nex -s < lc.nex  # Prints code on standard output.

or:

$ nex -s lc.nex  # Writes code to lc.nn.go

The NN_FUN macro is primitive, but I was unable to think of another way to achieve an Awk-esque feel. Purists unable to tolerate text substitution will need more code:

/\n/{ lval.l++; lval.c++ }
/./{ lval.c++ }
//
package main
import ("fmt";"os")
type yySymType struct { l, c int }
func main() {
  v := new(yySymType)
  NewLexer(os.Stdin).Lex(v)
  fmt.Printf("%d %d\n", v.l, v.c)
}

and must run nex without the -s option:

$ nex lc.nex

We could avoid defining a struct by using globals instead, but even then we need a throwaway definition of yySymType.

The yy prefix can be modified by adding -y option. When using yacc, it must use the same prefix:

$ nex -p YY lc.nex && go tool yacc -p YY && go run lc.nn.go y.go

Toy Pascal

The Flex manual also exhibits a scanner for a toy Pascal-like language, though last I checked, its comment regex was a little buggy. Here is a modified Nex version, without string-to-number conversions:

/[0-9]+/          { println("An integer:", txt()) }
/[0-9]+\.[0-9]*/  { println("A float:", txt()) }
/if|then|begin|end|procedure|function/
                  { println( "A keyword:", txt()) }
/[a-z][a-z0-9]*/  { println("An identifier:", txt()) }
/\+|-|\*|\//      { println("An operator:", txt()) }
/[ \t\n]+/        { /* eat up whitespace */ }
/./               { println("Unrecognized character:", txt()) }
/{[^\{\}\n]*}/    { /* eat up one-line comments */ }
//
package main
import "os"
func main() {
  lex := NewLexer(os.Stdin)
  txt := func() string { return lex.Text() }
  NN_FUN(lex)
}

Enough simple examples! Let us see what nesting can do.

Peter into silicon

In ``Structural Regular Expressions'', Pike imagines a newline-agnostic Awk that operates on matched text, rather than on the whole line containing a match, and writes code converting an input array of characters into descriptions of rectangles. For example, given an input such as:

    #######
   #########
  ####  #####
 ####    ####   #
 ####      #####
####        ###
########   #####
#### #########
#### #  # ####
## #  ###   ##
###    #  ###
###    ##
 ##   #
  #   ####
  # #
##   #   ##

we wish to produce something like:

rect 5 12 1 2
rect 4 13 2 3
rect 3 7 3 4
rect 9 14 3 4
...
rect 10 12 16 17

With Nex, we don’t have to imagine: such programs are real. Below are practical Nex programs that strongly resemble their theoretical counterparts. The one-character-at-a-time variant:

/ /{ x++ }
/#/{ println("rect", x, x+1, y, y+1); x++ }
/\n/{ x=1; y++ }
//
package main
import "os"
func main() {
  x, y := 1, 1
  NN_FUN(NewLexer(os.Stdin))
}

The one-run-at-a-time variant:

/ +/{ x+=len(txt()) }
/#+/{ println("rect", x, x+len(txt()), y, y+1); x+=len(txt()) }
/\n/{ x=1; y++ }
//
package main
import "os"
func main() {
  x, y := 1, 1
  lex := NewLexer(os.Stdin)
  txt := func() string { return lex.Text() }
  NN_FUN(lex)
}

The programs are more verbose than Awk because Go is the backend.

Rob but not robot

Pike demonstrates how nesting structural expressions leads to a few simple text editor commands to print all lines containing "rob" but not "robot". Though Nex fails to separate looping from matching, a corresponding program is bearable:

/[^\n]*\n/ < { isrobot = false; isrob = false }
  /robot/    { isrobot = true }
  /rob/      { isrob = true }
>            { if isrob && !isrobot { fmt.print(lex.Text()) } }
//
package main
import ("fmt";"os")
func main() {
  var isrobot, isrob bool
  lex := NewLexer(os.Stdin)
  NN_FUN(lex)
}

The "<" and ">" delimit nested expressions, and work as follows. On reading a line, we find it matches the first regex, so we execute the code immediately following the opening "<".

Then it’s as if we run Nex again, except we focus only on the patterns and actions up to the closing ">", with the matched line as the entire input. Thus we look for occurrences of "rob" and "robot" in just the matched line and set flags accordingly.

After the line ends, we execute the code following the closing ">" and return to our original state, scanning for more lines.

Word count

We can simultaneously count lines, words, and characters with Nex thanks to nesting:

/[^\n]*\n/ < {}
  /[^ \t\r\n]*/ < {}
    /./  { nChars++ }
  >      { nWords++ }
  /./    { nChars++ }
>        { nLines++ }
//
package main
import ("fmt";"os")
func main() {
  var nLines, nWords, nChars int
  NN_FUN(NewLexer(os.Stdin))
  fmt.Printf("%d %d %d\n", nLines, nWords, nChars)
}

The first regex matches entire lines: each line is passed to the first level of nested regexes. Within this level, the first regex matches words in the line: each word is passed to the second level of nested regexes. Within the second level, a regex causes every character of the word to be counted.

Lastly, we also count whitespace characters, a task performed by the second regex of the first level of nested regexes. We could remove this statement to count only non-whitespace characters.

UTF-8

The following Nex program converts Eastern Arabic numerals to the digits used in the Western world, and also Chinese phrases for numbers (the analog of something like "one-hundred and fifty-three") into digits.

/[零一二三四五六七八九十百千]+/ { fmt.Print(zhToInt(txt())) }
/[٠-٩]/ {
  // The above character class might show up right-to-left in a browser.
  // The equivalent of 0 should be on the left, and the equivalent of 9 should
  // be on the right.
  //
  // The Eastern Arabic numerals are ٠١٢٣٤٥٦٧٨٩.
  fmt.Print([]rune(txt())[0] - rune('٠'))
}
/./ { fmt.Print(txt()) }
//
package main
import ("fmt";"os")
func zhToInt(s string) int {
  n := 0
  prev := 0
  f := func(m int) {
    if 0 == prev { prev = 1 }
    n += m * prev
    prev = 0
  }
  for _, c := range s {
    for m, v := range []rune("一二三四五六七八九") {
      if v == c {
	prev = m+1
	goto continue2
      }
    }
    switch c {
    case '零':
    case '十': f(10)
    case '百': f(100)
    case '千': f(1000)
    }
continue2:
  }
  n += prev
  return n
}
func main() {
  lex := NewLexer(os.Stdin)
  txt := func() string { return lex.Text() }
  NN_FUN(lex)
}

nex and Go’s yacc

The parser generated by go tool yacc exports so little that it’s easiest to keep the lexer and the parser in the same package.

Here’s a yacc file based on the reverse-Polish-notation calculator example from the Bison manual:

%{
package main
import "fmt"
%}

%union {
  n int
}

%token NUM
%%
input:    /* empty */
       | input line
;

line:     '\n'
       | exp '\n'      { fmt.Println($1.n); }
;

exp:     NUM           { $$.n = $1.n;        }
       | exp exp '+'   { $$.n = $1.n + $2.n; }
       | exp exp '-'   { $$.n = $1.n - $2.n; }
       | exp exp '*'   { $$.n = $1.n * $2.n; }
       | exp exp '/'   { $$.n = $1.n / $2.n; }
	/* Unary minus    */
       | exp 'n'       { $$.n = -$1.n;       }
;
%%

We must import fmt even if we don’t use it, since code generated by yacc needs it. Also, the %union is mandatory; it generates yySymType.

Call the above rp.y. Then a suitable lexer, say rp.nex, might be:

/[ \t]/  { /* Skip blanks and tabs. */ }
/[0-9]*/ { lval.n,_ = strconv.Atoi(yylex.Text()); return NUM }
/./ { return int(yylex.Text()[0]) }
//
package main
import ("os";"strconv")
func main() {
  yyParse(NewLexer(os.Stdin))
}

Compile the two with:

$ nex rp.nex && go tool yacc rp.y && go build y.go rp.nn.go

For brevity, we work in the main package. In a larger project we might want to write a package that exports a function wrapped around yyParse(). This is fine, provided the parser and the lexer are both in the same package.

Alternatively, we could use yacc’s -p option to change the prefix from yy to one that begins with an uppercase letter.

Matching the beginning and end of input

We can simulate awk’s BEGIN and END blocks with a regex that matches the entire input:

/.*/ < { println("BEGIN") }
  /a/  { println("a") }
>      { println("END") }
//
package main
import "os"
func main() {
  NN_FUN(NewLexer(os.Stdin))
}

However, this causes Nex to read the entire input into memory. To solve this problem, Nex supports the following syntax:

<      { println("BEGIN") }
  /a/  { println("a") }
>      { println("END") }
package main
import "os"
func main() {
  NN_FUN(NewLexer(os.Stdin))
}

In other words, if a bare '<' appears as the first pattern, then its action is executed before reading the input. The last pattern must be a bare '>', and its action is executed on end of input.

Additionally, no empty regex is needed to mark the beginning of the Go program. (Fortunately, an empty regex is also a Go comment, so there’s no harm done if present.)

Matching Nuances

Among rules in the same scope, the longest matching pattern takes precedence. In event of a tie, the first pattern wins.

Unanchored patterns never match the empty string. For example,

/(foo)*/ {}

matches "foo" and "foofoo", but not "".

Anchored patterns can match the empty string at most once; after the match, the start or end null strings are "used up" so will not match again.

Internally, this is implemented by omitting the very first check to see if the current state is accepted when running the DFA corresponding to the regex. An alternative would be to simply ignore matches of length 0, but I chose to allow anchored empty matches just in case there turn out to be applications for them. I’m open to changing this behaviour.

Contributing and Testing

Check out this repo (or a clone) into a directory with the following structure:

mkdir -p nex/src
cd nex/src
git clone https://github.com/blynn/nex.git

The Makefile will put the binary into e.g. nex/bin

Reference

func NewLexer(in io.Reader) *Lexer

// NewLexerWithInit creates a new Lexer object, runs the given callback on it,
// then returns it.
func NewLexerWithInit(in io.Reader, initFun func(*Lexer)) *Lexer

 // Lex runs the lexer. Always returns 0.
 // When the -s option is given, this function is not generated;
 // instead, the NN_FUN macro runs the lexer.
func (yylex *Lexer) Lex(lval *yySymType) int

// Text returns the matched text.
func (yylex *Lexer) Text() string

// Line returns the current line number.
// The first line is 0.
func (yylex *Lexer) Line() int

// Column returns the current column number.
// The first column is 0.
func (yylex *Lexer) Column() int

nex's People

Contributors

Stargazers

Watchers

nex's Issues

Allow use of - (dash) inside a character group

Currently, if you try to put dash (-) inside a character group (at the beginning or at the end of it) nex will panic: bad range in character class:

/[-a-z]/ {
}
//
package main
import ("fmt";"os")
func main() {
  NN_FUN(NewLexer(os.Stdin))
}

I wish to be able to use dashes like that, because else regexp becomes cumbersome if your identifiers can contain a dash.

The example with go yacc doesn't work

I was having trouble using nex with go yacc so I copied the example to test it out. I compiled, but it didn't work, as soon as I enter some text it breaks

Using:
Mac OS 10.10.1
Go version go1.3.3 darwin/amd64

panic: syntax error

goroutine 16 [running]:
runtime.panic(0xa6300, 0x2081aa300)
    /usr/local/Cellar/go/1.3.3/libexec/src/pkg/runtime/panic.c:279 +0xf5
main.Lexer.Error(0x2081ae0c0, 0x2081cc240, 0x1, 0x1, 0x0, 0x0, 0x0, 0x0, 0x0, 0xe67f0, ...)
    /Users/donnanicolas/go/src/tu-carrito.com/backend/extractor/rp/rp.nn.go:239 +0x67
main.(*Lexer).Error(0x2081c21e0, 0xe67f0, 0xc)
    <autogenerated>:3 +0xa5
main.yyParse(0x22081bc340, 0x2081c21e0, 0x170bc8)
    /Users/donnanicolas/go/src/tu-carrito.com/backend/extractor/rp/yaccpar:155 +0x563
main.main()
    /Users/donnanicolas/go/src/tu-carrito.com/backend/extractor/rp/rp.nn.go:263 +0x84

goroutine 19 [finalizer wait]:
runtime.park(0x17f90, 0x170b90, 0x16ff09)
    /usr/local/Cellar/go/1.3.3/libexec/src/pkg/runtime/proc.c:1369 +0x89
runtime.parkunlock(0x170b90, 0x16ff09)
    /usr/local/Cellar/go/1.3.3/libexec/src/pkg/runtime/proc.c:1385 +0x3b
runfinq()
    /usr/local/Cellar/go/1.3.3/libexec/src/pkg/runtime/mgc0.c:2644 +0xcf
runtime.goexit()
    /usr/local/Cellar/go/1.3.3/libexec/src/pkg/runtime/proc.c:1445

goroutine 20 [chan send]:
main.func·003(0x2081ae120, 0x2081ae0c0, 0x2081ea000, 0x3, 0x3, 0x0, 0x1)
    /Users/donnanicolas/go/src/tu-carrito.com/backend/extractor/rp/rp.nn.go:136 +0x9c8
created by main.NewLexerWithInit
    /Users/donnanicolas/go/src/tu-carrito.com/backend/extractor/rp/rp.nn.go:193 +0x677

syntax error for go 1.1

10 type yySymType struct {
11 yys int n int } <---- won't compile

nested expression question

By using nex to parse SQL, I encounter the following problem:
Say two statements "BETWEEN expr AND expr" and "IF( expr AND expr)", there are two "AND" token that are different. In flex we can write rule like this:

%s BTWMODE
AND { BEGIN INITIAL; return AND; }
AND { return ANDOP; }
BETWEEN { BEGIN BTWMODE; return BETWEEN; }

%s represents an inclusive-mode lexing.(%x exclusive-mode correspondingly)

What is the equivalent counterpart of it?
In addition, flex has an option "case-insensitive" to ignore case, seems nex has no such switch, so I have to write rules like /[Ss][Ee][Ll][Ee][Cc][Tt]/ to represent the "select" keyword. Is there a recommend way to achieve this?

ternary operators throw unexpected error

Given the nex file:

%{
package main

%}

%union {
    n int
}


%token NUMBER
%token ADD SUB MUL DIV ABS
%token EOL

%%

start:           term
        ;
term:           NUMBER
        |       ABS term { $$ = $2 >= 0? $2 : - $2; }
        ;
%%

Got:

# command-line-arguments
desk.y:33[/Users/drew/src/snazzle/desk/y.go:468]: syntax error: unexpected ?

Expected:
$2 is returned as positive number

regex does not match with parens correctly

I have the following regex:

/[a-z]([a-z0-9:]*[a-z0-9]+)?/

This is intended to match:
g
or
hey42
or
hey:there
but NOT:
blah:
trailing colons should not be allowed. Yet it does.

I need more extensive testing, but AFAIK this is pretty serious matching bug!

I also tested

/[a-z]+([a-z0-9:]*[a-z0-9]+)?/

Which has the same problem :/

Digging deeper... Help welcome =D

Lexer doesn't return a terminal newline token?

So I was trying to migrate a BISON/FLEX basic parser written in "C" to GO. Settled on goyacc and nex. But was seeing bizarre behavior, where lines were not being processed until I entered a newline again. To eliminate questions about my grammar, I tried building the rp test program, and saw the same behavior. e.g. I enter '111 111 * 3 /', and nothing is printed until I enter another newline (or ctrl-d), at which point it prints the answer. I patched yyDebug to 3 to observe the grammar actions, and saw that it wasn't completing the parse, because the terminal newline was not being returned to the parser.

Output with yyDebug==3:

reduce 1 in:
state-0
111 111 * 3 /
lex NUM(57346)
reduce 5 in:
state-5
lex NUM(57346)
reduce 5 in:
state-5
lex '*'(42)
reduce 8 in:
state-11
lex NUM(57346)
reduce 5 in:
state-5
lex '/'(47)
reduce 9 in:
state-12

(notice how it hasn't printed anything yet?) Now I enter another newline:

lex '\n'(10)
reduce 4 in:
state-6
4107
reduce 2 in:
state-2

and it finishes up.

p.s. I cloned this straight out of github...

argument like go tool yacc's -p to change names of stuff

It would be great if nex had an argument just like yacc's -p to change yy to other things. As it is, I seem to have to write a bunch of wrapper code to make things line up.

ability to create my own error handler

what if I don't want to panic(err) when there is a parsing error?

ability to match an EOF token

for example flex supports something like this:

<<EOF>>

I need a way to return an EOF token to the parser.

Inconsistent behaviour of nested regular expressions

The following code

/[a-z]+: [a-z]+/ <
  { fmt.Println("BEGIN"); }
  /[a-z]+:/ {
    fmt.Println(1, yylex.Text())
  }
  /[a-z]+/ {
    fmt.Println(2, yylex.Text())
  }
> { fmt.Println("END"); }
//
package main
import ("fmt";"os")
func main() {
  NN_FUN(NewLexer(os.Stdin))
}

(when processed with nex and executed as echo name: value | ./testcase.nn) prints

BEGIN
1 name:
2 value
END

...as you would expect. However, if you'll change the second nested expression to /.+/ --

/[a-z]+: [a-z]+/ <
  { fmt.Println("BEGIN"); }
  /[a-z]+:/ {
    fmt.Println(1, yylex.Text())
  }
  /.+/ {
    fmt.Println(2, yylex.Text())
  }
> { fmt.Println("END"); }
//
package main
import ("fmt";"os")
func main() {
  NN_FUN(NewLexer(os.Stdin))
}

it will print only

BEGIN
2 name: value
END

That is, in the last case the first nested expression is never matched.

strange lexing behavior

I'm having a hard time understanding the behavior of the lexer in the following case:

/\(/   { fmt.Printf("-> %q\n", yylex.Text()) }
/\)/   { fmt.Printf("-> %q\n", yylex.Text()) }
/[^( ][^ ]*[^ )]/ { fmt.Printf("-> %q\n", yylex.Text()) }
//

package main
import ("fmt")
func main() {
  fmt.Printf("lexing %q:\n", "(rule)")
    NN_FUN(NewLexer(strings.NewReader("(rule)")))
  fmt.Printf("lexing %q:\n", "( rule  )")
    NN_FUN(NewLexer(strings.NewReader("( rule )")))
}

Output of nex -r -s huh.nex:

lexing "(rule)":
-> "("
-> "rule"
lexing "( rule  )":
-> "("
-> "rule"
-> ")"

Why is the lexer swallowing the trailing bracket when there is no space between the content and the surrounding brackets?. This looks like a bug to me. It has something to do with second character class in the content regex ([^ ]*), when I change that to also not match ) it works.

Case insensitive regular expressions

Is there a way to define case insensitive regular expressions? I am making a SQL parser and I need to be able to parse correctly SELECT, select or any of its variants.

I do know that Flex has %option caseless, is there something similar for Nex?

Different output on reruns

It seems that running nex multiple times on the same file generates different output sometimes. Example:

/a|A/ { return A }
//
package main

I get two different versions (filenames are their shas):

→ diff 55b29ae5b5679cab92281c8e9d3d493cb0f5be6c.go 73edf909a04d60e4f253d456fc4bf8b3ca6eee9c.go
155,156c155,156
<       case 65: return 1
<       case 97: return 2

---
>       case 65: return 2
>       case 97: return 1

I haven't looked at your code yet, but a possible source of randomness could be go's map iteration order (c.f. http://stackoverflow.com/a/9621526/220918).

May hang into inf-loop in multi-level klein-closure

I wrote a complex regex and run nex but found an 'out-of-memory' error.
My vps has 4G memory so I doubt somewhere may went wrong.
I tried to find a simpler regex which can cause the same problem, finally, I found one like below:

/(([ \t]*)*)*/ { /* eat up whitespace */ }

It is sensed that nex may hang into inf-loop in cases which has nested klein-closures.
the crash is sth like below:

fatal error: runtime: out of memory

runtime stack:
runtime.throw(0x546c99, 0x16)
/usr/local/go/src/runtime/panic.go:566 +0x95
runtime.sysMap(0xc440200000, 0x20000000, 0xc4401a3e00, 0x5ff718)
/usr/local/go/src/runtime/mem_linux.go:219 +0x1d0
runtime.(*mheap).sysAlloc(0x5e6ec0, 0x20000000, 0x0)
/usr/local/go/src/runtime/malloc.go:407 +0x37a
runtime.(*mheap).grow(0x5e6ec0, 0x10000, 0x0)
/usr/local/go/src/runtime/mheap.go:726 +0x62
runtime.(*mheap).allocSpanLocked(0x5e6ec0, 0x10000, 0x0)
/usr/local/go/src/runtime/mheap.go:630 +0x4f2
runtime.(*mheap).allocStack(0x5e6ec0, 0x10000, 0x0)
/usr/local/go/src/runtime/mheap.go:597 +0x62
runtime.stackalloc(0xc420000000, 0xc4301a3ee0, 0x8000000, 0xc4200001a0, 0x0, 0x0)
/usr/local/go/src/runtime/stack.go:395 +0x2ed
runtime.copystack(0xc4200001a0, 0x20000000, 0x5e3d01)
/usr/local/go/src/runtime/stack.go:839 +0x83
runtime.newstack()
/usr/local/go/src/runtime/stack.go:1070 +0x370
runtime.morestack()
/usr/local/go/src/runtime/asm_amd64.s:366 +0x7f

goroutine 1 [copystack]:
main.gen.func17.1(0x2)
/home/kongdeyu/goprojs/src/github.com/blynn/nex.go:522 fp=0xc4301a42b8 sp=0xc4301a42b0
main.gen.func17.1(0x1)
/home/kongdeyu/goprojs/src/github.com/blynn/nex.go:527 +0xf8 fp=0xc4301a4308 sp=0xc4301a42b8
main.gen.func17.1(0x2)
/home/kongdeyu/goprojs/src/github.com/blynn/nex.go:527 +0xf8 fp=0xc4301a4358 sp=0xc4301a4308
main.gen.func17.1(0x1)
/home/kongdeyu/goprojs/src/github.com/blynn/nex.go:527 +0xf8 fp=0xc4301a43a8 sp=0xc4301a4358
main.gen.func17.1(0x2)
/home/kongdeyu/goprojs/src/github.com/blynn/nex.go:527 +0xf8 fp=0xc4301a43f8 sp=0xc4301a43a8
main.gen.func17.1(0x1)

Difficulty with matching a "string"

Writing a little programming language, and trying to lex a "string":

Eg:

$foo = "hello, world"

This is kind of tricky, in particular in the docs:

**Matching Nuances

Among rules in the same scope, the longest matching pattern takes precedence. In event of a tie, the first pattern wins.**

This means that if I have other similar patterns in the code, this will match when I don't expect it. Additionally, is there some way to specify a string? I don't want to write in every unicode char, and if you do a \*\ match it matches too much!!

issue with y.(*Lexer).p

hi,

could the following hack be added by nex?
printf '/NEX_END_OF_LEXER_STRUCT/i\np *Tacky\n.\nw\nq\n' | ed -s tacky.nn.go

also the nex file expects a newline at the end of the file... why?

Unpaired curly inside string confuses nex

Quite hilariously, an unpaired curly brace inside nex golang code causes lex/parse errors with nex itself!

Example:

/\$[a-z][a-z0-9]*{[0-9]+}/
		{
			yylex.pos(lval) // our pos
			s := yylex.Text()
			a := strings.Split(s, "{") // XXX: close match here: }
			lval.str = a[0]
			return IDENTIFIER
		}

Note the comment I added with a close brace. I added that as a workaround so that this works. Remove it and you'll see nex errors:

panic: unmatched '{'

goroutine 16 [running]:
runtime.gopanic
	../../../libgo/go/runtime/panic.go:493
main.$nested34
	/builddir/build/BUILD/nex-5344f151fd3251726650dffd30a531d3f1bddc17/nex.go:1027
main.$nested35
	/builddir/build/BUILD/nex-5344f151fd3251726650dffd30a531d3f1bddc17/nex.go:1094
main.process
	/builddir/build/BUILD/nex-5344f151fd3251726650dffd30a531d3f1bddc17/nex.go:1099
main.main
	/builddir/build/BUILD/nex-5344f151fd3251726650dffd30a531d3f1bddc17/main.go:81
runtime_main
	../../../libgo/runtime/proc.c:606

HTH

"undefined: os" error in example.in readme.

Hi,

After typing in line count example in readme, I redirected it's output to a file and issued "go build" on it. And I got "undefine os" error for NewLexer(os.Stdin).

Readme says generated code already imports os, but mine (1eadfa3) didn't put it. I am on windows 7.

Unmatched characters are skipped!

hey there,

playing with nex, and using the below... The problem is I had to add an ERROR identifier to match anything not previously caught since otherwise random typos in the lexed code are skipped silently. Any advice on how to improve this?

Thanks!

/[ \t\n]/	{ /* Skip blanks and tabs. */ }
/{/		{ return OPEN_CURLY }
/}/		{ return CLOSE_CURLY }
/if/		{ return IF }
/else/		{ return ELSE }
/=/		{ return EQUALS }
/\$[a-z]+/	{
			s := yylex.Text()
			lval.str = s[1:len(s)] // remove the leading $
			return VAR_IDENTIFIER
		}
/[a-z]+/	{
			lval.str = yylex.Text()
			return IDENTIFIER
		}
/./		{ return ERROR }
//
package main
import ()

Size parameter should also be returned

Description

In the source there is func (yylex *Lexer) Line() int func (yylex *Lexer) Column() int. I think it would also be useful to have func (yylex *Lexer) Size() int which returns position in bytes relative to the start of the files.

Current approach

I'd like to lex and parse a number of different files together. The approach I'm attempting is the concatenate them together with a https://golang.org/pkg/io/#MultiWriter. I also store a list of cumulative file.Size() offsets for each file. I get this via Stat when opening each file for the MultiWriter.

At the end of lexing/parsing if I get an error, I work backwards from position of error to line number and file.

Problem

The problem is that I'm doing the "math" with line numbers instead of Size offsets. This means that I need to loop through each file (before lexing/parsing) and count all the newlines. If Size was available in addition to Column and Row, then this would be much more direct.

I haven't yet worked out how I'd get from the correct size offset to Column and Line, but I figure that's doable somehow.

Help

Help is appreciated if anyone can contribute. This currently works with line number alone, but as I mentioned a Size offset might be preferable. If there's an alternative technique for lexing/parsing multiple files together as one, please let me know!

Thanks!

Getting the AST without a global...

Is this possible? Currently I have to do something like:

in parser.y:

top:	prog
	{
		langGlobal = $1.prog
	}
;

in main.go:

	lexer := NewLexer(os.Stdin)
	yyParse(lexer) // writes the result to langGlobal

	log.Printf("behold, the AST: %+v\n", langGlobal)

I know it's not directly nex related, but I was hoping someone here might know! Thanks :)

Thread safety

I'm guessing that Nex is not thread safe if it uses globals, am I wrong? My program needs to be able to parse in multiple threads and I'm wondering if I will need to lock a mutex for every time Lex will be called.

Solved: How to match single line comments (using negation!)

I previously couldn't figure out how to match single line comments. Now solved, so I'm opening and closing this issue so it will get seen as reference for future users.

Match the single line comment:

/#[^\n]*/
	{	// this matches a (#) pound char followed by any
		// number of chars that aren't the (\n) newline!
		s := yylex.Text()
		lval.str = s[1:len(s)] // remove the leading #
		log.Printf("lang: lexer: comment: `%s`", lval.str)
		//return COMMENT // skip return to avoid parsing
	}

The trick was the ^ negates the newline char, which matches all the junk you want, up until the newline. In the source this is the negate property.

Sadly it seems as if all the regexp code was copy+pasted incompletely from the golang lib, and as a result, a lot of standard regexp things are missing such as :alpha: for example.

Hope this hidden tutorial helped you!

Multiline Comments

Are multiline comments possible with nex? I've been playing with nex quite a bit and I'm very impressed. However, I'm having trouble figuring out how to handle multiline comments. Can this been done with the nesting feature? I've tried but could not figure it out. Below is one thing I tired but I regex is greedy the nesting grabs the rest of the document.

/[0-9]+/            { println("INTEGER:", txt()) }
/[0-9]+\.[0-9]*/    { println("FLOAT:", txt()) }
/if|then|begin|end|procedure|function/
                    { println( "KEYWORD:", txt()) }
/[a-z][a-z0-9]*/    { println( "ID:", txt()) }
/\+|-|\*|\//        { println("OP:", txt()) }
/[ \t]+/            { /* eat up whitespace */ }
/\/\*.*|\n*\*\// <  { println("BEG_MULTI:", txt()) }
  /.*|\n*\*\//      { println("BEG_COMMENT:", txt()) }
  /\n/              { println("NEWLINE") }
>                   { println("END_COMMENT", txt()) }
/\n/                { println("NEWLINE") }
/./                 { println("UNRECOGNIZED CHAR:", txt()) }
//
package main
import (
  "os"
)

func main() {
  lex := NewLexer(os.Stdin)
  txt := func() string { return lex.Text() }
  NN_FUN(lex)
}

no way to change func (yylex Lexer) Error(e string)

As far as I can tell, there doesn't seem to be a way to change the func (yylex Lexer) Error(e string) function. This function gets called automatically from a go tool yacc generated parser, so it would be nice to be able to customize it.

Do we really need channels and goroutines?

I haven't done super serious profiling yet, but in a rather large application we run, our lexer keeps coming up in goroutine dumps at what seems to be a disproportionately high rate, especially given that we lex very small strings as a tiny part of everything our app does. I found a bunch of routines awaiting chan receive in the generated lexer, which kind of immediately raised my eyebrow. Given what the lexer does, does it really need to use channels and goroutines to accomplish its job? I'm not familiar with the design constraints or exactly what kind of guarantees we want to make for the user code that gets embedded in the generator, but it seems like this could all be done more efficiently single-threaded.

Thoughts?

shortest possible match | non greedy match

There is currently no possibility I could find to get the shortest possible match (non-greedy behaviour).

There should be a possibility to split the following snippet:

<?php
  b
?>
text
<?php
  a
?>

to these 2 matches:

<?php
  b
?>

and

<?php
  a
?>

Currently the regex /<\?php.*\?>/ matches the whole text.

Or did I simply miss something?
Thanks

generating syntax errors

hi,

how can i generate a syntax error using nex and add more information the the syntax error?

i tried to create a costum error function for nex:
nex -e=true lexer.nex

here my nex file:
/while/ { return WHILE }
/print/ { return PRINT }
/;/ { return END_LINE }
/+|-/ { lval.s = yylex.Text(); return ADD_OP }
/|// { lval.s = yylex.Text(); return MUL_OP }
/=/ { return ASSIGN }
/(/ { return BEGIN_EXPRESSION }
/)/ { return END_EXPRESSION }
/{/ { return BEGIN_BLOCK }
/}/ { return END_BLOCK }
/[0-9]+/ { lval.s = yylex.Text(); return NUMBER }
/[a-z][a-z0-9]/ { lval.s = yylex.Text(); return IDENTIFIER }
/\n+/ { lineno += len(yylex.Text()) }
//
package dsl
import "fmt"

var lineno int // idk if thats good to put it global?

func (yylex Lexer) Error(e string) {
yylex.p.err = fmt.Sprintf("Syntax error in line %d", lineno+1)
}

i can print the line number where the syntax error happend.
EDIT: there are those 2 variables in the type Lexer struct:
l, c int // line number and character position
are they incremented automatically and can i remove my lineno variable?

how can you add more information to the error message? i want to add the token/s that was not recognized in the parsing...

e.g.
how can i keep track of the last token that i sent to the parser, then on error just say Printf("error after %s on line %d" , humanReadable(lastToken,lastLine) ??

Make this compatible with "go get"

Should be as simple as renaming nex.go to main.go.

Allow use of ^/$ special characters inside regular expressions (including nested ones)

Currently ^ and $ symbols convey no special meaning and match literally. I want to be able to use these special symbols inside regular expressions, including nested ones. In the latter case I want them to match at the beginning and at the end of the substring, matched by an outer regular expression.