Using the following patterns (scanner defined as a plex ptr) I get a dfa that consists of more states than expected (note: the prefix $ keeps me from having to use \ as the $ keeps the FreeBASIC parser from interpreting escape sequences inside the string)
After calling plex_prepare the resulting dfa looks like this (printed in a 'human readable format')
0 (row size = 35, ID = 0, FLAGS = 0, REF = 0, DEF = 26,"_",1,"a",2,"d",3,"i",4,"t",5,"A","Z",6,"b","c",6,"e","h",6,"j","s",6,"u","z",6)
1 (row size = 17, ID = 0, FLAGS = 0, REF = 0, DEF = 26,"0","9",7,"A","Z",7,"_",7,"a","z",7)
2 (row size = 29, ID = 5, FLAGS = 0, REF = 0, DEF = 26,"s",8,"TAB",9,"SPACE",9,"0","9",10,"A","Z",10,"_",10,"a","r",10,"t","z",10)
3 (row size = 29, ID = 5, FLAGS = 0, REF = 0, DEF = 26,"i",11,"TAB",9,"SPACE",9,"0","9",10,"A","Z",10,"_",10,"a","h",10,"j","z",10)
4 (row size = 29, ID = 5, FLAGS = 0, REF = 0, DEF = 26,"f",12,"TAB",9,"SPACE",9,"0","9",10,"A","Z",10,"_",10,"a","e",10,"g","z",10)
5 (row size = 29, ID = 5, FLAGS = 0, REF = 0, DEF = 26,"h",13,"TAB",9,"SPACE",9,"0","9",10,"A","Z",10,"_",10,"a","g",10,"i","z",10)
6 (row size = 23, ID = 5, FLAGS = 0, REF = 0, DEF = 26,"TAB",9,"SPACE",9,"0","9",10,"A","Z",10,"_",10,"a","z",10)
7 (row size = 23, ID = 6, FLAGS = 0, REF = 0, DEF = 26,"TAB",14,"SPACE",14,"0","9",7,"A","Z",7,"_",7,"a","z",7)
8 (row size = 23, ID = 1, FLAGS = 0, REF = 0, DEF = 26,"TAB",15,"SPACE",15,"0","9",10,"A","Z",10,"_",10,"a","z",10)
9 (row size = 11, ID = 5, FLAGS = 0, REF = 0, DEF = 26,"TAB",9,"SPACE",9)
10 (row size = 23, ID = 5, FLAGS = 0, REF = 0, DEF = 26,"TAB",9,"SPACE",9,"0","9",10,"A","Z",10,"_",10,"a","z",10)
11 (row size = 29, ID = 5, FLAGS = 0, REF = 0, DEF = 26,"m",16,"TAB",9,"SPACE",9,"0","9",10,"A","Z",10,"_",10,"a","l",10,"n","z",10)
12 (row size = 23, ID = 3, FLAGS = 0, REF = 0, DEF = 26,"TAB",17,"SPACE",17,"0","9",10,"A","Z",10,"_",10,"a","z",10)
13 (row size = 29, ID = 5, FLAGS = 0, REF = 0, DEF = 26,"e",18,"TAB",9,"SPACE",9,"0","9",10,"A","Z",10,"_",10,"a","d",10,"f","z",10)
14 (row size = 11, ID = 6, FLAGS = 0, REF = 0, DEF = 26,"TAB",14,"SPACE",14)
15 (row size = 11, ID = 1, FLAGS = 0, REF = 0, DEF = 26,"TAB",19,"SPACE",19)
16 (row size = 23, ID = 2, FLAGS = 0, REF = 0, DEF = 26,"TAB",20,"SPACE",20,"0","9",10,"A","Z",10,"_",10,"a","z",10)
17 (row size = 11, ID = 3, FLAGS = 0, REF = 0, DEF = 26,"TAB",21,"SPACE",21)
18 (row size = 29, ID = 5, FLAGS = 0, REF = 0, DEF = 26,"n",22,"TAB",9,"SPACE",9,"0","9",10,"A","Z",10,"_",10,"a","m",10,"o","z",10)
19 (row size = 11, ID = 1, FLAGS = 0, REF = 0, DEF = 26,"TAB",15,"SPACE",15)
20 (row size = 11, ID = 2, FLAGS = 0, REF = 0, DEF = 26,"TAB",23,"SPACE",23)
21 (row size = 11, ID = 3, FLAGS = 0, REF = 0, DEF = 26,"TAB",17,"SPACE",17)
22 (row size = 23, ID = 4, FLAGS = 0, REF = 0, DEF = 26,"TAB",24,"SPACE",24,"0","9",10,"A","Z",10,"_",10,"a","z",10)
23 (row size = 11, ID = 2, FLAGS = 0, REF = 0, DEF = 26,"TAB",20,"SPACE",20)
24 (row size = 11, ID = 4, FLAGS = 0, REF = 0, DEF = 26,"TAB",25,"SPACE",25)
25 (row size = 11, ID = 4, FLAGS = 0, REF = 0, DEF = 26,"TAB",24,"SPACE",24)
Format of the above: first number on line is the state number, the rest is the content of the row for a given state (the transitions start after DEF = 26). There are no default transitions and flags/references are both 0 (partly due to the use of PREGEX_COMP_NOANCHORS | PREGEX_COMP_NOREF). The word "TAB" denotes the character \t and the word "SPACE" denotes the single space character.
The part of the dfa that seems 'wrong' is restricted to the states where the scanner has found one of the words as dim if then. After scanning one of those words the next character decides how matching will continue. If the next character in the input is [ \t] then the word is found and the scanner continues consuming the trailing [ \t].
If the next character in the input is [a-zA-Z0-9_] then the word was only a prefix of an identifier and the scanner continues scanning for an identifier.
States 15, 17, 19, 20, 21, 23, 24 and 25 are the ones that make me go 'hmmm...'.
The reason why I found the dfa somewhat strange is the fact that state 9 and state 14 behave as expected. Those states are entered after the scanner has matched an identifier and found a whitespace character.
In state 9 the only possible transition is on [ \t]. Since the id does not change src and dst of the only transition in state 9 is state 9.
In state 14 the only possible transition is on [ \t]. Since the id does not change the src and dst of the only transition in state 14 is state 14.
Unless I am getting something wrong here the states 15 17 20 and 24 should behave like states 9 and 14. The src and the dst of the transition in those states should be the same as the number of the state.
So you'd get
Given the above transitions the states 19 21 23 and 25 seem superfluous.
The routine that utilizes the dfa (plex_lex) does work as expected. So no problems there.
But the dfa produced by plex_prepare is bigger than expected (it looks 'suspicious').
States 9 and 14 are the ones that make me think that there might be an issue with the dfa creation algorithm.