telekons / one-more-re-nightmare Goto Github PK

View Code? Open in Web Editor NEW

135.0 5.0 9.0 320 KB

A fast regular expression compiler in Common Lisp

Home Page: https://applied-langua.ge/projects/one-more-re-nightmare/

License: BSD 2-Clause "Simplified" License

Common Lisp 100.00%

common-lisp regular-expression-engine compiler regex lisp

one-more-re-nightmare's Issues

Tiered compilation

The compiler is a bit slow. It's faster than compilers for many programming languages, but it's weird for a funny-looking regular expression to take 150ms or so to compile. I've done my part in optimizing DFA generation, so most compile-time is taken by the Common Lisp compiler. On the other hand, the compiler also generates really good code - I wouldn't be surprised if most projects could go without full compilation. While we could sink DFA compilation into compile time when the regular expression is known at compile time, we do provide a function which conceptually accepts a string at runtime, and so it would be nice to do something useful if we need to compile at runtime.

So here is an idea: we do something more like JIT compilation and have a tiered compiler. The current compiler is used for "hot" regexes which are used to scan lots of vectors, and we use a chain of closures for cold code. Some heuristic would be used to make the switch to hot code, such as if we've matched more than some length with a cached chain of closures, and the optimizing compiler could run concurrently in another thread, too.

The chain of closures would take some of the benefits of DFA generation: we know how many registers are needed (though, without a compiler with copy propagation we overestimate substantially; oh well) and we still have linear complexity while scanning. We also could still use type splitting for chains of closures. All in all I wouldn't be too surprised if the cold compiler was still faster than cl-ppcre and other bytecode/closure-based implementations.

Shorter package nickname?

I know people can use package local nicknames by themselves, but what about omrn as builtin package nickname?

Make character ranges inclusive

Character ranges currently seem to be exclusive, rather than inclusive of the last character.

> (all-string-matches "[a-b]" "a")
(#("a"))
> (ren:all-string-matches "[a-b]" "b")
nil

I don't think this is in line with POSIX and based on the examples I doubt it's intentional. Can the range be made inclusive instead?

Matching problems with `[0-9][0-9]`

CL-USER> (one-more-re-nightmare:all-matches "[0-1][0-9][0-9]" "192")
NIL
CL-USER> (one-more-re-nightmare:all-matches "[0-9][0-9]" "192")
NIL
CL-USER> (one-more-re-nightmare:all-matches "[0-9][0-9]" "1921")
(#(2 4))

Using latest Quicklisp (20220331) with SBCL 2.2.4.

Customise prefix code generation

one-more-re-nightmare splits a regular expression into a prefix string and the rest of the RE. The prefix is scanned for more efficiently using Boyer-Moore-Horspool, before entering the DFA body generated for the rest of the RE. (This technique is also used in cl-ppcre and GNU grep to my knowledge.)

However, it is possible that BMH is not ideal for processing prefixes. Modern processors have vector arithmetic units which can perform many element comparisons at once, and it is not too hard to design a substring search which uses vectorised code. When combined with heuristics based on properties known about the vector to search, vectorised searching can be faster than clever serial searching algorithms.

It should be possible to allow the client to generate their own prefix scanning code. The interface to the BMH code generator provides most of the required information to generate code, with the lambda list (start length vector prefix aref-generator fail). The first three arguments are symbols naming internal variables o-m-r-n uses, then the prefix sequence, then the :aref-generator argument to compile-regular-expression, then some code to evaluate when there are no more matches.

Match start and end of string

Great project. Is it intentional that a caret ^ does not match the start of the string and a dollar $ does not match the end? Wikipedia seems to think those are included in POSIX regex.

Compiler objects

If we proceed with allowing more client specialization of code generated by one-more-re-nightmare, it would be necessary to introduce a representation for information for the compiler, and separate caches for code compiled with different specializations. Following the "backend" object Petalisp uses, we could introduce an object which stores this information, and classes which can be specialized on to introduce changes to the compiler.

It may also be useful to allow for different policies for holding onto cached code; a LRU cache may be more appropriate if many regular expressions are used, or a concurrent hash table if contention* over the cache table is somehow a problem.

The high level interface could be reused by providing a dynamic variable, say *compiler* which would default to a compiler object with the standard optimisations and hash table. The variable could be re-bound with another compiler object, and the provided functions would work with that compiler object.

*Our current cache is currently not thread-safe. Well, that was stupid not to do.

Some systems failed to build for Quicklisp dist

Building with SBCL 2.2.3.156-e971aa48f / ASDF 3.3.5 for quicklisp dist creation.

Trying to build commit id 14c62db

one-more-re-nightmare-simd fails to build with the following error:

; caught ERROR:
;   (during macroexpansion of (DEFINE-VOP (ONE-MORE-RE-NIGHTMARE.VECTOR-PRIMOPS:V-AND) ...))
;   SIMD-PACK-256-INT is not a defined primitive type.
...
; caught ERROR:
;   (during macroexpansion of (DEFINE-VOP (ONE-MORE-RE-NIGHTMARE.VECTOR-PRIMOPS:V-OR) ...))
;   SIMD-PACK-256-INT is not a defined primitive type.
...
; caught ERROR:
;   (during macroexpansion of (DEFINE-VOP (ONE-MORE-RE-NIGHTMARE.VECTOR-PRIMOPS:V-NOT) ...))
;   SIMD-PACK-256-INT is not a defined primitive type.
...
; caught ERROR:
;   (during macroexpansion of (DEFINE-VOP (ONE-MORE-RE-NIGHTMARE.VECTOR-PRIMOPS:V32>) ...))
;   SIMD-PACK-256-INT is not a defined primitive type.
...
; caught ERROR:
;   (during macroexpansion of (DEFINE-VOP (ONE-MORE-RE-NIGHTMARE.VECTOR-PRIMOPS:V8>) ...))
;   SIMD-PACK-256-INT is not a defined primitive type.
...
; caught ERROR:
;   (during macroexpansion of (DEFINE-VOP (ONE-MORE-RE-NIGHTMARE.VECTOR-PRIMOPS:V32=) ...))
;   SIMD-PACK-256-INT is not a defined primitive type.
...
; caught ERROR:
;   (during macroexpansion of (DEFINE-VOP (ONE-MORE-RE-NIGHTMARE.VECTOR-PRIMOPS:V8=) ...))
;   SIMD-PACK-256-INT is not a defined primitive type.
...
; caught ERROR:
;   (during macroexpansion of (DEFINE-VOP (ONE-MORE-RE-NIGHTMARE.VECTOR-PRIMOPS:V8-) ...))
;   SIMD-PACK-256-INT is not a defined primitive type.
...
; caught ERROR:
;   (during macroexpansion of (DEFINE-VOP (ONE-MORE-RE-NIGHTMARE.VECTOR-PRIMOPS:V-BROADCAST32) ...))
;   SIMD-PACK-256-INT is not a defined primitive type.
...
; caught ERROR:
;   (during macroexpansion of (DEFINE-VOP (ONE-MORE-RE-NIGHTMARE.VECTOR-PRIMOPS:V-BROADCAST8) ...))
;   SIMD-PACK-256-INT is not a defined primitive type.
...
; caught ERROR:
;   (during macroexpansion of (DEFINE-VOP (ONE-MORE-RE-NIGHTMARE.VECTOR-PRIMOPS:V-MOVEMASK32) ...))
;   SIMD-PACK-256-INT is not a defined primitive type.
...
; caught ERROR:
;   (during macroexpansion of (DEFINE-VOP (ONE-MORE-RE-NIGHTMARE.VECTOR-PRIMOPS:V-MOVEMASK8) ...))
;   SIMD-PACK-256-INT is not a defined primitive type.
...
; caught ERROR:
;   (during macroexpansion of (DEFINE-VOP (ONE-MORE-RE-NIGHTMARE.VECTOR-PRIMOPS:V-LOAD32) ...))
;   SIMD-PACK-256-INT is not a defined primitive type.
...
; caught ERROR:
;   (during macroexpansion of (DEFINE-VOP (ONE-MORE-RE-NIGHTMARE.VECTOR-PRIMOPS:V-LOAD8) ...))
;   SIMD-PACK-256-INT is not a defined primitive type.
...
Unhandled UIOP/LISP-BUILD:COMPILE-FILE-ERROR in thread #<SB-THREAD:THREAD "main thread" RUNNING {1001C50003}>: COMPILE-FILE-ERROR while compiling #<CL-SOURCE-FILE "one-more-re-nightmare-simd" "sbcl-x86-64">

Full log here

Runtime feedback for SIMD and repetition

Consider the regular expression "(¬")*", which will match some text enclosed in double quotes.

There are two ways to run the "tight loop" currently:

We may use the "scalar" DFA and repeatedly check each character, exiting the loop when we encounter a " character.
We may use SIMD and scan several characters at a time.

The latter is faster than the former for long texts, but not for shorter texts. (I suspect less than one SIMD vector wide? idk) So to achieve best performance, we should choose the more appropriate tight loop implementation. There is no indication of what lengths are expected in the regular expression, and we have no way to annotate such things, so we must rely on runtime feedback (which is probably for the better, in terms of performance/effort on behalf of the user).

According to Cliff Click (from a conversation in the coffee compiler club) it should suffice to maintain a count of how many times the tight loop is iterated. Upon entering the loop, we compare the last count to some crossover point. If we have a higher count, we take the SIMD route, else we use scalar code.

There is also a case for unrolling the scalar code according to Gilbert Baumann, but I haven't tried that yet. For very long loop counts, unrolling the SIMD code may also produce some benefit, but we may face more code bloat; I still intend to compile everything eagerly, because it is simpler, and the duplicate code is not too large anyway.

Equivalent of ? and {n,m} for POSIX ERE feature parity

Title says it all, having equivalents would be nice.

Include in quicklisp

I would be interested in accessing this through quicklisp. It may be there already but I didn't see it. Thanks.

Lint regular expressions

The compiler macro should complain if a RE or some submatch will never match, at the least. Make the user think we're being productive while we're compiling the same code four times.

CL-PPCRE compatible functions

It would be great if there were CL-PPCRE compatible functions, at least a few of them -- then it might be possible to just load OMM instead and provide CL-PPCRE as local nickname.

Eg. ALL-MATCHES has a different result format; perhaps that could be changed yet? Not sure whether a transforming function (to get compatibility) would be fast enough ;)

Thanks!

Character classes

Hello, and first, thanks a lot of for trying to make CL regexps faster (I come from mariomka/regex-benchmark#43).

At first I wanted to emulate \w by using [[:alnum:]_<other ranges>] but I realized that simple disjoint character ranges (e.g. [0-9a-z]) were either unimplemented or had a different syntax.

So, is there a way to get this (other than [[:alnum:]] => ([a-z]|[A-Z]|[0-9])? Related question, but what about POSIX characters classes and maybe more useful Unicode classes?

Unbreak submatching

Submatching is horrendously broken, and I can't seem to find a way out with any code that has been committed so far. Gilbert Baumann has kindly provided me a draft of a paper about a technique for creating DFAs from tagged extended regular expressions, which will probably be used to correctly implement submatching.

telekons / one-more-re-nightmare Goto Github PK

one-more-re-nightmare's Issues

Tiered compilation

Shorter package nickname?

Make character ranges inclusive

Matching problems with `[0-9][0-9]`

Customise prefix code generation

Match start and end of string

Compiler objects

Some systems failed to build for Quicklisp dist

Runtime feedback for SIMD and repetition

Equivalent of ? and {n,m} for POSIX ERE feature parity

Include in quicklisp

Lint regular expressions

CL-PPCRE compatible functions

Character classes

Unbreak submatching

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent

Jobs