telekons / one-more-re-nightmare Goto Github PK
View Code? Open in Web Editor NEWA fast regular expression compiler in Common Lisp
Home Page: https://applied-langua.ge/projects/one-more-re-nightmare/
License: BSD 2-Clause "Simplified" License
A fast regular expression compiler in Common Lisp
Home Page: https://applied-langua.ge/projects/one-more-re-nightmare/
License: BSD 2-Clause "Simplified" License
The compiler is a bit slow. It's faster than compilers for many programming languages, but it's weird for a funny-looking regular expression to take 150ms or so to compile. I've done my part in optimizing DFA generation, so most compile-time is taken by the Common Lisp compiler. On the other hand, the compiler also generates really good code - I wouldn't be surprised if most projects could go without full compilation. While we could sink DFA compilation into compile time when the regular expression is known at compile time, we do provide a function which conceptually accepts a string at runtime, and so it would be nice to do something useful if we need to compile at runtime.
So here is an idea: we do something more like JIT compilation and have a tiered compiler. The current compiler is used for "hot" regexes which are used to scan lots of vectors, and we use a chain of closures for cold code. Some heuristic would be used to make the switch to hot code, such as if we've matched more than some length with a cached chain of closures, and the optimizing compiler could run concurrently in another thread, too.
The chain of closures would take some of the benefits of DFA generation: we know how many registers are needed (though, without a compiler with copy propagation we overestimate substantially; oh well) and we still have linear complexity while scanning. We also could still use type splitting for chains of closures. All in all I wouldn't be too surprised if the cold compiler was still faster than cl-ppcre and other bytecode/closure-based implementations.
I know people can use package local nicknames by themselves, but what about omrn
as builtin package nickname?
Character ranges currently seem to be exclusive, rather than inclusive of the last character.
> (all-string-matches "[a-b]" "a")
(#("a"))
> (ren:all-string-matches "[a-b]" "b")
nil
I don't think this is in line with POSIX and based on the examples I doubt it's intentional. Can the range be made inclusive instead?
CL-USER> (one-more-re-nightmare:all-matches "[0-1][0-9][0-9]" "192")
NIL
CL-USER> (one-more-re-nightmare:all-matches "[0-9][0-9]" "192")
NIL
CL-USER> (one-more-re-nightmare:all-matches "[0-9][0-9]" "1921")
(#(2 4))
Using latest Quicklisp (20220331) with SBCL 2.2.4.
one-more-re-nightmare splits a regular expression into a prefix string and the rest of the RE. The prefix is scanned for more efficiently using Boyer-Moore-Horspool, before entering the DFA body generated for the rest of the RE. (This technique is also used in cl-ppcre and GNU grep to my knowledge.)
However, it is possible that BMH is not ideal for processing prefixes. Modern processors have vector arithmetic units which can perform many element comparisons at once, and it is not too hard to design a substring search which uses vectorised code. When combined with heuristics based on properties known about the vector to search, vectorised searching can be faster than clever serial searching algorithms.
It should be possible to allow the client to generate their own prefix scanning code. The interface to the BMH code generator provides most of the required information to generate code, with the lambda list (start length vector prefix aref-generator fail)
. The first three arguments are symbols naming internal variables o-m-r-n uses, then the prefix sequence, then the :aref-generator
argument to compile-regular-expression
, then some code to evaluate when there are no more matches.
Great project. Is it intentional that a caret ^
does not match the start of the string and a dollar $
does not match the end? Wikipedia seems to think those are included in POSIX regex.
If we proceed with allowing more client specialization of code generated by one-more-re-nightmare, it would be necessary to introduce a representation for information for the compiler, and separate caches for code compiled with different specializations. Following the "backend" object Petalisp uses, we could introduce an object which stores this information, and classes which can be specialized on to introduce changes to the compiler.
It may also be useful to allow for different policies for holding onto cached code; a LRU cache may be more appropriate if many regular expressions are used, or a concurrent hash table if contention* over the cache table is somehow a problem.
The high level interface could be reused by providing a dynamic variable, say *compiler*
which would default to a compiler object with the standard optimisations and hash table. The variable could be re-bound with another compiler object, and the provided functions would work with that compiler object.
*Our current cache is currently not thread-safe. Well, that was stupid not to do.
Building with SBCL 2.2.3.156-e971aa48f / ASDF 3.3.5 for quicklisp dist creation.
Trying to build commit id 14c62db
one-more-re-nightmare-simd fails to build with the following error:
; caught ERROR:
; (during macroexpansion of (DEFINE-VOP (ONE-MORE-RE-NIGHTMARE.VECTOR-PRIMOPS:V-AND) ...))
; SIMD-PACK-256-INT is not a defined primitive type.
...
; caught ERROR:
; (during macroexpansion of (DEFINE-VOP (ONE-MORE-RE-NIGHTMARE.VECTOR-PRIMOPS:V-OR) ...))
; SIMD-PACK-256-INT is not a defined primitive type.
...
; caught ERROR:
; (during macroexpansion of (DEFINE-VOP (ONE-MORE-RE-NIGHTMARE.VECTOR-PRIMOPS:V-NOT) ...))
; SIMD-PACK-256-INT is not a defined primitive type.
...
; caught ERROR:
; (during macroexpansion of (DEFINE-VOP (ONE-MORE-RE-NIGHTMARE.VECTOR-PRIMOPS:V32>) ...))
; SIMD-PACK-256-INT is not a defined primitive type.
...
; caught ERROR:
; (during macroexpansion of (DEFINE-VOP (ONE-MORE-RE-NIGHTMARE.VECTOR-PRIMOPS:V8>) ...))
; SIMD-PACK-256-INT is not a defined primitive type.
...
; caught ERROR:
; (during macroexpansion of (DEFINE-VOP (ONE-MORE-RE-NIGHTMARE.VECTOR-PRIMOPS:V32=) ...))
; SIMD-PACK-256-INT is not a defined primitive type.
...
; caught ERROR:
; (during macroexpansion of (DEFINE-VOP (ONE-MORE-RE-NIGHTMARE.VECTOR-PRIMOPS:V8=) ...))
; SIMD-PACK-256-INT is not a defined primitive type.
...
; caught ERROR:
; (during macroexpansion of (DEFINE-VOP (ONE-MORE-RE-NIGHTMARE.VECTOR-PRIMOPS:V8-) ...))
; SIMD-PACK-256-INT is not a defined primitive type.
...
; caught ERROR:
; (during macroexpansion of (DEFINE-VOP (ONE-MORE-RE-NIGHTMARE.VECTOR-PRIMOPS:V-BROADCAST32) ...))
; SIMD-PACK-256-INT is not a defined primitive type.
...
; caught ERROR:
; (during macroexpansion of (DEFINE-VOP (ONE-MORE-RE-NIGHTMARE.VECTOR-PRIMOPS:V-BROADCAST8) ...))
; SIMD-PACK-256-INT is not a defined primitive type.
...
; caught ERROR:
; (during macroexpansion of (DEFINE-VOP (ONE-MORE-RE-NIGHTMARE.VECTOR-PRIMOPS:V-MOVEMASK32) ...))
; SIMD-PACK-256-INT is not a defined primitive type.
...
; caught ERROR:
; (during macroexpansion of (DEFINE-VOP (ONE-MORE-RE-NIGHTMARE.VECTOR-PRIMOPS:V-MOVEMASK8) ...))
; SIMD-PACK-256-INT is not a defined primitive type.
...
; caught ERROR:
; (during macroexpansion of (DEFINE-VOP (ONE-MORE-RE-NIGHTMARE.VECTOR-PRIMOPS:V-LOAD32) ...))
; SIMD-PACK-256-INT is not a defined primitive type.
...
; caught ERROR:
; (during macroexpansion of (DEFINE-VOP (ONE-MORE-RE-NIGHTMARE.VECTOR-PRIMOPS:V-LOAD8) ...))
; SIMD-PACK-256-INT is not a defined primitive type.
...
Unhandled UIOP/LISP-BUILD:COMPILE-FILE-ERROR in thread #<SB-THREAD:THREAD "main thread" RUNNING {1001C50003}>: COMPILE-FILE-ERROR while compiling #<CL-SOURCE-FILE "one-more-re-nightmare-simd" "sbcl-x86-64">
Consider the regular expression "(¬")*"
, which will match some text enclosed in double quotes.
There are two ways to run the "tight loop" currently:
"
character.The latter is faster than the former for long texts, but not for shorter texts. (I suspect less than one SIMD vector wide? idk) So to achieve best performance, we should choose the more appropriate tight loop implementation. There is no indication of what lengths are expected in the regular expression, and we have no way to annotate such things, so we must rely on runtime feedback (which is probably for the better, in terms of performance/effort on behalf of the user).
According to Cliff Click (from a conversation in the coffee compiler club) it should suffice to maintain a count of how many times the tight loop is iterated. Upon entering the loop, we compare the last count to some crossover point. If we have a higher count, we take the SIMD route, else we use scalar code.
There is also a case for unrolling the scalar code according to Gilbert Baumann, but I haven't tried that yet. For very long loop counts, unrolling the SIMD code may also produce some benefit, but we may face more code bloat; I still intend to compile everything eagerly, because it is simpler, and the duplicate code is not too large anyway.
Title says it all, having equivalents would be nice.
I would be interested in accessing this through quicklisp. It may be there already but I didn't see it. Thanks.
The compiler macro should complain if a RE or some submatch will never match, at the least. Make the user think we're being productive while we're compiling the same code four times.
It would be great if there were CL-PPCRE
compatible functions, at least a few of them -- then it might be possible to just load OMM
instead and provide CL-PPCRE
as local nickname.
Eg. ALL-MATCHES
has a different result format; perhaps that could be changed yet? Not sure whether a transforming function (to get compatibility) would be fast enough ;)
Thanks!
Hello, and first, thanks a lot of for trying to make CL regexps faster (I come from mariomka/regex-benchmark#43).
At first I wanted to emulate \w
by using [[:alnum:]_<other ranges>]
but I realized that simple disjoint character ranges (e.g. [0-9a-z]
) were either unimplemented or had a different syntax.
So, is there a way to get this (other than [[:alnum:]]
=> ([a-z]|[A-Z]|[0-9])
? Related question, but what about POSIX characters classes and maybe more useful Unicode classes?
Submatching is horrendously broken, and I can't seem to find a way out with any code that has been committed so far. Gilbert Baumann has kindly provided me a draft of a paper about a technique for creating DFAs from tagged extended regular expressions, which will probably be used to correctly implement submatching.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.