GithubHelp home page GithubHelp logo

pcre2project / pcre2 Goto Github PK

View Code? Open in Web Editor NEW
788.0 788.0 163.0 12.7 MB

PCRE2 development is now based here.

License: Other

Perl 0.25% CMake 1.19% Makefile 0.59% Shell 2.24% Batchfile 1.00% M4 1.22% Python 1.27% C 91.34% Starlark 0.04% Zig 0.06% DIGITAL Command Language 0.80%

pcre2's People

Contributors

addisoncrump avatar adsr avatar akien-mga avatar alejandro-colomar avatar aminya avatar andreygorbachev avatar ayesh avatar carenas avatar cbouc avatar jetxujing avatar jrtc27 avatar larinsv avatar lrzlin avatar ltrzesniewski avatar mango0x45 avatar mizzrym1 avatar pagabuc avatar philiphazel avatar pkeir avatar pkuzco avatar player-two avatar rosscomputerguy avatar spaceim avatar star-tek-mb avatar teo-tsirpanis avatar theeragon avatar tobil4sk avatar wrowe avatar yselkowitz avatar zherczeg avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

pcre2's Issues

Caseless ASCII matching

This is #2625 in the old Bugzilla, submitted by Rich Siegel.

RS: Given this text represented as UTF-16:

this is a test

Search using this pattern, which as written should not match any ASCII characters:

[\x{00FF}-\x{FFEE}]

If the pattern was compiled with PCRE2_CASELESS turned on, pcre2_match() will return a match at the first "s" in the subject text, even though that is outside the explicit range of characters. (And the uppercase version "S" would be, as well.)

Further testing shows that "k" and "K" are matching as well, presumably with the same underlying cause.

ZH: This is expected. Please check:
http://www.unicode.org/Public/12.1.0/ucd/CaseFolding.txt

Examples:

017F; C; 0073; # LATIN SMALL LETTER LONG S
212A; C; 006B; # KELVIN SIGN

In other words, a caseless 0x212A codepoint matches to K.

RS: Thank you! I'm a little ashamed to admit that Unicode case folding didn't even occur to me. This may pose more of a UI and/or documentation challenge, since I'm quite certain that it won't occur to end users either. :-)

Does it make sense to consider a flag in the PCRE2_EXTRA* space which would limit case folding to the ASCII range when PCRE2_CASELESS is specified? (I'm not yet advocating for it; I can see some clear limitations and disadvantages, and trying to express all of the possible variations could rapidly turn into a snake pit.)

PH: I think the best way of doing this would be to add PCRE2_ASCII_CASELESS to the main options, because having two separated flags seems very untidy. There are only two bits left in the main options, so I am slightly reluctant, but on the other hand leaving them unused just in case something more important comes along could last for ever. Zoltan, what do you think? Implementing this would need changes to JIT as well as the interpreters.

On further reflection, I've changed my mind and think that a PCRE2_EXTRA option would be better, as you suggested, partly because there may be a number of variations needed. And indeed, some additions to the documentation.

ZH:
I feel this is a non trivial change, and it can be easy done on pattern level.

The issue here is caseless, and you can temporarily disable it:

(?-i:[\x{00FF}-\x{FFEE}])

Or use an assertion on which separates ascii from the rest.

crosscompiling shared library on macos

I try to crosscompile the shared library on macOS. My system uses an Intel processor and I want to compile for arm.

When I pass the --host argument to configure the shared library creation will be disabled:

$ ./configure --enable-jit --enable-pcre2-32 --disable-pcre2-8 --prefix=/Users/sbarex/Downloads/pcre2-32 --host arm64-apple-macos11 --enable-shared --disable-static

checking for a BSD-compatible install... /usr/local/bin/ginstall -c
checking whether build environment is sane... yes
checking for arm64-apple-macos11-strip... no
checking for strip... strip
checking for a race-free mkdir -p... /usr/local/bin/gmkdir -p
checking for gawk... no
checking for mawk... no
checking for nawk... no
checking for awk... awk
checking whether make sets $(MAKE)... yes
checking whether make supports nested variables... yes
checking whether make supports nested variables... (cached) yes
checking for arm64-apple-macos11-gcc... no
checking for gcc... gcc
checking whether the C compiler works... yes
checking for C compiler default output file name... a.out
checking for suffix of executables... 
checking whether we are cross compiling... no
checking for suffix of object files... o
checking whether the compiler supports GNU C... yes
checking whether gcc accepts -g... yes
checking for gcc option to enable C11 features... none needed
checking whether gcc understands -c and -o together... yes
checking whether make supports the include directive... yes (GNU style)
checking dependency style of gcc... gcc3
checking for stdio.h... yes
checking for stdlib.h... yes
checking for string.h... yes
checking for inttypes.h... yes
checking for stdint.h... yes
checking for strings.h... yes
checking for sys/stat.h... yes
checking for sys/types.h... yes
checking for unistd.h... yes
checking for wchar.h... yes
checking for minix/config.h... no
checking whether it is safe to define __EXTENSIONS__... yes
checking whether _XOPEN_SOURCE should be defined... no
checking for arm64-apple-macos11-ar... no
checking for arm64-apple-macos11-lib... no
checking for arm64-apple-macos11-link... no
checking for ar... ar
checking the archiver (ar) interface... ar
checking for int64_t... yes
checking build system type... x86_64-apple-darwin21.2.0
checking host system type... aarch64-apple-macos11
checking how to print strings... printf
checking for a sed that does not truncate output... /usr/local/bin/gsed
checking for grep that handles long lines and -e... /usr/bin/grep
checking for egrep... /usr/bin/grep -E
checking for fgrep... /usr/bin/grep -F
checking for ld used by gcc... /Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/bin/ld
checking if the linker (/Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/bin/ld) is GNU ld... no
checking for BSD- or MS-compatible name lister (nm)... no
checking for arm64-apple-macos11-dumpbin... no
checking for arm64-apple-macos11-link... no
checking for dumpbin... no
checking for link... link -dump
checking the name lister (nm) interface... BSD nm
checking whether ln -s works... yes
checking the maximum length of command line arguments... 786432
checking how to convert x86_64-apple-darwin21.2.0 file names to aarch64-apple-macos11 format... func_convert_file_noop
checking how to convert x86_64-apple-darwin21.2.0 file names to toolchain format... func_convert_file_noop
checking for /Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/bin/ld option to reload object files... -r
checking for arm64-apple-macos11-objdump... no
checking for objdump... objdump
checking how to recognize dependent libraries... unknown
checking for arm64-apple-macos11-dlltool... no
checking for dlltool... no
checking how to associate runtime and link libraries... printf %s\n
checking for arm64-apple-macos11-ar... ar
checking for archiver @FILE support... no
checking for arm64-apple-macos11-strip... strip
checking for arm64-apple-macos11-ranlib... no
checking for ranlib... ranlib
checking command to parse nm output from gcc object... ok
checking for sysroot... no
checking for a working dd... /bin/dd
checking how to truncate binary pipes... /bin/dd bs=4096 count=1
checking for arm64-apple-macos11-mt... no
checking for mt... no
checking if : is a manifest tool... no
checking for dlfcn.h... yes
checking for objdir... .libs
checking if gcc supports -fno-rtti -fno-exceptions... yes
checking for gcc option to produce PIC... -fPIC -DPIC
checking if gcc PIC flag -fPIC -DPIC works... yes
checking if gcc static flag -static works... no
checking if gcc supports -c -o file.o... yes
checking if gcc supports -c -o file.o... (cached) yes
checking whether the gcc linker (/Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/bin/ld) supports shared libraries... no
checking dynamic linker characteristics... no
checking how to hardcode library paths into programs... unsupported
checking whether stripping libraries is possible... no
checking if libtool supports shared libraries... no
checking whether to build shared libraries... no
checking whether to build static libraries... yes
checking whether ln -s works... yes
checking whether the -Werror option is usable... yes
checking for simple visibility declarations... yes
checking for __attribute__((uninitialized))... yes
checking for limits.h... yes
checking for sys/types.h... (cached) yes
checking for sys/stat.h... (cached) yes
checking for dirent.h... yes
checking for windows.h... no
checking for sys/wait.h... yes
checking for an ANSI C-conforming const... yes
checking for size_t... yes
checking for bcopy... yes
checking for memfd_create... no
checking for memmove... yes
checking for mkostemp... yes
checking for realpath... yes
checking for secure_getenv... no
checking for strerror... yes
checking for zlib.h... yes
checking for gzopen in -lz... yes
checking for bzlib.h... yes
checking for libbz2... yes
checking for the pthreads library -lpthreads... no
checking whether pthreads work without any flags... yes
checking for joinable pthread attribute... PTHREAD_CREATE_JOINABLE
checking if more special flags are required for pthreads... no
checking for PTHREAD_PRIO_INHERIT... yes
checking whether Intel CET is enabled... no
checking that generated files are newer than configure... done
configure: creating ./config.status
config.status: creating Makefile
config.status: creating libpcre2-8.pc
config.status: creating libpcre2-16.pc
config.status: creating libpcre2-32.pc
config.status: creating libpcre2-posix.pc
config.status: creating pcre2-config
config.status: creating src/pcre2.h
config.status: creating src/config.h
config.status: executing depfiles commands
config.status: executing libtool commands
config.status: executing script-chmod commands
config.status: executing delete-old-chartables commands

pcre2-10.39 configuration summary:

    Install prefix ..................... : /Users/sbarex/Downloads/pcre2-32
    C preprocessor ..................... : 
    C compiler ......................... : gcc
    Linker ............................. : /Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/bin/ld
    C preprocessor flags ............... : 
    C compiler flags ................... :  -O2 -fvisibility=hidden
    Linker flags ....................... : 
    Extra libraries .................... :  

    Build 8-bit pcre2 library .......... : no
    Build 16-bit pcre2 library ......... : no
    Build 32-bit pcre2 library ......... : yes
    Include debugging code ............. : no
    Enable JIT compiling support ....... : yes
    Use SELinux allocator in JIT ....... : unsupported
    Enable Unicode support ............. : yes
    Newline char/sequence .............. : lf
    \R matches only ANYCRLF ............ : no
    \C is disabled ..................... : no
    EBCDIC coding ...................... : no
    EBCDIC code for NL ................. : n/a
    Rebuild char tables ................ : no
    Internal link size ................. : 2
    Nested parentheses limit ........... : 250
    Heap limit ......................... : 20000000 kibibytes
    Match limit ........................ : 10000000
    Match depth limit .................. : MATCH_LIMIT
    Build shared libs .................. : no
    Build static libs .................. : yes
    Use JIT in pcre2grep ............... : yes
    Enable callouts in pcre2grep ....... : yes
    Enable fork in pcre2grep callouts .. : yes
    Initial buffer size for pcre2grep .. : 20480
    Maximum buffer size for pcre2grep .. : 1048576
    Link pcre2grep with libz ........... : no
    Link pcre2grep with libbz2 ......... : no
    Link pcre2test with libedit ........ : no
    Link pcre2test with libreadline .... : no
    Valgrind support ................... : no
    Code coverage ...................... : no
    Fuzzer support ..................... : no
    Use %zu and %td .................... : auto

Also if i try to configure an host with the same CPU architecture (like x86_64-apple-macos10.15) the shared library cannot be created.

If I remove the --host argument the shared library is build (for current arch):

$ ./configure --enable-jit --enable-pcre2-32 --disable-pcre2-8 --prefix=/Users/sbarex/Downloads/pcre2-32 -enable-shared --disable-static

checking for a BSD-compatible install... /usr/local/bin/ginstall -c
checking whether build environment is sane... yes
checking for a race-free mkdir -p... /usr/local/bin/gmkdir -p
checking for gawk... no
checking for mawk... no
checking for nawk... no
checking for awk... awk
checking whether make sets $(MAKE)... yes
checking whether make supports nested variables... yes
checking whether make supports nested variables... (cached) yes
checking for gcc... gcc
checking whether the C compiler works... yes
checking for C compiler default output file name... a.out
checking for suffix of executables... 
checking whether we are cross compiling... no
checking for suffix of object files... o
checking whether the compiler supports GNU C... yes
checking whether gcc accepts -g... yes
checking for gcc option to enable C11 features... none needed
checking whether gcc understands -c and -o together... yes
checking whether make supports the include directive... yes (GNU style)
checking dependency style of gcc... gcc3
checking for stdio.h... yes
checking for stdlib.h... yes
checking for string.h... yes
checking for inttypes.h... yes
checking for stdint.h... yes
checking for strings.h... yes
checking for sys/stat.h... yes
checking for sys/types.h... yes
checking for unistd.h... yes
checking for wchar.h... yes
checking for minix/config.h... no
checking whether it is safe to define __EXTENSIONS__... yes
checking whether _XOPEN_SOURCE should be defined... no
checking for ar... ar
checking the archiver (ar) interface... ar
checking for int64_t... yes
checking build system type... x86_64-apple-darwin21.2.0
checking host system type... x86_64-apple-darwin21.2.0
checking how to print strings... printf
checking for a sed that does not truncate output... /usr/local/bin/gsed
checking for grep that handles long lines and -e... /usr/bin/grep
checking for egrep... /usr/bin/grep -E
checking for fgrep... /usr/bin/grep -F
checking for ld used by gcc... /Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/bin/ld
checking if the linker (/Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/bin/ld) is GNU ld... no
checking for BSD- or MS-compatible name lister (nm)... /usr/bin/nm -B
checking the name lister (/usr/bin/nm -B) interface... BSD nm
checking whether ln -s works... yes
checking the maximum length of command line arguments... 786432
checking how to convert x86_64-apple-darwin21.2.0 file names to x86_64-apple-darwin21.2.0 format... func_convert_file_noop
checking how to convert x86_64-apple-darwin21.2.0 file names to toolchain format... func_convert_file_noop
checking for /Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/bin/ld option to reload object files... -r
checking for objdump... objdump
checking how to recognize dependent libraries... pass_all
checking for dlltool... no
checking how to associate runtime and link libraries... printf %s\n
checking for archiver @FILE support... no
checking for strip... strip
checking for ranlib... ranlib
checking command to parse /usr/bin/nm -B output from gcc object... ok
checking for sysroot... no
checking for a working dd... /bin/dd
checking how to truncate binary pipes... /bin/dd bs=4096 count=1
checking for mt... no
checking if : is a manifest tool... no
checking for dsymutil... dsymutil
checking for nmedit... nmedit
checking for lipo... lipo
checking for otool... otool
checking for otool64... no
checking for -single_module linker flag... yes
checking for -exported_symbols_list linker flag... yes
checking for -force_load linker flag... yes
checking for dlfcn.h... yes
checking for objdir... .libs
checking if gcc supports -fno-rtti -fno-exceptions... yes
checking for gcc option to produce PIC... -fno-common -DPIC
checking if gcc PIC flag -fno-common -DPIC works... yes
checking if gcc static flag -static works... no
checking if gcc supports -c -o file.o... yes
checking if gcc supports -c -o file.o... (cached) yes
checking whether the gcc linker (/Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/bin/ld) supports shared libraries... yes
checking dynamic linker characteristics... darwin21.2.0 dyld
checking how to hardcode library paths into programs... immediate
checking whether stripping libraries is possible... yes
checking if libtool supports shared libraries... yes
checking whether to build shared libraries... yes
checking whether to build static libraries... no
checking whether ln -s works... yes
checking whether the -Werror option is usable... yes
checking for simple visibility declarations... yes
checking for __attribute__((uninitialized))... yes
checking for limits.h... yes
checking for sys/types.h... (cached) yes
checking for sys/stat.h... (cached) yes
checking for dirent.h... yes
checking for windows.h... no
checking for sys/wait.h... yes
checking for an ANSI C-conforming const... yes
checking for size_t... yes
checking for bcopy... yes
checking for memfd_create... no
checking for memmove... yes
checking for mkostemp... yes
checking for realpath... yes
checking for secure_getenv... no
checking for strerror... yes
checking for zlib.h... yes
checking for gzopen in -lz... yes
checking for bzlib.h... yes
checking for libbz2... yes
checking whether pthreads work with -pthread... yes
checking for joinable pthread attribute... PTHREAD_CREATE_JOINABLE
checking if more special flags are required for pthreads... -D_THREAD_SAFE
checking for PTHREAD_PRIO_INHERIT... yes
checking whether Intel CET is enabled... no
checking that generated files are newer than configure... done
configure: creating ./config.status
config.status: creating Makefile
config.status: creating libpcre2-8.pc
config.status: creating libpcre2-16.pc
config.status: creating libpcre2-32.pc
config.status: creating libpcre2-posix.pc
config.status: creating pcre2-config
config.status: creating src/pcre2.h
config.status: creating src/config.h
config.status: src/config.h is unchanged
config.status: executing depfiles commands
config.status: executing libtool commands
config.status: executing script-chmod commands
config.status: executing delete-old-chartables commands

pcre2-10.39 configuration summary:

    Install prefix ..................... : /Users/sbarex/Downloads/pcre2-32
    C preprocessor ..................... : 
    C compiler ......................... : gcc
    Linker ............................. : /Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/bin/ld
    C preprocessor flags ............... : 
    C compiler flags ................... : -D_THREAD_SAFE -pthread -O2 -fvisibility=hidden
    Linker flags ....................... : 
    Extra libraries .................... :  

    Build 8-bit pcre2 library .......... : no
    Build 16-bit pcre2 library ......... : no
    Build 32-bit pcre2 library ......... : yes
    Include debugging code ............. : no
    Enable JIT compiling support ....... : yes
    Use SELinux allocator in JIT ....... : unsupported
    Enable Unicode support ............. : yes
    Newline char/sequence .............. : lf
    \R matches only ANYCRLF ............ : no
    \C is disabled ..................... : no
    EBCDIC coding ...................... : no
    EBCDIC code for NL ................. : n/a
    Rebuild char tables ................ : no
    Internal link size ................. : 2
    Nested parentheses limit ........... : 250
    Heap limit ......................... : 20000000 kibibytes
    Match limit ........................ : 10000000
    Match depth limit .................. : MATCH_LIMIT
    Build shared libs .................. : yes
    Build static libs .................. : no
    Use JIT in pcre2grep ............... : yes
    Enable callouts in pcre2grep ....... : yes
    Enable fork in pcre2grep callouts .. : yes
    Initial buffer size for pcre2grep .. : 20480
    Maximum buffer size for pcre2grep .. : 1048576
    Link pcre2grep with libz ........... : no
    Link pcre2grep with libbz2 ......... : no
    Link pcre2test with libedit ........ : no
    Link pcre2test with libreadline .... : no
    Valgrind support ................... : no
    Code coverage ...................... : no
    Fuzzer support ..................... : no
    Use %zu and %td .................... : auto

Support the `/n` pattern modifier

From pcre2test and the docs, I don't see a way to name the whole string match, i.e. begin the pattern with a ?<name> variant, which would be useful. I realise it's possible to wrap the entire expression in a named subpattern, but that creates an unnecessary extra match group.

It seems like it would be a backward-compatible change as beginning a pattern with ? currently produces an error like the following, so there can't be any valid patterns that would begin with those sequences.

Failed: error 109 at offset 0: quantifier does not follow a repeatable item

Missing public key after ftp.pcre.org retirement.

Please could you republish the public key for pcre - without this it is impossible to verify pcre releases to prevent supply chain vulnerabilities.
It was previously on ftp.pcre.org at https://ftp.pcre.org/pub/pcre/Public-Key - I see other failures in the wild.http://exim.mirror.iphh.net/ftp/pcre/Public-Key is not a valid key - the one we need is 45F68D54BBE23FB3039B46E59766E084FB0F43D8
eg I imported that key, and it

C02DC0SHMD6W:web-agents alex.levin$ gpg --import ~/Downloads/Public-Key
gpg: key 9766E084FB0F43D8: public key "Philip Hazel [email protected]" imported
gpg: Total number processed: 1
gpg: imported: 1
C02DC0SHMD6W:web-agents alex.levin$ gpg --verify libs/pcre2-10.39.tar.gz.sig ph.gpg
gpg: Signature made Fri 29 Oct 17:07:03 2021 BST
gpg: using RSA key 45F68D54BBE23FB3039B46E59766E084FB0F43D8
gpg: BAD signature from "Philip Hazel [email protected]" [unknown]

Case-insensitive search gets exponentially bad with long buffers

This is bug 2793 from the old Bugzilla, posted by Thomas Tempelmann, who, after some discussion, provided a proposed patch. Here is some relevant discussion and the patch:

Here's what probably happens:

  1. Fast Scan for "E". Takes long because it's down 5 MB
  2. Now it goes back to the start, finds "e" and checks the rest of the search string.
  3. No match. So, it moves forward.

At this point in the loop, instead of going on with step 2, it goes back to step 1, where it again searches ahead for 5 MB until it runs into the first "E".

I can think of several remedies:

Change the fast scan to include searching all possible options. In my example, it has to scan for both "e" and "E". I assume this benefits by using a specialized CPU instructions that can scan for a byte (because, if not, you'd simply do a loop where you get one byte and check it then against both "e" and "E")? So, what you'd need to do is to use that search operation in small ranges, e.g. over 1000 bytes, looking for "e", and then the same 1000 bytes looking for "E". If none hits, move forward. But if one hits, the nearer one is processed (and the farther one's position can be cached so that you won't need to search again for it until you've moved there).

But currently, instead, I suspect it's searching for "E" and eventually gives up when it reaches the cut-off point you mentioned. But then repeats the same long search again and again.

So, the next possible optimization, which may be much easier to implement than the first suggestion, is to simply cache the point at which the "E" was found or not, and then not repeat looking for "E" before that point.

Actually, can you tell me where this happens (if you don't have time to look now, can you give me some pointers where to look)? I like to try the caching myself, it shouldn't be too hard I hope.

Thru the macro option to suppress the fast scan, I located the relevant code areas.
Around line 6800 in pcre2_match.c the comment explains that, in caseless mode, it does indeed consider looking for both cases.

Alright, it's as I suspected:

First off, the code already does what I suggested to do: Scan for both "e" and "E" and then process the nearer one.

The "bug" is that the found locations are not cached, so the next time both chars are searched again from the current position, even if it has already been determined that there's no such char for a while.

Adding some caching for both found locations should fix this.
1.

Change

BOOL memchr_not_found_first_cu;
BOOL memchr_not_found_first_cu2;

into

PCRE2_SPTR memchr_found_first_cu;
PCRE2_SPTR memchr_found_first_cu2;

Change

memchr_not_found_first_cu = FALSE;
memchr_not_found_first_cu2 = FALSE;

into

memchr_found_first_cu = NULL;
memchr_found_first_cu2 = NULL;

Change

      if (!memchr_not_found_first_cu)
        {
        pp1 = memchr(start_match, first_cu, end_subject - start_match);
        if (pp1 == NULL) memchr_not_found_first_cu = TRUE;
          else cu2size = pp1 - start_match;
        }

      /* If pp1 is not NULL, we have arranged to search only as far as pp1,
      to see if the other case is earlier, so we can set "not found" only
      when both searches have returned NULL. */

      if (!memchr_not_found_first_cu2)
        {
        pp2 = memchr(start_match, first_cu2, cu2size);
        memchr_not_found_first_cu2 = (pp2 == NULL && pp1 == NULL);
        }

into

      if (start_match <= memchr_found_first_cu) {
        pp1 = memchr_found_first_cu;
        if (pp1 == end_subject) {
          pp1 = NULL;
        }
      } else {
        pp1 = memchr(start_match, first_cu, cu2size);
        if (pp1 == NULL) {
        	memchr_found_first_cu = end_subject;
		} else {
            memchr_found_first_cu = pp1;
		}
	  }

      /* If pp1 is not NULL, we have arranged to search only as far as pp1,
      to see if the other case is earlier, so we can set "not found" only
      when both searches have returned NULL. */

      if (start_match <= memchr_found_first_cu2) {
        pp2 = memchr_found_first_cu2;
        if (pp2 == end_subject) {
          pp2 = NULL;
        }
      } else {
        pp2 = memchr(start_match, first_cu2, cu2size);
        if (pp2 == NULL) {
          memchr_found_first_cu2 = end_subject;
        } else {
          memchr_found_first_cu2 = pp2;
        }
      }

This means two changes to the algorithm:

  1. Instead of using a flag to tell whether memchr() found something, it'll now store the last found position and re-use that as long as the start_match pointer is still behind.

  2. The size parameter for the second memchr() is not getting reduced any more (formerly, if the first found a location, the second one would be limited to search until there). Since we now cache each location, there's no need to shorten the searches any more.

pcre2 causes crash with alignment issue (Bus Error) in exim on SPARC

The exim mailer daemon switched to pcre2 in version 4.95. Ever since, the exim daemon crashes with a Bus Error on SPARC on Linux:

glaubitz@gcc202:~/exim$ ./src/build-Linux-sparc64/exim
Bus error
glaubitz@gcc202:~/exim$

Bisecting the issue lead to the change which switched exim from pcre to pcre2 (Exim/exim@22ed7a5).

Running the exim binary in gdb, lead to the following backtrace:

(gdb) bt
#0  pcre2_general_context_create_8 (private_malloc=0x1000004f680 <function_store_malloc>, 
    private_free=0x1000004f650 <function_store_free>, memory_data=0x0) at src/pcre2_context.c:123
#1  0x00000100000517a8 in main ()
(gdb)

Access to a SPARC machine running Linux or Solaris can be obtained through the GCC Compile Farm, see: https://gcc.gnu.org/wiki/CompileFarm

yocto dunfell warning

yocto dunfell, a LTS version of yocto, is complaining about a missing release:

WARNING: libpcre2-native-10.34-r0 do_fetch: Failed to fetch URL https://github.com/PhilipHazel/pcre2/releases/download/pcre2-10.34/pcre2-10.34.tar.bz2, attempting MIRRORS if available

I noted that this release has been removed from github.

RunTests.bat on Windows fails test 8 for me

I don't know what the cause of this might be. I have a thirdparty port of pcre2 to the Meson build system (for embedding into other projects using meson) and as part of the port we try to run the testsuite.

For reference, here is the meson build description: https://github.com/mesonbuild/wrapdb/blob/pcre2/subprojects/packagefiles/pcre2/meson.build

In our Github CI, everything works fine on Ubuntu. On Windows using MSVC, I get this error instead:

https://github.com/mesonbuild/wrapdb/runs/4637121959?check_suite_focus=true#step:6:261

Test 8: "Internal offsets and code size tests"

          failed comparison: fc /n D:\a\wrapdb\wrapdb\subprojects\pcre2-10.39\testdata\testoutput8-8-2 testout8\testoutput8-8-2

Do you have any idea what the problem might be? Suggestions for figuring out the problem?

Is this an expected failure? There isn't currently any publicly visible CI for pcre2 (and an open PR only adds some for linux) so it is difficult to know for sure...

Test issue

This is a first, test issue to initialize the issue tracker for the PCRE2 repo.

Install of PDB files does not work (with a possible fix?)

For me the installation of debugger pdb files in MSVC does not work. I do not know if this is caused by some build flags that I'm using or if this is a general issue(?)

I looked a bit at CMakeLists.txt and was under the impression that the library names in the PDB install command are incorrect. Here is a patch that I found makes the install work for me. If this works and is sensible, I would be more than happy if this could be included (no copyright attached):

 IF(MSVC AND INSTALL_MSVC_PDB)
-    INSTALL(FILES ${PROJECT_BINARY_DIR}/pcre2.pdb
-                  ${PROJECT_BINARY_DIR}/pcre2posix.pdb
+    INSTALL(FILES ${PROJECT_BINARY_DIR}/pcre2-8.pdb
+                  ${PROJECT_BINARY_DIR}/pcre2-16.pdb
+                  ${PROJECT_BINARY_DIR}/pcre2-32.pdb
+                  ${PROJECT_BINARY_DIR}/pcre2-posix.pdb
             DESTINATION bin
             CONFIGURATIONS RelWithDebInfo)
-    INSTALL(FILES ${PROJECT_BINARY_DIR}/pcre2d.pdb
-                  ${PROJECT_BINARY_DIR}/pcre2posixd.pdb
+    INSTALL(FILES ${PROJECT_BINARY_DIR}/pcre2-8d.pdb
+                  ${PROJECT_BINARY_DIR}/pcre2-16d.pdb
+                  ${PROJECT_BINARY_DIR}/pcre2-32d.pdb
+                  ${PROJECT_BINARY_DIR}/pcre2-posixd.pdb
             DESTINATION bin
             CONFIGURATIONS Debug)
 ENDIF(MSVC AND INSTALL_MSVC_PDB)

legacy PCRE: segfault in pcre_exec (pcre_study'ed without PCRE_STUDY_JIT_COMPILE)

Disclaimer: I know it is affecting legacy PCRE, but because the bug is heavy, but PCRE is still using in many applications (e. g. in exim or nginx) and since it is not possible to report a bug in exim bug tracker for PCRE and even pcre.org suggests to report bug here...

Non-jit'ed pcre_exec can cause SF/SO if called with greedy RE containing two char-tokens group with most used quantifiers matching large block of input strings (ca. 12K for x86 and ca. 34K for x64).
It seems to enter endless recursion, so can cause stack overflow or segmentation fault conditionally.

Simplest PoC using pcregrep:

for 32-bit compiled engine it is enough to supply 12k buffer which would match RE:

$ printf '%10000s' | pcregrep -c --no-jit '(?:\s\s)+'
1
$ printf '%12000s' | pcregrep -c --no-jit '(?:\s\s)+'
Segmentation fault

$ printf '%10000s' | tr ' ' x | pcregrep -c --no-jit '(?:\w\w)+'
1
$ printf '%12000s' | tr ' ' x | pcregrep -c --no-jit '(?:\w\w)+'
Segmentation fault

for 64-bit version one need to supply 34k buffer which would match:

$ printf '%30000s' | pcregrep -c --no-jit '(?:\s\s)+'
1
$ printf '%34000s' | pcregrep -c --no-jit '(?:\s\s)+'
Segmentation fault

$ printf '%30000s' | tr ' ' x | pcregrep -c --no-jit '(?:\w\w)+'
1
$ printf '%34000s' | tr ' ' x | pcregrep -c --no-jit '(?:\w\w)+'
Segmentation fault

To cause this the REs must contain 2 tokens in repeatable group like \s. or .\w and similar and does not segfault with 1, 3 or 4 tokens (in that simplest form). The quantifier can be * but the group must match long input anyway, for example this would segfault also:

printf '%34000s' x | pcregrep -c --no-jit '(?:\s.)*x'

It is only affecting non-jit compiled (studied) REs.

RunGrepTest issue on OpenBSD

This is #2680 in the old Bugzilla, submitted by Nam Nguyen.

RunGrepTest fails on Test 132 for me on OpenBSD.

./testdata/grepoutput expects an 'a' at the end.
---------------------------- Test 132 -----------------------------
match 1:
 a
match 2:
 b
---
 a
RC=0

I actually get no 'a' when I run RunGrepTest:

---------------------------- Test 132 -----------------------------
match 1:
 a
match 2:
 b
---
RC=0

Manually running this test I can't see how it produces 'a' after the '---':

$ (pcre2grep -m1 -A3 '^match'; echo '---'; head -1) < testdata/grepinput
match 1:
 a
match 2:
 b
---

PH: The idea of this test is to check that the standard input is left in the right place when pcre2grep stops because it has reached the -m limit. The "a" line is generated by the "head -1" command when run under Linux (which is all I have):

$ (./pcre2grep -m1 -A3 '^match'; echo '---'; head -1) < testdata/grepinput
match 1:
 a
match 2:
 b
---
 a

Looks like this is yet another Linux/BSD difference. Sigh. I think the relevant code is around line 2589 in pcre2test.c:

  /* If the -m option set a limit for the number of matched or non-matched
  lines, check it here. A limit of zero means that no matching is ever done.
  For stdin from a file, set the file position. */                         
                                                                            
  if (count_limit >= 0 && count_matched_lines >= count_limit)             
    {                                                                   
    if (frtype == FR_PLAIN && filename == stdin_name && !is_file_tty(handle))
      (void)fseek(handle, (long int)filepos, SEEK_SET);             
    rc = (count_limit == 0)? 1 : 0;                                     
    break;                                                             
    }

Incorrect DFA example in documentation

This is #2756 in the old Bugzilla, submitted by S. Shuck.

The DFA example in the docs demonstrating finding every match does not work as expected (details omitted).

PH: This is not a bug, but a misunderstanding. You used match_data_create_from_code() to set up a match data block. As your pattern contains no capturing parentheses, this will create a block with a very small ovector (enough to hold just the whole match, no captured groups). However, when you use the DFA matcher, the ovector is used in a different way, as explained in the pcre2api page:

"On success, the yield of the function is a number greater than zero, which is
the number of matched substrings. The offsets of the substrings are returned in
the ovector, and can be extracted by number in the same way as for
\fBpcre2_match()\fP, but the numbers bear no relation to any capture groups
that may exist in the pattern, because DFA matching does not support capturing."

As your example should yield 3 matches, the ovector is not big enough, and therefore the yield is zero. If you change the match data creation to create a match data block with at least 3 ovector pairs, your example should return 3.

SS: Thanks for the insight. I'm unblocked for the moment.

The docs for pcre2_match_data_create_from_pattern() says "The ovector is created to be exactly the right size to hold all the substrings a pattern might capture." I guess I could have figured out that this number is not computable in the general case for DFA matching. Nevertheless, this sentence is false without a disclaimer about this case.

PH: Yes, I've noted that the documentation needs clarification, but it's too late for 10.37, which has been released today. I'll update the doc in due course - I suspect that DFA matching is in practice not used very much.

Disable \K in lookarounds

This is bug 2792 from the old Bugzilla, posted by firas. Perl used to allow \K in lookarounds, but it now throws an error. PCRE2 currently supports \K in positive lookarounds, and ignores it in negative ones. However, naive implementations can cause loops. After some discussion on the old list, the following was my (PH) conclusion:

I should have looked more closely at the code in pcre2demo. It has special code to deal with this case. Here is the comment:

/* If the previous match was not an empty string, there is one tricky case to
consider. If a pattern contains \K within a lookbehind assertion at the
start, the end of the matched string can be at the offset where the match
started. Without special action, this leads to a loop that keeps on matching
the same substring. We must detect this case and arrange to move the start on
by one character. The pcre2_get_startchar() function returns the starting
offset that was passed to pcre2_match(). */

OK, so now all is understood (pcre2test no doubt does the same). Perhaps the best thing to do here is to forbid \K in assertions, but to implement a new option in the PCRE2_EXTRA series to allow the current implementation. Then anyone who really needs the current behaviour can get it. We can put lots of warnings in the docs.

Document improvement for pcre2test delimiters

This was #2770 in the old Bugzilla. The poster said:

You notice that pcre2test gives an error when attempting to use the delimiter within \Q..\E, but accepted the pattern when I escaped it as: [-+*/]

My (PH) response was this:

This is perhaps a lack of detail in the documentation, no more. You are using pcre2test, which is a program for testing the PCRE2 library, and running small regex tests. The way it works is to identify a pattern by delimiters, before passing it to the library for interpretation. It makes no interpretation of the pattern itself, except that, if '\' is encountered, the next character is not checked for being the delimiter. This is an easy fudge for simple cases. I see no reason why pcre2test should implement sophisticated regex parsing such as \Q...\E interpretation itself. Note also that pcre2test is not intended for use in any kind of production situation. Sometime before the next release I will take a look at the documentation to see if it can be made more clear.

How to verify the GPG signature?

Hi, sorry for stupid non pcre2 questions here ;-)

In the releases, there's a .sig file where I can find a verification signature for the tar-balls. Trying to verify it failed until now, because I'm unable to find an importable key for it yet. GPG itself claims

gpg: key 4AEE18F83AFDEB23: new key but contains no user ID - skipped

for a random keyserver I tried. Loading the key from github tells me:

$ curl https://github.com/PhilipHazel.gpg
-----BEGIN PGP PUBLIC KEY BLOCK-----
Note: This user hasn't uploaded any GPG keys.


=twTO
-----END PGP PUBLIC KEY BLOCK-----%

which seems odd, because you've verified commits and signed tar-balls. Any hints?

Heap buffer overflow

Hi,

I'm a developer in the Qt project which uses pcre2. I think I found an issue. Do demonstrate this, I'll use your oss-fuzz image with a changed fuzz target which mimics how Qt uses your library. I did not change anything in pcre2 itself. Should any of my steps be incorrect, please let me know.

  1. Check out my branch of oss-fuzz.
    This will clone your latest sources to the oss-fuzz image with changes I did to the fuzz target in my fork of your repo.
  2. Build your fuzzer in oss-fuzz:
    python infra/helper.py build_image pcre2
    python infra/helper.py build_fuzzers --engine libfuzzer --sanitizer address --architecture x86_64 pcre2
  3. Run this with the input file I will send to you via mail.
    python infra/helper.py reproduce pcre2 pcre2_fuzzer <input_file>
    You will see output like:

Running: /testcase
=================================================================
==18==ERROR: AddressSanitizer: heap-buffer-overflow on address 0x6030000001a1 at pc 0x00000058a03f bp 0x7ffd1c26fab0 sp 0x7ffd1c26faa8
READ of size 1 at 0x6030000001a1 thread T0
SCARINESS: 12 (1-byte-read-heap-buffer-overflow)
#0 0x58a03e in get_ucp /src/pcre2/src/pcre2_compile.c
#1 0x56f0a6 in parse_regex /src/pcre2/src/pcre2_compile.c:3152:14
#2 0x56452e in pcre2_compile_8 /src/pcre2/src/pcre2_compile.c:10147:13
#3 0x55e3df in LLVMFuzzerTestOneInput /src/pcre2/src/pcre2_fuzzsupport.c:68:23
#4 0x455283 in fuzzer::Fuzzer::ExecuteCallback(unsigned char const*, unsigned long) cxa_noexception.cpp
#5 0x440ec2 in fuzzer::RunOneTest(fuzzer::Fuzzer*, char const*, unsigned long) /src/llvm-project/compiler-rt/lib/fuzzer/FuzzerDriver.cpp:324:6
#6 0x44671c in fuzzer::FuzzerDriver(int*, char***, int ()(unsigned char const, unsigned long)) cxa_noexception.cpp
#7 0x46f522 in main /src/llvm-project/compiler-rt/lib/fuzzer/FuzzerMain.cpp:20:10
#8 0x7fce9188c0b2 in __libc_start_main (/lib/x86_64-linux-gnu/libc.so.6+0x270b2)
#9 0x41f67d in _start (/out/pcre2_fuzzer+0x41f67d)

DEDUP_TOKEN: get_ucp--parse_regex--pcre2_compile_8
0x6030000001a1 is located 0 bytes to the right of 17-byte region [0x603000000190,0x6030000001a1)
allocated by thread T0 here:
#0 0x52510d in __interceptor_malloc /src/llvm-project/compiler-rt/lib/asan/asan_malloc_linux.cpp:129:3
#1 0x436e97 in operator new(unsigned long) cxa_noexception.cpp
#2 0x440ec2 in fuzzer::RunOneTest(fuzzer::Fuzzer*, char const*, unsigned long) /src/llvm-project/compiler-rt/lib/fuzzer/FuzzerDriver.cpp:324:6
#3 0x44671c in fuzzer::FuzzerDriver(int*, char***, int ()(unsigned char const, unsigned long)) cxa_noexception.cpp
#4 0x46f522 in main /src/llvm-project/compiler-rt/lib/fuzzer/FuzzerMain.cpp:20:10
#5 0x7fce9188c0b2 in __libc_start_main (/lib/x86_64-linux-gnu/libc.so.6+0x270b2)

DEDUP_TOKEN: __interceptor_malloc--operator new(unsigned long)--fuzzer::RunOneTest(fuzzer::Fuzzer*, char const*, unsigned long)
SUMMARY: AddressSanitizer: heap-buffer-overflow /src/pcre2/src/pcre2_compile.c in get_ucp
Shadow bytes around the buggy address:
0x0c067fff7fe0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
0x0c067fff7ff0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
0x0c067fff8000: fa fa 00 00 00 fa fa fa 00 00 00 00 fa fa 00 00
0x0c067fff8010: 00 fa fa fa 00 00 00 fa fa fa 00 00 00 fa fa fa
0x0c067fff8020: fd fd fd fa fa fa 00 00 00 00 fa fa 00 00 01 fa
=>0x0c067fff8030: fa fa 00 00[01]fa fa fa fa fa fa fa fa fa fa fa
0x0c067fff8040: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa
0x0c067fff8050: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa
0x0c067fff8060: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa
0x0c067fff8070: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa
0x0c067fff8080: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa
Shadow byte legend (one shadow byte represents 8 application bytes):
Addressable: 00
Partially addressable: 01 02 03 04 05 06 07
Heap left redzone: fa
Freed heap region: fd
Stack left redzone: f1
Stack mid redzone: f2
Stack right redzone: f3
Stack after return: f5
Stack use after scope: f8
Global redzone: f9
Global init order: f6
Poisoned by user: f7
Container overflow: fc
Array cookie: ac
Intra object redzone: bb
ASan internal: fe
Left alloca redzone: ca
Right alloca redzone: cb
==18==ABORTING`

I'd appreciate if you could have a look into this. If I call your code incorrectly, please tell me so I can correct this in Qt.

PCRE2_ENDANCHORED option missing?

Hello,

the docs say matching has the following option flag:
PCRE2_ENDANCHORED Pattern can match only at end of subject
but it seems to be undefined. PCRE2_ANCHORED on the other hand works as expected, so I wanted to check if this is a known issue.

Regex times out and crashes Apache when pcre.jit is enabled

Recently I've installed PHP8.1.0RC6 and Laravel 9 to try out some of the new features. I was unable to load any pages using Laravel, as Apache would crash and I would get a connection reset message.

I've posted a bug report on Laravel's repository here. We've tracked down the issue to a regex. I was suggested to post the bug on php.net, which I did submit. After further investigation, it appears that the issue is related to a recent patch to this repository.

The detailed bug description can be found in the links below, but just as a short explanation, the following piece of code will crash Apache ( using 2.4.51 ) when pcre.jit = 1 but does not cause a problem when it's executed via CLI:

var_dump(
	preg_match(
		'(([\\r\\n]{1,1000})|([^\\S\\r\\n]{1,1000})|(\\\\)|(\')|(")|(\\#)|(\\$)|(([^(\\s\\\\\'"\\#\\$)]|\\(|\\)){1,1000}))A',
		'Laravel',
		$matches
	)
);

Bug report on Laravel repository:
laravel/framework#39716

Bug report on bugs.php.net:
https://bugs.php.net/bug.php?id=81647

Backwards incompatible change between 10.37 and 10.38

When PHP is compiled with pcre2-10.39 (currently packaged version), but pcre2-10.36 (previously packaged version), there's a BC-break:

=== pcre2-10.35 ===
PHP Warning:  preg_match(): Compilation failed: unrecognised compile-time option bit(s) at offset 0 in Command line code on line 1
=== pcre2-10.36 ===
PHP Warning:  preg_match(): Compilation failed: unrecognised compile-time option bit(s) at offset 0 in Command line code on line 1
=== pcre2-10.37 ===
PHP Warning:  preg_match(): Compilation failed: unrecognised compile-time option bit(s) at offset 0 in Command line code on line 1
=== pcre2-10.38 ===
=== pcre2-10.39 ===

This breaks just too many things, so perhaps either the change need to be reverted/fixed or SONAME bumped.

I manually bisected the problem to be introduced in the 21c2669 commit:

ondrej@calcifer:~/Projects/tmp/pcre2 ((eea410b...))$
PHP Warning:  preg_match(): Compilation failed: unrecognised compile-time option bit(s) at offset 0 in Command line code on line 1

vs nothing printed here:

ondrej@calcifer:~/Projects/tmp/pcre2 ((21c2669...))$

e.g. PHP compiled with pcre2 that includes 21c2669 doesn't run without warning when linked with pcre2 that doesn't include 21c2669.

(*SKIP)(*F) within a (?(DEFINE)) does not skip position

This is #2725 from the old Bugzilla.

PH: It is documented at the end of pcre2pattern.3 that COMMIT, PRUNE, and SKIP are confined within a subroutine call in PCRE2, and just cause it to fail to match. I cannot remember why this is so. Subroutine calls appeared in PCRE before they did in Perl, so it might be that this behaviour dates from then, but it might also be because Perl has exhibited some conflicting behaviour in the past.

PH: Experiments show certain inconsistencies in Perl, which documents that (*ACCEPT) stays within a subroutine call, but is not explicit about the others, though it does state that a subroutine is processed as an independent subpattern. For the moment, we are not going to change anything in PCRE, partly because though this is an easy change in the interpreter, it is a substantial upgrade for the JIT.

OnlineCop wrote: The language of pcre2pattern.3 states:

(*SKIP)

This verb, when given without a name, is like (*PRUNE), except that if the pattern is unanchored, the "bumpalong" advance is not to the next character, but to the position in the subject where (*SKIP) was encountered. (*SKIP) signifies that whatever text was matched leading up to it cannot be part of a successful match if there is a later mismatch.

(*FAIL) in a group called as a subroutine has its normal effect: it forces an immediate backtrack.
(*COMMIT), (*SKIP), and (*PRUNE) cause the subroutine match to fail when triggered by being backtracked to in a group called as a subroutine. There is then a backtrack at the outer level.

There is no mention of (*SKIP) or (*PRUNE) being unable to modify the bumpalong of the outer level.

Perl appears to modify the bumpalong before the subroutine match fails, which (like PCRE2) then causes a backtrack at the outer level.

I believe that all other verbs, including (*ACCEPT), are fine to stay as-is, and the only change here being that (*SKIP) should be able to modify the outer level's bumpalong advance.

pcre2unicode.3 is installed twice when building with CMake

If I run the following:

tar -xf pcre2-10.39.tar.bz2
cd pcre2-10.39
mkdir build
cd build
cmake -DCMAKE_INSTALL_PREFIX=/tmp/pcre2test -G Ninja ..
ninja
ninja install

ninja install prints the following:

-- Installing: /tmp/pcre2test/man/man3/pcre2syntax.3
-- Installing: /tmp/pcre2test/man/man3/pcre2unicode.3
-- Up-to-date: /tmp/pcre2test/man/man3/pcre2unicode.3

This is because pcre2unicode.3 is listed twice in cmake_install.cmake. It's also listed twice in install_manifest.txt, which means that if you uninstall using xargs rm < install_manifest.txt as recommended by CMake, you get an error.

Inexplicable error -8 while matching certain pattern

Hi!

There is an error when matching the following pattern:

a(.|\s)*?asdf

against:

a                b b bbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbf

Specifically: a, followed by newline, followed by 16 spaces, followed by b, space, b, space, 35 b's, then an f.

pcregrep 'a(.|\s)*?asdf' returns:

pcregrep: pcre_exec() gave error -8 while matching this text:

a                b b bbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbf


pcregrep: Error -8, -21 or -27 means that a resource limit was exceeded.
pcregrep: Check your regex for nested unlimited loops.

Using pcre2 10.37_1, as installed on macOS 11.6 via Homebrew.

Am I missing something obvious? This should work, right?

perl compatible matching of '\n' with '.' when (*NUL)

perl doesn't match '\n' with '.' unless the "s" modifier is provided and regardless of what the input separator is as shown by:

$ printf '\n\na\0' | perl -ne 'BEGIN { $/="\0" } /(?<=\n)(.*)$/ and print $1' | od -c | head -1
0000000    a  \0
$ printf '\n\na\0' | perl -ne 'BEGIN { $/="\0" } /(?<=\n)(.*)$/s and print $1' | od -c | head -1
0000000   \n   a  \0

GNU grep (that uses the old PCRE) shows a similar behaviour when using NUL as a line delimiter (-z is for NUL separated input, and also sets the line terminator of output, just like perl's "-0 -l") as shown by GNU grep

$ printf '\n\na\0' | ggrep -Pzo '(?<=\n).*$' | od -c | head -1
0000000    a  \0

but PCRE2 does not, if the newline is not LF (or a compatible ANY or ANYCRLF) and that is actually validated by the testsuite (set 2) and IMHO makes more sense, but that will prevent grep (that has been recently updated to pcre2 on its unreleased version) to use PCRE2_NEWLINE_NUL as this change of behaviour might be considered a regression.

$ printf '\n\na\0' | pcre2grep -o -NNUL '(?<=\n).*$' | od -c | head -1
0000000   \n   a  \n

the documentation explicly says that no changes on the matching are expected when the newline definition is changed, and when newline is '\n' an equivalent "s" mode is provided through PCRE2_DOTALL making the result the same than perl for the modes that have '\n' as a valid new line delimiter, but not if CR or NUL are used, so there is at least a possibility this might be a "bug"?

FWIW, confirmed at least it is not a regression, as 8.x, while not having NUL, behaves the same when using CR and which is consistent with the behaviour observed in 10.x

[Makefile:1018: ossec-agentd]

Trying to install an agent on my servers but i have two servers with the same error and i just can't seem to figure it out :S Could u help?

  • I've installed packages: wget git vim unzip make gcc build-essential php php-cli php-common libapache2-mod-php apache2-utils inotify-tools libpcre2-dev zlib1g-dev libz-dev libssl-dev libevent-dev libssl-dev

  • I also have: pcre2-10.32 and pcre2-10.39 within /ossec-hids/scr/external

  • I've tried 'make clean' within /src

  • I downloaded the ossec-hids from: git clone https://github.com/ossec/ossec-hids.git and I'm using version 3.6.0 C:

I'm still pretty new to networking and so i'm sure I'm overlooking something C:

5- Installing the system
 - Running the Makefile
cc -I./external/compat -DMAX_AGENTS=2048 -DOSSECHIDS -DDEFAULTDIR=\"/var/ossec\" -DUSER=\"ossec\" -DREMUSER=\"ossecr\" -DGROUPGLOBAL=\"ossec\" -DMAILUSER=\"ossecm\" -DLinux -DINOTIFY_ENABLED -DHAVE_SYSTEMD -DZLIB_SYSTEM -DUSE_PCRE2_JIT -DLIBOPENSSL_ENABLED -DCLIENT -Wall -Wextra -I./ -I./headers/  client-agent/agentd.o client-agent/config.o client-agent/event-forward.o client-agent/intcheck_op.o client-agent/main.o client-agent/notify.o client-agent/receiver.o client-agent/receiver-win.o client-agent/sendmsg.o client-agent/start_agent.o os_crypto.a config.a shared.a os_net.a os_regex.a os_xml.a os_zlib.a  -lm -lpthread -lsystemd -lpcre2-8 -lssl -lcrypto -lz  ./external/compat/imsg.c ./external/compat/imsg-buffer.c -o ossec-agentd
/usr/bin/ld: cannot find -lsystemd
collect2: error: ld returned 1 exit status
make: *** [Makefile:1018: ossec-agentd] Error 1

pcre2.h missing on CentOS 7, even after installing pcre2

Hi,

I am not sure if this is the right place to raise this issue.
On CentOS 7, when we install pcre2 with the below command.
yum install pcre2

It doesn't; install pcre2.h file along with other files like .so.
Am I doing something wrong? or its a bug in CentOS packaging?

Also, I tried installing using rpm, same issue no luck.
https://centos.pkgs.org/7/centos-x86_64/pcre2-10.23-2.el7.x86_64.rpm.html
As this page shows under Files header it also doesn't install pcre2.h file.

Thanks in advance for help.

pcre2grep should detect symlink recursions

This is issue 2794 from the old Bugzilla, posted by Thomas Tempelmann. This is the proposed patch:

Assuming this is line 3330:

char buffer[FNBUFSIZ];

Then please rename "buffer" into "childpath" for better readability.

Then insert right after 3352 ("sprintf(..."):

  #if 1 // <-- replace with test for Linux and BSD (macOS/Darwin)
    // prevent endless recursion due to a symlink pointing to a parent dir (Bug 2794)
    char resolvedpath[PATH_MAX];
    if (realpath(childpath, resolvedpath) == NULL)
      continue;     // this path is invalid - we can skip processing this
    BOOL isSame = strcmp(pathname, resolvedpath) == 0;
    if (isSame)
      continue;    // we have a recursion
    strlcat(resolvedpath, "/", sizeof(resolvedpath));
    size_t rlen = strlen(resolvedpath);
    BOOL contained = strncmp(pathname, resolvedpath, rlen) == 0;
    if (contained)
      continue;    // we have a recursion
    resolvedpath[rlen-1] = 0;   // removes the added "/"
    strlcpy(childpath, resolvedpath, sizeof(childpath));
  #endif

I've tested this to work successfully on macOS with my screwed-up symlink.

The tricky part is to tell whether the resolved path is pointing back to where we already were.

With the first strcmp() I check whether it equals the parent current directory path. This assumes, though, that "pathname" is also already resolved - but if the user passed an unresolved path that points to itself, this recursion detection will fail the first time, but then, in the recursion, the paths will be the same (because I replace childpath with the resolved path further down) and thus the recursion will be stopped. You could just as well also resolve the path at the top of the function, but that's wasteful, IMO.

The second strcmp then checks of the resolved path exists as a the current path's parent or their parent. I do this by adding a "/" to the resolved path so that I do not mismatch the case where the resolved is "/a/b" and the parent is "/a/b2".

Porting guide from pcre1

Hi,

I don't suppose there's any chance of a porting guide from pcre1 to pcre2, is there, please?

I know you want to be shot of pcre1; I've recently filed bugs against the outstanding packages in Debian which still Build against pcre1, and there are a lot of responses of the form "is there any guidance on porting to pcre2?" I don't feel I have deep enough knowledge of the two libraries (especially the older one) to do so myself, but I think having something to point folk at might help in getting more of the remaining ~200(!) packages that still need old-pcre ported, which in turn will make it plausible for me to drop old-pcre from Debian...

Thanks :)

Duplicate names and callouts

I have a function that is searching for the nearest named group by name by measuring the distance from the nametable and last capture_last callout structure field.

It seems to be buggy.

Like how do I get the most recent (actually only one normally in the current subroutine) named group when duplicate names are allowed?

[Conan] Building on Windows 10 x86 - linking error

Hello!
I am trying compile on Windows by conan.

Library: pcre/8.45
Operating System: Windows 10 (x86)
Compiler version: VS 15
Conan version: conan 1.43.0
Python version: Python 3.8.0

Conan profile:

C:\Program Files (x86)\Microsoft Visual Studio\2017\BuildTools>conan profile show default
Configuration for profile default:

[settings]
os=Windows
os_build=Windows
arch=x86
arch_build=x86
compiler=Visual Studio
compiler.version=15
build_type=Debug

Error: Full log

pcregrep.obj : error LNK2019: nierozpoznany zewnŕtrzny symbol _BZ2_bzopen@8 przywo│any w funkcji _grep_or_recurse [C:\Users\mhanu\.conan\data\pcre\8
.45\_\_\build\752948ce548bd345ffbb13d47bc67547d791cc18\build_subfolder\source_subfolder\pcregrep.vcxproj]
pcregrep.obj : error LNK2019: nierozpoznany zewnŕtrzny symbol _BZ2_bzread@12 przywo│any w funkcji _pcregrep [C:\Users\mhanu\.conan\data\pcre\8.45\_\
_\build\752948ce548bd345ffbb13d47bc67547d791cc18\build_subfolder\source_subfolder\pcregrep.vcxproj]
pcregrep.obj : error LNK2019: nierozpoznany zewnŕtrzny symbol _BZ2_bzclose@4 przywo│any w funkcji _grep_or_recurse [C:\Users\mhanu\.conan\data\pcre\
8.45\_\_\build\752948ce548bd345ffbb13d47bc67547d791cc18\build_subfolder\source_subfolder\pcregrep.vcxproj]
pcregrep.obj : error LNK2019: nierozpoznany zewnŕtrzny symbol _BZ2_bzerror@8 przywo│any w funkcji _grep_or_recurse [C:\Users\mhanu\.conan\data\pcre\
8.45\_\_\build\752948ce548bd345ffbb13d47bc67547d791cc18\build_subfolder\source_subfolder\pcregrep.vcxproj]
C:\Users\mhanu\.conan\data\pcre\8.45\_\_\build\752948ce548bd345ffbb13d47bc67547d791cc18\build_subfolder\bin\pcregrep.exe : fatal error LNK1120: licz
ba nierozpoznanych elementˇw zewnŕtrznych: 4 [C:\Users\mhanu\.conan\data\pcre\8.45\_\_\build\752948ce548bd345ffbb13d47bc67547d791cc18\build_subfolde
r\source_subfolder\pcregrep.vcxproj]
pcre/8.45:
pcre/8.45: ERROR: Package '752948ce548bd345ffbb13d47bc67547d791cc18' build failed

[JIT] Performance regression with some regexs

I'm forwarding this from https://bugs.php.net/81424, but it seems that this is actually a PCRE2 issue. Consider the following (bad) regex:

/[^{};\/\n]+\{\}/

When run on a large string (e.g. https://pastebin.com/WVBR4f9T), with PCRE2 10.34 JIT this was fast; with PCRE2 10.35 and later it is more than hundred times slower.

If the regex is rewritten to use a lookbehind assertion (/(?<![{};\/\n]+)\{\}/), performance with the different PCRE2 versions is on par, so you may not consider this something to be fix-worthy. :)

There is no performance regression without JIT, so I wonder whether this regex isn't jitted anymore as of PCRE2 10.35.

Revise and extend character classes

This issue records several potential upgrades to the handling of character classes in PCRE2. This could be a lot of work in both the interpreters and the JIT.

  1. The current code in the compiler has been hacked into an untidy mess and the compiled code is also messy. A revised implementation is needed that is more uniform and can better handle Unicode characters so as to make matching more efficient. For example, bitmaps could be used for runs of characters other than just 0-0xFF. Or some better coding scheme could be devised.

  2. Perl has an experimental extended class feature as in this example:

/(?[ ( \p{Thai} + \p{Lao} ) & \p{Digit} ])/

Any new compiled format should be able to handle such things.

  1. There was a request for a way of re-defining \w (and therefore \W, \b, and \B). An in-pattern sequence such as (?w=[...]) was suggested. Easiest way would be simply to inline the class, with lookarounds for \b and \B. Ideally the setting should last till the end of the group, which means remembering all previous settings; maybe a fixed amount of stack would do - how deep would anyone want to nest these things? Of course, this idea also suggests redefining \d and \s. Is this worth doing, given that named groups can be used? It would be more efficient, because it can be processed at compile time.

Declaring named capture groups

Named capture groups are supported for a while.
How do you think about a software extension for the possibility to declare the binding of regular expressions to identifiers before such identifiers would be reused at other places?

Add some additional configuration

This is #2767 from the old Bugzilla.

i found the source code have no such export functions to make the setting
pcre2_set_max_name_count
pcre2_set_max_name_length
i think this maybe need for some scenes, hope you to add these two export functions
as well as pcre2_config Options.

Release of 10.39

Hi,

i wanted to ask if there is a rough estimation of when 10.39 with the fix is going to be released?

We are currently using PHP 7.4 packaged from Sury (https://deb.sury.org/) which includes libpcre2-8-0 in version 10.38.

Because of the bug some of our regex do not longer work like this one:

<?php
$matches = [];
preg_match('/(^.*phv-0*([1-9]\d{4,})(\D|$).*$)|((\D|^)([1-9]\d{7})(\D|$))/is', 'Mein Vertrag 12345678', $matches);
var_dump($matches);

It would be super cool if you could release the fix (which seems to be already implemented/merged).

FTP/HTTP/Conan not working?

Hello.
I have a problem with download from conan-center.

pcre/8.45: Configuring sources in /root/.conan/data/pcre/8.45/_/_/source
ERROR: Error downloading file https://ftp.pcre.org/pub/pcre/pcre-8.45.tar.gz: 'HTTPSConnectionPool(host='ftp.pcre.org', port=443): Max retries exceeded with url: /pub/pcre/pcre-8.45.tar.gz (Caused by ConnectTimeoutError(<urllib3.connection.HTTPSConnection object at 0x7f2dd13fa640>, 'Connection to ftp.pcre.org timed out. (connect timeout=60.0)'))'

Target: https://ftp.pcre.org/pub/pcre

CMake version needs incrementing

This is a consequence of old Bugzilla #2785, by Jan-Willem Blokland . The patch in that issue was applied, but the minimum CMake version in CMakeLists.txt needs updating (it is very old). These are the final comments:

JWB: Interesting that you see this Deprecation Warning. It turns out that CMake deprecated version older than 2.8.12:

Compatibility with versions of CMake older than 2.8.12 is now deprecated and will be removed from a future version. Calls to cmake_minimum_required() or cmake_policy() that set the policy version to an older value now issue a deprecation diagnostic.

For more details see https://cmake.org/cmake/help/v3.20/release/3.19.html. We could decide to increase the minimum required version to something more recent, like version 3.0 to avoid this warning. If we do so, I am willing to make another update to the CMake build configuration.

PH: Yes, I think we can usefully up the number to 3.0.0, which was released in 2014, so it's unlikely to catch anybody. I will do it sometime.

Support for more Unicode properties?

Regarding which Unicode properties are supported, the manual says:

The property names represented by xx above are limited to the Unicode script names, the general category properties, "Any", which matches any character (including newline), and some special PCRE properties (described in the next section). Other Perl properties such as "InMusicalSymbols" are not currently supported by PCRE. Note that \P{Any} does not match any characters, so always causes a match failure.

We have users who want support for the Bidi_Control property (semgrep/semgrep#3974), which is supported by Perl (also, by Go's regexp library). I'm not familiar with any of these implementations and I'm wondering why PCRE doesn't support all Unicode properties. Is it because they were added late and PCRE needs to catch up or for a technical reason?

Note that we're using PCRE from OCaml for which there hasn't been an effort to migrate to pcre2. So if we extend PCRE2 with support for more Unicode properties, we'll be unable to use it from OCaml unless we also port these changes to the old PCRE or we change the OCaml bindings to support the new API. It's really a separate issue but I thought I should mention it.

Another regression in JIT matching

This is #2762 in the old Bugzilla, submitted by Milian Wolff.

This is probably related to BUG 2621 except that I'm running with PCRE2 version 10.37 2021-05-26 and the specific issue from that bug doesn't reproduce anymore.

Instead, I'm running into the following reduced issue:

works:

printf '%s\n%s\n' '/\/([^\/]+)\/\d+/' '/A/B/0' | pcre2test 
PCRE2 version 10.37 2021-05-26
/\/([^\/]+)\/\d+/
/A/B/0
 0: /B/0
 1: B

does not work:

printf '%s\n%s\n' '/\/([^\/]+)\/\d+/' '/A/B/0' | pcre2test -jit
PCRE2 version 10.37 2021-05-26
/\/([^\/]+)\/\d+/
/A/B/0
No match

slight changes to the pattern make the issue go away

Also please note that bugzilla is missing an entry for version 10.37, as such I selected N/A for now.

Add support for the RISC-V architecture

This is a low-priority feature request, but one that would make sense to be done eventually.

https://github.com/PhilipHazel/pcre2/tree/master/src/sljit

Based on the files in this folder, it seems that currently PCRE supports many architectures, including x86, ARM, PPC, MIPS, SPARC, and S390X. RISC-V is a new open source ISA supported by several Linux distros, and it would be nice if PCRE supported RISC-V.

Specifically, I am only interested in 64-bit RISC-V aka RV64 (32-bit and 128-bit exist too), and as for extensions, I recommend just supporting the general purpose extensions, so -march=rv64g (or -march=rv64gc). The abbreviation of RISC-V to just RV is very common and is the preferred option in some situations, so feel free to use that name.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.