GithubHelp home page GithubHelp logo

Comments (6)

PhilipHazel avatar PhilipHazel commented on July 18, 2024

PCRE2 already has support for different "newlines", but this does not change when PCRE2_MULTILINE is set. You can choose between CR (only), LF (only), CR+LF (i.e. two characters), any of the previous, any Unicode newline sequence, or NUL. A default can be set when PCRE2 is built, but this can be overridden by a function call and this in turn can be overridden within the pattern string. If you were to set ANYCRLF as the newline, it would almost agree with your "not s" mode, except that a CR followed by a LF would count as just one newline, not two. It sounds as if you have full control over the regex. In that case, when you are going to set the "m" option, you could also set LF as the only linefeed. So my suggestion is:

Default: start the pattern with (*ANYCRLF) which will give you correct "." behaviour, that is, "." will not match CR or LF.

If the "s" option is wanted, start the pattern with (?s) and "." will match any character.

If the "m" option is wanted, start the pattern with (*LF)(?m) and "." will match any except LF.

If both options are wanted, start with (*LF)(?ms).

That seems to me to give you the wanted behaviour, except that in the default case CRLF counts as just one newline. Making PCRE2 recognize either CR or LF as a newline, but treat CRLF as two newlines would require a new newline mode.

from pcre2.

cohomology avatar cohomology commented on July 18, 2024

The problem with your approach is, that in "m" mode without "s" you want to set (*LF)(?m), and a single dot will match CR, which it shouldn't according to the standard. The problem you mentioned at the end is also present.

My colleague suggested:

Always set (*LF)

Transform the regex that "." is never generated but:

no "s" mode: dot is transformed to [^\r\n]
in "s" mode: dot is transformed to [\s\S]

Would that work?

from pcre2.

PhilipHazel avatar PhilipHazel commented on July 18, 2024

It's somewhat inconsistent to have "." not match CR or LF while at the same time only recognizing LF as newline. However, I think your approach would work, though for "s" mode you could just set PCRE2_DOTALL (or (?s)) which would be more efficient.

from pcre2.

cohomology avatar cohomology commented on July 18, 2024

Yeah, thanks!

I really want to know why W3C decided to do it that way. The XML people should be very clever, shouldn't they? Do you have a clue?

The replacement syntax (i.e. substitute) is even more difficult to get used to. They don't accept ${num}, only $num, and if there are only 22 groups, then $223 will be equivalent to PCRE's ${22}3. Replacing by ${2}23 is impossible in this case in XPATH.

from pcre2.

PhilipHazel avatar PhilipHazel commented on July 18, 2024

Who knows? Reading the doc suggests to me that they thought about "^" and "$" completely separately from "." whereas PCRE ties them all to the concept of a logical "newline". The replacement rules seem totally weird.

from pcre2.

cohomology avatar cohomology commented on July 18, 2024

Thanks!

from pcre2.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.