GithubHelp home page GithubHelp logo

Comments (4)

eliben avatar eliben commented on May 27, 2024

Interesting find. If you have a fix, please submit a pull request

from pycparser.

TysonAndre avatar TysonAndre commented on May 27, 2024

The regex module (not re) has functionality that seems like it would help with that - atomic grouping, which will prevent it from backtracing. I'd recommend adding a dependency on that and using that

These commands now complete in milliseconds (both matches and failures) instead of minutes or longer.

>>> bad_string_literal = '"(?>'+string_char+'*)'+bad_escape+string_char+'*"'
>>> string_to_match='"0\\xE0\\xE1\\xE2\\xE3\\xE4\\xE5\\xE6\\xE7\\xE8\\xE9\\xEB\\xEC\\xEE\\xF0\\xF1\\xF2\\xF4\\xF6\\xF7\\xF8\\xF9\`\\xFA";\n\n\n\n\n\nin'
>>> regex.match(bad_string_literal, string_to_match)
<regex.Match object; span=(0, 93), match='"0\\xE0\\xE1\\xE2\\xE3\\xE4\\xE5\\xE6\\xE7\\xE8\\xE9\\xEB\\xEC\\xEE\\xF0\\xF1\\xF2\\xF4\\xF6\\xF7\\xF8\\xF9\\`\\xFA"'>
>>> string_to_match='"0\\xE0\\xE1\\xE2\\xE3\\xE4\\xE5\\xE6\\xE7\\xE8\\xE9\\xEB\\xEC\\xEE\\xF0\\xF1\\xF2\\xF4\\xF6\\xF7\\xF8\\xF9\\xFA";\n\n\n\n\n\nin'
>>> regex.match(bad_string_literal, r'"\123\123\123\123\123\123\123\123\123\123\123\123\123\123\123";')
>>> regex.match(bad_string_literal, r'"\123\123\123\123\123\123\`\123\123\123\123\123\123\123\123";')
<regex.Match object; span=(0, 60), match='"\\123\\123\\123\\123\\123\\123\\`\\123\\123\\123\\123\\123\\123\\123\\123"'>

Aside: I'm not familiar with the C standard. The regex looks like it allows exactly 1 invalid escape per string - Should this be allowing 1 or more invalid escapes instead (combine valid with invalid escape patterns and repeat)?

from pycparser.

TysonAndre avatar TysonAndre commented on May 27, 2024

Also, it looks like my patch to use regex will cause the test to fail because it no longer rejects "jx\9", but the test seems to have relied on a bug in the old re implementation.

I assume that it's testing that invalid octal escapes are rejected, but the implementation in c_lexer.py is permissive and tries to allow decimal escapes (which would permit \99, \9, etc?)

        self.assertLexerError(r'"jx\9"', ERR_STRING_ESCAPE)
    # character constants (K&R2: A.2.5.2)
    # Note: a-zA-Z and '.-~^_!=&;,' are allowed as escape chars to support #line
    # directives with Windows paths as filenames (..\..\dir\file)
    # For the same reason, decimal_escape allows all digit sequences. We want to
    # parse all correct code, even if it means to sometimes parse incorrect
    # code.

from pycparser.

TysonAndre avatar TysonAndre commented on May 27, 2024

Sorry, I'm not going to accept this. The bug isn't realistically important enough to add another dependency to pycparser. At this time pycparser has no external deps (PLY is vendored), and adding one is a big step function.

Would you accept PRs for either of these 3 options:

  1. Load regex. If it exists, use the new regex. If it doesn't, use the old regex with re.

  2. Run 3 regex matches with re, and return a dummy re.Match emulating the correct behavior

    1. For the valid string prefix
    2. For a single invalid escape sequence
    3. For a combination of valid and and invalid sequences, followed by the closing quote
  3. Try to fix ambiguity with negative lookahead.


Also, the root cause is still there. Thinking about this again, I assume that \123 can be parsed as \1, then 2+3, or \12, then 3, or \123, which means there would be 3 ways of parsing that escape sequence, and every single one would be tried if the regex failed somewhere (so each escape sequence makes worst-case matching 3 times longer).

I think the problem is that the BAD_STRING_LITERAL regular expression is doing backtracking, resulting in exponential running time. Add another '\123' to the string and it will take roughly 3 times as long.

Adding a lookahead assertion that the character after \[0-9]+ is not 0 or 9 might be a better approach to avoiding this exponential backtracking (same for anything else accepting variable numbers of characters). I don't know if python has had any bugs related to lookahead that I need to know about.


https://docs.python.org/2/library/re.html#regular-expression-syntax

(?!...)
Matches if `...` doesn’t match next. This is a negative lookahead assertion. For example, `Isaac (?!Asimov)` will match 'Isaac ' only if it’s not followed by 'Asimov'.

from pycparser.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.