Running the following program on the following data takes a long time to parse. <d

String literals with many escapes take a long time to tokenize about pycparser HOT 4 CLOSED

eliben commented on May 27, 2024

String literals with many escapes take a long time to tokenize

from pycparser.

Comments (4)

eliben commented on May 27, 2024

Interesting find. If you have a fix, please submit a pull request

from pycparser.

TysonAndre commented on May 27, 2024

The regex module (not re) has functionality that seems like it would help with that - atomic grouping, which will prevent it from backtracing. I'd recommend adding a dependency on that and using that

These commands now complete in milliseconds (both matches and failures) instead of minutes or longer.

>>> bad_string_literal = '"(?>'+string_char+'*)'+bad_escape+string_char+'*"'
>>> string_to_match='"0\\xE0\\xE1\\xE2\\xE3\\xE4\\xE5\\xE6\\xE7\\xE8\\xE9\\xEB\\xEC\\xEE\\xF0\\xF1\\xF2\\xF4\\xF6\\xF7\\xF8\\xF9\`\\xFA";\n\n\n\n\n\nin'
>>> regex.match(bad_string_literal, string_to_match)
<regex.Match object; span=(0, 93), match='"0\\xE0\\xE1\\xE2\\xE3\\xE4\\xE5\\xE6\\xE7\\xE8\\xE9\\xEB\\xEC\\xEE\\xF0\\xF1\\xF2\\xF4\\xF6\\xF7\\xF8\\xF9\\`\\xFA"'>
>>> string_to_match='"0\\xE0\\xE1\\xE2\\xE3\\xE4\\xE5\\xE6\\xE7\\xE8\\xE9\\xEB\\xEC\\xEE\\xF0\\xF1\\xF2\\xF4\\xF6\\xF7\\xF8\\xF9\\xFA";\n\n\n\n\n\nin'
>>> regex.match(bad_string_literal, r'"\123\123\123\123\123\123\123\123\123\123\123\123\123\123\123";')
>>> regex.match(bad_string_literal, r'"\123\123\123\123\123\123\`\123\123\123\123\123\123\123\123";')
<regex.Match object; span=(0, 60), match='"\\123\\123\\123\\123\\123\\123\\`\\123\\123\\123\\123\\123\\123\\123\\123"'>

Aside: I'm not familiar with the C standard. The regex looks like it allows exactly 1 invalid escape per string - Should this be allowing 1 or more invalid escapes instead (combine valid with invalid escape patterns and repeat)?

from pycparser.

TysonAndre commented on May 27, 2024

Also, it looks like my patch to use regex will cause the test to fail because it no longer rejects "jx\9", but the test seems to have relied on a bug in the old re implementation.

I assume that it's testing that invalid octal escapes are rejected, but the implementation in c_lexer.py is permissive and tries to allow decimal escapes (which would permit \99, \9, etc?)

        self.assertLexerError(r'"jx\9"', ERR_STRING_ESCAPE)

    # character constants (K&R2: A.2.5.2)
    # Note: a-zA-Z and '.-~^_!=&;,' are allowed as escape chars to support #line
    # directives with Windows paths as filenames (..\..\dir\file)
    # For the same reason, decimal_escape allows all digit sequences. We want to
    # parse all correct code, even if it means to sometimes parse incorrect
    # code.

from pycparser.

TysonAndre commented on May 27, 2024

Sorry, I'm not going to accept this. The bug isn't realistically important enough to add another dependency to pycparser. At this time pycparser has no external deps (PLY is vendored), and adding one is a big step function.

Would you accept PRs for either of these 3 options:

Load regex. If it exists, use the new regex. If it doesn't, use the old regex with re.
Run 3 regex matches with re, and return a dummy re.Match emulating the correct behavior
1. For the valid string prefix
2. For a single invalid escape sequence
3. For a combination of valid and and invalid sequences, followed by the closing quote
Try to fix ambiguity with negative lookahead.

Also, the root cause is still there. Thinking about this again, I assume that \123 can be parsed as \1, then 2+3, or \12, then 3, or \123, which means there would be 3 ways of parsing that escape sequence, and every single one would be tried if the regex failed somewhere (so each escape sequence makes worst-case matching 3 times longer).

I think the problem is that the BAD_STRING_LITERAL regular expression is doing backtracking, resulting in exponential running time. Add another '\123' to the string and it will take roughly 3 times as long.

Adding a lookahead assertion that the character after \[0-9]+ is not 0 or 9 might be a better approach to avoiding this exponential backtracking (same for anything else accepting variable numbers of characters). I don't know if python has had any bugs related to lookahead that I need to know about.


https://docs.python.org/2/library/re.html#regular-expression-syntax

(?!...)
Matches if `...` doesn’t match next. This is a negative lookahead assertion. For example, `Isaac (?!Asimov)` will match 'Isaac ' only if it’s not followed by 'Asimov'.

from pycparser.

String literals with many escapes take a long time to tokenize about pycparser HOT 4 CLOSED

Comments (4)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent

Jobs