Comments (4)
Interesting find. If you have a fix, please submit a pull request
from pycparser.
The regex
module (not re
) has functionality that seems like it would help with that - atomic grouping, which will prevent it from backtracing. I'd recommend adding a dependency on that and using that
These commands now complete in milliseconds (both matches and failures) instead of minutes or longer.
>>> bad_string_literal = '"(?>'+string_char+'*)'+bad_escape+string_char+'*"'
>>> string_to_match='"0\\xE0\\xE1\\xE2\\xE3\\xE4\\xE5\\xE6\\xE7\\xE8\\xE9\\xEB\\xEC\\xEE\\xF0\\xF1\\xF2\\xF4\\xF6\\xF7\\xF8\\xF9\`\\xFA";\n\n\n\n\n\nin'
>>> regex.match(bad_string_literal, string_to_match)
<regex.Match object; span=(0, 93), match='"0\\xE0\\xE1\\xE2\\xE3\\xE4\\xE5\\xE6\\xE7\\xE8\\xE9\\xEB\\xEC\\xEE\\xF0\\xF1\\xF2\\xF4\\xF6\\xF7\\xF8\\xF9\\`\\xFA"'>
>>> string_to_match='"0\\xE0\\xE1\\xE2\\xE3\\xE4\\xE5\\xE6\\xE7\\xE8\\xE9\\xEB\\xEC\\xEE\\xF0\\xF1\\xF2\\xF4\\xF6\\xF7\\xF8\\xF9\\xFA";\n\n\n\n\n\nin'
>>> regex.match(bad_string_literal, r'"\123\123\123\123\123\123\123\123\123\123\123\123\123\123\123";')
>>> regex.match(bad_string_literal, r'"\123\123\123\123\123\123\`\123\123\123\123\123\123\123\123";')
<regex.Match object; span=(0, 60), match='"\\123\\123\\123\\123\\123\\123\\`\\123\\123\\123\\123\\123\\123\\123\\123"'>
Aside: I'm not familiar with the C standard. The regex looks like it allows exactly 1 invalid escape per string - Should this be allowing 1 or more invalid escapes instead (combine valid with invalid escape patterns and repeat)?
from pycparser.
Also, it looks like my patch to use regex
will cause the test to fail because it no longer rejects "jx\9"
, but the test seems to have relied on a bug in the old re
implementation.
I assume that it's testing that invalid octal escapes are rejected, but the implementation in c_lexer.py is permissive and tries to allow decimal escapes (which would permit \99
, \9
, etc?)
self.assertLexerError(r'"jx\9"', ERR_STRING_ESCAPE)
# character constants (K&R2: A.2.5.2)
# Note: a-zA-Z and '.-~^_!=&;,' are allowed as escape chars to support #line
# directives with Windows paths as filenames (..\..\dir\file)
# For the same reason, decimal_escape allows all digit sequences. We want to
# parse all correct code, even if it means to sometimes parse incorrect
# code.
from pycparser.
Sorry, I'm not going to accept this. The bug isn't realistically important enough to add another dependency to pycparser. At this time pycparser has no external deps (PLY is vendored), and adding one is a big step function.
Would you accept PRs for either of these 3 options:
-
Load
regex
. If it exists, use the new regex. If it doesn't, use the old regex withre
. -
Run 3 regex matches with
re
, and return a dummy re.Match emulating the correct behavior- For the valid string prefix
- For a single invalid escape sequence
- For a combination of valid and and invalid sequences, followed by the closing quote
-
Try to fix ambiguity with negative lookahead.
Also, the root cause is still there. Thinking about this again, I assume that \123
can be parsed as \1
, then 2
+3
, or \12
, then 3
, or \123
, which means there would be 3 ways of parsing that escape sequence, and every single one would be tried if the regex failed somewhere (so each escape sequence makes worst-case matching 3 times longer).
I think the problem is that the BAD_STRING_LITERAL regular expression is doing backtracking, resulting in exponential running time. Add another '\123' to the string and it will take roughly 3 times as long.
Adding a lookahead assertion that the character after \[0-9]+
is not 0 or 9 might be a better approach to avoiding this exponential backtracking (same for anything else accepting variable numbers of characters). I don't know if python has had any bugs related to lookahead that I need to know about.
https://docs.python.org/2/library/re.html#regular-expression-syntax
(?!...)
Matches if `...` doesn’t match next. This is a negative lookahead assertion. For example, `Isaac (?!Asimov)` will match 'Isaac ' only if it’s not followed by 'Asimov'.
from pycparser.
Related Issues (20)
- offsetof parsing fails due to TYPEID as offsetof_member_designator HOT 1
- Hash Pin Github Action on Workflows HOT 1
- is it possible to parse in-complete C code snippet? HOT 1
- c_generator returning a dict mapping the nodes to their position in the resulting code HOT 1
- Support for __attribute__((weak)) HOT 1
- pycparser.plyparser.ParseError: xx/include/vadefs.h:24:28: before: __gnuc_va_list HOT 4
- Enable OpenSSF Scorecard Action and Badge HOT 2
- make pycparser work with linux kernel code HOT 4
- Missing ; when generating code for extern functions
- Can
- Can't parse incomplete types and other syntactically valid but non-compilable code HOT 1
- Curly braces inside braced-group throws ParseError HOT 2
- AssertionError
- Two-dimensional array binding type problem HOT 1
- Is there a release plan for the next version of pycparser?
- parser error with typedef HOT 10
- assertion error on gcc-9 stddef.h
- Is there a way to find the function declaration matching a function call? HOT 1
- CParser doesn't work with comments HOT 1
- Add end of token coord
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from pycparser.