This appears to trigger exponential parsing time: <div class="snippet-clipboard-co

Hi jbcpollak, The slow parsing mentioned in <a class="issue-link js-

Extremely slow parsing for certain pathological input,about sirthias/pegdown

Comments (15)

eamelink commented on August 28, 2024

I noticed the same thing in 1.1.0, incidentally also in a piece of SQL. I guess the underscores have something to do with it?

This is the simplified input:

'a, a(a_a, ';', a_a), a_a_a_a=`a-a-a` a_a=`a-a-a` a, a(a_a, ';', a_a), a_a_a_a=`a-a-a` a_a=a(`a-a-a`,1,2) a, a(a_a, ';', a_a), a_a_a_a=`a-a-a` a_a=a(`a-a-a`,1,1) a, a(a_a, ';', a_a), a_a_a_a=`a-a-a`'

from pegdown.

sonson commented on August 28, 2024

I have also problems with slow parsing of words with inline-underscores, like f.ex.:

A_1 A_2 A_3 A_4 A_5 A_6 A_7 A_8 A_9 A_0
B_1 B_2 B_3 B_4 B_5 B_6 B_7 B_8 B_9 B_0
<p>

If I add even more words like C_1 to the text, the parsing will become much more slower.

The problem still exists in the 1.2.0 release.

from pegdown.

Elmervc commented on August 28, 2024

Trying to find the cullprit.

Focussing on StrongOrEmph, and now for simplicity only the _-character, it probably relates to the parser trying to match the remaining symbols ( suffix ) after a _-char as part of a single StrongOrEmph-sequence. If that suffix contains another _, it will first try to match it as UlOrStarLine-sequence (as this is the first rule in the method-chain of rules that matches the _ in rule Inline being part of the StrongOrEmph sequence) and then continues to match the new suffix. If parsing of the suffix eventually finishes, it still needs to match the closing _, which might already been eaten by the suffix-parsing. That suffix-parsing is now invalidated, and it tries to find the next valid path conform the parsing rules, which in turn may lead to the same behavior even within that suffix, and so on -> parse time grows exponentially with the number of added characters. Even when simple alphanumerics are added on a new line, the parse time seems to explode, which is interesting, because a StrongOrEmph-sequence only allows Inline constructs. That might hint to a different rule causing this parser behavior.

I will debug some further. Still, my intuition says that we should somehow model the rules, such that the parser is encouraged to treat the first StrongOrEmph-char (_/__/*/**) it encounters, in the context of an unfinished StrongOrEmph-sequence xSeq, as the closing StrongOrEmph`-char for xSeq.

from pegdown.

Elmervc commented on August 28, 2024

Some update: When I only allow a Space() character as first one in the StrongOrEmph-sequence, parse time gets normal on some sample input (36ms, where the original one is ... waiting... 1369627ms).

This is definitely not a fix yet, as it does not allow StrongOrEmph at a line-start. Trying to figure out how the parser should only allow a StrongOrEmph-construct after a whitespace OR at line-start. Any suggestions?

Furthermore, I think this restriction will make the output closer to the user 's purpose of using _/__/*/**.
Here is some output from my test:
note: the last line using the patched parser is not strong, because it starts at line start

---------Input--------
s_o_m_e _emph_ t_e_x_t_ 
 some words on a new line 
 some o_t_h_e_r _e_m_p_ t_e_xt
 _emphasized sentence_
 _emphasized multi-
sentence_
__A _v_e_r_y strong sentence__
----------------------
-----Output-patched-----
<p>s_o_m_e <em>emph</em> t_e_x_t_<br/>
 some words on a new line<br/>
 some o_t_h_e_r <em>e_m_p</em> t_e_xt<br/>
 <em>emphasized sentence</em><br/>
 <em>emphasized multi-<br/>
sentence</em><br/>
__A <em>v_e_r_y strong sentence</em>_</p>
----------------------
Took: 36ms

------Output-orig-------
<p>s_o_m_e <em>emph</em> t_e_x<em>t</em><br/>
 some words on a new line<br/>
 some o_t_h_e_r _e_m<em>p</em> t_e<em>xt<br/>
 <em>emphasized sentence</em><br/>
 <em>emphasized multi-<br/>
sentence</em><br/>
</em>_A _v_e<em>r<em>y strong sentence</em></em></p>
----------------------
Took: 1369627ms

from pegdown.

Elmervc commented on August 28, 2024

I've created a patched parser that checks if it is allowed to enter a StrongOrEmph-sequence. See this commit: Elmervc@bd636ee

It seems to work nicely, does not cause much overhead and treats emph/strong more nicely, as it only allows them to use after a white space or at new line. It also handles nested emph/strong constructs nicely.

Some tests and comparisons with original parser, including parse times:
http://pastebin.com/c87ksj7w

from pegdown.

jbcpollak commented on August 28, 2024

I'm not sure how relevant this is, but this issues seems related to an issue I filed: #78.

Forgive me if this isn't added any new information but GitHub's markdown processor ignores * (or I assume _) characters that don't complete before the hard return. For example:

_This one line
_this is a second

When I add an _ after the ends of the line:

This one line
this is a second

I don't know how flexible the parboiled processor is, but I would suggest just ignoring any unbalanced _ or *'s when a newline is encountered. This is what GitHub's parser does:

foo emp foo emp foo _bar

Which is:

    foo _emp_ foo _emp_ foo _bar

from pegdown.

Elmervc commented on August 28, 2024

Hi jbcpollak,

The slow parsing mentioned in #78 is indeed related to the StrongOrEmph-construct parsing. My patched parser is not able to parse the input provided in #78 within a 10 sec timeout.

However, if I change the StrongOrEmph-construct to not allow line breaks, it finishes in 6ms :)

Follow-up question is:

Should we allow line breaks ?

If we follow the original Markdown specifications, assuming the dingus follows that specifications... the answer should be a conditional yes. It does the same as github does (except for the hardwraps):
Input:

_what does
github_ do?

_what does
_ github_ do?

_what does  
_github do?

output:
what does
github do?

what does
_ github do?

_what does
_github do?

from pegdown.

jbcpollak commented on August 28, 2024

I don't really like the idea of breaking a specification, even one as loose as the Markdown one, however I think the utility of tracking StrongEmp across hard returns is limited. It doesn't strike me as something most people would expect to work or rely on.

Its also something quickly recoverable from when they realize it doesn't work.

The alternative gracefully degrades and the behavior is manageable to the user, so I'd vote for stopping analysis after a hard return.

It also solves the issue sooner than later. :)

from pegdown.

sirthias commented on August 28, 2024

Elmer,
thanks for your very detailed analysis of this problem!
Unfortunately PEG parsing, while being conceptually simple, has this problem of potentially exponential run-time and large and complex grammars like the one we are dealing with here can be a real PITA to fix.
The solution for curbing exponential run-time (which has worked a number of times on other areas of the grammar) is to add one or more "simple" syntactic predicates (i.e. Test or TestNot rules) at the right places that prevent the parser from entering a pathological rule pattern.
I don't currently have time to dig into this issue deeper and I bet you currently have a better grasp of how the parser works in that particular area, so maybe you have an idea of what predicate we might want to add?
Until we see clearly that a syntactic predicate is not going to help I'd be reluctant to break the current linebreak handling.

Another hint: the grammar is largely based on this one from peg-markdown, so it might be worth checking if (and how) John might have fixed this issue on his side...

Thanks again for your help!

from pegdown.

jbcpollak commented on August 28, 2024

Playing with GitHub's renderer, it seems to ignore all _ followed by non-whitespace when looking for a closing tag. Does that help at all?

from pegdown.

Elmervc commented on August 28, 2024

sirthias wrote:

Unfortunately PEG parsing, while being conceptually simple, has this problem of potentially exponential run-time and large and complex grammars like the one we are dealing with here can be a real PITA to fix.

I cannot disagree ;)

The solution for curbing exponential run-time (which has worked a number of times on other areas of the grammar) is to add one or more "simple" syntactic predicates (i.e. Test or TestNot rules) at the right places that prevent the parser from entering a pathological rule pattern.

I've played a lot with these predicates. Unfortunately, in the context of StrongOrEmph, I often ran into a problem related to the SuperNode children becoming out of sync for some reason. Need some time to understand why.

Another hint: the grammar is largely based on this one from peg-markdown, so it might be worth checking if (and how) John might have fixed this issue on his side...

I've had a quick look but it seems that this grammar has the same problem. (not completely sure yet)

jbcpollak wrote:

Playing with GitHub's renderer, it seems to ignore all _ followed by non-whitespace when looking for a closing tag. Does that help at all?

That behaviour is already implemented in the current pegdown parser (code here).

I just tested the examples from my previous comment. It seems that both the unpatched and patched peg-down parsers fail to parse the second snippet "correctly".
It also revealed some checks are missing in the patched parser.

I will try to improve the patched parser further, hopefully keeping multi-line emph/strong supported.

Update

Some clarification on the problem:
I think the general problem is that it is a PEG-parser. From wikipedia:

the choice operator selects the first match in PEG

Take the fragment of issue #78 with multiple *-characters distributed over multiple Inline-constructs. The parser will try to match the 1st * as StrongOrEmph, the 2nd * as nested StrongOrEmph in the 1st StrongOrEmph, the 3rd * as part of .. etc, without the guarantee that it has enough closing * left for the stack of unfinished StrongOrEmph-costructs, which forces the parser to traverse an exponential number of paths.

from pegdown.

Elmervc commented on August 28, 2024

Progressing...

I've now implemented a restriction that disallows starting to parse a StrongOrEmph within a StrongOrEmph iff the parent StrongOrEmph is of the same type (using a new node type which extends the previously used SuperType). This significantly drops the parse times for some of my tests (eg 39ms->3ms), while still allowing multi-line emph/strong constructs. And it does parse the #78 fragment in 61ms 👍

TODO: have this snippet working:

_what does
_ github_ do?

from pegdown.

sirthias commented on August 28, 2024

Nice, Elmer!
Let's me know when you think you have something that's ready for merging...

from pegdown.

Elmervc commented on August 28, 2024

Will probably do a pull request this week. Need some more testing /fix some corner-case usages.

from pegdown.

Elmervc commented on August 28, 2024

After repeatedly rewriting and retesting, I finally managed to patch the parser in such a way that it does exactly the same as original markdown, except for 1 case: __where and _emph__ may start within a strong and end_ outside it _and __vice_ versa__. But that's no problem, you can easily rewrite this to do the same if you really really want to have such a construct ;)

The patch now includes various checks within the StrongOrEmph-rules. Important change is that it now accepts unclosed strong/emph-sequences as AST-node to be serialized (-> parsing time for #78 drops further to 6ms) . The ToHtmlSerializer will decide to print the children inline-nodes decorated with emph/strong tags (in case of closed sequence), or only preceded by the opening char (in case of unclosed sequence).

Preview:
Here is the output of 18 tests I used to validate the patch. (includes patched and unpatched output + parse times)

Parse-times may drop further when I'm done cleaning up code + removing debug statements. Then it's ready for pull-request. Will be there soon.

from pegdown.

Extremely slow parsing for certain pathological input about pegdown HOT 15 CLOSED

Comments (15)

Update

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent

Jobs