Comments (15)
I noticed the same thing in 1.1.0, incidentally also in a piece of SQL. I guess the underscores have something to do with it?
This is the simplified input:
'a, a(a_a, ';', a_a), a_a_a_a=`a-a-a` a_a=`a-a-a` a, a(a_a, ';', a_a), a_a_a_a=`a-a-a` a_a=a(`a-a-a`,1,2) a, a(a_a, ';', a_a), a_a_a_a=`a-a-a` a_a=a(`a-a-a`,1,1) a, a(a_a, ';', a_a), a_a_a_a=`a-a-a`'
from pegdown.
I have also problems with slow parsing of words with inline-underscores, like f.ex.:
A_1 A_2 A_3 A_4 A_5 A_6 A_7 A_8 A_9 A_0
B_1 B_2 B_3 B_4 B_5 B_6 B_7 B_8 B_9 B_0
<p>
If I add even more words like C_1
to the text, the parsing will become much more slower.
The problem still exists in the 1.2.0 release.
from pegdown.
Trying to find the cullprit.
Focussing on StrongOrEmph
, and now for simplicity only the _
-character, it probably relates to the parser trying to match the remaining symbols ( suffix ) after a _
-char as part of a single StrongOrEmph
-sequence. If that suffix contains another _
, it will first try to match it as UlOrStarLine
-sequence (as this is the first rule in the method-chain of rules that matches the _
in rule Inline
being part of the StrongOrEmph sequence) and then continues to match the new suffix. If parsing of the suffix eventually finishes, it still needs to match the closing _
, which might already been eaten by the suffix-parsing. That suffix-parsing is now invalidated, and it tries to find the next valid path conform the parsing rules, which in turn may lead to the same behavior even within that suffix, and so on -> parse time grows exponentially with the number of added characters. Even when simple alphanumerics are added on a new line, the parse time seems to explode, which is interesting, because a StrongOrEmph
-sequence only allows Inline
constructs. That might hint to a different rule causing this parser behavior.
I will debug some further. Still, my intuition says that we should somehow model the rules, such that the parser is encouraged to treat the first StrongOrEmph
-char (_/__/*/**
) it encounters, in the context of an unfinished StrongOrEmph-sequence xSeq, as the closing StrongOrEmph`-char for xSeq.
from pegdown.
Some update: When I only allow a Space()
character as first one in the StrongOrEmph
-sequence, parse time gets normal on some sample input (36ms, where the original one is ... waiting... 1369627ms).
This is definitely not a fix yet, as it does not allow StrongOrEmph
at a line-start. Trying to figure out how the parser should only allow a StrongOrEmph
-construct after a whitespace OR at line-start. Any suggestions?
Furthermore, I think this restriction will make the output closer to the user 's purpose of using _/__/*/**
.
Here is some output from my test:
note: the last line using the patched parser is not strong, because it starts at line start
---------Input--------
s_o_m_e _emph_ t_e_x_t_
some words on a new line
some o_t_h_e_r _e_m_p_ t_e_xt
_emphasized sentence_
_emphasized multi-
sentence_
__A _v_e_r_y strong sentence__
----------------------
-----Output-patched-----
<p>s_o_m_e <em>emph</em> t_e_x_t_<br/>
some words on a new line<br/>
some o_t_h_e_r <em>e_m_p</em> t_e_xt<br/>
<em>emphasized sentence</em><br/>
<em>emphasized multi-<br/>
sentence</em><br/>
__A <em>v_e_r_y strong sentence</em>_</p>
----------------------
Took: 36ms
------Output-orig-------
<p>s_o_m_e <em>emph</em> t_e_x<em>t</em><br/>
some words on a new line<br/>
some o_t_h_e_r _e_m<em>p</em> t_e<em>xt<br/>
<em>emphasized sentence</em><br/>
<em>emphasized multi-<br/>
sentence</em><br/>
</em>_A _v_e<em>r<em>y strong sentence</em></em></p>
----------------------
Took: 1369627ms
from pegdown.
I've created a patched parser that checks if it is allowed to enter a StrongOrEmph
-sequence. See this commit: Elmervc@bd636ee
It seems to work nicely, does not cause much overhead and treats emph/strong more nicely, as it only allows them to use after a white space or at new line. It also handles nested emph/strong constructs nicely.
Some tests and comparisons with original parser, including parse times:
http://pastebin.com/c87ksj7w
from pegdown.
I'm not sure how relevant this is, but this issues seems related to an issue I filed: #78.
Forgive me if this isn't added any new information but GitHub's markdown processor ignores * (or I assume _) characters that don't complete before the hard return. For example:
_This one line
_this is a second
When I add an _ after the ends of the line:
This one line
this is a second
I don't know how flexible the parboiled processor is, but I would suggest just ignoring any unbalanced _ or *'s when a newline is encountered. This is what GitHub's parser does:
foo emp foo emp foo _bar
Which is:
foo _emp_ foo _emp_ foo _bar
from pegdown.
Hi jbcpollak,
The slow parsing mentioned in #78 is indeed related to the StrongOrEmph
-construct parsing. My patched parser is not able to parse the input provided in #78 within a 10 sec timeout.
However, if I change the StrongOrEmph
-construct to not allow line breaks, it finishes in 6ms :)
Follow-up question is:
Should we allow line breaks ?
If we follow the original Markdown specifications, assuming the dingus follows that specifications... the answer should be a conditional yes. It does the same as github does (except for the hardwraps):
Input:
_what does
github_ do?
_what does
_ github_ do?
_what does
_github do?
output:
what does
github do?
what does
_ github do?
_what does
_github do?
from pegdown.
I don't really like the idea of breaking a specification, even one as loose as the Markdown one, however I think the utility of tracking StrongEmp across hard returns is limited. It doesn't strike me as something most people would expect to work or rely on.
Its also something quickly recoverable from when they realize it doesn't work.
The alternative gracefully degrades and the behavior is manageable to the user, so I'd vote for stopping analysis after a hard return.
It also solves the issue sooner than later. :)
from pegdown.
Elmer,
thanks for your very detailed analysis of this problem!
Unfortunately PEG parsing, while being conceptually simple, has this problem of potentially exponential run-time and large and complex grammars like the one we are dealing with here can be a real PITA to fix.
The solution for curbing exponential run-time (which has worked a number of times on other areas of the grammar) is to add one or more "simple" syntactic predicates (i.e. Test
or TestNot
rules) at the right places that prevent the parser from entering a pathological rule pattern.
I don't currently have time to dig into this issue deeper and I bet you currently have a better grasp of how the parser works in that particular area, so maybe you have an idea of what predicate we might want to add?
Until we see clearly that a syntactic predicate is not going to help I'd be reluctant to break the current linebreak handling.
Another hint: the grammar is largely based on this one from peg-markdown, so it might be worth checking if (and how) John might have fixed this issue on his side...
Thanks again for your help!
from pegdown.
Playing with GitHub's renderer, it seems to ignore all _ followed by non-whitespace when looking for a closing tag. Does that help at all?
from pegdown.
sirthias wrote:
Unfortunately PEG parsing, while being conceptually simple, has this problem of potentially exponential run-time and large and complex grammars like the one we are dealing with here can be a real PITA to fix.
I cannot disagree ;)
The solution for curbing exponential run-time (which has worked a number of times on other areas of the grammar) is to add one or more "simple" syntactic predicates (i.e. Test or TestNot rules) at the right places that prevent the parser from entering a pathological rule pattern.
I've played a lot with these predicates. Unfortunately, in the context of StrongOrEmph
, I often ran into a problem related to the SuperNode children becoming out of sync for some reason. Need some time to understand why.
Another hint: the grammar is largely based on this one from peg-markdown, so it might be worth checking if (and how) John might have fixed this issue on his side...
I've had a quick look but it seems that this grammar has the same problem. (not completely sure yet)
jbcpollak wrote:
Playing with GitHub's renderer, it seems to ignore all _ followed by non-whitespace when looking for a closing tag. Does that help at all?
That behaviour is already implemented in the current pegdown parser (code here).
I just tested the examples from my previous comment. It seems that both the unpatched and patched peg-down parsers fail to parse the second snippet "correctly".
It also revealed some checks are missing in the patched parser.
I will try to improve the patched parser further, hopefully keeping multi-line emph/strong supported.
Update
Some clarification on the problem:
I think the general problem is that it is a PEG-parser. From wikipedia:
the choice operator selects the first match in PEG
Take the fragment of issue #78 with multiple *
-characters distributed over multiple Inline
-constructs. The parser will try to match the 1st *
as StrongOrEmph
, the 2nd *
as nested StrongOrEmph
in the 1st StrongOrEmph
, the 3rd *
as part of .. etc, without the guarantee that it has enough closing *
left for the stack of unfinished StrongOrEmph
-costructs, which forces the parser to traverse an exponential number of paths.
from pegdown.
Progressing...
I've now implemented a restriction that disallows starting to parse a StrongOrEmph
within a StrongOrEmph
iff the parent StrongOrEmph
is of the same type (using a new node type which extends the previously used SuperType
). This significantly drops the parse times for some of my tests (eg 39ms->3ms), while still allowing multi-line emph/strong constructs. And it does parse the #78 fragment in 61ms 👍
TODO: have this snippet working:
_what does
_ github_ do?
from pegdown.
Nice, Elmer!
Let's me know when you think you have something that's ready for merging...
from pegdown.
Will probably do a pull request this week. Need some more testing /fix some corner-case usages.
from pegdown.
After repeatedly rewriting and retesting, I finally managed to patch the parser in such a way that it does exactly the same as original markdown, except for 1 case: __where and _emph__ may start within a strong and end_ outside it _and __vice_ versa__
. But that's no problem, you can easily rewrite this to do the same if you really really want to have such a construct ;)
The patch now includes various checks within the StrongOrEmph
-rules. Important change is that it now accepts unclosed strong/emph
-sequences as AST-node to be serialized (-> parsing time for #78 drops further to 6ms) . The ToHtmlSerializer
will decide to print the children inline-nodes decorated with emph/strong tags (in case of closed sequence), or only preceded by the opening char (in case of unclosed sequence).
Preview:
Here is the output of 18 tests I used to validate the patch. (includes patched and unpatched output + parse times)
Parse-times may drop further when I'm done cleaning up code + removing debug statements. Then it's ready for pull-request. Will be there soon.
from pegdown.
Related Issues (20)
- Problems with Parpoiled's ASM transitive dependency HOT 1
- Chinese support problem HOT 1
- Maven Coordinates in Readme HOT 1
- java.lang.ExceptionInInitializerError -> Is this support for Android ? HOT 6
- Missing Overrides
- Add Command Line Interface
- Pegdown escapes HTML in code blocks, but not in other output
- Issue labels
- Custom WikiLinkRenderer does not get considered if Extensions.ANCHORLINKS is turned off
- Spock Framework
- Write the AST back to a MD document HOT 2
- NoClassDefFoundError on jdk8 HOT 3
- Add support for Wiki Images HOT 1
- Forms in pegdown? HOT 1
- Nested lists are broken HOT 4
- Rendering escaped dot in e-mail
- How to Detect new lines HOT 1
- Need help on confluence wiki link from pegdown to flexmark
- Atlassian Add-ons requires a valid license. Reason: EXPIRED
- runMatcher Bug HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from pegdown.