A strange parsing behaviour can be observed with the markdown library when used on Lis

Strange and inconsistent parsing of lists with headers and multiple lines about markdown HOT 6 CLOSED

Andre601 commented on September 25, 2024

Strange and inconsistent parsing of lists with headers and multiple lines

from markdown.

Comments (6)

Andre601 commented on September 25, 2024

Forgot to add another solution/workaround.
Adding an empty line after the header also prevents the code block issue.

I would assume that this is some block-related rendering behaviour?

from markdown.

facelessuser commented on September 25, 2024

I do agree it is weird that there are some cases where the paragraph under the header is getting turned into code blocks. I'm not sure if this is a list issue or a header extension issue within lists. I do know that lists especially have a few quirky issues like this. I do think behavior should be more consistent in lists. The fact that headers handle this case outside of lists fine but have issues in lists should probably be looked into.

With that said, for most consistent behavior, It is always best to keep blocks separate. Generally, Python Markdown expects blocks to have new lines between them.

import markdown

MD = """
-   ### List 1

    Entry 1.1

    Entry 1.2

-   ### List 2

    Entry 2.1

    Entry 2.2

-   ### List 3

    Entry 3.1

    Entry 3.2
"""

html = markdown.markdown(
    MD,
    extensions=[],
)

print(html)

<ul>
<li>
<h3>List 1</h3>
<p>Entry 1.1</p>
<p>Entry 1.2</p>
</li>
<li>
<h3>List 2</h3>
<p>Entry 2.1</p>
<p>Entry 2.2</p>
</li>
<li>
<h3>List 3</h3>
<p>Entry 3.1</p>
<p>Entry 3.2</p>
</li>
</ul>

from markdown.

waylan commented on September 25, 2024

I haven't looked closely at each example given yet (I will when I have time), but the first thing I would check is the reference implementation. Is our behavior any different? For any example that our behavior matches the reference implementation, I would expect that to be the correct behavior (unless it is clearly a bug in the reference implementation, which does happen on occasion). If however, the behavior between implementations differs, then we probably have a bug here.

As a general observation, there are a lot of subtleties with list parsing. Especially when you get into differences between tight (blank lines between items) and loose (no blank lines between items) lists. As loose list items always contain block level children, I can see an argument that any list item which contains a heading (which is clearly a block level element) should get loose list behavior even without the blank lines, but that is not how the reference implementation works, so we don't either. I'm assuming that this is what is leading to the unexpected output.

With that said, for most consistent behavior, It is always best to keep blocks separate. Generally, Python Markdown expects blocks to have new lines between them.

This is generally good advice. Yes, it is true that Markdown can work with all sorts of weird edge cases. However, for consistent results across all implementations I always format all of my Markdown according to the strictest linting rules, such as always including a blank line between all block level elements, no matter what. That has become especially important with the popularity of Commonmark, which handles many edge cases differently that old-school Markdown. My Markdown always renders the same with both Commonmark (on GitHub) and Python-Markdown (on my own sites) because I follow those strict linting rules and I avoid the various weird behaviors raised here.

To be clear, I am not suggesting that we shouldn't bother to fix an edge case if the behavior is clearly wrong because it can be avoided by using a stricture set of rules. What I am saying is that because the correct behavior (as defined by Markdown rather vague syntax rules) is not always clear, it is easier to avoid surprises if you stick to those stricter rules. In fact, for the documentation on this project, we run all proposed changes through the linter tool to enforce those stricter rules.

from markdown.

Andre601 commented on September 25, 2024

Something I want to point out real quick.

The linting rule you linked show 2 spaces as proper indent, which is also the default, yet your markdown parser is requiring 4 spaces, no matter what, for proper indents.
Why?

from markdown.

waylan commented on September 25, 2024

Had a chance to look at these.

Test 1 and Test 2 both demonstrate the same bug. There should be no code blocks (paragraphs instead). What is really strange is that the first item is correct, but the subsequent items are wrong.

Test 3 looks correct, but when you check against the reference implementation, it is also wrong. I think this one is interesting in that because there are no blank lines, the reference implementation sees it as a tight list. Presumably, the idea is that a tight list item does not contain any block level children. Therefore it is parsed as inline text only. markdown.pl returns the following result:

<ul>
<li>### List 1
Entry 1.1</li>
<li>### List 2
Entry 2.1</li>
<li>### List 3
Entry 3.1</li>
</ul>

According to Babelmark, there is a lot of variability across implementations with this one. Not sure what to think about it. Regardless, I am inclined to not treat this as the same bug. In fact, I may ignore it altogether.

I'm not sure what is going on with Test 4 as a heading should never be more than one line (a heading always ends at the first newline). However, what is even more curious, is that this specific edge case results in the bug in Tests 1 and 2 being avoided. Add the additional indentation, and we get those issues back. It looks like Test 5 is a workaround to avoid the issues. I suspect Tests 4 and 5 will help in working out what is causing the issues in Tests 1 and 2.

Thanks for posting this. This is clearly a bug. A bug I never would have found as I always follow a heading with a blank line in my own documents.

from markdown.

waylan commented on September 25, 2024

The reason these edge cases are not so clear is because lists support hanging indents. For example, these two list items are parsed the same way:

-   line one of one paragraph
    line 2 of the same paragraph

-   line one of one paragraph
line 2 of the same paragraph

However, because a heading can only ever be one line, then that forces the second line to start a new paragraph, which is unintuitive. For example, the following two list items get parsed very differently:

- # A Heading
A paragraph in the list item.

- # A Heading

A paragraph outside the list.

Yet, when we take those out of a list, then they get parsed the same.

# A Heading
A paragraph

# A Heading

A paragraph

All of these differences make for a challenge when developing a parser that works consistently and unsurprisingly. An additional complication is that the rules are not comprehensive and some edge cases of the reference implementation don't seem to be consistent with what one might expect having read the rules. I suppose that is why Commonmark completely abandoned the original rules and reimplemented a completely different scheme for parsing list items. But we are not a Commonmark parser, so we are stuck with the weirdness that is old-school Markdown.

from markdown.

Strange and inconsistent parsing of lists with headers and multiple lines about markdown HOT 6 CLOSED

Comments (6)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent

Jobs