tree-sitter-grammars / tree-sitter-markdown Goto Github PK

View Code? Open in Web Editor NEW

357.0 6.0 41.0 19.52 MB

Markdown grammar for tree-sitter

License: MIT License

JavaScript 35.23% Rust 13.64% C 49.55% Scheme 1.36% Makefile 0.22%

markdown parser tree-sitter

tree-sitter-markdown's Introduction

tree-sitter-markdown

A Markdown parser for tree-sitter.

The parser is designed to read markdown according to the CommonMark Spec, but some extensions to the spec from different sources such as Github flavored markdown are also included. These can be toggled on or off at compile time. For specifics see Extensions

Goals

Even though this parser has existed for some while and obvious issues are mostly solved, there are still lots of inaccuarcies in the output. These stem from restricting a complex format such as markdown to the quite restricting tree-sitter parsing rules.

As such it is not recommended to use this parser where correctness is important. The main goal for this parser is to provide syntactical information for syntax highlighting in parsers such as neovim and helix.

Contributing

All contributions are welcome. For details refer to CONTRIBUTING.md.

Extensions

Extensions can be enabled at compile time through environment variables. Some of them are on by default, these can be disabled with the environment variable NO_DEFAULT_EXTENSIONS.

Name	Environment variable	Specification	Default	Also enables
Github flavored markdown	`EXTENSION_GFM`	link	✓	Task lists, strikethrough, pipe tables
Task lists	`EXTENSION_TASK_LIST`	link	✓
Strikethrough	`EXTENSION_STRIKETHROUGH`	link	✓
Pipe tables	`EXTENSION_PIPE_TABLE`	link	✓
YAML metadata	`EXTENSION_MINUS_METADATA`	link	✓
TOML metadata	`EXTENSION_PLUS_METADATA`	link	✓
Tags	`EXTENSION_TAGS`	link
Wiki Link	`EXTENSION_WIKI_LINK`	link

Usage in Editors

For guides on how to use this parser in a specific editor, refer to that editor's specific documentation, e.g.

neovim
helix

Standalone usage

To use the two grammars, first parse the document with the block grammar. Then perform a second parse with the inline grammar using ts_parser_set_included_ranges to specify which parts are inline content. These parts are marked as inline nodes. Children of those inline nodes should be excluded from these ranges. For an example implementation see lib.rs in the bindings folder.

Usage with WASM

Unfortunately using this parser with WASM/web-tree-sitter does not work out of the box at the moment. This is because the parser uses some C functions that are not exported by tree-sitter by default. To fix this you can statically link the parser to tree-sitter. See also tree-sitter/tree-sitter#949, #126, and #93

tree-sitter-markdown's People

Contributors

Stargazers

Watchers

tree-sitter-markdown's Issues

Multiple code spans with HTML comments does not working properly

It seems that using multiple code spans with HTML comments isn't working properly. Using a single code span with a HTML comment is working fine though. Consider the following examples.

A single code span with a comment at the end of a paragraph works fine.

foo `<!--comment-->`

✅ (document (paragraph (html_tag)))

A single comment at the beginning of a paragraph works fine as well.

`<!--comment-->` foo

✅ (document (paragraph (html_tag)))

However, having two code spans with HTML results in incorrect results.

`<!--comment-->` foo `<!--comment-->`

❌ (document (paragraph (html_tag) (code_span (code_span_delimiter) (code_span_delimiter)) (html_tag)))

I'm not experienced with writing Tree-sitter parsers, so I'm unsure how the tree should look when parsing the last example.
However, I think it's supposed to look like this:

(document (paragraph (html_tag) (html_tag)))

Links are not recognised with parenthesis

Describe the bug
Links in the following format are not recognised [Link](https://google.com) at all.

Code example
These links are not recognised.

[Link](https://google.com)
[Link2](https://gmail.com)

But these are recognised.

[Link]: https://google.com
[Link2]: https://gmail.com

Expected behaviour
I would expect [Link](https://google.com) is recognised as a link.

Actual behaviour
It is not recognised as a link, so it is not highlighted.

Codespan gets parsed as autolink

See example 355.

s/parenthethis/parenthesis

I was just checking in on progress on this project and the typo 'parenthethis' for 'parenthesis' caught my eye.

I guess it doesn't really matter for the internal implementation, but it would be a shame to have this show up in the parse tree.

Add support for c++ as a valid language identifier along with cpp

Currently in a md file, these is what the cplusplus code blocks look like:

Some projects use c++ instead of cpp to mark cplusplus code blocks, it would be nice if that was recognised as well.
I would've made a PR, but I am not well acquainted with the ways of tree-sitter, hence filing an issue.

Thanks!

Markdown link not highlight link with Parentheses

https://en.wikipedia.org/wiki/Ping_(networking_utility)#ICMP_packet as the link contain parentheses.It doesn't highlight properly.

No `code_span` nodes after latest update

After I synced neovim's packer, it updated the markdown parser, and it stopped working for me, since I use code_span and code_span_delimiter nodes in my custom queries. On another machine, where I did not update, I see with the TSPlayground that these nodes are there, but with updated parser, they are just gone. Is it intentional, and your parser will not recognize code spans like this one any more?

Code fense aliases

Tim Pop's implementation includes the following

let g:markdown_fenced_languages = ['bash=sh']

It would be nice to implement in here as well.

No support for shell code blocks

I just tried editing markdown in Neovim with treesitter, and my shell code blocks don't get any highlighting. It works fine when i rename them to bash, but not sh. Should this parser support the same languages as Github does? As far as I can tell from: https://github.com/github/linguist/blob/master/lib/linguist/languages.yml the possible aliases are:

shell
sh
shell-script
bash
zsh

Feature request: add front matter support

Rationale

Markdown is a quite popular choice for writing content for static sites or documentation. Majority of existing tooling supports so called "front matter" that allows to attach some metadata to the document.

See example https://gohugo.io/content-management/front-matter/

Suggestion

Front matter is neither a part of the standard nor standardize among existing tooling. Nevertheless, it seems community pretty much settled on two front matter formats:

YAML is used in case of --- separator. E.g:

---
tags: [foo, bar]
summary: a short summary
---

# My blog post

...

TOML is used in case of +++ separator. E.g:

+++
tags = ["foo", "bar"]
summary = "a short summary"
+++

# My blog post

...

It would be nice to support both YAML and TOML injections for this 2 most popular choices. I'm sure it would cover 99% of cases.

Multiple thematic breaks

text 1

---

text 2

---

text 3

does

paragraph [0, 0] - [1, 0]
thematic_break [2, 0] - [3, 0]
paragraph [4, 0] - [5, 0]
  document [4, 0] - [4, 6]
    flow_node [4, 0] - [4, 6]
      plain_scalar [4, 0] - [4, 6]
        string_scalar [4, 0] - [4, 6]
thematic_break [6, 0] - [7, 0]
paragraph [8, 0] - [9, 0]

when I would have expected

paragraph [0, 0] - [1, 0]
thematic_break [2, 0] - [3, 0]
paragraph [4, 0] - [5, 0]
thematic_break [6, 0] - [7, 0]
paragraph [8, 0] - [9, 0]

A smaller example would be

a
***
b
***

paragraph [0, 0] - [1, 0]
thematic_break [1, 0] - [2, 0]
paragraph [2, 0] - [3, 0]
  document [2, 0] - [2, 1]
    flow_node [2, 0] - [2, 1]
      plain_scalar [2, 0] - [2, 1]
        string_scalar [2, 0] - [2, 1]
thematic_break [3, 0] - [4, 0]

instead of

paragraph [0, 0] - [1, 0]
thematic_break [1, 0] - [2, 0]
paragraph [2, 0] - [3, 0]
thematic_break [3, 0] - [4, 0]

a
***
b

works as expected — it's the second break that changes it

Brackets containing only whitespace parsed as shortcut link

See example 560.

Html tag containing multiple html tags gets parsed wrong

See example 635.

Invalid link reference definition highlight

Describe the bug

Code example

-   [Link][]
-   [Link2][]

  [Link]: https://google.com
  [Link2]: https://gmail.com

Expected behavior

Link reference definitions are highlighted as such. Commonmark reference allows up to three spaces of identation for them (src).

Actual behavior

Error on markdown images

If two consecutive lines have images the parser fails.

Example

### test doc

![img1](link1)
![img2](link2)

what's going on?
### why an error after two image links ?

ast from TSPlayground

atx_heading [0, 0] - [1, 0]
  atx_h3_marker [0, 0] - [0, 3]
  heading_content [0, 3] - [0, 12]
ERROR [2, 0] - [7, 0]
  image [3, 0] - [3, 14]
    image_description [3, 2] - [3, 6]
    link_destination [3, 8] - [3, 13]
  ERROR [4, 0] - [5, 0]
  ERROR [6, 0] - [6, 3]

But if new_line between the images is removed or additional line is added between the two images the parser works .

Means this works but the above one doesn't

![img1](link1)![img2](link2)

![img1](link1)

![img2](link2)

wikilinks

Maybe this should not be part of tree-sitter-markdown since its not part of CommonMark Spec or Github flavored markdown so just opening up the question of supporting wikilinks.

nvim becomes unresponsive when a fenced code block uses attributes

Hi there and thank you for your parser! Fixing this issue might be out of scope, as it is specific to pandocs extension of markdown, but since it leads to a crash / nvim becoming unresponsive I am reporting it anyways because people might come across this by accident.

Pandoc allows attributes to be added to block level and inline elements (https://pandoc.org/MANUAL.html#extension-attributes) using a syntax with curly braces (https://pandoc.org/MANUAL.html#extension-header_attributes). These attributes can also be added to (fenced) code blocks (https://pandoc.org/MANUAL.html#fenced-code-blocks), which is commonly used by literate programming formats combining computation and documentation such as Rmarkdown to note down the language of a code block for execution. Unfortunately this syntax leads to a crash.

So while

```R
1 + 1
```

works,

```{R}
1 + 1
```

becomes very slow once it is typed out and gives warnings about the missing closing } while typing. This must be somehow related to how the info_string after the three backticks is parsed. If this could allow for curly braces it would help a lot of pandoc and Rmarkdown users.

HTML tag can sometimes be parsed as code span

See example 354

Highlight group suggestions

Thanks for the awesome parser!

I have some suggestions that could improve the highlighting:

quote symbol (>) should have TSPunctSpecial highlight group
fenced code block symbol (backtick) and language are currently using TSLiteral group, I think it would be better to use TSPunctDelimiter
link delimiters ([]()) should use TSPunctDelimiter or TSPunctBracket highlight group
Emphasis characters (_ and *) should use TSPunctDelimiter highlight group

Thanks again!

Link reference definition with newline in link label gets parsed as shortcut link

See example 177.

Code fence indents

Hi, not too sure about the current implementation, but would it be possible to get treesitter indenting according to the parser for the language corresponding to code blocks, just like how treesitter highlights are correctly used based on the language?

How to enable that for rmarkdown?

Link destination gets parsed as html

See example 588.

Support tag convention

Hello!
Thanks for this great parser, it's helping me doing some great things on parsing my notes

I currently use tags derived from obsidian (and other notes management tools who uses tags) and want to see if it's possible for them to be treated as tags in TS:
The syntax is: #word word can be any letter or digit

Références:

https://help.obsidian.md/How+to/Working+with+tags

Titles not detected after an empty inner list item (bullet point)

I think titles after an empty inner list item are not detected.
Here is some example:

Thank you!

Publish to crates.io

Hello!

If possible could you publish the rust crate to crates.io so that it can be consumed by other rust crates?

Cheers!

setext heading level 1 not parsed

Describe the bug
The second heading with = underline isn't highlighted

Code example

title
=====

asdf

title
=====

asdf

Expected behavior
maybe something like this?

section [0, 0] - [9, 0]
  setext_heading [0, 0] - [2, 0]
    heading_content: paragraph [0, 0] - [1, 0]
      inline [0, 0] - [0, 5]
    setext_h1_underline [1, 0] - [1, 5]
  paragraph [3, 0] - [4, 0]
    inline [3, 0] - [3, 4]
  section [5, 0] - [9, 0]
    setext_heading [5, 0] - [7, 0]
      heading_content: paragraph [5, 0] - [6, 0]
        inline [5, 0] - [5, 5]
      setext_h1_underline [6, 0] - [6, 5]
    paragraph [8, 0] - [9, 0]
      inline [8, 0] - [8, 4]

Actual behavior

section [0, 0] - [9, 0]
  setext_heading [0, 0] - [2, 0]
    heading_content: paragraph [0, 0] - [1, 0]
      inline [0, 0] - [0, 5]
    setext_h1_underline [1, 0] - [1, 5]
  paragraph [3, 0] - [4, 0]
    inline [3, 0] - [3, 4]
  paragraph [5, 0] - [6, 0]
    inline [5, 0] - [5, 5]
  paragraph [6, 0] - [7, 0]
    inline [6, 0] - [6, 5]
  paragraph [8, 0] - [9, 0]
    inline [8, 0] - [8, 4]

Text in square brackets conceals like a link

Text contained in square brackets will be rendered as link, despite the absence of a link destination in parenthesis.

Code example

This [link] renders like a link when it should not.

Expected behavior
When no link destination is provided square brackets render as text.

Actual behavior
When no link destination is provided square brackets render as link.

Segfault in neovim on roc FAQ.md

Hi ! First thanks a lot for maintaining this markdown parser :)

Neovim crashes with a segfault when opening the FAQ.md file in the source code from roc lang (https://www.roc-lang.org/), and i believe it's related to the markdown parser, but i'm not a 100% sure because it could be related to some other parser that is injected.

I can take a look when i have some time during this week if you wish.

List markers highlight?

Is there a way for the list markers to get some highlighting? Thanks!

Add highlighting support for code spans

As seen here https://github.github.com/gfm/#code-spans

Code block highlights

Is it possible to highlight blocks of code surrounded by triple backtip according to the set syntax?

Invalid html comments containing hypens

See examples 645 and 646.

Bug: code bolck in quote block

something like this will break the AST:

> ```bash
> git add --all
> git commit -m "msg"
> ```

playground shows that:

block_quote [0, 0] - [4, 0]
  block_quote_marker [0, 0] - [0, 2]
  fenced_code_block [0, 2] - [4, 0]
    info_string [0, 5] - [0, 9]
      language [0, 5] - [0, 9]
    block_quote_marker [1, 0] - [1, 2]
    code_fence_content [1, 2] - [3, 2]
      ERROR [1, 2] - [3, 1]
        command [1, 2] - [1, 15]
          name: command_name [1, 2] - [1, 5]
            word [1, 2] - [1, 5]
          argument: word [1, 6] - [1, 9]
          argument: word [1, 10] - [1, 15]
        command [2, 0] - [2, 21]
          file_redirect [2, 0] - [2, 5]
            destination: word [2, 2] - [2, 5]
          name: command_name [2, 6] - [2, 12]
            word [2, 6] - [2, 12]
          argument: word [2, 13] - [2, 15]
          argument: string [2, 16] - [2, 21]
      block_quote_marker [2, 0] - [2, 2]
      block_quote_marker [3, 0] - [3, 2]

It seems this problem is very hard to fix.

comma in info string prevents language specific parsing of code block

I really appreciate code parsing in fenced code blocks. In Rmarkdown, the language identifier (r) is often followed by a comma and further "chunk options". The comma prevents the language from being recognized. Could we allow comma in addition to whitespace after language identifier?

```{r, fig.height=4, fig.width=7}
x <- rnorm(100)                  # no syntax highlighting in these blocks
plot(x)
```

```{r chunk-name, fig.height=4, fig.width=7}
x <- rnorm(100)                  # this block has proper highlighting
plot(x)
```

Link concealing?

I've been using plasticboy/vim-markdown for markdown syntax highlighting and other stuff for years.
I really misses the link concealing feature of that plugin when I switched to tree-sitter-markdown today.
Link concealing shows [text](link) as text and make the whole document less cluttered, especially if the link is very long.

Not sure if link concealing is within the scope of tree-sitter.

Shortcut Link and Code Span Edge Case

Hello! I ran into the following edge case with a combination of shortcut_link and code_span:

- `x[0]` is equivalent to `*x`

...is parsed as:

  list_item [5, 0] - [7, 0]
    list_marker_minus [5, 0] - [5, 2]
    paragraph [5, 2] - [6, 0]
      shortcut_link [5, 4] - [5, 7]
        link_text [5, 5] - [5, 6]
      code_span [5, 7] - [5, 27]
        code_span_delimiter [5, 7] - [5, 8]
        code_span_delimiter [5, 26] - [5, 27]

...when it should be:

  list_item [5, 0] - [7, 0]
    list_marker_minus [5, 0] - [5, 2]
    paragraph [5, 2] - [6, 0]
      code_span [5, 2] - [5, 8]
        code_span_delimiter [5, 2] - [5, 3]
        code_span_delimiter [5, 7] - [5, 8]

The smallest example I could come up with is:

`[a]`b`*c`

...which is parsed as:

paragraph [7, 0] - [8, 0]
  shortcut_link [7, 1] - [7, 4]
    link_text [7, 2] - [7, 3]
  code_span [7, 4] - [7, 7]
    code_span_delimiter [7, 4] - [7, 5]
    code_span_delimiter [7, 6] - [7, 7]

...instead of:

paragraph [7, 0] - [8, 0]
  code_span [7, 0] - [7, 5]
    code_span_delimiter [7, 0] - [7, 1]
    code_span_delimiter [7, 4] - [7, 5]

This is a very "edge case" kind of scenario, but I figured I should mention it.

PS: thanks for your work on this, I was waiting for a tree sitter markdown grammar and I attempted to write one myself a few months ago but gave up cause I couldn't figure it out 🤣, so I appreciate the effort this must've taken.

Errror: Ranges can on ly be made from 6 element long tables or nodes.

Haven't investigated this fully yet but it seems to be related to the latest updates.

I'm using this parser to generate vimdoc from markdown, I've narrowed it down to the part of code that's failing:

this fails even with an empty file

  local parser = vim.treesitter.get_string_parser(
    -- contents of a sample markdown file consisting of 2 lines
    -- tile and body
    "# title\nbody\n\n",
    "markdown"
  )
  parser:parse()

The call to parser:parse() fails with:

Error detected while processing /home/bhagwan/test.lua:
E5113: Error while calling lua chunk: /usr/share/nvim/runtime/lua/vim/treesitter/languagetree.lua:115:
Ranges can only be made from 6 element long tables or nodes.
stack traceback:
        [C]: in function 'set_included_ranges'
        /usr/share/nvim/runtime/lua/vim/treesitter/languagetree.lua:115: in function 'parse'
        /usr/share/nvim/runtime/lua/vim/treesitter/languagetree.lua:149: in function 'parse'
        /home/bhagwan/test.lua:3: in main chunk

Bug: werid indexed list

1. a
  1. b
    1. c
    2. d
      1. e
      2. f

parsed as:

list [0, 0] - [6, 0]
  list_item [0, 0] - [1, 2]
    list_marker_dot [0, 0] - [0, 3]
    paragraph [0, 3] - [1, 2]
  list_item [1, 2] - [6, 0]
    list_marker_dot [1, 2] - [1, 5]
    paragraph [1, 5] - [4, 6]
    list [4, 6] - [6, 0]
      list_item [4, 6] - [5, 6]
        list_marker_dot [4, 6] - [4, 9]
        paragraph [4, 9] - [5, 6]
      list_item [5, 6] - [6, 0]
        list_marker_dot [5, 6] - [5, 9]
        paragraph [5, 9] - [6, 0]

which result list mark before c and d have no tshighlight:

Table support?

First, great job on this and ty for taking ownership on mardown in neovim treesitter :-)

Are tables supported? I'm using treesitter to auto-generate vimdoc for my plugin from the README, I used to do with https://github.com/ikatyang/tree-sitter-markdown which recognized the below as table, this seems not to be the case here.

Are tables supported?

For reference, here's how it's parsed with the ikatyang:

Feature request: Strikethrough

check here: https://github.github.com/gfm/#strikethrough-extension-

Error: query: invalid node type at position 152

Hi again @MDeiml,

I seem to encounter a new error when using the markdown parser with the nightly, using get_string_parser fails:

vim.treesitter.get_string_parser("test", "cpp")              -- success
vim.treesitter.get_string_parser("test", "markdown_inline")  -- success
vim.treesitter.get_string_parser("test", "markdown")         -- FAIL

With error:

The reason the runtime paths are weird is because I'm using the nightly appimage

E5108: Error executing lua ...488e/usr/share/nvim/runtime/lua/vim/treesitter/query.lua:174: query: invalid node type at position 152
stack traceback:
        [C]: in function '_ts_parse_query'
        ...488e/usr/share/nvim/runtime/lua/vim/treesitter/query.lua:174: in function 'get_query'
        ...r/share/nvim/runtime/lua/vim/treesitter/languagetree.lua:35: in function 'get_string_parser'
        .../site/pack/vendor/start/ts-vimdoc.nvim/lua/ts-vimdoc.lua:23: in function 'docgen'
        [string ":lua"]:1: in main chunk

Missing headings level names

The headings level names are missing:

Heading      Default      Treesitter
-----------------------------------------
  # Level 1  markdownH1   markdownTSTitle
 ## Level 2  markdownH2   markdownTSTitle
### Level 3  markdownH3   markdownTSTitle
...
-----------------------------------------

Implementing the levels would significantly improve the readability via color schemes.
This module is amazing, and the H levels would make it even more spectacular.
❤️

Injection for LaTeX blocks

Although it does not appear to have been added to the GFM spec, GitHub recently added support for LaTeX rendering on their website using $ and $$ blocks. It is more formally outlined here.

Accordingly, It would be amazing if the parser could be extended to detect LaTeX blocks as marked by $ and $$ and provide injections for them.

missing case for `[TIGHT|LOOSE]_LIST_ITEM + 3`?

A random thing I noticed while looking over the code the other day: the cases for list_item + 3 seem to be missing from https://github.com/MDeiml/tree-sitter-markdown/blob/921ce062919d356ebf424eddc709cd2182e7b903/src/scanner.cc#L194-L201

Possibly there's something clever going on and this case is indeed special and deliberately missing; but it looks like it might be an accident!

Proposal: `link_text` match only content

@MDeiml great job. I'm using it on neovim.

link_text include [ and ], should only capture the content, like link_destination?

Emphasis does not always work as intended

Additional parameters in info string break injection.

For example as in this example from the specs

~~~~    ruby startline=3 $%@#$
def foo(x)
  return 3
end
~~~~~~~

Only the first word of the info string should be treated as a language. To solve this we probably need to add a new node so the output looks like

(fenced_code_block
  (info_string
    (language))
  (fenced_code_block_content))

Is there any way that I can get the conceal math in Markdown?

Grettings,
I have been working on getting my notetaking setup for my classes and love having some sort of plugin to do math conceal. As you can see even though I have preservim/vim-markdown installed but the conceal doesn't work with Treesitter.

With TS enabled:

With TS Disabled:

block_quote_marker includes whitespace

I've played around with Treesitter playground (using Neovim), with the aim to have nice-looking markdown right in the nvim buffer (using conceal). I found that the block_quote_marker nodes include not only > but also the next whitespace. Is this dictated by Markdown syntax? If so, is it possible to break down them into two nodes?

tree-sitter-grammars / tree-sitter-markdown Goto Github PK

tree-sitter-markdown's Introduction

tree-sitter-markdown

Goals

Contributing

Extensions

Usage in Editors

Standalone usage

Usage with WASM

tree-sitter-markdown's People

Contributors

Stargazers

Watchers

Forkers

tree-sitter-markdown's Issues

Rationale

Suggestion

Recommend Projects

Recommend Topics

Recommend Org

Jobs