GithubHelp home page GithubHelp logo

tree-sitter-grammars / tree-sitter-markdown Goto Github PK

View Code? Open in Web Editor NEW
357.0 6.0 41.0 19.52 MB

Markdown grammar for tree-sitter

License: MIT License

JavaScript 35.23% Rust 13.64% C 49.55% Scheme 1.36% Makefile 0.22%
markdown parser tree-sitter

tree-sitter-markdown's Introduction

tree-sitter-markdown

CI discord matrix npm crates

A Markdown parser for tree-sitter.

screenshot

The parser is designed to read markdown according to the CommonMark Spec, but some extensions to the spec from different sources such as Github flavored markdown are also included. These can be toggled on or off at compile time. For specifics see Extensions

Goals

Even though this parser has existed for some while and obvious issues are mostly solved, there are still lots of inaccuarcies in the output. These stem from restricting a complex format such as markdown to the quite restricting tree-sitter parsing rules.

As such it is not recommended to use this parser where correctness is important. The main goal for this parser is to provide syntactical information for syntax highlighting in parsers such as neovim and helix.

Contributing

All contributions are welcome. For details refer to CONTRIBUTING.md.

Extensions

Extensions can be enabled at compile time through environment variables. Some of them are on by default, these can be disabled with the environment variable NO_DEFAULT_EXTENSIONS.

Name Environment variable Specification Default Also enables
Github flavored markdown EXTENSION_GFM link Task lists, strikethrough, pipe tables
Task lists EXTENSION_TASK_LIST link
Strikethrough EXTENSION_STRIKETHROUGH link
Pipe tables EXTENSION_PIPE_TABLE link
YAML metadata EXTENSION_MINUS_METADATA link
TOML metadata EXTENSION_PLUS_METADATA link
Tags EXTENSION_TAGS link
Wiki Link EXTENSION_WIKI_LINK link

Usage in Editors

For guides on how to use this parser in a specific editor, refer to that editor's specific documentation, e.g.

Standalone usage

To use the two grammars, first parse the document with the block grammar. Then perform a second parse with the inline grammar using ts_parser_set_included_ranges to specify which parts are inline content. These parts are marked as inline nodes. Children of those inline nodes should be excluded from these ranges. For an example implementation see lib.rs in the bindings folder.

Usage with WASM

Unfortunately using this parser with WASM/web-tree-sitter does not work out of the box at the moment. This is because the parser uses some C functions that are not exported by tree-sitter by default. To fix this you can statically link the parser to tree-sitter. See also tree-sitter/tree-sitter#949, #126, and #93

tree-sitter-markdown's People

Contributors

amaanq avatar aminya avatar boltlessengineer avatar danielpunkass avatar dimbleby avatar dstoc avatar dzhou121 avatar ghishadow avatar hendrikvanantwerpen avatar ibash avatar jsorge avatar lewis6991 avatar liskin avatar mattmassicotte avatar maxbrunsfeld avatar mdeiml avatar mikavilpas avatar observeroftime avatar ryleelyman avatar wookayin avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar

tree-sitter-markdown's Issues

Multiple code spans with HTML comments does not working properly

It seems that using multiple code spans with HTML comments isn't working properly. Using a single code span with a HTML comment is working fine though. Consider the following examples.

A single code span with a comment at the end of a paragraph works fine.

foo `<!--comment-->`
✅ (document (paragraph (html_tag)))

A single comment at the beginning of a paragraph works fine as well.

`<!--comment-->` foo
✅ (document (paragraph (html_tag)))

However, having two code spans with HTML results in incorrect results.

`<!--comment-->` foo `<!--comment-->`
❌ (document (paragraph (html_tag) (code_span (code_span_delimiter) (code_span_delimiter)) (html_tag)))

I'm not experienced with writing Tree-sitter parsers, so I'm unsure how the tree should look when parsing the last example.
However, I think it's supposed to look like this:

(document (paragraph (html_tag) (html_tag)))

Links are not recognised with parenthesis

Describe the bug
Links in the following format are not recognised [Link](https://google.com) at all.

Code example
These links are not recognised.

[Link](https://google.com)
[Link2](https://gmail.com)

But these are recognised.

[Link]: https://google.com
[Link2]: https://gmail.com 

Expected behaviour
I would expect [Link](https://google.com) is recognised as a link.

Actual behaviour
It is not recognised as a link, so it is not highlighted.

s/parenthethis/parenthesis

I was just checking in on progress on this project and the typo 'parenthethis' for 'parenthesis' caught my eye.

I guess it doesn't really matter for the internal implementation, but it would be a shame to have this show up in the parse tree.

Add support for c++ as a valid language identifier along with cpp

Currently in a md file, these is what the cplusplus code blocks look like:

Screenshot 2022-06-11 at 7 29 32 PM

Some projects use c++ instead of cpp to mark cplusplus code blocks, it would be nice if that was recognised as well.
I would've made a PR, but I am not well acquainted with the ways of tree-sitter, hence filing an issue.

Thanks!

No `code_span` nodes after latest update

After I synced neovim's packer, it updated the markdown parser, and it stopped working for me, since I use code_span and code_span_delimiter nodes in my custom queries. On another machine, where I did not update, I see with the TSPlayground that these nodes are there, but with updated parser, they are just gone. Is it intentional, and your parser will not recognize code spans like this one any more?

Code fense aliases

Tim Pop's implementation includes the following

let g:markdown_fenced_languages = ['bash=sh']

It would be nice to implement in here as well.

Feature request: add front matter support

Rationale

Markdown is a quite popular choice for writing content for static sites or documentation. Majority of existing tooling supports so called "front matter" that allows to attach some metadata to the document.

See example https://gohugo.io/content-management/front-matter/

Suggestion

Front matter is neither a part of the standard nor standardize among existing tooling. Nevertheless, it seems community pretty much settled on two front matter formats:

  • YAML is used in case of --- separator. E.g:

    ---
    tags: [foo, bar]
    summary: a short summary
    ---
    
    # My blog post
    
    ...
  • TOML is used in case of +++ separator. E.g:

    +++
    tags = ["foo", "bar"]
    summary = "a short summary"
    +++
    
    # My blog post
    
    ...

It would be nice to support both YAML and TOML injections for this 2 most popular choices. I'm sure it would cover 99% of cases.

Multiple thematic breaks

text 1

---

text 2

---

text 3

does

paragraph [0, 0] - [1, 0]
thematic_break [2, 0] - [3, 0]
paragraph [4, 0] - [5, 0]
  document [4, 0] - [4, 6]
    flow_node [4, 0] - [4, 6]
      plain_scalar [4, 0] - [4, 6]
        string_scalar [4, 0] - [4, 6]
thematic_break [6, 0] - [7, 0]
paragraph [8, 0] - [9, 0]

when I would have expected

paragraph [0, 0] - [1, 0]
thematic_break [2, 0] - [3, 0]
paragraph [4, 0] - [5, 0]
thematic_break [6, 0] - [7, 0]
paragraph [8, 0] - [9, 0]

A smaller example would be

a
***
b
***
paragraph [0, 0] - [1, 0]
thematic_break [1, 0] - [2, 0]
paragraph [2, 0] - [3, 0]
  document [2, 0] - [2, 1]
    flow_node [2, 0] - [2, 1]
      plain_scalar [2, 0] - [2, 1]
        string_scalar [2, 0] - [2, 1]
thematic_break [3, 0] - [4, 0]

instead of

paragraph [0, 0] - [1, 0]
thematic_break [1, 0] - [2, 0]
paragraph [2, 0] - [3, 0]
thematic_break [3, 0] - [4, 0]
a
***
b

works as expected — it's the second break that changes it

Invalid link reference definition highlight

Describe the bug

Code example

-   [Link][]
-   [Link2][]

  [Link]: https://google.com
  [Link2]: https://gmail.com

Expected behavior

Link reference definitions are highlighted as such. Commonmark reference allows up to three spaces of identation for them (src).

Actual behavior
Screenshot 2022-08-04 at 16 38 12

Error on markdown images

If two consecutive lines have images the parser fails.

Example

### test doc

![img1](link1)
![img2](link2)

what's going on?
### why an error after two image links ?

ast from TSPlayground

atx_heading [0, 0] - [1, 0]
  atx_h3_marker [0, 0] - [0, 3]
  heading_content [0, 3] - [0, 12]
ERROR [2, 0] - [7, 0]
  image [3, 0] - [3, 14]
    image_description [3, 2] - [3, 6]
    link_destination [3, 8] - [3, 13]
  ERROR [4, 0] - [5, 0]
  ERROR [6, 0] - [6, 3]

But if new_line between the images is removed or additional line is added between the two images the parser works .

Means this works but the above one doesn't

![img1](link1)![img2](link2)

![img1](link1)

![img2](link2)

nvim becomes unresponsive when a fenced code block uses attributes

Hi there and thank you for your parser! Fixing this issue might be out of scope, as it is specific to pandocs extension of markdown, but since it leads to a crash / nvim becoming unresponsive I am reporting it anyways because people might come across this by accident.

Pandoc allows attributes to be added to block level and inline elements (https://pandoc.org/MANUAL.html#extension-attributes) using a syntax with curly braces (https://pandoc.org/MANUAL.html#extension-header_attributes). These attributes can also be added to (fenced) code blocks (https://pandoc.org/MANUAL.html#fenced-code-blocks), which is commonly used by literate programming formats combining computation and documentation such as Rmarkdown to note down the language of a code block for execution. Unfortunately this syntax leads to a crash.

So while

```R
1 + 1
```

works,

```{R}
1 + 1
```

becomes very slow once it is typed out and gives warnings about the missing closing } while typing. This must be somehow related to how the info_string after the three backticks is parsed. If this could allow for curly braces it would help a lot of pandoc and Rmarkdown users.

Code fence indents

Hi, not too sure about the current implementation, but would it be possible to get treesitter indenting according to the parser for the language corresponding to code blocks, just like how treesitter highlights are correctly used based on the language?

Support tag convention

Hello!
Thanks for this great parser, it's helping me doing some great things on parsing my notes

I currently use tags derived from obsidian (and other notes management tools who uses tags) and want to see if it's possible for them to be treated as tags in TS:
The syntax is: #word word can be any letter or digit

Références:

https://help.obsidian.md/How+to/Working+with+tags

setext heading level 1 not parsed

Describe the bug
The second heading with = underline isn't highlighted
Screen Shot 2022-09-04 at 1 50 29 PM

Code example

title
=====

asdf

title
=====

asdf

Expected behavior
maybe something like this?

section [0, 0] - [9, 0]
  setext_heading [0, 0] - [2, 0]
    heading_content: paragraph [0, 0] - [1, 0]
      inline [0, 0] - [0, 5]
    setext_h1_underline [1, 0] - [1, 5]
  paragraph [3, 0] - [4, 0]
    inline [3, 0] - [3, 4]
  section [5, 0] - [9, 0]
    setext_heading [5, 0] - [7, 0]
      heading_content: paragraph [5, 0] - [6, 0]
        inline [5, 0] - [5, 5]
      setext_h1_underline [6, 0] - [6, 5]
    paragraph [8, 0] - [9, 0]
      inline [8, 0] - [8, 4]

Actual behavior

section [0, 0] - [9, 0]
  setext_heading [0, 0] - [2, 0]
    heading_content: paragraph [0, 0] - [1, 0]
      inline [0, 0] - [0, 5]
    setext_h1_underline [1, 0] - [1, 5]
  paragraph [3, 0] - [4, 0]
    inline [3, 0] - [3, 4]
  paragraph [5, 0] - [6, 0]
    inline [5, 0] - [5, 5]
  paragraph [6, 0] - [7, 0]
    inline [6, 0] - [6, 5]
  paragraph [8, 0] - [9, 0]
    inline [8, 0] - [8, 4]

Text in square brackets conceals like a link

Text contained in square brackets will be rendered as link, despite the absence of a link destination in parenthesis.

Code example

This [link] renders like a link when it should not.

Expected behavior
When no link destination is provided square brackets render as text.

Actual behavior
When no link destination is provided square brackets render as link.

Segfault in neovim on roc FAQ.md

Hi ! First thanks a lot for maintaining this markdown parser :)

Neovim crashes with a segfault when opening the FAQ.md file in the source code from roc lang (https://www.roc-lang.org/), and i believe it's related to the markdown parser, but i'm not a 100% sure because it could be related to some other parser that is injected.

I can take a look when i have some time during this week if you wish.

Code block highlights

Is it possible to highlight blocks of code surrounded by triple backtip according to the set syntax?

Bug: code bolck in quote block

something like this will break the AST:

> ```bash
> git add --all
> git commit -m "msg"
> ```

playground shows that:

block_quote [0, 0] - [4, 0]
  block_quote_marker [0, 0] - [0, 2]
  fenced_code_block [0, 2] - [4, 0]
    info_string [0, 5] - [0, 9]
      language [0, 5] - [0, 9]
    block_quote_marker [1, 0] - [1, 2]
    code_fence_content [1, 2] - [3, 2]
      ERROR [1, 2] - [3, 1]
        command [1, 2] - [1, 15]
          name: command_name [1, 2] - [1, 5]
            word [1, 2] - [1, 5]
          argument: word [1, 6] - [1, 9]
          argument: word [1, 10] - [1, 15]
        command [2, 0] - [2, 21]
          file_redirect [2, 0] - [2, 5]
            destination: word [2, 2] - [2, 5]
          name: command_name [2, 6] - [2, 12]
            word [2, 6] - [2, 12]
          argument: word [2, 13] - [2, 15]
          argument: string [2, 16] - [2, 21]
      block_quote_marker [2, 0] - [2, 2]
      block_quote_marker [3, 0] - [3, 2]

It seems this problem is very hard to fix.

comma in info string prevents language specific parsing of code block

I really appreciate code parsing in fenced code blocks. In Rmarkdown, the language identifier (r) is often followed by a comma and further "chunk options". The comma prevents the language from being recognized. Could we allow comma in addition to whitespace after language identifier?

```{r, fig.height=4, fig.width=7}
x <- rnorm(100)                  # no syntax highlighting in these blocks
plot(x)
```

```{r chunk-name, fig.height=4, fig.width=7}
x <- rnorm(100)                  # this block has proper highlighting
plot(x)
```

Link concealing?

I've been using plasticboy/vim-markdown for markdown syntax highlighting and other stuff for years.
I really misses the link concealing feature of that plugin when I switched to tree-sitter-markdown today.
Link concealing shows [text](link) as text and make the whole document less cluttered, especially if the link is very long.

Not sure if link concealing is within the scope of tree-sitter.

Shortcut Link and Code Span Edge Case

Hello! I ran into the following edge case with a combination of shortcut_link and code_span:

- `x[0]` is equivalent to `*x`

...is parsed as:

  list_item [5, 0] - [7, 0]
    list_marker_minus [5, 0] - [5, 2]
    paragraph [5, 2] - [6, 0]
      shortcut_link [5, 4] - [5, 7]
        link_text [5, 5] - [5, 6]
      code_span [5, 7] - [5, 27]
        code_span_delimiter [5, 7] - [5, 8]
        code_span_delimiter [5, 26] - [5, 27]

...when it should be:

  list_item [5, 0] - [7, 0]
    list_marker_minus [5, 0] - [5, 2]
    paragraph [5, 2] - [6, 0]
      code_span [5, 2] - [5, 8]
        code_span_delimiter [5, 2] - [5, 3]
        code_span_delimiter [5, 7] - [5, 8]

The smallest example I could come up with is:

`[a]`b`*c`

...which is parsed as:

paragraph [7, 0] - [8, 0]
  shortcut_link [7, 1] - [7, 4]
    link_text [7, 2] - [7, 3]
  code_span [7, 4] - [7, 7]
    code_span_delimiter [7, 4] - [7, 5]
    code_span_delimiter [7, 6] - [7, 7]

...instead of:

paragraph [7, 0] - [8, 0]
  code_span [7, 0] - [7, 5]
    code_span_delimiter [7, 0] - [7, 1]
    code_span_delimiter [7, 4] - [7, 5]

This is a very "edge case" kind of scenario, but I figured I should mention it.

PS: thanks for your work on this, I was waiting for a tree sitter markdown grammar and I attempted to write one myself a few months ago but gave up cause I couldn't figure it out 🤣, so I appreciate the effort this must've taken.

Errror: Ranges can on ly be made from 6 element long tables or nodes.

Haven't investigated this fully yet but it seems to be related to the latest updates.

I'm using this parser to generate vimdoc from markdown, I've narrowed it down to the part of code that's failing:

this fails even with an empty file

  local parser = vim.treesitter.get_string_parser(
    -- contents of a sample markdown file consisting of 2 lines
    -- tile and body
    "# title\nbody\n\n",
    "markdown"
  )
  parser:parse()

The call to parser:parse() fails with:

Error detected while processing /home/bhagwan/test.lua:
E5113: Error while calling lua chunk: /usr/share/nvim/runtime/lua/vim/treesitter/languagetree.lua:115:
Ranges can only be made from 6 element long tables or nodes.
stack traceback:
        [C]: in function 'set_included_ranges'
        /usr/share/nvim/runtime/lua/vim/treesitter/languagetree.lua:115: in function 'parse'
        /usr/share/nvim/runtime/lua/vim/treesitter/languagetree.lua:149: in function 'parse'
        /home/bhagwan/test.lua:3: in main chunk

Bug: werid indexed list

1. a
  1. b
    1. c
    2. d
      1. e
      2. f

parsed as:

list [0, 0] - [6, 0]
  list_item [0, 0] - [1, 2]
    list_marker_dot [0, 0] - [0, 3]
    paragraph [0, 3] - [1, 2]
  list_item [1, 2] - [6, 0]
    list_marker_dot [1, 2] - [1, 5]
    paragraph [1, 5] - [4, 6]
    list [4, 6] - [6, 0]
      list_item [4, 6] - [5, 6]
        list_marker_dot [4, 6] - [4, 9]
        paragraph [4, 9] - [5, 6]
      list_item [5, 6] - [6, 0]
        list_marker_dot [5, 6] - [5, 9]
        paragraph [5, 9] - [6, 0]

which result list mark before c and d have no tshighlight:

image

Table support?

First, great job on this and ty for taking ownership on mardown in neovim treesitter :-)

Are tables supported? I'm using treesitter to auto-generate vimdoc for my plugin from the README, I used to do with https://github.com/ikatyang/tree-sitter-markdown which recognized the below as table, this seems not to be the case here.

Are tables supported?

screenshot-1639535526

For reference, here's how it's parsed with the ikatyang:
screenshot-1639536128

Error: query: invalid node type at position 152

Hi again @MDeiml,

I seem to encounter a new error when using the markdown parser with the nightly, using get_string_parser fails:

vim.treesitter.get_string_parser("test", "cpp")              -- success
vim.treesitter.get_string_parser("test", "markdown_inline")  -- success
vim.treesitter.get_string_parser("test", "markdown")         -- FAIL

With error:

The reason the runtime paths are weird is because I'm using the nightly appimage

E5108: Error executing lua ...488e/usr/share/nvim/runtime/lua/vim/treesitter/query.lua:174: query: invalid node type at position 152
stack traceback:
        [C]: in function '_ts_parse_query'
        ...488e/usr/share/nvim/runtime/lua/vim/treesitter/query.lua:174: in function 'get_query'
        ...r/share/nvim/runtime/lua/vim/treesitter/languagetree.lua:35: in function 'get_string_parser'
        .../site/pack/vendor/start/ts-vimdoc.nvim/lua/ts-vimdoc.lua:23: in function 'docgen'
        [string ":lua"]:1: in main chunk

Missing headings level names

The headings level names are missing:

Heading      Default      Treesitter
-----------------------------------------
  # Level 1  markdownH1   markdownTSTitle
 ## Level 2  markdownH2   markdownTSTitle
### Level 3  markdownH3   markdownTSTitle
...
-----------------------------------------

Implementing the levels would significantly improve the readability via color schemes.
This module is amazing, and the H levels would make it even more spectacular.
❤️

Injection for LaTeX blocks

Although it does not appear to have been added to the GFM spec, GitHub recently added support for LaTeX rendering on their website using $ and $$ blocks. It is more formally outlined here.

Accordingly, It would be amazing if the parser could be extended to detect LaTeX blocks as marked by $ and $$ and provide injections for them.

Emphasis does not always work as intended

Additional parameters in info string break injection.

For example as in this example from the specs

~~~~    ruby startline=3 $%@#$
def foo(x)
  return 3
end
~~~~~~~

Only the first word of the info string should be treated as a language. To solve this we probably need to add a new node so the output looks like

(fenced_code_block
  (info_string
    (language))
  (fenced_code_block_content))

block_quote_marker includes whitespace

I've played around with Treesitter playground (using Neovim), with the aim to have nice-looking markdown right in the nvim buffer (using conceal). I found that the block_quote_marker nodes include not only > but also the next whitespace. Is this dictated by Markdown syntax? If so, is it possible to break down them into two nodes?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.