GithubHelp home page GithubHelp logo

markdown's Introduction

lezer-markdown

This is an incremental Markdown (CommonMark with support for extension) parser that integrates well with the Lezer parser system. It does not in fact use the Lezer runtime (that runs LR parsers, and Markdown can't really be parsed that way), but it produces Lezer-style compact syntax trees and consumes fragments of such trees for its incremental parsing.

Note that this only parses the document, producing a data structure that represents its syntactic form, and doesn't help with outputting HTML. Also, in order to be single-pass and incremental, it doesn't do some things that a conforming CommonMark parser is expected to do—specifically, it doesn't validate link references, so it'll parse [a][b] and similar as a link, even if no [b] reference is declared.

The @codemirror/lang-markdown package integrates this parser with CodeMirror to provide Markdown editor support.

The code is licensed under an MIT license.

Interface

parser: MarkdownParser

The default CommonMark parser.

class MarkdownParser extends Parser

A Markdown parser configuration.

nodeSet: NodeSet

The parser's syntax node types.

configure(specMarkdownExtension) → MarkdownParser

Reconfigure the parser.

parseInline(textstring, offsetnumber) → Element[]

Parse the given piece of inline text at the given offset, returning an array of Element objects representing the inline content.

interface MarkdownConfig

Objects of this type are used to configure the Markdown parser.

props⁠?: readonly NodePropSource[]

Node props to add to the parser's node set.

defineNodes⁠?: readonly (string | NodeSpec)[]

Define new node types for use in parser extensions.

parseBlock⁠?: readonly BlockParser[]

Define additional block parsing logic.

parseInline⁠?: readonly InlineParser[]

Define new inline parsing logic.

remove⁠?: readonly string[]

Remove the named parsers from the configuration.

wrap⁠?: ParseWrapper

Add a parse wrapper (such as a mixed-language parser) to this parser.

type MarkdownExtension = MarkdownConfig | readonly MarkdownExtension[]

To make it possible to group extensions together into bigger extensions (such as the Github-flavored Markdown extension), reconfiguration accepts nested arrays of config objects.

parseCode(configObject) → MarkdownExtension

Create a Markdown extension to enable nested parsing on code blocks and/or embedded HTML.

config
codeParser⁠?: fn(infostring) → Parser | null

When provided, this will be used to parse the content of code blocks. info is the string after the opening ``` marker, or the empty string if there is no such info or this is an indented code block. If there is a parser available for the code, it should return a function that can construct the parse.

htmlParser⁠?: Parser

The parser used to parse HTML tags (both block and inline).

GitHub Flavored Markdown

GFM: MarkdownConfig[]

Extension bundle containing Table, TaskList, Strikethrough, and Autolink.

Table: MarkdownConfig

This extension provides GFM-style tables, using syntax like this:

| head 1 | head 2 |
| ---    | ---    |
| cell 1 | cell 2 |
TaskList: MarkdownConfig

Extension providing GFM-style task list items, where list items can be prefixed with [ ] or [x] to add a checkbox.

Strikethrough: MarkdownConfig

An extension that implements GFM-style Strikethrough syntax using ~~ delimiters.

Extension that implements autolinking for www./http:///https:///mailto:/xmpp: URLs and email addresses.

Other extensions

Subscript: MarkdownConfig

Extension providing Pandoc-style subscript using ~ markers.

Superscript: MarkdownConfig

Extension providing Pandoc-style superscript using ^ markers.

Emoji: MarkdownConfig

Extension that parses two colons with only letters, underscores, and numbers between them as Emoji nodes.

Extension

The parser can, to a certain extent, be extended to handle additional syntax.

interface NodeSpec

Used in the configuration to define new syntax node types.

name: string

The node's name.

block⁠?: boolean

Should be set to true if this type represents a block node.

composite⁠?: fn(cxBlockContext, lineLine, valuenumber) → boolean

If this is a composite block, this should hold a function that, at the start of a new line where that block is active, checks whether the composite block should continue (return value) and optionally adjusts the line's base position and registers nodes for any markers involved in the block's syntax.

style⁠?: Tag | readonly Tag[] | Object<Tag | readonly Tag[]>

Add highlighting tag information for this node. The value of this property may either by a tag or array of tags to assign directly to this node, or an object in the style of styleTags's argument to assign more complicated rules.

class BlockContext implements PartialParse

Block-level parsing functions get access to this context object.

lineStart: number

The start of the current line.

parser: MarkdownParser

The parser configuration used.

depth: number

The number of parent blocks surrounding the current block.

parentType(depth⁠?: number = this.depth - 1) → NodeType

Get the type of the parent block at the given depth. When no depth is passed, return the type of the innermost parent.

nextLine() → boolean

Move to the next input line. This should only be called by (non-composite) block parsers that consume the line directly, or leaf block parser nextLine methods when they consume the current line (and return true).

prevLineEnd() → number

The end position of the previous line.

startComposite(typestring, startnumber, value⁠?: number = 0)

Start a composite block. Should only be called from block parser functions that return null.

addElement(eltElement)

Add a block element. Can be called by block parsers.

addLeafElement(leafLeafBlock, eltElement)

Add a block element from a leaf parser. This makes sure any extra composite block markup (such as blockquote markers) inside the block are also added to the syntax tree.

elt(typestring, fromnumber, tonumber, children⁠?: readonly Element[]) → Element

Create an Element object to represent some syntax node.

interface BlockParser

Block parsers handle block-level structure. There are three general types of block parsers:

  • Composite block parsers, which handle things like lists and blockquotes. These define a parse method that starts a composite block and returns null when it recognizes its syntax.

  • Eager leaf block parsers, used for things like code or HTML blocks. These can unambiguously recognize their content from its first line. They define a parse method that, if it recognizes the construct, moves the current line forward to the line beyond the end of the block, add a syntax node for the block, and return true.

  • Leaf block parsers that observe a paragraph-like construct as it comes in, and optionally decide to handle it at some point. This is used for "setext" (underlined) headings and link references. These define a leaf method that checks the first line of the block and returns a LeafBlockParser object if it wants to observe that block.

name: string

The name of the parser. Can be used by other block parsers to specify precedence.

parse⁠?: fn(cxBlockContext, lineLine) → boolean | null

The eager parse function, which can look at the block's first line and return false to do nothing, true if it has parsed (and moved past a block), or null if it has started a composite block.

leaf⁠?: fn(cxBlockContext, leafLeafBlock) → LeafBlockParser | null

A leaf parse function. If no regular parse functions match for a given line, its content will be accumulated for a paragraph-style block. This method can return an object that overrides that style of parsing in some situations.

endLeaf⁠?: fn(cxBlockContext, lineLine, leafLeafBlock) → boolean

Some constructs, such as code blocks or newly started blockquotes, can interrupt paragraphs even without a blank line. If your construct can do this, provide a predicate here that recognizes lines that should end a paragraph (or other non-eager leaf block).

before⁠?: string

When given, this parser will be installed directly before the block parser with the given name. The default configuration defines block parsers with names LinkReference, IndentedCode, FencedCode, Blockquote, HorizontalRule, BulletList, OrderedList, ATXHeading, HTMLBlock, and SetextHeading.

after⁠?: string

When given, the parser will be installed directly after the parser with the given name.

interface LeafBlockParser

Objects that are used to override paragraph-style blocks should conform to this interface.

nextLine(cxBlockContext, lineLine, leafLeafBlock) → boolean

Update the parser's state for the next line, and optionally finish the block. This is not called for the first line (the object is contructed at that line), but for any further lines. When it returns true, the block is finished. It is okay for the function to consume the current line or any subsequent lines when returning true.

finish(cxBlockContext, leafLeafBlock) → boolean

Called when the block is finished by external circumstances (such as a blank line or the start of another construct). If this parser can handle the block up to its current position, it should finish the block and return true.

class Line

Data structure used during block-level per-line parsing.

text: string

The line's full text.

baseIndent: number

The base indent provided by the composite contexts (that have been handled so far).

basePos: number

The string position corresponding to the base indent.

pos: number

The position of the next non-whitespace character beyond any list, blockquote, or other composite block markers.

indent: number

The column of the next non-whitespace character.

next: number

The character code of the character after pos.

skipSpace(fromnumber) → number

Skip whitespace after the given position, return the position of the next non-space character or the end of the line if there's only space after from.

moveBase(tonumber)

Move the line's base position forward to the given position. This should only be called by composite block parsers or markup skipping functions.

moveBaseColumn(indentnumber)

Move the line's base position forward to the given column.

addMarker(eltElement)

Store a composite-block-level marker. Should be called from markup skipping functions when they consume any non-whitespace characters.

countIndent(tonumber, from⁠?: number = 0, indent⁠?: number = 0) → number

Find the column position at to, optionally starting at a given position and column.

findColumn(goalnumber) → number

Find the position corresponding to the given column.

class LeafBlock

Data structure used to accumulate a block's content during leaf block parsing.

parsers: LeafBlockParser[]

The block parsers active for this block.

start: number

The start position of the block.

content: string

The block's text content.

Inline parsing functions get access to this context, and use it to read the content and emit syntax nodes.

parser: MarkdownParser

The parser that is being used.

text: string

The text of this inline section.

offset: number

The starting offset of the section in the document.

char(posnumber) → number

Get the character code at the given (document-relative) position.

end: number

The position of the end of this inline section.

slice(fromnumber, tonumber) → string

Get a substring of this inline section. Again uses document-relative positions.

addDelimiter(typeDelimiterType, fromnumber, tonumber, openboolean, closeboolean) → number

Add a delimiter at this given position. open and close indicate whether this delimiter is opening, closing, or both. Returns the end of the delimiter, for convenient returning from parse functions.

addElement(eltElement) → number

Add an inline element. Returns the end of the element.

findOpeningDelimiter(typeDelimiterType) → number | null

Find an opening delimiter of the given type. Returns null if no delimiter is found, or an index that can be passed to takeContent otherwise.

takeContent(startIndexnumber) → Element[]

Remove all inline elements and delimiters starting from the given index (which you should get from findOpeningDelimiter, resolve delimiters inside of them, and return them as an array of elements.

skipSpace(fromnumber) → number

Skip space after the given (document) position, returning either the position of the next non-space character or the end of the section.

elt(typestring, fromnumber, tonumber, children⁠?: readonly Element[]) → Element

Create an Element for a syntax node.

interface InlineParser

Inline parsers are called for every character of parts of the document that are parsed as inline content.

name: string

This parser's name, which can be used by other parsers to indicate a relative precedence.

parse(cxInlineContext, nextnumber, posnumber) → number

The parse function. Gets the next character and its position as arguments. Should return -1 if it doesn't handle the character, or add some element or delimiter and return the end position of the content it parsed if it can.

before⁠?: string

When given, this parser will be installed directly before the parser with the given name. The default configuration defines inline parsers with names Escape, Entity, InlineCode, HTMLTag, Emphasis, HardBreak, Link, and Image. When no before or after property is given, the parser is added to the end of the list.

after⁠?: string

When given, the parser will be installed directly after the parser with the given name.

interface DelimiterType

Delimiters are used during inline parsing to store the positions of things that might be delimiters, if another matching delimiter is found. They are identified by objects with these properties.

resolve⁠?: string

If this is given, the delimiter should be matched automatically when a piece of inline content is finished. Such delimiters will be matched with delimiters of the same type according to their open and close properties. When a match is found, the content between the delimiters is wrapped in a node whose name is given by the value of this property.

When this isn't given, you need to match the delimiter eagerly using the findOpeningDelimiter and takeContent methods.

mark⁠?: string

If the delimiter itself should, when matched, create a syntax node, set this to the name of the syntax node.

class Element

Elements are used to compose syntax nodes during parsing.

type: number

The node's id.

from: number

The start of the node, as an offset from the start of the document.

to: number

The end of the node.

markdown's People

Contributors

acnebs avatar craftzdog avatar danon avatar losingle avatar marijnh avatar valadaptive avatar willcrichton avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar

markdown's Issues

Accessing block type in block context

BlockContext.block is marked as internal and CompositeBlock is not exported. So in a leaf parser, you can't check what the parent block type is (or any of the blocks further up the stack).

My use case is almost the exact same as what's in the extensions for task item, but I need to allow any character for the marker rather than just space or x.

Maybe having a separate field or function that has the types would be possible? Right now I'm just declaring the block and type fields to access them.

Table extension does not follow GFM spec (cannot interrupt a paragraph)

In the GFM spec, all leaf blocks are allowed to interrupt a paragraph, for example a fenced code block can appear in the text immediately after other text with no line breaks, like so:

const x = 3

And this markdown parser respects that part of the spec. However, I've noticed that the Table extension provided by the lib does not follow this convention, and if a Table interrupts a paragraph lezer does not recognize it as such.

Easiest way to demonstrate is probably that Github's markdown parser, which follows the spec exactly AFAIK, allows this behavior, like so:

Syntax Description
Header Title
Paragraph Text

Anyway just thought I'd point this out, thanks for making such a cool parsing library!

Line objects should maybe have a char() function equivalent to InlineContext.char()

Preface: This is a really minor "issue".

Currently, line objects implement similar methods to InlineContext objects but notably do not have a char() function, which feels disjointed / mildly disorientating (to me), considering the similarity of the two interfaces.

I would think this function would be document-relative, just like InlineContext.char().

To demonstrate my use-case:

function matches(points: number[], pos: number, cx: InlineContext | Line) {
  if (cx instanceof InlineContext)
    return points.every((ch, idx) => cx.char(pos + idx) === ch)
  else if (cx instanceof Line)
    return points.every((ch, idx) => cx.text.codePointAt(pos + idx) === ch)
  else return false
}

This function is a handy helper function to check if an array of specified code points can be found at a chosen position within the content of a Line or InlineContext object. Currently, this function behaves differently depending on whether you pass a Line object or if you pass a InlineContext object. With lines, the position is relative, and with inline, the position is document-relative.

I believe the reason why this isn't a thing is because Line is agnostic to its position within the document, so I understand if you don't want to implement this feature. I will say that, at least for me, the most common usage of Line is in the context of block parsers. Unless I am mistaken, in that context you will always have cx.lineStart available to you. Maybe a separate Line object type that includes document-relative position info and helper methods could be handy for block parsers?

And with this issue: I am done with my spam. Again, great work on this! It's awesome and I'm somehow having a lot of fun re-implementing my grammar for a third time.

Incorrect offset of children's children when using addElement in a BlockParser eager parser

Here is a representative example of my eager parser (BlockParser['parser']) code:

/** Returns a `MarkdownConfig` extension that adds a line-start based syntax using the given string.
 *  @example const ext = line('#', 'Heading1') */
function line(str: string, name: string, { interrupt, consume }: LineOpts): MarkdownConfig {

  const mark = name + 'Mark'
  const chars = toPoints(str)

  const parse = (cx: BlockContext, line: Line) => {
    if (matches(chars, line.pos, line)) {
      const start = cx.lineStart + line.pos
      const len = chars.length
      const offset = start + len

      // starting marker (e.g. '//' in a line comment)
      const children = [cx.elt(mark, start, offset)]
      if (!consume) children.push(...cx.parseInline(line.text.slice(len), offset))

      cx.nextLine()
      cx.addElement(cx.elt(name, start, cx.prevLineEnd(), children))

      return true
    }
    return false
  }

  return {
    defineNodes: [{ name, block: true }, mark],
    parseBlock: [{
      name,
      parse,
      endLeaf: !interrupt ? undefined : (_, line) => matches(chars, line.pos, line),
      before: 'Blockquote'
    }]
  }
}

The intention of this code is to generically support 'line' syntax, like ATX headings or JS line comments.

The way this malfunctions is as following:

// this is added as an extension
const ext = line('//', 'LineComment', { interrupt: true })

// given the following input (truncated): 
/*
Markdown Test Page
***
// `inline code`
*/

// I get this in the tree:
/*
└─ Document [0..22192]
   ├─ Paragraph [0..18]: 'Markdown Test Page'
   ├─ HorizontalRule [19..22]: '***'
   ├─ LineComment [23..39]
   │  ├─ LineCommentMark [23..25]: '//'
   │  └─ InlineCode [26..39]
   │     ├─ CodeMark [75..76]: 'i'
   │     └─ CodeMark [87..88]: ''
*/

As you can probably see, the 'children's children' have incorrect offsets. I've done some poking around, and while I'm still not sure why this is happening, I am fairly confident that the issue is not within the cx.inlineParse function.

image

This is console.log(children). To me, everything seems correct here - I think the issue is with addElement.

Table extension does not recognize tables that have any empty header cells

image
image

As far as I know, this is valid syntax for GFM tables. Well, actually, I can prove it!

see ma no headers

Just to demonstrate real-world use: with the right styling, using this lets you get simple unmarked tables:
image

Great work on the extension support btw, its API is so much cleaner than markdown-it and natively supports more complex syntax. In particular, multi-character delimiters I've found very easy to add. I will absolutely attempt to make an HTML renderer for it.

EDIT: Hah, I just took a look at the GFM spec and it doesn't even mention this edge case. Don't you love Markdown?

Duplicate re-parsing when inserting an empty list item?

Hi! I'd like to ask a question about the Markdown block parser.
Say you have the below Markdown (| represents the current cursor position):

* List item|

```js
function foo() {
  return false
}
```

When inserting a new list item by hitting Enter key:

* List item
* |

```js
function foo() {
  return false
}
```

The JS codeblock after the list gets re-parsed. I could see the JavaScript parser running again.
Could you tell me why it happens? If we could avoid that, it would greatly improve performance.

The custom `endLeaf` function has no effect on parsing

I've been experimenting with markdown extensions these days. I have a markdown config object like this

 const GoodBlock = {
     defineNodes: [
         { name: "GoodBlock", block: true }, "GoodMark",
     ],
     parseBlock: [{
         name: "GoodBlock",
         parse: GoodBlockParser,
         endLeaf:(_, line) => isGoodBlock(line) >= 0,
	after: "FencedCode",
     }]
 };

but for me the endLeaf function has no effect on parsing. After some digging into the source code I think the problem is on line

for (let stop of parser.endLeafBlock) if (stop(this, line)) break lines

Of course, I don't have a real understanding of the source code but it seems to me that the parser.endLeafBlock here refer only to the DefaultEndLeaf array so my custom endLeaf function has no chance to affect the parsing.

But after replacing parser.endLeafBlock with this.parser.endLeafBlock my endLeaf functions becomes active and everything seems ok. Is this correct?

Checklists/todos are highlighted as such without a trailing slash

(Originally posted in CodeMirror issue tracker but figured here is better)

In GFM, the task list syntax is like this:

- [ ] my task item

Lezer's markdown language highlights this correctly, but it also highlights this

- [ ]my task item

Which according to the spec is not a valid task list item, but is rather just a list item that starts with some [] characters.

You can see this in practice in GitHub's markdown renderer:

  • this is a task
  • [ ]this is not

Probably lezer's markdown parser should behave that way as well.

Table folding not work

Screenshot 2023-11-04 at 03 58 33

Without loading GFM, the table can be folded.
But no header highlighting

Screenshot 2023-11-04 at 04 02 34 Screenshot 2023-11-04 at 04 02 19
markdown({
                    base: commonmarkLanguage,
                    addKeymap: true,
                    completeHTMLTags: true
                }),

The link/image node does not seem to be construct

The tree hierarchy resulting from parsing an example markdown is shown below.
There is a node with a blank node type name, but I assume that there must be a link/image nodes there (Is this assumption correct?).

Links:

[link](/uri "title")
+ "Document"
    + "Paragraph"
        + ""

Link reference definitions:

[foo]: /url "title"

[foo]
+ "Document"
    + "LinkReference"
        + ""
    + "Paragraph"
        + ""

Images:

![foo](/url "title")
+ "Document"
    + "Paragraph"
        + ""

Thanks for the great project.

Potential issue with escaping pipes in tables

Preface: I am unsure what is actually supposed to happen with this behavior.

This is easiest to explain with my use case:

I've added some custom syntax that looks like this:

`js|console.log('Inline code!')`
#font sans|Here is the sans font.|#

As you can see, both use the pipe character.

When you use them in a table, you need to escape the pipe. This matches markdown-it, which I'll use as a reference because I don't have much else to go off of.

| | | |
| :--: | :-- | :-- |
| `` `...` `` | Monospace | `Monospaced text.`
| `` `lang\|...` ``| Inline Code | `js\|console.log('Inline code!')`
| `$...$` | Math (TeX) | $\int_{-\infty}^\infty e^{-x^2}\,dx =\sqrt{\pi}$
| `@@...@@`| Escaped | @@/This text is __escaped__, and **will only be rendered as plain text.**/@@

The issue arises when comparing what markdown-it outputs, and what Lezer outputs.
image
image

The first image should have an inline code bit that looks like this:
image

The blue pipe character indicates that the syntax has been successfully recognized (I haven't gotten nesting working yet).

Fundamentally, the table extension "carries over" the escape character into the inline parser, when maybe it shouldn't do that, like markdown-it. I'm not sure what the GFM spec implies to do here, if anything. I think the root issue is with the GFM spec introducing a crummy syntax, because allowing table delimiters inside inline-spans was a fairly absurd decision. MultiMarkdown doesn't do this (and has imo entirely superior tables).

Nested parser methods should maybe be public (cx.startNested)

Some flavors of Markdown add $ (inline) and $$ (or $$$) (block) for embedding TeX expressions. This is perfectly feasible to add to lezer-markdown using a MarkdownConfig extension, but it requires using private methods on the context objects.

Exposing the parser methods also opens up some support for other common extensions, like YAML front-matter, or something radical like MDX. Front-matter in particular is extremely common, and it might even benefit from a 'native' extension.

Parser assumes that IndentedCode will not be disabled

Certain bits of lezer-markdown check for the indentation level because it may invalidate a block or whatever entity due to IndentedCode becoming valid.

e.g. in BlockContext.advance:

if (line.indent < line.baseIndent + 4) {
  for (let stop of parser.endLeafBlock) if (stop(this, line)) break lines
}

However, because you are allowed to disable IndentedCode (which I appreciate!), these checks should no longer run if IndentedCode is disabled, but that edge case isn't checked. At least, that's what I would think - I understand that this is basically an extendable CommonMark implementation, and that this is what the spec describes.

The parse function in the BlockParser interface is missing the Line argument in the docs

The eager parser function in BlockParser seems to be documented incorrectly - in the readme, it shows that the function is only passed a context object, which does not expose enough information to actually do anything with. This is because, I assume unintentionally, the second argument passing a Line object is not shown in the documentation.

I assume it's supposed to be:

type EagerParser = (cx: BlockContext, line: Line) => boolean | null

MDX support

What would it take to support an MDX format, basically a combination of TSX and MD? These 2 are mixed together without any indication of which line of code is which language.

Spec deviation with angle-bracketed link destination followed by title

CommonMark links have two ways to write link destinations. You can write out the plain URL, or surround it in angle brackets (<>). Here's the relevant portion of the spec.

Regardless of whether or not you wrap your link destination in angle brackets, the spec is pretty clear that it must be separated from a following link title by whitespace:

If both link destination and link title are present, they must be separated by spaces, tabs, and up to one line ending.

@lezer/markdown does not follow this rule--it allows a link title to immediately follow a link destination that is wrapped in angle brackets, e.g.

[Click here](<http://example.com>"This link goes to example.com")

This will be parsed as a link with a destination and title, despite the two not being separated by whitespace.

Here's a CommonMark reference implementation demo that shows when link titles are and are not parsed.

Is removing language features possible?

Hi, I want to remove certain language features (namely setext-style headers) from parsing. Is this possible in an extensible way or will I need to modify the parser itself in order to do this? Had a look at the code but it wasn't clear to me if this is possible. Basically I want to do this but for CM6.

Thanks!

Add alerts to GFM

Github supports alerts like these:

Note

Alerts, notes and more

with a format like so:

> [!NOTE]
> [Alerts](https://docs.github.com/en/get-started/writing-on-github/getting-started-with-writing-and-formatting-on-github/basic-writing-and-formatting-syntax#alerts), notes and more

A simple parser can look like this:

export const Alert: MarkdownConfig = {
    defineNodes: [{
        name: "Alert",
    }, {
        name: "AlertInfo"
    }, {
        name: "AlertMark"
    }],
    parseBlock: [{
        name: "Alert",
        before: "Blockquote",
        parse(cx: BlockContext, line: Line): boolean {
            let cline = line.text;
            if (cline.slice(0, 2) != "> ") return false;
            let match = cline.match(/\[!(\w+)\]/);
            if (!match) return false;
            let type = match[1];
            let typeStart = cx.lineStart + cline.indexOf("[!") + 2;
            let typeEnd = typeStart + type.length;

            let start = cx.lineStart;
            let end = 0;
            let marks = [
                cx.elt("AlertMark", start, typeStart),
                cx.elt("AlertMark", typeEnd, typeEnd + 1)
            ];
            while (cx.nextLine()) {
                cline = line.text;
                end = cx.lineStart - 1;
                if (cline.slice(0, 1) != ">") {
                    break;
                }
                marks.push(cx.elt("AlertMark", cx.lineStart, cx.lineStart + 1));
            }
            if (end === 0) return false;

            cx.addElement(cx.elt("Alert", start, end, [
                ...marks,
                cx.elt("AlertInfo", typeStart, typeEnd)
            ]));
            return true;
        }
    }]
}

except I'm bad at writing parsers and the info is parsed wrongly (it marks markers correctly, but not the alertinfo, not sure why), and I'm not sure if you want to introduce a new node like Alert for what is essentially a quote with a colored title.

Some browsers can't catch SyntaxError in regexp at runtime

In markdown.ts, on line 1357 there's this code:

let Punctuation = /[!"#$%&'()*+,\-.\/:;<=>?@\[\\\]^_`{|}~\xA1\u2010-\u2027]/;
try { 
  Punctuation = /[\p{Pc}|\p{Pd}|\p{Pe}|\p{Pf}|\p{Pi}|\p{Po}|\p{Ps}]/u 
} catch (_) {
}

I understand that it's supposed to provide some form of default punctuation regexp, and if it's supported, then use this \p notation. \p is supported in browsers since Ecma2018, but I see you also want to support older browsers, because you provide the alternative notation without the \p, and to do that you use try/catch, which is supposed to catch SyntaxError, and it actually works, on some browsers.

Some browsers parse it in runtime, throw the exception and can catch them. This is also how node.js works. But other browsers, some based on FireFox engine, are actually parsing the regexp in compile time, and the catch simply doesn't catch them. It's as if the JavaScript parsing fails on them.

The fix could be very easy, simply write it as:

try { 
  Punctuation = new RegExp('[\\p{Pc}|\\p{Pd}|\\p{Pe}|\\p{Pf}|\\p{Pi}|\\p{Po}|\\p{Ps}]', 'u');
} catch (_) {
}

And in that case it's guranteed to be parsed at runtime, which means any SyntaxError thrown will be cought by the catch. The example of browser that don't work with that is for example SeaMonkey.

image

This fix would have to be done everywhere, of course, where unsupported tokens with /u flag are used; but I only found this one.

Some reference documentation is not updated

Styles in NodeSpec, which are recently added, is not updated on the markdown reference documentation in the README.md.

markdown/src/markdown.ts

Lines 961 to 966 in 3c5f5dc

/// Add highlighting tag information for this node. The value of
/// this property may either by a tag or array of tags to assign
/// directly to this node, or an object in the style of
/// [`styleTags`](https://lezer.codemirror.net/docs/ref/#highlight.styleTags)'s
/// argument to assign more complicated rules.
style?: Tag | readonly Tag[] | {[selector: string]: Tag | readonly Tag[]},

Unable to run "prepare" script

With Node 18.12.0 (LTS) and npm 8.19.2, I am unable to run the "prepare" script in this repository. Is there a specific combination of Node and npm expected, or perhaps a different package manager altogether?

Since this repository doesn't include a package lockfile, I'm also curious if a breaking change in some dependency isn't causing this. Does it make sense to check a lockfile in, as is common for npm package development?

Context I am considering modifying this package for more robust Markdown parsing in a product using CodeMirror. For example, the current parser does not differentiate between Markdown header content and whitespace (i.e. it parses separately {##}{ Content}, but I need it to parse {##}{ }{Content}).

View error output
~/s/g/l/markdown → npm i                                                                                                                                            (main → …1)

> @lezer/[email protected] prepare
> rollup -c rollup.config.js

Error loading `tslib` helper library.
[!] Error: Package subpath './package.json' is not defined by "exports" in /Users/jclem/src/github.com/lezer-parser/markdown/node_modules/tslib/package.json
Error [ERR_PACKAGE_PATH_NOT_EXPORTED]: Package subpath './package.json' is not defined by "exports" in /Users/jclem/src/github.com/lezer-parser/markdown/node_modules/tslib/package.json
    at new NodeError (node:internal/errors:393:5)
    at throwExportsNotFound (node:internal/modules/esm/resolve:358:9)
    at packageExportsResolve (node:internal/modules/esm/resolve:668:3)
    at resolveExports (node:internal/modules/cjs/loader:522:36)
    at Module._findPath (node:internal/modules/cjs/loader:562:31)
    at Module._resolveFilename (node:internal/modules/cjs/loader:971:27)
    at Module._load (node:internal/modules/cjs/loader:833:27)
    at Module.require (node:internal/modules/cjs/loader:1051:19)
    at require (node:internal/modules/cjs/helpers:103:18)
    at Object.<anonymous> (/Users/jclem/src/github.com/lezer-parser/markdown/node_modules/rollup-plugin-typescript2/dist/rollup-plugin-typescript2.cjs.js:25170:26)

npm ERR! code 1
npm ERR! path /Users/jclem/src/github.com/lezer-parser/markdown
npm ERR! command failed
npm ERR! command sh -c -- rollup -c rollup.config.js

npm ERR! A complete log of this run can be found in:
npm ERR!     /Users/jclem/.npm/_logs/2022-11-03T13_31_33_247Z-debug-0.log

The `line`-parameter in `BlockParser` `parse()`-methods should not be a side-effect

First of all, thank you for all the work you've already poured into CodeMirror 6. I'm currently migrating and I really like the cleaner and faster API.

However, while I am an absolute fan of the pure design of the system, I noticed one very awkward side-effect that the markdown package is producing. Specifically, one can define additional BlockParsers that define a parse-method which receives a line-parameter. A block parser is, as I've learned, expected to parse every line that belongs to the block and add corresponding tree elements to it in one single go.

At first, I was very confused since I only get a line object once and nextLine() only returns a boolean. It took me some time to realize that the line object magically updates after each call to nextLine. I just had a look, and indeed it's a class that updates itself on every call, not as one might think when looking at the docs for the core package, an object.

This feels very weird and out of place given the very pure and functional style of the whole ecosystem. Therefore, wouldn't it be cleaner or at least more transparent to users if nextLine would return a Line object or undefined instead of a boolean?

I do see the benefit of being able to write something like while(ctx.nextLine() && line.text !== someCondition) instead of always having to remember the line, but it does have its drawbacks.

An alternative would be to make it clear in the documentation that the line parameter is not a Line object (as defined in the core library), but actually kind of an iterator. Maybe even rename that class? That would make the side effect a little bit more obvious.

A third alternative might be to move the Line class instance to the context object, since ctx.line makes it clearer that the context and the line are coupled.

What do you think? Thanks a bunch in advance! Sorry that I'm starting off with such a nitty-gritty issue 🙈

Autolinks and regular link URLs are both parsed as "URL" nodes

The parser currently parses autolinks as URL nodes. However, the URL part of a regular link is also parsed as a URL node, and the two are semantically very different-- an autolink is a complete visible element in and of itself, while a URL in a regular link is completely invisible. In addition, the URL in a Link is the actual text of a URL, whereas the URL from an autolink includes the opening and closing brackets.

This is a problem for my use case--I'm trying to render Markdown to HTML (I know this library was designed for syntax highlighting, but it's the only incremental Markdown parser I can find). There's no way to tell whether a URL node should be rendered directly as an <a> (in the case of an autolink) or treated as metadata and ignored (in the case of a Link node) without walking up the tree to see if we're inside a Link node.

A fix for this would be to create an Autolink node like the existing Link node, with the opening and closing brackets and URL nested inside it as LinkMark and URL nodes.

Proposed exports

For Nota, I have implemented a new syntax as a Markdown extension using the interface in @lezer-parser/markdown: https://github.com/nota-lang/nota/blob/markdown/packages/nota-syntax/lib/parse.ts

Everything worked quite well -- great job on the flexible API design! But I ran into a few cases where I needed functionality that's currently private to markdown.ts. I'd like to show what those cases are, and in turn propose they be exported.

  1. Type should be exported from index.ts: I need Type in order to post-process the Markdown AST. However, although Type is exported from markdown.ts, it's not accessible through the module root. My bundler can't seem to resolve import "@lezer/markdown/dist/markdown" so it's not easy for me to import it from the file directly.
  2. BlockResult should be exported: this type appears in a public interface (BlockParser.parse), so it would be nice to name it directly.
  3. InlineContext constructor should be exported: similar to the problems described in #6, I had issues dealing with relative vs. global indexes inside block parsers. My solution was to just construct an InlineContext for a given line, use absolute indexes everywhere, and then let the context methods handle the translation to relative indexes for me. But this strategy requires access to the InlineContext constructor, which isn't exported.

If you're ok with any of these, let me know and I will put up a pull request.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.