commonmark / cmark Goto Github PK

CommonMark parsing and rendering library and program in C

License: Other

CMake 2.23% C++ 1.43% C 84.81% Python 7.62% Makefile 0.93% Ruby 0.05% Racket 1.35% Batchfile 0.04% XSLT 1.53% Nix 0.02%

cmark's Introduction

CommonMark

CommonMark is a rationalized version of Markdown syntax, with a spec and BSD-licensed reference implementations in C and JavaScript.

Try it now!

For more details, see https://commonmark.org.

This repository contains the spec itself, along with tools for running tests against the spec, and for creating HTML and PDF versions of the spec.

The reference implementations live in separate repositories:

https://github.com/commonmark/cmark (C)
https://github.com/commonmark/commonmark.js (JavaScript)

There is a list of third-party libraries in a dozen different languages here.

Running tests against the spec

The spec contains over 500 embedded examples which serve as conformance tests. To run the tests using an executable $PROG:

python3 test/spec_tests.py --program $PROG

If you want to extract the raw test data from the spec without actually running the tests, you can do:

python3 test/spec_tests.py --dump-tests

and you'll get all the tests in JSON format.

JavaScript developers may find it more convenient to use the commonmark-spec npm package, which is published from this repository. It exports an array tests of JSON objects with the format

{
  "markdown": "Foo\nBar\n---\n",
  "html": "<h2>Foo\nBar</h2>\n",
  "section": "Setext headings",
  "number": 65
}

The spec

The source of the spec is spec.txt. This is basically a Markdown file, with code examples written in a shorthand form:

```````````````````````````````` example
Markdown source
.
expected HTML output
````````````````````````````````

To build an HTML version of the spec, do make spec.html. To build a PDF version, do make spec.pdf. For both versions, you must have the lua rock lcmark installed: after installing lua and lua rocks, luarocks install lcmark. For the PDF you must also have xelatex installed.

The spec is written from the point of view of the human writer, not the computer reader. It is not an algorithm---an English translation of a computer program---but a declarative description of what counts as a block quote, a code block, and each of the other structural elements that can make up a Markdown document.

Because John Gruber's canonical syntax description leaves many aspects of the syntax undetermined, writing a precise spec requires making a large number of decisions, many of them somewhat arbitrary. In making them, we have appealed to existing conventions and considerations of simplicity, readability, expressive power, and consistency. We have tried to ensure that "normal" documents in the many incompatible existing implementations of Markdown will render, as far as possible, as their authors intended. And we have tried to make the rules for different elements work together harmoniously. In places where different decisions could have been made (for example, the rules governing list indentation), we have explained the rationale for our choices. In a few cases, we have departed slightly from the canonical syntax description, in ways that we think further the goals of Markdown as stated in that description.

For the most part, we have limited ourselves to the basic elements described in Gruber's canonical syntax description, eschewing extensions like footnotes and definition lists. It is important to get the core right before considering such things. However, we have included a visible syntax for line breaks and fenced code blocks.

Differences from original Markdown

There are only a few places where this spec says things that contradict the canonical syntax description:

It allows all punctuation symbols to be backslash-escaped, not just the symbols with special meanings in Markdown. We found that it was just too hard to remember which symbols could be escaped.
It introduces an alternative syntax for hard line breaks, a backslash at the end of the line, supplementing the two-spaces-at-the-end-of-line rule. This is motivated by persistent complaints about the “invisible” nature of the two-space rule.
Link syntax has been made a bit more predictable (in a backwards-compatible way). For example, Markdown.pl allows single quotes around a title in inline links, but not in reference links. This kind of difference is really hard for users to remember, so the spec allows single quotes in both contexts.
The rule for HTML blocks differs, though in most real cases it shouldn't make a difference. (See the section on HTML Blocks for details.) The spec's proposal makes it easy to include Markdown inside HTML block-level tags, if you want to, but also allows you to exclude this. It also makes parsing much easier, avoiding expensive backtracking.

It does not collapse adjacent bird-track blocks into a single blockquote:

> these are two

> blockquotes

> this is a single
>
> blockquote with two paragraphs

Rules for content in lists differ in a few respects, though (as with HTML blocks), most lists in existing documents should render as intended. There is some discussion of the choice points and differences in the subsection of List Items entitled Motivation. We think that the spec's proposal does better than any existing implementation in rendering lists the way a human writer or reader would intuitively understand them. (We could give numerous examples of perfectly natural looking lists that nearly every existing implementation flubs up.)
Changing bullet characters, or changing from bullets to numbers or vice versa, starts a new list. We think that is almost always going to be the writer's intent.
The number that begins an ordered list item may be followed by either . or ). Changing the delimiter style starts a new list.
The start number of an ordered list is significant.
Fenced code blocks are supported, delimited by either backticks (```) or tildes (~~~).

Contributing

There is a forum for discussing CommonMark; you should use it instead of github issues for questions and possibly open-ended discussions. Use the github issue tracker only for simple, clear, actionable issues.

Authors

The spec was written by John MacFarlane, drawing on

his experience writing and maintaining Markdown implementations in several languages, including the first Markdown parser not based on regular expression substitutions (pandoc) and the first markdown parsers based on PEG grammars (peg-markdown, lunamark)
a detailed examination of the differences between existing Markdown implementations using BabelMark 2, and
extensive discussions with David Greenspan, Jeff Atwood, Vicent Marti, Neil Williams, and Benjamin Dumke-von der Ehe.

Since the first announcement, many people have contributed ideas. Kārlis Gaņģis was especially helpful in refining the rules for emphasis, strong emphasis, links, and images.

cmark's People

Contributors

Stargazers

Watchers

Forkers

neonichu peterarmstrong nwellnhof jordanmilne errpro llxwj nanis btrask dtweston yangcha yangman-c bobfridley kapalani1 bryant txdv sconemad 0b10011 tin-pot cnbin mcanthony mbenelli anderas seem-sky xuhe apple hasimir dmitry-shechtman transformersprimeabcxyz wiltonlazary smcv gcgft fxcebx mathieuduponchelle chriseidhof krytarowski tinysun212 kerrizor yiqideren vmg tony kevinburke benedictc sguzwf k06a tst2005 petere pi3k14 bloom vapor-community jameslinus modulexcite objcio doeme github aseprite sztomi strogo ishaansejwal juhp main0c alexxnica tomihasa fjh658 satzz rowhit tindzk letory rnystrom jebcat1982 ikedas philipturnbull heavenfox mpacer pkroma kapyshin polysync richvred engnr baberthal lfany iwasrobbed yang123vc liangklfang sjrmanning zhang1990215 jiakuan ambiesoft samalone nadesai her001 skaee hawflakes prbase btbytes andersjel 63830708 cybort shyamalschandra macsual brunophilipe

cmark's Issues

-pg flag is added to Debug and Profile builds on every platform

The -pg flag shouldn't be used for Debug builds at all. For Profile builds, it should only be used with compilers that support it (gcc and clang).

Memory leak in `cmark_consolidate_text_nodes`

AFAICT, the call it does a call to cmark_node_set_literal copies the string from buf, but that string is never copied. It's easy to verify the leak with running it in a loop and looking at the memory (which is how I caught it) then adding a free for the detached string and the leak is gone.

[I'd submit it as a PR, but it seems silly to put stuff in an allocated string, then copying it and freeing the first. A more proper solution would be some way to signal a request to not copy the text (which cmark_node_set_literal could get as a new argument which will be passed on to cmark_chunk_set_cstr) but that's an API change, so I'm just reporting it...]

Get rid of cmake dependency?

I think it'd be really nice to get rid of the cmake dependency (from a user's perspective, at least). Are there big implications from the maintainer's perspective? I'm willing to help out and do the work (but of course, only if we agree this is a good idea).

Latex output

Two questions about possible Latex output.

Is this a planned feature?
If one were to implement it, is there a chance it would be merge?

Thanks.

cmark_parser_new declaration versus use

I decided to try to build this library using Visual Studio 2013 CE for the first time. I do not know much about the internal design decisions ... I just thought it would be a good idea to try it out.

My build failed with:

[ 32%] Building C object src/CMakeFiles/cmark.dir/main.c.obj
main.c
C:\...\main.c(119) : error C2660: 'cmark_parser_new' : function does not take 1 arguments

Lo and behold, cmark.h has:

CMARK_EXPORT
cmark_parser *cmark_parser_new();

Knowing enough C, it is not shocking that some compilers may forgive this, but I am not sure which compiler let blocks.c compile.

C:\...> cl /?
Microsoft (R) C/C++ Optimizing Compiler Version 18.00.31101 for x64

man page output formats

The list of output formats in the man page lists ast as one of the options, but the actual option is xml.

Build script error on Windows

I couldn't build with .\nmake.bat on Windows with a probably proper build environment. I asked for help at http://talk.commonmark.org/t/help-building-cmark-on-windows/1881 and @jgm suggested me to report an issue instead.

I was able to built it successfully with the following steps:

I made some edits to "Makefile.nmake".

diff --git a/Makefile.nmake b/Makefile.nmake
index b0556e2..e4124a6 100644
--- a/Makefile.nmake
+++ b/Makefile.nmake
@@ -7,7 +7,7 @@ PROG=$(BUILDDIR)\src\cmark.exe
GENERATOR=NMake Makefiles

all: $(BUILDDIR)
    -   @cd $(BUILDDIR) && $(MAKE) /nologo && cd ..
    +   @cd $(BUILDDIR) && $(MAKE) /nologo /f ..\Makefile.nmake && cd ..

    $(BUILDDIR):
        @-mkdir $(BUILDDIR) 2> nul
        @@ -20,7 +20,7 @@ $(BUILDDIR):
            cd ..

            install: all
            -   @cd $(BUILDDIR) && $(MAKE) /nologo install && cd ..
            +   @cd $(BUILDDIR) && $(MAKE) /f ..\Makefile.nmake /nologo install && cd ..

            clean:
            -rmdir /s /q $(BUILDDIR) $(MINGW_INSTALLDIR) 2> nul
            @@ -29,7 +29,7 @@ $(SRCDIR)\case_fold_switch.inc: $(DATADIR)\CaseFolding-3.2.0.txt
            perl mkcasefold.pl < $? > $@

            test: $(SPEC) all
            -   @cd $(BUILDDIR) && $(MAKE) /nologo test ARGS="-V" && cd ..
            +   @cd $(BUILDDIR) && $(MAKE) /f ..\Makefile.nmake /nologo test ARGS="-V" && cd ..

            distclean: clean
            del /q src\scanners.c 2> nul

I ran .\nmake.bat and got:

...
cd build &&  cmake  -G "NMake Makefiles"  -D CMAKE_BUILD_TYPE=  -D CMAKE_INSTALL_PREFIX=windows  .. &&  cd ..
-- Configuring done
-- Generating done
-- Build files have been written to: D:/Workspaces/cmark/build
NMAKE : fatal error U1052: file '..\Makefile.nmake' not found
Stop.
NMAKE : fatal error U1077: 'cd' : return code '0x2'
Stop.
NMAKE : fatal error U1077: 'cd' : return code '0x2'
Stop.

I supposed at least the build files are generated successfully.
I then reverted the change to "Makefile.nmake" and ran .\nmake.bat again. This time it succeeded.

So it seems the build script is not robust or not correct.

With CRLF line endings in input, we get mixed line endings in output

Current behavior with line endings is not ideal.

```\r\ntest\r\n```\r\n

becomes

<pre><code>test\r\n
</code></pre>\n

Note the inconsistent line endings. \n is used for line
endings added by the renderer, but the existing line endings
are preserved in code blocks.

HTML table-of-contents renderer

I'm working on replacing libmarkdown (Discount) with libcmark in one of my projects, but progress is currently blocked by the lack of HTML table-of-contents generation in libcmark. I initially thought I could just add a new cmark_render_html_toc() function to use alongside the existing cmark_render_html(). However, hyperlinking from the ToC items to the headings also requires a link anchor for each generated heading, which cmark_render_html() currently doesn't include.

Does it sound reasonable to implement both of these changes here? If so, any thoughts on the best way to do so? Should the link anchors always be included, or should they be configurable?

compile fail, missing dependency

are you maybe forgetting to declare a dependency?

 cabal install cmark
Resolving dependencies...
Configuring cmark-0.3.2...
Building cmark-0.3.2...
Failed to install cmark-0.3.2
Build log ( /home/fommil/.cabal/logs/cmark-0.3.2.log ):
Configuring cmark-0.3.2...
Building cmark-0.3.2...
Preprocessing library cmark-0.3.2...
[1 of 1] Compiling CMark            ( dist/build/CMark.hs, dist/build/CMark.o )

CMark.hsc:61:11: Not in scope: `TF.withCStringLen'

CMark.hsc:72:5:
    Not in scope: `TF.peekCStringLen'
    Perhaps you meant `B.packCStringLen' (imported from Data.ByteString)

CMark.hsc:79:3: Not in scope: `TF.withCStringLen'

CMark.hsc:85:12:
    Not in scope: `TF.peekCStringLen'
    Perhaps you meant `B.packCStringLen' (imported from Data.ByteString)

CMark.hsc:324:22:
    Not in scope: `TF.peekCStringLen'
    Perhaps you meant `B.packCStringLen' (imported from Data.ByteString)
cabal: Error: some packages failed to install:
cmark-0.3.2 failed during the building phase. The exception was:
ExitFailure 1

Expose reference links

It seems the parser automatically looks up references and turns them in to links. It would be nice to have a way to see the actual references and their definitions so we can use this library to programatically manipulate markdown documents without having all the references collapsed.

example:

> ./cmark --to commonmark <<EOF
[foo] [bar]

[bar]: /url "title"
EOF
[foo](/url "title")

Static library?

Cmark doesn't appear to build a static library by default. Am I missing an easy way to do this? If not, could it be added?

Thanks. Great project!

Warning on Solaris 11

When building with SolarisStudio 12.3

root@solaris11:~/Desktop/cmark/build# make
Scanning dependencies of target cmark
[  1%] Building C object src/CMakeFiles/cmark.dir/cmark.c.o
"/root/Desktop/cmark/src/buffer.h", line 96: warning:  attribute "format" is unknown, ignored
[  3%] Building C object src/CMakeFiles/cmark.dir/node.c.o
"/root/Desktop/cmark/src/buffer.h", line 96: warning:  attribute "format" is unknown, ignored
[  5%] Building C object src/CMakeFiles/cmark.dir/iterator.c.o
"/root/Desktop/cmark/src/buffer.h", line 96: warning:  attribute "format" is unknown, ignored
[  7%] Building C object src/CMakeFiles/cmark.dir/blocks.c.o
"/root/Desktop/cmark/src/buffer.h", line 96: warning:  attribute "format" is unknown, ignored
...

Odd list behavior

See commonmark/commonmark.js#42
Same issue afflicts cmark

smart_punct.txt

We have already discussed it commonmark.js, but I spotted test for smart punctuation here. And have the same question. Why does markdown transformer have responsibility to do smth with typography?

freeing cmark_render_* buffers

I am writing bindings for C# and I saw that there are 4 functions which take a buffer immediately and do some rendering:

/** Render a 'node' tree as XML.
 */
CMARK_EXPORT
char *cmark_render_xml(cmark_node *root, int options);

/** Render a 'node' tree as an HTML fragment.  It is up to the user
 * to add an appropriate header and footer.
 */
CMARK_EXPORT
char *cmark_render_html(cmark_node *root, int options);

/** Render a 'node' tree as a groff man page, without the header.
 */
CMARK_EXPORT
char *cmark_render_man(cmark_node *root, int options);

/** Render a 'node' tree as a commonmark document.
 */
CMARK_EXPORT
char *cmark_render_commonmark(cmark_node *root, int options, int width);

However, I have no access to free from C# (Well one can code it but it is considered to be not good).

Maybe we could expose the free function (cmark_render_free?) so a user who uses ffi bindings could just tap into that?

commonmark renderer hangs on certain unicode inputs

Found with some help from afl. This is a renderer bug, not a parser bug, as it only affects commonmark (and sometimes man) output.

python3 -c 'print("\uffff")' | cmark -t commonmark

Hang with links and emphasis

**x [a*b**c*](d)

found by afl. Oddly, this doesn't affect commonmark.js, so it's likely a programming error rather than a flaw in the parsing algorithm.

Failed tests on CPython 3.5.0

Some of the tests are not compatible with CPython 3.5.0. With latest release 0.22.0 (and probably also master, since there's only one, unrelated commit since the latest release), four out of the six make test tests fail:

      2 - html_normalization (Failed)
      3 - spectest_library (Failed)
      5 - spectest_executable (Failed)
      6 - smartpuncttest_executable (Failed)

A sample test log is here: https://gist.github.com/a13cfb20c900e16a8bf0.

Most of the failures are due to the removed HTMLParseError in Python 3.5, and that one is easy to fix. See my proposed changes in #84 . After applying my patch, tests 3, 5, and 6 all pass, but there's still a non-trivial failure in html_normalization's doctest (log is here: https://gist.github.com/12b71e2ef22ef4b9d635):

Command: "/usr/local/bin/python3" "-m" "doctest" "/tmp/commonmark20150920-37576-vpuo15/cmark-0.22.0/test/normalize.py"
Directory: /tmp/commonmark20150920-37576-vpuo15/cmark-0.22.0/build/testdir
"html_normalization" start time: Sep 20 16:36 PDT
Output:
----------------------------------------------------------
**********************************************************************
File "/tmp/commonmark20150920-37576-vpuo15/cmark-0.22.0/test/normalize.py", line 170, in normalize.normalize_html
Failed example:
    normalize_html("&forall;&amp;&gt;&lt;&quot;")
Expected:
    '\u2200&amp;&gt;&lt;&quot;'
Got:
    '∀&><"'
**********************************************************************
1 items had failures:
   1 of  10 in normalize.normalize_html
***Test Failed*** 1 failures.

Looks like somehow the HTML entities have been decoded. I think it is related to the Python 3.5 change to HTMLParser: the convert_charrefs argument to __init__, introduced in Python 3.4, is now default to True. But I've tried passing convert_charrefs=False to __init__; that doesn't immediately fix the problem. I haven't got time to read through normalize.py, so perhaps someone else has to look at what's wrong with it.

P.S. Quoting from https://docs.python.org/3.5/whatsnew/3.5.html:

The deprecated “strict” mode and argument of HTMLParser, HTMLParser.error(), and the HTMLParserError exception have been removed. (Contributed by Ezio Melotti in issue 15114.) The convert_charrefs argument of HTMLParser is now True by default. (Contributed by Berker Peksag in issue 21047.)

Commonmark writer bug - regular link turning into autolink

% cmark -t commonmark
[test](http://example.com)
^D
<http://example.com>

Found by Mike Ward, see jgm/pandoc#2203.

Building cmark for iOS

I followed the steps to create a libcmark.0.18.1.dylib, but when added to the project I get a:

building for iOS Simulator, but linking against dylib built for MacOSX file

How can I make the dylib to iOS?

UBSAN note

cmark/chunk.h:69:3: runtime error: null pointer passed as argument 2, which is declared to never be null

FFI bindings

Hey guys, I saw that the version strings are external symbols to variables.

Some programming languages support only ffi, as in, all external calls can be made only through functions, variables are not accessible even if they are external, for example C#.

Could we get additional functions to retrieve the version string and integer number?

Escaping and --smart

The recently added --smart option operates at the level of the renderer, hence it can't distinguish a " character that appeared in the CommonMark source as \" from one that appeared as " or ". Hence there is no way to indicate in the source that a certain " character should not be converted into a curly quote. This is bad, since there arguably some legitimate uses for straight quotes (e.g. He is 6'2".

[Side note: Technically, one should use slightly slanted "prime" characters for things like 6'2". However, as Butterick's Practical Typography points out, although this is correct, fonts often don't have these characters, so this is probably not a good all-purpose solution.]

Interestingly, most "smart" converters operate at the level of the renderer, not the parser (e.g. php-smartypants). However, they are able to handle escaping because their Markdown parsers don't let ' and " be escaped.

The only solution I see (unless we change CommonMark's escaping rules, which I think would be the tail wagging the dog) would be to move "smart punctuation" processing into the parser, as is done in pandoc. The parser can see the escapes and could treat escaped straight quotes as literal straight quotes.

[EDIT:] Another advantage of putting smart quote parsing in the parser is accuracy. We could put these in the delimiter stack, which would allow us to interpret things as left quotes only when we find a right match; we could also ensure that matched quotes don't cross link and emph boundaries. Consider this:

'tis the season to be 'jolly'.

The ' in front of should not be a left single quote. Some other cases where we might not want smartification:

*'hello*'
['link](url)'

Compilation Fails in Xcode

In Xcode, I'm getting a compile error:

cmark/src/buffer.c:187:10: Implicit declaration of function '_vscprintf' is invalid in C99

Is line 184 of buffer.c supposed to be:

#ifdef HAVE_C99_SNPRINTF instead of #ifndef HAVE_C99_SNPRINTF?

CRLF regression

I just updated the version of cmark I am using and found that CRLF support was broken in bd3b245. CR and LF endings independently work fine, but the logic that treated CRLF as one line break was over-simplified. Now CRLF gets handled as two line breaks, one CR and one LF.

The main example of this can be seen at bd3b245#diff-f4eff1b97e63a76ec31f3bdea11f10deL568 blocks.c line 568 where the logic that would eat up to two characters (one CR and one LF) was replaced by a single check that would only eat one character (either CR or LF but not both).

I think this is an argument for running the full test suite with all three line ending modes.

Thanks.

Edit: forgot to provide a test case.

line
line

Converted with cmark --hardbreaks crlf.md. Output is:

<p>line<br />
<br />
line</p>

I don't believe this problem is specific to the hardbreaks option, that's just where I happened to notice it first.

PDB of cmark.exe is overwritten by DLL's PDB on Windows

If you build a debug version on Windows (nmake BUILD_TYPE=Debug), the PDB file for cmark.exe (cmark.pdb) is overwritten by the PDB file of cmark.dll with the same name. This makes debugging a bit painful. For now, I'm resorting to temporarily renaming the DLL in CMakeLists.txt. It would be nice if we could somehow build the executable and library in separate directories.

Missing LICENSE

Can you add a license, please?

Clarify parser and render options in docs

An interesting issue came up on the Ruby wrapper for libcmark.

It looks like, in order to render a string into a document, and that document into HTML, you must pass the rendering options twice, like this:

doc = CommonMarker.render_doc(markdown, [:smart, :hardbreaks])
doc.to_html([:smart, :hardbreaks])

As far as I can see, there's no way for render_doc to store the render options on the cmark_node document itself. And there's no way for that same node to call to_html—it must be handled by a different cmark_render_html method.

Is that correct? And if so, is that design intentional? I could piece something together for the Ruby wrapper to work around this, but I thought it might make sense to apply the change at the C level.

Heap buffer overflow

Build cmark with -fsanitize=address, then

$ echo -e "# 000000[0\x00\x0000000000000000000\x000\x00000000000](p0000\\" | ./cmark

Output:

==7307== ERROR: AddressSanitizer: heap-buffer-overflow on address 0x600c0000bf38 at pc 0x4cb8dd bp 0x7ffc3308d950 sp 0x7ffc3308d948
READ of size 1 at 0x600c0000bf38 thread T0
    #0 0x4cb8dc (cmark/build/src/cmark+0x4cb8dc)
    #1 0x4537eb (cmark/build/src/cmark+0x4537eb)
    #2 0x446079 (cmark/build/src/cmark+0x446079)
    #3 0x42e25c (cmark/build/src/cmark+0x42e25c)
    #4 0x403057 (cmark/build/src/cmark+0x403057)
    #5 0x7f515758eaa4 (/lib64/libc-2.20.so+0x21aa4)
    #6 0x404369 (cmark/build/src/cmark+0x404369)
0x600c0000bf38 is located 0 bytes to the right of 56-byte region [0x600c0000bf00,0x600c0000bf38)
allocated by thread T0 here:
    #0 0x7f515791a5df (/usr/lib64/gcc/x86_64-pc-linux-gnu/4.8.4/libasan.so.0.0.0+0x155df)
    #1 0x5514bd (cmark/build/src/cmark+0x5514bd)
Shadow bytes around the buggy address:
  0x0c01ffff9790: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa
  0x0c01ffff97a0: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa
  0x0c01ffff97b0: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa
  0x0c01ffff97c0: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa
  0x0c01ffff97d0: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa
=>0x0c01ffff97e0: 00 00 00 00 00 00 00[fa]fa fa fa fa fd fd fd fd
  0x0c01ffff97f0: fd fd fd fa fa fa fa fa 00 00 00 00 00 00 00 fa
  0x0c01ffff9800: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa
  0x0c01ffff9810: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa
  0x0c01ffff9820: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa
  0x0c01ffff9830: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa
Shadow byte legend (one shadow byte represents 8 application bytes):
  Addressable:           00
  Partially addressable: 01 02 03 04 05 06 07 
  Heap left redzone:     fa
  Heap righ redzone:     fb
  Freed Heap region:     fd
  Stack left redzone:    f1
  Stack mid redzone:     f2
  Stack right redzone:   f3
  Stack partial redzone: f4
  Stack after return:    f5
  Stack use after scope: f8
  Global redzone:        f9
  Global init order:     f6
  Poisoned by user:      f7
  ASan internal:         fe
==7307== ABORTING

afl

cmark_markdown_to_html returns something wrong on the 2nd call

I'm calling cmark_markdown_to_html() twice in a row, and it seems to return something wrong the second time (test.c). The program segfaults on the line 23. Makefile.

Backtrace:

Program received signal SIGSEGV, Segmentation fault.
0x00007ffff786cc4a in strlen () from /lib64/libc.so.6
(gdb) bt
#0  0x00007ffff786cc4a in strlen () from /lib64/libc.so.6
#1  0x0000000000400a95 in main (argc=1, argv=0x7fffffffdd98) at test.c:23
(gdb)

gcc warnings:

$ make clean all
rm -f test
gcc -g -ggdb -o test test.c -I. -L. -lcmark
test.c: In function 'main':
test.c:17:9: warning: assignment makes pointer from integer without a cast [enabled by default]
  result = cmark_markdown_to_html(text, sz);
         ^
test.c:21:9: warning: assignment makes pointer from integer without a cast [enabled by default]
  result = cmark_markdown_to_html(text, sz);
         ^

Odd list case

seems to be handled wrong by cmark: "there" turns into a paragraph outside the list item. This violates the principle of uniformity.

make mingw fails with undefined reference to _strnlen

Ubuntu 14.04
sudo apt-cache show mingw32 : 4.2.1.dfsg-2ubuntu1

...
Linking C executable cmark.exe
CMakeFiles/cmark.dir/objects.a(houdini_html_u.c.obj):houdini_html_u.c:(.text+0x38f): undefined reference to `_strnlen'
collect2: ld returned 1 exit status
...

ABI stability policy

The addition of the options parameter to cmark_markdown_to_html in version 0.18.2 breaks the API and ABI. Since the soname of libcmark is currently based on major and minor version (0.18), this can cause problems. If someone updates the libcmark binary from 0.18.1 to 0.18.2, any code compiled against the old version will call cmark_markdown_to_html without an options argument. There won't be a linker error because the soname matches.

We should make sure to always bump the library's soname if the ABI changes. I can see the following solutions:

Derive soname from major and minor version like we do now. Only allow ABI changes if we release a new minor version.
Derive soname from major, minor and patchlevel. This means to recompile all dependent binaries whenever a new libcmark version is installed, even if it were ABI-compatible.
Manage soname manually. Check if there's an incompatible ABI change when cutting a release. If yes, bump soname.

The second approach puts the least burden on the release manager and, at least for now, shouldn't be too annoying for users.

cmark_markdown_to_html len parameter should be size_t

The cmark_markdown_to_html function currently takes an int for the len parameter. Wouldn't it make more sense to use a size_t? I'm currently having to do something like:

#include <assert.h>
#include <limits.h>
#include <cmark.h>

char *example(const char *input, size_t length) {
    assert(length < INT_MAX); // avoid int overflow
    return cmark_markdown_to_html(input, (int)length, CMARK_OPT_DEFAULT);
}

...which seems a little contorted.

segfault with --normalize

% build/src/cmark --normalize -t xml
hi[there
^D
zsh: segmentation fault (core dumped)  build/src/cmark --normalize -t xml

Extract common rendering code

There's some code duplication in commonmark.c and latex.c. The render_state and most of the functions handling output and line breaking could be factored out. This would also allow us to use them in the man and HTML renderers.

CRLF regression with fenced code

Thank you for making quick work of #68, but I'm afraid that's not the end of it.

Example input:

```
test
```

Output:

<pre><code>test
```</code></pre>

This happens with or without the --hardbreaks option.

More line ending issues

Modifying the test suite to use cr line endings instead of lf, we get two failures. (With crlf endings, we only get the second of these.)

Example 56 (lines 955-970) Setext headers
Foo
Bar

---

Foo
Bar
===

--- expected HTML
+++ actual HTML
@@ -1,6 +1,4 @@
-<p>Foo
-Bar</p>
-<hr />
-<p>Foo
-Bar
-===</p>
+<h2>Foo
+Bar</h2>
+<h1>Foo
+Bar</h1>

Example 493 (lines 6924-6931) Links
[Foo
  bar]: /url

[Baz][Foo bar]

--- expected HTML
+++ actual HTML
@@ -1 +1 @@
-<p><a href="/url">Baz</a></p>
+<p>[Baz][Foo bar]</p>

596 passed, 2 failed, 0 errored, 0 skipped

NULL pointer dereference in safe mode

$ echo '[]()' | ./cmark --safe
Segmentation fault (core dumped)

$ gdb --args ./cmark --safe input.md
(gdb) r
Program received signal SIGSEGV, Segmentation fault.
0x0000000000407a38 in _scan_at (scanner=0x427ef6 <_scan_dangerous_url>, c=0x64f448, offset=0)
    at cmark/src/scanners.c:10
10        unsigned char lim = ptr[c->len];
(gdb) p ptr
$1 = (unsigned char *) 0x0
(gdb) bt
#0  0x0000000000407a38 in _scan_at (scanner=0x427ef6 <_scan_dangerous_url>, c=0x64f448, offset=0)
    at cmark/src/scanners.c:10
#1  0x0000000000434baf in S_render_node (node=0x64f3f0, ev_type=CMARK_EVENT_ENTER, state=0x7fffffffcaa0, options=32)
    at cmark/src/html.c:256
#2  0x0000000000434dfc in cmark_render_html (root=0x64f090, options=32)
    at cmark/src/html.c:309
#3  0x0000000000437ee4 in print_document (document=0x64f090, writer=FORMAT_HTML, options=32, width=0)
    at cmark/src/main.c:44
#4  0x0000000000438812 in main (argc=3, argv=0x7fffffffdc68) at cmark/src/main.c:184

afl

Rename the CommonMark DTD `html` element pleeeeease!

I have done so in my clone of cmark, and let the parser now use block_html as the name for this kind of node, in analogy to the existing inline_html generic identifier.

Not that I like the new name better than the old (IMO, they are too specific anyway), but it solves an obvious problem: every HTML and XHTML document on the planet (and in orbit on the ISS ;-) has an element of this name; and therefore you can't bring "native" CommonMark elements even into close proximity to regular HTML or XHTML documents (or content, or parse trees, ASTs, whatever you call it, for that matter).

Not to mention that this use of <html> is pretty confusing for humans too, like for everyone who has ever klicked on the "show document source" button in his browser and took a look.

I assume that the CommonMark DTD is not set in stone (yet), or are there important consumers of the cmark -t xml output (other than regression tests)? While there are other things I'd like to see changed in it, this one is---at least for me---absolutely crucial.

[Using the cmark -t xml output for verification and testing purposes has it's own problems; but that's a topic for another conversation I'd be glad to have.]

Best regards
tin-pot

More sourcepos?

It would be nice if there is more source position in the output, to the point where it is possible to track the source of every bit of text.

It seems to me that currently, this tool is overall very limited (eg, doesn't come with a huge number of extensions and command line arguments to switch them on and off). This looks like a good idea for something that is supposed to serve as a basis for some system that will extend it. In my case, I played with the idea of embedding it with a different markup system, which would result in having the best of both (convenience of cmark, flexibility of a markup when needed). In any case, one way to get something like that going without implementing yet another markdown (or commonmark) parser is to use an existing tool like cmark. But unfortunately this requires knowing exactly where each bit of text came from, not just the current per line thing (?). This is because my target system is a proper language, so source tracking is important for syntax errors etc.

Ideally, this could be done for --smart replacement text too...

CRLF support

Thanks for your quick work on issue #11. I've continued integrating Cmark with my project.

I've found that when parsing files that use CRLF line endings with Cmark, block quotes extend into subsequent paragraphs, text after lists stays indented, and there are problems with fenced code.

I think the biggest problem is that the test cases don't seem to exercise alternate line endings at all. I'd suggest that all of the test cases should be automatically converted to each type of line ending (LF, CR, and CRLF) during make test and Cmark should be expected to pass all of them.

Perhaps there should also be some test cases for a single file brokenly mixing different line endings. This should be possible to handle reasonably intelligently.

I got CRLF support mostly working with some quick patches to blocks.c, which I can post if it would help. I didn't try digging into the tests.

Use-after-free in blocks.c:224

Found with American Fuzzy Lop + ASAN. Seems to be an issue with any block-level thing with four spaces following a reference.

For repro, link cmark against ASAN, compile with -fsanitize=address and pass in:

[foo]: /bar
*     anything so long as it's preceded by four spaces, a quote works too.

Looks like the node gets freed at https://github.com/jgm/cmark/blob/17e4a8203dc24ecee990ba3e8880092a1864e12e/src/blocks.c#L224 but is still referenced by parser->current when S_process_line() gets called again?

ASAN trace:

==9298== ERROR: AddressSanitizer: heap-use-after-free on address 0x60180000bd80 at pc 0x40c881 bp 0x7fffffffc690 sp 0x7fffffffc688
READ of size 4 at 0x60180000bd80 thread T0
    #0 0x40c880 in S_process_line /home/user/source/cmark-working/src/blocks.c:662
    #1 0x40dbd9 in S_parser_feed /home/user/source/cmark-working/src/blocks.c:490
    #2 0x40dbd9 in cmark_parser_feed /home/user/source/cmark-working/src/blocks.c:460
    #3 0x401bb7 in main /home/user/source/cmark-working/src/main.c:130
    #4 0x7ffff4aa6ec4 in __libc_start_main /build/buildd/eglibc-2.19/csu/libc-start.c:287
    #5 0x402399 in _start (/home/user/source/cmark-working/build/src/cmark+0x402399)
0x60180000bd80 is located 64 bytes inside of 128-byte region [0x60180000bd40,0x60180000bdc0)
freed by thread T0 here:
    #0 0x7ffff4e6033a in __interceptor_free (/usr/lib/x86_64-linux-gnu/libasan.so.0+0x1533a)
    #1 0x402a5f in S_free_nodes /home/user/source/cmark-working/src/node.c:141
    #2 0x402a5f in cmark_node_free /home/user/source/cmark-working/src/node.c:151
previously allocated by thread T0 here:
    #0 0x7ffff4e604e5 in calloc (/usr/lib/x86_64-linux-gnu/libasan.so.0+0x154e5)
    #1 0x40b1a9 in make_block /home/user/source/cmark-working/src/blocks.c:33
    #2 0x40b1a9 in add_child /home/user/source/cmark-working/src/blocks.c:303
    #3 0x40b1a9 in S_process_line /home/user/source/cmark-working/src/blocks.c:847
SUMMARY: AddressSanitizer: heap-use-after-free /home/user/source/cmark-working/src/blocks.c:561 S_process_line
Shadow bytes around the buggy address:
[...]
==9298== ABORTING

Newer versions of re2c fail to scan escaped chars in links

This is not a problem in libcmark itself but in re2c. Versions known to work are 0.13.5 and 0.13.6. Versions known to fail are 0.13.7.5 and 0.14.1. I created a minimal testcase and opened a ticket in the re2c bug tracker.

UTF8 on windows can lead to Segmentation fault

My R bindings are causing crashes on windows for certain files containing non-ascii utf8 characters when rendering to xml. It is very difficult to narrow this down because it happens quite randomly and only on windows. R uses mingw-w64 with gcc-4.6.3

One input file that consistently causes the crash is:

https://raw.githubusercontent.com/yihui/knitr/1df6eee4ac9387a881db60316c9b334fe21d5133/NEWS.md

The strange thing is that there is no particular line that causes the problem. Here it chokes on line 888 but modifying a random line elsewhere in the document can sometimes also fix the problem. Moreover I noticed that enabling CMARK_OPT_NORMALIZE will prevent the problem from appearing as well, at least for this particular file.

Multiple issues with numeric entities

Single digit decimal entities are sometimes recognized, sometimes not. I believe the issue here is the size > 3 test at https://github.com/jgm/cmark/blob/master/src/houdini_html_u.c#L15. When, for example, 	 appears at the end of a line, size == 3 and the test fails.

Handling of  fails to recognize as an entity. This seems to be out of compliance with the current state of the spec, which asks for all 1-8 digit sequences to be recognized. For this issue, perhaps the spec should be changed, and a separate issue commonmark/commonmark-spec#323 open about handling of NULL.

Invalid Unicode characters are passed through to the final render, without replacement. For example, &#xd800; is rendered as b'<p>\xed\xa0\x80</p>\n'. These should be replaced with U+FFFD at parse time.

Entities with more than 8 digits are interpreted as numeric entities. According to the spec, they should be treated as literal text.

Currently, during parsing of entities, the int codepoint is subject to integer overflow, which is undefined behavior in C (yes, I know this is insane, but when you lie down with C, you get up with UB). A sufficiently smart compiler could optimize away the if (cp < codepoint) test because negative values are impossible. This issue would be mitigated somewhat by using a maximum of 8 digits, but &#x80000000 would still provoke it. My recommendation is to use uint32_t and bail when the number of digits exceeds 8.

Need some tests for commonmark, xml, man writers

Currently we have make roundtrip_test for commonmark, and nothing for the others.

Discern `html` and `xhtml` output

The html output option currently produces XML-style empty elements (like <br />) - these are not valid in HTML, but required in XHTML. Implementing a separate xhtml output option is trivial: duplicate and modify the html.c source file into xhtml.c (see tin-pot/cmark@06ee949).

Add option to use two spaces instead of \ (backslash) for line breaks

The backslash to indicate a line break while correct is a bit unsightly (IMHO).