thephpleague / html-to-markdown Goto Github PK

View Code? Open in Web Editor NEW

1.7K 45.0 204.0 419 KB

Convert HTML to Markdown with PHP

License: MIT License

PHP 99.92% HTML 0.08%

php markdown commonmark html converter phpleague hacktoberfest

html-to-markdown's People

Stargazers

Watchers

Forkers

unlight kyokk shukebeta nashjain netconstructor open-source-gis big-data scraping-xx bigdata-tools parsing models taxonomy naturallanguage datamodels synchro woutersioen roachhd schucan raamdev awesome joozt83 fufutu lcherone ujawebdev greywillfade alexander-kim ibsbs matt-schwartz therounds-contrib twithers slackero mtco pkatsifaras math4youbyusgroupillinois cvrlebg h4cc metrakit gnat42 inureyes vernal-creative thecatontheflat webmasterjunkie d-alejo90 rafasashi ravenb jarnix tomzxforks jeroensmit lifecloud jeremykenedy huoxudong125 marijnvdwerf icasa stephenpunwasi heldderarbeit stephanecoinon gitter-badger ssgonchar zinzinday zhangyonglei stevehalford lyhiving brokentone llewellynvdm atabaksd ymnl007 l-angel hason bluehaoran xopoc14 livferliu ouchao thocell andreskrey alacner chubv athiwatp iuam gregorjaworski markdownforzning filipgolonka kaihuiwang ppalludan raphaelriviere mmceib manuelisimo localheinz xunux laravel24 chadwk itsnikka jayadevn fationyyk vijo rizagunes studionone laurent22 zhengheforum johnsvenn peip-mirror

html-to-markdown's Issues

Bold in italic isn't converted properly

Input:

<p>Het zoeken naar <em>een spel<strong>t</strong></em> (zaadje) in een hooiberg is moeilijk.

Expected:

Het zoeken naar _een spel**t**_ (zaadje) in een hooiberg is moeilijk.

Actual:

Het zoeken naar *een spel**t***(zaadje) in een hooiberg (zaadje) in een hooiberg is moeilijk.

With the latter getting converted to this html:

<p>Het zoeken naar <em>een spel</em><em>t</em>**(zaadje) in een hooiberg (zaadje) in een hooiberg is moeilijk.</p>

Non-autolinks should get escaped

Input:

<p>&lt;league/commonmark&gt;, &lt;github.com/thephpleague&gt;, &lt;league@commonmark&gt;, &lt;https://github.com/thephpleague&gt;</p>

Actual:

<league/commonmark>, <github.com/thephpleague>,
 <league@commonmark>, <https://github.com/thephpleague>

Expected:

<league/commonmark>, <github.com/thephpleague>,
 \<league@commonmark>, \<https://github.com/thephpleague>

Content after << is stripped in pre tags

From this report in the Ghost WordPress plugin support forum:

<pre>touch ~/.profile cat >> ~/.profile <<EOF export PATH="$(brew --prefix homebrew/php/php56)/bin:$PATH" EOF </pre>

becomes:

touch ~/.profile cat >> ~/.profile

And:

<pre>args << "--with-z=/usr/local/Cellar/zlib/1.2.8"</pre>

becomes:

args

Preserve links and other tags with class attributes

HTML To Markdown should probably not convert tags with class attributes to HTML by default, so that the conversion is not destructive.

The option to convert all tags regardless (stripping class attributes) should be available, but probably not on by default.

This will likely help WordPress plugins such as Ghost to convert HTML to Markdown more faithfully without stripping essential class names.

This HTML:

<span class="span-class">Label: </span><a class="link-class" href="http://example.com">Link</a>

...should probably result in this (identical code):

<span class="span-class">Label: </span><a class="link-class" href="http://example.com">Link</a>

...instead of this:

<span class="span-class">Label: </span>[Link](http://example.com)

For an anchor, when href equals innerHTML, use simpler link syntax

Given the HTML:

<a href="https://github.com/">https://github.com/</a>

Should probably be converted to:

<https://github.com/>

Instead of the current version:

[https://github.com/](https://github.com/)

My use case is converting HTML to text for plain text emails.

Reading the URL twice is a bit jarring. The two Markdown version above are equivalent, but the shorter syntax is easier to read for a human.

Adding new converters

As of now, the list of converters supported by the environment is hardcoded and cannot be extended. If one wants to add new converters (say to support HTML files with a specific template), it would require extending HtmlConverter (to use a custom Environment class) and Environment (to add new converters).

Thus, it appears it would be useful to have the following:

Environment available through $htmlConverter->getEnvironment()
$environment->addConverter() public

This would allow someone to do the following:

$converter = new HtmlConverter();
$converter->getEnvironment()->addConverter($myConverter);

$converter->convert($content);

Code blocks not being parsed correctly

I am having an issue converting code blocks that has been converted to <code><pre>...</pre></code> blocks into their appropriate triple-backtick markdown syntax, is this a known issue or a mistake on my part?

Code with specified language should render as fenced codeblocks

Input:

<pre><code class="language-ruby">def foo(x)
  return 3
end
</code></pre>

Actual:

    <code class="language-ruby">def foo(x)
      return 3
    end

Expected:

```ruby
def foo(x)
  return 3
end
``` # comment after end fence because GitHub is weird

Also, care should be taken that fences are longer than any candidate-fences in the block

Namespace for test(s)

Final rendering may strip content

Since the convertor uses DOMDocument internally, output needs to be sanitised. This happens in HtmlConvertor::sanitize. Unfortunately, this step may also strip content, as is shown in the following example.

Input:

<pre><code>...
&lt;script type = "text/javascript"&gt;
function startTimer() {
   var tim = window.setTimeout("hideMessage()", 5000)
}
</head>
<body>
...</pre></code>

Actual

    ...
    <script type = "text/javascript">
    function startTimer() {
       var tim = window.setTimeout("hideMessage()", 5000)
    }


    ...

Expected

    ...
    <script type = "text/javascript">
    function startTimer() {
       var tim = window.setTimeout("hideMessage()", 5000)
    }
    </head>
    </body>
    ...

CommonMark support

Would be interesting to see this be in line with CommonMark, as it's a fairly solid new standard by a bunch of smart people. I've had a few issues with Markdown differing between Jekyll + Kramdown, Jekyll + Redcarpet, Leanpub, etc, one of which being lists in blockquotes and using CommonMark seems to fix them. Having output come out as a standard would be awesome I think.

Potentially driver based. Call it "Markdown" which is no-frills Gruber-compliant, and CommonMark which outputs the new standard.

Tests should match this.

Tag-like items should be escaped

Input:

<p>You forgot the &lt;!--more--&gt; tag!</p>

Actual:

You forgot the <!--more--> tag!

Expected:

You forgot the \<!--more--> tag!

Angle brackets in inner nested tags are converted to HTML entities

As reported in #5, inner tags incorrectly become HTML entities:

<span><span>Test</div></div> => <span><span>Test</span></span>

<div><div>Test</div></div> => <div><div>Test</div></div>

  before/after <a> tags are lost

If you have, for some reason

    blah blah blah&nbsp;<a href="http://www.google.com">google</a>&nbsp;blah blah blah

the resultant output is in the format:

    blah blah blah[google](http://www.google.com)blah blah blah

the spaces either side of the link go missing.

This does not seem to happen when the space is a space character (" ") instead of &nbsp;.

Don't convert anchor tags with no href

Links within a page are currently destroyed. Need to add an option to either remove tags with no href or include them as html in the markdown.

e.g. (up the top of the page) -
<a href="#step1">Step 1</a>

(down the page)
<a id="step1"></a>

... fix for getting around this below ...

private function convert_anchor($node)
{
    .. snip ...

    if ( $href == "" ) {
        return html_entity_decode($node->C14N());
    }

    .. snip ...
}

Improve test-suite

It might be a good idea to improve the test-suite by looking at test data in projects like Parsedown.

Some of the results from Parsedown might conflict with suggestions in #13 to use CommonMark, but thats because they aren't necessarily compliant with it themselves.

Spaces are stripped between consecutive tags

This test currently fails:

$this->html_gives_markdown("<b>Bold</b> <i>Italic</i>", "**Test** *Italic*");

Output:

Expected :**Test** *Italic*
Actual   :**Bold***Italic*

Need a better way to preserve spaces that exist between consecutive span tags.

Non-codefences should be escaped

Input:

<p>~~~ Marijn</p>

Actual

~~~ Marijn

Expected

\~~~ Marijn

Disable wiki

Script snippet causes fatal error

Passing this snippet as the content to convert causes a fatal error

<script type="text/javascript">document.write(unescape("%3Cscript src='http" +  (document.location.protocol == 'https:' ? 's' : '') + "://www.coffeecup.com/api/sdrive/forms/form.js?name=PGCERT%26slug=204562%26width=600%26height=780%26crossdomains=true' type='text/javascript'%3E%3C/script%3E"));</script>

An initial notice ('Trying to get property of non-object') appears in is_code_sample when trying to access $node->parentNode, and then a fatal error ('Call to a member function hasChildNodes() on a non-object') in convert_children when trying to call $node->hasChildNodes().

I guess it's quite likely that this markup collapses to nothing since it contains only a script tag, and so you end up with an empty DOM, however, it shouldn't crash!

Support a chain-of-responsibility/visitors on the elements

The current implementation expects only a single converter per HTML tag. However, it makes it hard to do things such as having multiple processing passes on the same tag.

For example, I have the following:

<span class="c1 c5">Some text</span>

With a chain-of-responsibility/visitors, we could have multiple parsers going through the element and manipulate it. One of the converter could, for instance, see that the span has a class c5, and modify the value of the element and make it bold.

<span class="c1 c5">**Some text**</span>

I've already implemented something like this @ https://github.com/TomzxForks/html-to-markdown/tree/features/chain-of-responsibility. It builds on my other change where I add a preProcess/postProcess step when we're walking through the HTML tree.

when i deal it with chinese , i always got a strange charater like �

i parser the html with php-simple-html-dom,but when i try to convert it to markdown,i always got �,

what can i do

Ampersands in entity-like text aren't being escaped

Input:

<p>&amp;euro;</p>

Expected:

\&euro;

Actual:

€

Also, entities seemingly get parsed twice. Note that an ampersand before non-entities doesn't need to be escaped: <p>R&D</p> is perfectly translated as R&D.

Strong/em should not always be converted to asterisks/underscores

Should check whether the opening and closing delimiters would actually be parsed as suck

Input:

<p>Did you mean 200<strong>.</strong>000 instead of 200<strong>,</strong>000?</p>

Expected:

Did you mean 200<strong>.</strong>000 instead of 200<strong>,</strong>000?

Actual:

Did you mean 200**.**000 instead of 200**,**000?

Stop using PHP_EOL

Surely we want the output to be identical between different systems, not dependent on the system in use?

Incorrect comment?

I'm puzzled by lines 273-274 in the current master:

            // If strip_tags is false (the default), preserve tags that don't have Markdown equivalents,
            // such as <span> and #text nodes on their own. C14N() canonicalizes the node to a string.

There's a

        case "#text":

a few lines above (line 262), so isn't the comment wrong to mention #text nodes?

Spaces are always added between consecutive anchor tags

This test fails:

$this->html_gives_markdown("<a href='#'>Link 1</a><a href='#'>Link 2</a>", "[Link 1](#)[Link 2](#)");

Output:

Expected: [Link 1](#)[Link 2](#)
Actual:   [Link 1](#) [Link 2](#)

Ideally, anchors that do not have spaces between them in the HTML should not see spaces between them in the resulting Markdown.

This is caused by the workaround in convert_anchor(), and is related to issue #9.

Code blocks requires 4 spaces instead of 1

So line 453 should be:

$markdown .= "    " . $line;

not

$markdown .= " " . $line;

...btw nice work

convert link to Jira markdown

Hi,

I am about to convert HTML code to JIRA markdown.

Is it possible to use custom converter function instead the default one?

<a href="http://example.com">link title</a>  ==> [link title|http://example.com]

Paragraphs with empty-ish values are removed

This happens because empty() considers "0" as being empty.

Input

<p>0</p>

Actual:

Expected:

Possible horizontal rules aren't escaped

Input

<p>---</p>
<p>- - - </p>
<p>--</p>

Actual

---

- - -

--

Expected

\---

\- - -

--

Small improvement

In line 254 of current release instead of
$markdown = preg_replace('~~\s+~~', ' ', $value);
i write
$markdown = preg_replace('~~\s+~~', ' ', preg_replace('/^\s+/', '', $value));
because most browsers ignore begin-of-line spaces

Ordered list-like lines should be escaped

Input:

<p>123456789) Foo and 1234567890) Bar!</p>
<p>1. Platz in 'Das große Backen'</p>

Actual:

123456789) Foo and 1234567890) Bar!

1. Platz in 'Das große Backen'

Expected:

123456789\) Foo and 1234567890) Bar!

1\. Platz in 'Das große Backen'

List-like lines should be escaped

Input:

<p>
+ Siri works well for TV and movies<br>
+ Really fast<br>
+ Games are a fun addition<br>
- Siri is extremely limited<br>
- Have to log in to every app individually<br>
- No 4K support
</p>

Actual:

+ Siri works well for TV and movies  
+ Really fast  
+ Games are a fun addition  
- Siri is extremely limited  
- Have to log in to every app individually  
- No 4K support

Expected:

\+ Siri works well for TV and movies  
\+ Really fast  
\+ Games are a fun addition  
\- Siri is extremely limited  
\- Have to log in to every app individually  
\- No 4K support

Links do not render when href is same as the text

$test = '<a href="http://test.com/">http://test.com/</a>';

echo $markdown->convert($test);

produces: <http: test.com="">"<br><br><br><br></http:>

However after changing the text it works.

$test = '<a href="http://test.com/">test</a>';

echo $markdown->convert($test);

produces: [test](http://test.com/)

If actually even works if you change one character (I took off the / in the text):

$test = '<a href="http://test.com/">http://test.com</a>';

echo $markdown->convert($test);

produces: [http://test.com](http://test.com/)

Paragraph in list element is not converted properly

HTML:

<li>
  <h3>Header</h3>
  <p>Description</p>
</li>

Expected:

- ### Header
  Description

Actual:

- ### Header
Description

Which converted back to HTML results in this:

<ul>
  <li>
    <h3>Header</h3>
  </li>
</ul>
<p>Description</p>

Integrate with StyleCI

Would be nice to have automated code style checks. Use the same settings as thephpleague/commonmark.

Versions in files are wrong

Some files say 2.2.2 in them for example.

League Checklist

One of the League peoples made this checklist:

http://phppackagechecklist.com/

It's a solid list of good advice that packages should try to follow, based off of requirements we thought up for League packages.

So, take a look at that, but minimum is:

Use League as the PSR-4 autoloader namespace. Shove code in a src folder.
Adhere to PSR-2 as the coding style guide.
List on Packagist with league as the vendor namespace.
Write unit tests. Aim for at least 80% coverage in version 1.
DocBlock all the things.
Use Semantic Versioning to manage version numbers.
Keep a Changelog.
Use Travis-CI or Circle-CI to automatically check coding standards and run tests.
Have an extensive README.

Unbalanced square brackets in links should be escaped

Input:

<p><a href="http://www.php7book.com/">Your guide to [[...] more.</a></p>
<p><a href="http://www.php7book.com/">Your guide to [...] more.</a></p>
<p><a href="http://www.php7book.com/">Your guide to [...]] more.</a></p>

Actual:

[Your guide to [[...] more.](http://www.php7book.com/)

[Your guide to [...] more.](http://www.php7book.com/)

[Your guide to [...]] more.](http://www.php7book.com/)

Expected:

[Your guide to \[[...] more.](http://www.php7book.com/)

[Your guide to [...] more.](http://www.php7book.com/)

[Your guide to [...]\] more.](http://www.php7book.com/)

In-word emphasis uses underscores as delimiter

Input:

<p>Did you check use the Test<em>Case</em>?</p>

Actual:

Did you check use the Test_Case_?

Expected:

Did you check use the Test*Case*?

List required extensions as dependencies in composer.json

Related to #4

<span> tags are not cleaned up

It would be nice to have at least an option to simply discard all tags that aren't "translatable" into MD syntax.

In particular, here's the code I tried to convert to MD, with any identifying info removed:

<p class="MsoNormal"><span style="font-family: Cambria, serif; background-position: initial initial; background-repeat: initial initial;">We love Microsoft Office Word <strong>so much</strong>! Especially for the HTML code it generates!</span></p>
<p class="MsoNormal"><span>That's some text: </span></p>
<p class="MsoNormal"><span>That's some more text <br /> <!--[if !supportLineBreakNewLine]--><br /> <!--[endif]--></span></p>
<p class="MsoNormal"><span style="font-family: Cambria, serif; background-position: initial initial; background-repeat: initial initial;"><span style="text-decoration: underline;">List heading:</span></span></p>
<p> </p>
<p class="MsoNormal"><span><br /> Subhead one : <br /> - <strong>blah blah <br /> - bleh bleh <br /> - List item #3 <br /> - List item #4 </strong><br /> - OK, this was a lame way to make lists. But this is a real life example nevertheless... <br /> <br /> Subhead 2 : <br /> - Item 1 <br /> - Item 2 <br /> - Item 3</span></p>

You may try it yourself, and you'll see that the result is below satisfaction, because all <span> tags remain in the code. Furthermore, what's even worse (and is perhaps worth filing a separate bug) is that for some strange reason, a nested <span> tag is not recognised and its angle brackets are being escaped, so it becomes <span> instead!

Header-like structures in content should be escaped

Input:

<p>Foo<br>--<br>Bar</p>
<p>Foo<br>Bar<br>--</p>

Actual:

Foo  
--  
Bar

Foo  
Bar  
--

Expected:

Foo  
\--  
Bar

Foo  
Bar  
\--

Indented list not correctly converted

The list

<ul>
    <li>normal list item
        <ul>
            <li>indented item</li>
            <li>another one</li>
        </ul>
    </li>
    <li>normal item again</li>
</ul>

should be converted to:

- normal list item
 - indented item
 - another one
- normal item again

But it is being converted to:

- normal list item- indented item
- another one
- normal item again

Blockquote before a link isn't converted correctly

See this code:

<br><a href="http://www.heise.de/newsticker/meldung/NSA-Skandal-Snowden-Doku-Citizenfour-mit-Oscar-ausgezeichnet-2557279.html" target="_blank">NSA-Skandal: Snowden-Doku "Citizenfour" mit Oscar ausgezeichnet</a><blockquote>23.02.2015 07:07 Bei der Oscar-Verleihung ist der Dokumentarfilm "Citizenfour" &uuml;ber die NSA-Enth&uuml;llungen des Edward Snowden als bester Dokumentarfilm ausgezeichnet worden.

This converted to this code:

[NSA-Skandal: Snowden-Doku "Citizenfour" mit Oscar ausgezeichnet](http://www.heise.de/newsticker/meldung/NSA-Skandal-Snowden-Doku-Citizenfour-mit-Oscar-ausgezeichnet-2557279.html)> 23.02.2015 07:07 Bei der Oscar-Verleihung ist der Dokumentarfilm "Citizenfour" über die NSA-Enthüllungen des Edward Snowden als bester Dokumentarfilm ausgezeichnet worden.

Problem is that my tested markdown browsers don't recognized the ">" character as start of a quote because there is no newline before it.

`<code>` with not text-only content shouldn't use backticks

Input:

<p>Did you check use the <code>PHPUnit_Framework_Test<em>Case</em></code>?</p>

Actual:

Did you check use the `PHPUnit_Framework_Test<em>Case</em>`?

Expected:

Did you check use the <code>PHPUnit_Framework_Test*Case*</code>?

Use an intermediate AST

Some of the issues being reported are caused by (or related to) one of the following:

Converting HTML directly to a string
Not doing a great job of tracking element depth
Storing the resulting MD in the DOM tree

We'll need a major revamp of the codebase to address the issues that these things cause. I therefore propose implementing an AST as an intermediate conversion step, similar to how league/commonmark works.

So basically as we traverse the DOM tree, we simultaneously build an AST which mirrors it (using the same Nodes and sub-classes from league/commonmark). This part should be fairly straight-forward. Once the AST is build, we pass that along to renderers which convert that AST representation into the final markdown.

In the long term, perhaps we could eventually merge the two codebases into a single library, which would be awesome! (This wouldn't be done for at least two major versions though)

Dependency on php-xml should be mentioned in the readme

"No dependencies except for PHP 5.2" is not really true. html2markdown doesn't work without the php-xml package installed and appropriate php extensions enabled.

missing setOption

To begin, thank you for this great tool,
the method setOption is missing on converter, you have to add it or
change "$converter->setOption('strip_tags', true);" by "$converter->getConfig()->setOption('strip_tags', true);" at line 74 in README.md and same at line 93 and 94.

thephpleague / html-to-markdown Goto Github PK

html-to-markdown's People

Stargazers

Watchers

Forkers

html-to-markdown's Issues

Recommend Projects

Recommend Topics

Recommend Org

Jobs