GithubHelp home page GithubHelp logo

thephpleague / html-to-markdown Goto Github PK

View Code? Open in Web Editor NEW
1.7K 45.0 204.0 419 KB

Convert HTML to Markdown with PHP

License: MIT License

PHP 99.92% HTML 0.08%
php markdown commonmark html converter phpleague hacktoberfest

html-to-markdown's People

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

html-to-markdown's Issues

Bold in italic isn't converted properly

Input:

<p>Het zoeken naar <em>een spel<strong>t</strong></em> (zaadje) in een hooiberg is moeilijk.

Expected:

Het zoeken naar _een spel**t**_ (zaadje) in een hooiberg is moeilijk.

Actual:

Het zoeken naar *een spel**t***(zaadje) in een hooiberg (zaadje) in een hooiberg is moeilijk.

With the latter getting converted to this html:

<p>Het zoeken naar <em>een spel</em><em>t</em>**(zaadje) in een hooiberg (zaadje) in een hooiberg is moeilijk.</p>

Non-autolinks should get escaped

Input:

<p>&lt;league/commonmark&gt;, &lt;github.com/thephpleague&gt;, &lt;league@commonmark&gt;, &lt;https://github.com/thephpleague&gt;</p>

Actual:

<league/commonmark>, <github.com/thephpleague>,
 <league@commonmark>, <https://github.com/thephpleague>

Expected:

<league/commonmark>, <github.com/thephpleague>,
 \<league@commonmark>, \<https://github.com/thephpleague>

Content after << is stripped in pre tags

From this report in the Ghost WordPress plugin support forum:

<pre>touch ~/.profile cat >> ~/.profile <<EOF export PATH="$(brew --prefix homebrew/php/php56)/bin:$PATH" EOF </pre>

becomes:

touch ~/.profile cat >> ~/.profile

And:

<pre>args << "--with-z=/usr/local/Cellar/zlib/1.2.8"</pre>

becomes:

args

Preserve links and other tags with class attributes

HTML To Markdown should probably not convert tags with class attributes to HTML by default, so that the conversion is not destructive.

The option to convert all tags regardless (stripping class attributes) should be available, but probably not on by default.

This will likely help WordPress plugins such as Ghost to convert HTML to Markdown more faithfully without stripping essential class names.

This HTML:

<span class="span-class">Label: </span><a class="link-class" href="http://example.com">Link</a>

...should probably result in this (identical code):

<span class="span-class">Label: </span><a class="link-class" href="http://example.com">Link</a>

...instead of this:

<span class="span-class">Label: </span>[Link](http://example.com)

For an anchor, when href equals innerHTML, use simpler link syntax

Given the HTML:

<a href="https://github.com/">https://github.com/</a>

Should probably be converted to:

<https://github.com/>

Instead of the current version:

[https://github.com/](https://github.com/)

My use case is converting HTML to text for plain text emails.

Reading the URL twice is a bit jarring. The two Markdown version above are equivalent, but the shorter syntax is easier to read for a human.

Adding new converters

As of now, the list of converters supported by the environment is hardcoded and cannot be extended. If one wants to add new converters (say to support HTML files with a specific template), it would require extending HtmlConverter (to use a custom Environment class) and Environment (to add new converters).

Thus, it appears it would be useful to have the following:

  • Environment available through $htmlConverter->getEnvironment()
  • $environment->addConverter() public

This would allow someone to do the following:

$converter = new HtmlConverter();
$converter->getEnvironment()->addConverter($myConverter);

$converter->convert($content);

Code blocks not being parsed correctly

I am having an issue converting code blocks that has been converted to <code><pre>...</pre></code> blocks into their appropriate triple-backtick markdown syntax, is this a known issue or a mistake on my part?

Code with specified language should render as fenced codeblocks

Input:

<pre><code class="language-ruby">def foo(x)
  return 3
end
</code></pre>

Actual:

    <code class="language-ruby">def foo(x)
      return 3
    end

Expected:

```ruby
def foo(x)
  return 3
end
``` # comment after end fence because GitHub is weird

Also, care should be taken that fences are longer than any candidate-fences in the block

Final rendering may strip content

Since the convertor uses DOMDocument internally, output needs to be sanitised. This happens in HtmlConvertor::sanitize. Unfortunately, this step may also strip content, as is shown in the following example.

Input:

<pre><code>...
&lt;script type = "text/javascript"&gt;
function startTimer() {
   var tim = window.setTimeout("hideMessage()", 5000)
}
</head>
<body>
...</pre></code>

Actual

    ...
    <script type = "text/javascript">
    function startTimer() {
       var tim = window.setTimeout("hideMessage()", 5000)
    }


    ...

Expected

    ...
    <script type = "text/javascript">
    function startTimer() {
       var tim = window.setTimeout("hideMessage()", 5000)
    }
    </head>
    </body>
    ...

CommonMark support

Would be interesting to see this be in line with CommonMark, as it's a fairly solid new standard by a bunch of smart people. I've had a few issues with Markdown differing between Jekyll + Kramdown, Jekyll + Redcarpet, Leanpub, etc, one of which being lists in blockquotes and using CommonMark seems to fix them. Having output come out as a standard would be awesome I think.

Potentially driver based. Call it "Markdown" which is no-frills Gruber-compliant, and CommonMark which outputs the new standard.

Tests should match this.

Tag-like items should be escaped

Input:

<p>You forgot the &lt;!--more--&gt; tag!</p>

Actual:

You forgot the <!--more--> tag!

Expected:

You forgot the \<!--more--> tag!

&nbsp; before/after <a> tags are lost

If you have, for some reason

    blah blah blah&nbsp;<a href="http://www.google.com">google</a>&nbsp;blah blah blah

the resultant output is in the format:

    blah blah blah[google](http://www.google.com)blah blah blah

the spaces either side of the link go missing.

This does not seem to happen when the space is a space character (" ") instead of &nbsp;.

Don't convert anchor tags with no href

Links within a page are currently destroyed. Need to add an option to either remove tags with no href or include them as html in the markdown.

e.g. (up the top of the page) -
<a href="#step1">Step 1</a>

(down the page)
<a id="step1"></a>

... fix for getting around this below ...

private function convert_anchor($node)
{
    .. snip ...

    if ( $href == "" ) {
        return html_entity_decode($node->C14N());
    }

    .. snip ...
}

Improve test-suite

It might be a good idea to improve the test-suite by looking at test data in projects like Parsedown.

Some of the results from Parsedown might conflict with suggestions in #13 to use CommonMark, but thats because they aren't necessarily compliant with it themselves.

Spaces are stripped between consecutive tags

This test currently fails:

$this->html_gives_markdown("<b>Bold</b> <i>Italic</i>", "**Test** *Italic*");

Output:

Expected :**Test** *Italic*
Actual   :**Bold***Italic*

Need a better way to preserve spaces that exist between consecutive span tags.

Script snippet causes fatal error

Passing this snippet as the content to convert causes a fatal error

<script type="text/javascript">document.write(unescape("%3Cscript src='http" +  (document.location.protocol == 'https:' ? 's' : '') + "://www.coffeecup.com/api/sdrive/forms/form.js?name=PGCERT%26slug=204562%26width=600%26height=780%26crossdomains=true' type='text/javascript'%3E%3C/script%3E"));</script>

An initial notice ('Trying to get property of non-object') appears in is_code_sample when trying to access $node->parentNode, and then a fatal error ('Call to a member function hasChildNodes() on a non-object') in convert_children when trying to call $node->hasChildNodes().

I guess it's quite likely that this markup collapses to nothing since it contains only a script tag, and so you end up with an empty DOM, however, it shouldn't crash!

Support a chain-of-responsibility/visitors on the elements

The current implementation expects only a single converter per HTML tag. However, it makes it hard to do things such as having multiple processing passes on the same tag.

For example, I have the following:

<span class="c1 c5">Some text</span>

With a chain-of-responsibility/visitors, we could have multiple parsers going through the element and manipulate it. One of the converter could, for instance, see that the span has a class c5, and modify the value of the element and make it bold.

<span class="c1 c5">**Some text**</span>

I've already implemented something like this @ https://github.com/TomzxForks/html-to-markdown/tree/features/chain-of-responsibility. It builds on my other change where I add a preProcess/postProcess step when we're walking through the HTML tree.

Ampersands in entity-like text aren't being escaped

Input:

<p>&amp;euro;</p>

Expected:

\&euro;

Actual:

Also, entities seemingly get parsed twice. Note that an ampersand before non-entities doesn't need to be escaped: <p>R&amp;D</p> is perfectly translated as R&D.

Strong/em should not always be converted to asterisks/underscores

Should check whether the opening and closing delimiters would actually be parsed as suck

Input:

<p>Did you mean 200<strong>.</strong>000 instead of 200<strong>,</strong>000?</p>

Expected:

Did you mean 200<strong>.</strong>000 instead of 200<strong>,</strong>000?

Actual:

Did you mean 200**.**000 instead of 200**,**000?

Stop using PHP_EOL

Surely we want the output to be identical between different systems, not dependent on the system in use?

Incorrect comment?

I'm puzzled by lines 273-274 in the current master:

            // If strip_tags is false (the default), preserve tags that don't have Markdown equivalents,
            // such as <span> and #text nodes on their own. C14N() canonicalizes the node to a string.

There's a

        case "#text":

a few lines above (line 262), so isn't the comment wrong to mention #text nodes?

Spaces are always added between consecutive anchor tags

This test fails:

$this->html_gives_markdown("<a href='#'>Link 1</a><a href='#'>Link 2</a>", "[Link 1](#)[Link 2](#)");

Output:

Expected: [Link 1](#)[Link 2](#)
Actual:   [Link 1](#) [Link 2](#)

Ideally, anchors that do not have spaces between them in the HTML should not see spaces between them in the resulting Markdown.

This is caused by the workaround in convert_anchor(), and is related to issue #9.

convert link to Jira markdown

Hi,

I am about to convert HTML code to JIRA markdown.

Is it possible to use custom converter function instead the default one?

<a href="http://example.com">link title</a>  ==> [link title|http://example.com]

Small improvement

In line 254 of current release instead of
$markdown = preg_replace('\s+', ' ', $value);
i write
$markdown = preg_replace('\s+', ' ', preg_replace('/^\s+/', '', $value));
because most browsers ignore begin-of-line spaces

Ordered list-like lines should be escaped

Input:

<p>123456789) Foo and 1234567890) Bar!</p>
<p>1. Platz in 'Das große Backen'</p>

Actual:

123456789) Foo and 1234567890) Bar!

1. Platz in 'Das große Backen'

Expected:

123456789\) Foo and 1234567890) Bar!

1\. Platz in 'Das große Backen'

List-like lines should be escaped

Input:

<p>
+ Siri works well for TV and movies<br>
+ Really fast<br>
+ Games are a fun addition<br>
- Siri is extremely limited<br>
- Have to log in to every app individually<br>
- No 4K support
</p>

Actual:

+ Siri works well for TV and movies  
+ Really fast  
+ Games are a fun addition  
- Siri is extremely limited  
- Have to log in to every app individually  
- No 4K support

Expected:

\+ Siri works well for TV and movies  
\+ Really fast  
\+ Games are a fun addition  
\- Siri is extremely limited  
\- Have to log in to every app individually  
\- No 4K support

Links do not render when href is same as the text

$test = '<a href="http://test.com/">http://test.com/</a>';

echo $markdown->convert($test);

produces: <http: test.com="">"<br><br><br><br></http:>

However after changing the text it works.

$test = '<a href="http://test.com/">test</a>';

echo $markdown->convert($test);

produces: [test](http://test.com/)

If actually even works if you change one character (I took off the / in the text):

$test = '<a href="http://test.com/">http://test.com</a>';

echo $markdown->convert($test);

produces: [http://test.com](http://test.com/)

Paragraph in list element is not converted properly

HTML:

<li>
  <h3>Header</h3>
  <p>Description</p>
</li>

Expected:

- ### Header
  Description

Actual:

- ### Header
Description

Which converted back to HTML results in this:

<ul>
  <li>
    <h3>Header</h3>
  </li>
</ul>
<p>Description</p>

Integrate with StyleCI

Would be nice to have automated code style checks. Use the same settings as thephpleague/commonmark.

League Checklist

One of the League peoples made this checklist:

http://phppackagechecklist.com/

It's a solid list of good advice that packages should try to follow, based off of requirements we thought up for League packages.

So, take a look at that, but minimum is:

  • Use League as the PSR-4 autoloader namespace. Shove code in a src folder.
  • Adhere to PSR-2 as the coding style guide.
  • List on Packagist with league as the vendor namespace.
  • Write unit tests. Aim for at least 80% coverage in version 1.
  • DocBlock all the things.
  • Use Semantic Versioning to manage version numbers.
  • Keep a Changelog.
  • Use Travis-CI or Circle-CI to automatically check coding standards and run tests.
  • Have an extensive README.

Unbalanced square brackets in links should be escaped

Input:

<p><a href="http://www.php7book.com/">Your guide to [[...] more.</a></p>
<p><a href="http://www.php7book.com/">Your guide to [...] more.</a></p>
<p><a href="http://www.php7book.com/">Your guide to [...]] more.</a></p>

Actual:

[Your guide to [[...] more.](http://www.php7book.com/)

[Your guide to [...] more.](http://www.php7book.com/)

[Your guide to [...]] more.](http://www.php7book.com/)

Expected:

[Your guide to \[[...] more.](http://www.php7book.com/)

[Your guide to [...] more.](http://www.php7book.com/)

[Your guide to [...]\] more.](http://www.php7book.com/)

<span> tags are not cleaned up

It would be nice to have at least an option to simply discard all tags that aren't "translatable" into MD syntax.

In particular, here's the code I tried to convert to MD, with any identifying info removed:

<p class="MsoNormal"><span style="font-family: Cambria, serif; background-position: initial initial; background-repeat: initial initial;">We love Microsoft Office Word <strong>so much</strong>! Especially for the HTML code it generates!</span></p>
<p class="MsoNormal"><span>That's some text: </span></p>
<p class="MsoNormal"><span>That's some more text <br /> <!--[if !supportLineBreakNewLine]--><br /> <!--[endif]--></span></p>
<p class="MsoNormal"><span style="font-family: Cambria, serif; background-position: initial initial; background-repeat: initial initial;"><span style="text-decoration: underline;">List heading:</span></span></p>
<p> </p>
<p class="MsoNormal"><span><br /> Subhead one : <br /> - <strong>blah blah <br /> - bleh bleh <br /> - List item #3 <br /> - List item #4 </strong><br /> - OK, this was a lame way to make lists. But this is a real life example nevertheless... <br /> <br /> Subhead 2 : <br /> - Item 1 <br /> - Item 2 <br /> - Item 3</span></p>

You may try it yourself, and you'll see that the result is below satisfaction, because all <span> tags remain in the code. Furthermore, what's even worse (and is perhaps worth filing a separate bug) is that for some strange reason, a nested <span> tag is not recognised and its angle brackets are being escaped, so it becomes &lt;span&gt; instead!

Indented list not correctly converted

The list

<ul>
    <li>normal list item
        <ul>
            <li>indented item</li>
            <li>another one</li>
        </ul>
    </li>
    <li>normal item again</li>
</ul>

should be converted to:

- normal list item
 - indented item
 - another one
- normal item again

But it is being converted to:

- normal list item- indented item
- another one
- normal item again

Blockquote before a link isn't converted correctly

See this code:

<br><a href="http://www.heise.de/newsticker/meldung/NSA-Skandal-Snowden-Doku-Citizenfour-mit-Oscar-ausgezeichnet-2557279.html" target="_blank">NSA-Skandal: Snowden-Doku "Citizenfour" mit Oscar ausgezeichnet</a><blockquote>23.02.2015 07:07 Bei der Oscar-Verleihung ist der Dokumentarfilm "Citizenfour" &uuml;ber die NSA-Enth&uuml;llungen des Edward Snowden als bester Dokumentarfilm ausgezeichnet worden.

This converted to this code:

[NSA-Skandal: Snowden-Doku "Citizenfour" mit Oscar ausgezeichnet](http://www.heise.de/newsticker/meldung/NSA-Skandal-Snowden-Doku-Citizenfour-mit-Oscar-ausgezeichnet-2557279.html)> 23.02.2015 07:07 Bei der Oscar-Verleihung ist der Dokumentarfilm "Citizenfour" über die NSA-Enthüllungen des Edward Snowden als bester Dokumentarfilm ausgezeichnet worden.

Problem is that my tested markdown browsers don't recognized the ">" character as start of a quote because there is no newline before it.

Use an intermediate AST

Some of the issues being reported are caused by (or related to) one of the following:

  1. Converting HTML directly to a string
  2. Not doing a great job of tracking element depth
  3. Storing the resulting MD in the DOM tree

We'll need a major revamp of the codebase to address the issues that these things cause. I therefore propose implementing an AST as an intermediate conversion step, similar to how league/commonmark works.

So basically as we traverse the DOM tree, we simultaneously build an AST which mirrors it (using the same Nodes and sub-classes from league/commonmark). This part should be fairly straight-forward. Once the AST is build, we pass that along to renderers which convert that AST representation into the final markdown.

In the long term, perhaps we could eventually merge the two codebases into a single library, which would be awesome! (This wouldn't be done for at least two major versions though)

missing setOption

To begin, thank you for this great tool,
the method setOption is missing on converter, you have to add it or
change "$converter->setOption('strip_tags', true);" by "$converter->getConfig()->setOption('strip_tags', true);" at line 74 in README.md and same at line 93 and 94.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.