thephpleague / html-to-markdown Goto Github PK
View Code? Open in Web Editor NEWConvert HTML to Markdown with PHP
License: MIT License
Convert HTML to Markdown with PHP
License: MIT License
Input:
<p>Het zoeken naar <em>een spel<strong>t</strong></em> (zaadje) in een hooiberg is moeilijk.
Expected:
Het zoeken naar _een spel**t**_ (zaadje) in een hooiberg is moeilijk.
Actual:
Het zoeken naar *een spel**t***(zaadje) in een hooiberg (zaadje) in een hooiberg is moeilijk.
With the latter getting converted to this html:
<p>Het zoeken naar <em>een spel</em><em>t</em>**(zaadje) in een hooiberg (zaadje) in een hooiberg is moeilijk.</p>
Input:
<p><league/commonmark>, <github.com/thephpleague>, <league@commonmark>, <https://github.com/thephpleague></p>
Actual:
<league/commonmark>, <github.com/thephpleague>,
<league@commonmark>, <https://github.com/thephpleague>
Expected:
<league/commonmark>, <github.com/thephpleague>,
\<league@commonmark>, \<https://github.com/thephpleague>
From this report in the Ghost WordPress plugin support forum:
<pre>touch ~/.profile cat >> ~/.profile <<EOF export PATH="$(brew --prefix homebrew/php/php56)/bin:$PATH" EOF </pre>
becomes:
touch ~/.profile cat >> ~/.profile
And:
<pre>args << "--with-z=/usr/local/Cellar/zlib/1.2.8"</pre>
becomes:
args
HTML To Markdown should probably not convert tags with class attributes to HTML by default, so that the conversion is not destructive.
The option to convert all tags regardless (stripping class attributes) should be available, but probably not on by default.
This will likely help WordPress plugins such as Ghost to convert HTML to Markdown more faithfully without stripping essential class names.
This HTML:
<span class="span-class">Label: </span><a class="link-class" href="http://example.com">Link</a>
...should probably result in this (identical code):
<span class="span-class">Label: </span><a class="link-class" href="http://example.com">Link</a>
...instead of this:
<span class="span-class">Label: </span>[Link](http://example.com)
Given the HTML:
<a href="https://github.com/">https://github.com/</a>
Should probably be converted to:
<https://github.com/>
Instead of the current version:
[https://github.com/](https://github.com/)
My use case is converting HTML to text for plain text emails.
Reading the URL twice is a bit jarring. The two Markdown version above are equivalent, but the shorter syntax is easier to read for a human.
As of now, the list of converters supported by the environment is hardcoded and cannot be extended. If one wants to add new converters (say to support HTML files with a specific template), it would require extending HtmlConverter
(to use a custom Environment
class) and Environment
(to add new converters).
Thus, it appears it would be useful to have the following:
Environment
available through $htmlConverter->getEnvironment()
$environment->addConverter()
publicThis would allow someone to do the following:
$converter = new HtmlConverter();
$converter->getEnvironment()->addConverter($myConverter);
$converter->convert($content);
I am having an issue converting code blocks that has been converted to <code><pre>...</pre></code> blocks into their appropriate triple-backtick markdown syntax, is this a known issue or a mistake on my part?
Input:
<pre><code class="language-ruby">def foo(x)
return 3
end
</code></pre>
Actual:
<code class="language-ruby">def foo(x)
return 3
end
Expected:
```ruby
def foo(x)
return 3
end
``` # comment after end fence because GitHub is weird
Also, care should be taken that fences are longer than any candidate-fences in the block
Since the convertor uses DOMDocument
internally, output needs to be sanitised. This happens in HtmlConvertor::sanitize
. Unfortunately, this step may also strip content, as is shown in the following example.
Input:
<pre><code>...
<script type = "text/javascript">
function startTimer() {
var tim = window.setTimeout("hideMessage()", 5000)
}
</head>
<body>
...</pre></code>
Actual
...
<script type = "text/javascript">
function startTimer() {
var tim = window.setTimeout("hideMessage()", 5000)
}
...
Expected
...
<script type = "text/javascript">
function startTimer() {
var tim = window.setTimeout("hideMessage()", 5000)
}
</head>
</body>
...
Would be interesting to see this be in line with CommonMark, as it's a fairly solid new standard by a bunch of smart people. I've had a few issues with Markdown differing between Jekyll + Kramdown, Jekyll + Redcarpet, Leanpub, etc, one of which being lists in blockquotes and using CommonMark seems to fix them. Having output come out as a standard would be awesome I think.
Potentially driver based. Call it "Markdown" which is no-frills Gruber-compliant, and CommonMark which outputs the new standard.
Tests should match this.
Input:
<p>You forgot the <!--more--> tag!</p>
Actual:
You forgot the <!--more--> tag!
Expected:
You forgot the \<!--more--> tag!
As reported in #5, inner tags incorrectly become HTML entities:
<span><span>Test</div></div> => <span><span>Test</span></span>
<div><div>Test</div></div> => <div><div>Test</div></div>
If you have, for some reason
blah blah blah <a href="http://www.google.com">google</a> blah blah blah
the resultant output is in the format:
blah blah blah[google](http://www.google.com)blah blah blah
the spaces either side of the link go missing.
This does not seem to happen when the space is a space character (" ") instead of  
;.
Links within a page are currently destroyed. Need to add an option to either remove tags with no href or include them as html in the markdown.
e.g. (up the top of the page) -
<a href="#step1">Step 1</a>
(down the page)
<a id="step1"></a>
... fix for getting around this below ...
private function convert_anchor($node)
{
.. snip ...
if ( $href == "" ) {
return html_entity_decode($node->C14N());
}
.. snip ...
}
This test currently fails:
$this->html_gives_markdown("<b>Bold</b> <i>Italic</i>", "**Test** *Italic*");
Output:
Expected :**Test** *Italic*
Actual :**Bold***Italic*
Need a better way to preserve spaces that exist between consecutive span tags.
Input:
<p>~~~ Marijn</p>
Actual
~~~ Marijn
Expected
\~~~ Marijn
Passing this snippet as the content to convert causes a fatal error
<script type="text/javascript">document.write(unescape("%3Cscript src='http" + (document.location.protocol == 'https:' ? 's' : '') + "://www.coffeecup.com/api/sdrive/forms/form.js?name=PGCERT%26slug=204562%26width=600%26height=780%26crossdomains=true' type='text/javascript'%3E%3C/script%3E"));</script>
An initial notice ('Trying to get property of non-object') appears in is_code_sample
when trying to access $node->parentNode
, and then a fatal error ('Call to a member function hasChildNodes() on a non-object') in convert_children
when trying to call $node->hasChildNodes()
.
I guess it's quite likely that this markup collapses to nothing since it contains only a script tag, and so you end up with an empty DOM, however, it shouldn't crash!
The current implementation expects only a single converter per HTML tag. However, it makes it hard to do things such as having multiple processing passes on the same tag.
For example, I have the following:
<span class="c1 c5">Some text</span>
With a chain-of-responsibility/visitors, we could have multiple parsers going through the element and manipulate it. One of the converter could, for instance, see that the span has a class c5, and modify the value of the element and make it bold.
<span class="c1 c5">**Some text**</span>
I've already implemented something like this @ https://github.com/TomzxForks/html-to-markdown/tree/features/chain-of-responsibility. It builds on my other change where I add a preProcess/postProcess step when we're walking through the HTML tree.
i parser the html with php-simple-html-dom,but when i try to convert it to markdown,i always got �,
what can i do
Input:
<p>&euro;</p>
Expected:
\€
Actual:
€
Also, entities seemingly get parsed twice. Note that an ampersand before non-entities doesn't need to be escaped: <p>R&D</p>
is perfectly translated as R&D
.
Should check whether the opening and closing delimiters would actually be parsed as suck
Input:
<p>Did you mean 200<strong>.</strong>000 instead of 200<strong>,</strong>000?</p>
Expected:
Did you mean 200<strong>.</strong>000 instead of 200<strong>,</strong>000?
Actual:
Did you mean 200**.**000 instead of 200**,**000?
Surely we want the output to be identical between different systems, not dependent on the system in use?
I'm puzzled by lines 273-274 in the current master:
// If strip_tags is false (the default), preserve tags that don't have Markdown equivalents, // such as <span> and #text nodes on their own. C14N() canonicalizes the node to a string.
There's a
case "#text":
a few lines above (line 262), so isn't the comment wrong to mention #text nodes?
This test fails:
$this->html_gives_markdown("<a href='#'>Link 1</a><a href='#'>Link 2</a>", "[Link 1](#)[Link 2](#)");
Output:
Expected: [Link 1](#)[Link 2](#)
Actual: [Link 1](#) [Link 2](#)
Ideally, anchors that do not have spaces between them in the HTML should not see spaces between them in the resulting Markdown.
This is caused by the workaround in convert_anchor()
, and is related to issue #9.
Code blocks requires 4 spaces instead of 1
So line 453 should be:
$markdown .= " " . $line;
not
$markdown .= " " . $line;
...btw nice work
Hi,
I am about to convert HTML code to JIRA markdown.
Is it possible to use custom converter function instead the default one?
<a href="http://example.com">link title</a> ==> [link title|http://example.com]
This happens because empty()
considers "0"
as being empty.
Input
<p>0</p>
Actual:
Expected:
0
Input
<p>---</p>
<p>- - - </p>
<p>--</p>
Actual
---
- - -
--
Expected
\---
\- - -
--
In line 254 of current release instead of
$markdown = preg_replace('\s+', ' ', $value);
i write
$markdown = preg_replace('\s+', ' ', preg_replace('/^\s+/', '', $value));
because most browsers ignore begin-of-line spaces
Input:
<p>123456789) Foo and 1234567890) Bar!</p>
<p>1. Platz in 'Das große Backen'</p>
Actual:
123456789) Foo and 1234567890) Bar!
1. Platz in 'Das große Backen'
Expected:
123456789\) Foo and 1234567890) Bar!
1\. Platz in 'Das große Backen'
Input:
<p>
+ Siri works well for TV and movies<br>
+ Really fast<br>
+ Games are a fun addition<br>
- Siri is extremely limited<br>
- Have to log in to every app individually<br>
- No 4K support
</p>
Actual:
+ Siri works well for TV and movies
+ Really fast
+ Games are a fun addition
- Siri is extremely limited
- Have to log in to every app individually
- No 4K support
Expected:
\+ Siri works well for TV and movies
\+ Really fast
\+ Games are a fun addition
\- Siri is extremely limited
\- Have to log in to every app individually
\- No 4K support
$test = '<a href="http://test.com/">http://test.com/</a>';
echo $markdown->convert($test);
produces: <http: test.com="">"<br><br><br><br></http:>
However after changing the text it works.
$test = '<a href="http://test.com/">test</a>';
echo $markdown->convert($test);
produces: [test](http://test.com/)
If actually even works if you change one character (I took off the /
in the text):
$test = '<a href="http://test.com/">http://test.com</a>';
echo $markdown->convert($test);
produces: [http://test.com](http://test.com/)
HTML:
<li>
<h3>Header</h3>
<p>Description</p>
</li>
Expected:
- ### Header
Description
Actual:
- ### Header
Description
Which converted back to HTML results in this:
<ul>
<li>
<h3>Header</h3>
</li>
</ul>
<p>Description</p>
Would be nice to have automated code style checks. Use the same settings as thephpleague/commonmark.
Some files say 2.2.2 in them for example.
One of the League peoples made this checklist:
http://phppackagechecklist.com/
It's a solid list of good advice that packages should try to follow, based off of requirements we thought up for League packages.
So, take a look at that, but minimum is:
League
as the PSR-4 autoloader namespace. Shove code in a src
folder.league
as the vendor namespace.Input:
<p><a href="http://www.php7book.com/">Your guide to [[...] more.</a></p>
<p><a href="http://www.php7book.com/">Your guide to [...] more.</a></p>
<p><a href="http://www.php7book.com/">Your guide to [...]] more.</a></p>
Actual:
[Your guide to [[...] more.](http://www.php7book.com/)
[Your guide to [...] more.](http://www.php7book.com/)
[Your guide to [...]] more.](http://www.php7book.com/)
Expected:
[Your guide to \[[...] more.](http://www.php7book.com/)
[Your guide to [...] more.](http://www.php7book.com/)
[Your guide to [...]\] more.](http://www.php7book.com/)
Input:
<p>Did you check use the Test<em>Case</em>?</p>
Actual:
Did you check use the Test_Case_?
Expected:
Did you check use the Test*Case*?
Related to #4
It would be nice to have at least an option to simply discard all tags that aren't "translatable" into MD syntax.
In particular, here's the code I tried to convert to MD, with any identifying info removed:
<p class="MsoNormal"><span style="font-family: Cambria, serif; background-position: initial initial; background-repeat: initial initial;">We love Microsoft Office Word <strong>so much</strong>! Especially for the HTML code it generates!</span></p>
<p class="MsoNormal"><span>That's some text: </span></p>
<p class="MsoNormal"><span>That's some more text <br /> <!--[if !supportLineBreakNewLine]--><br /> <!--[endif]--></span></p>
<p class="MsoNormal"><span style="font-family: Cambria, serif; background-position: initial initial; background-repeat: initial initial;"><span style="text-decoration: underline;">List heading:</span></span></p>
<p> </p>
<p class="MsoNormal"><span><br /> Subhead one : <br /> - <strong>blah blah <br /> - bleh bleh <br /> - List item #3 <br /> - List item #4 </strong><br /> - OK, this was a lame way to make lists. But this is a real life example nevertheless... <br /> <br /> Subhead 2 : <br /> - Item 1 <br /> - Item 2 <br /> - Item 3</span></p>
You may try it yourself, and you'll see that the result is below satisfaction, because all <span>
tags remain in the code. Furthermore, what's even worse (and is perhaps worth filing a separate bug) is that for some strange reason, a nested <span>
tag is not recognised and its angle brackets are being escaped, so it becomes <span>
instead!
Input:
<p>Foo<br>--<br>Bar</p>
<p>Foo<br>Bar<br>--</p>
Actual:
Foo
--
Bar
Foo
Bar
--
Expected:
Foo
\--
Bar
Foo
Bar
\--
The list
<ul>
<li>normal list item
<ul>
<li>indented item</li>
<li>another one</li>
</ul>
</li>
<li>normal item again</li>
</ul>
should be converted to:
- normal list item
- indented item
- another one
- normal item again
But it is being converted to:
- normal list item- indented item
- another one
- normal item again
See this code:
<br><a href="http://www.heise.de/newsticker/meldung/NSA-Skandal-Snowden-Doku-Citizenfour-mit-Oscar-ausgezeichnet-2557279.html" target="_blank">NSA-Skandal: Snowden-Doku "Citizenfour" mit Oscar ausgezeichnet</a><blockquote>23.02.2015 07:07 Bei der Oscar-Verleihung ist der Dokumentarfilm "Citizenfour" über die NSA-Enthüllungen des Edward Snowden als bester Dokumentarfilm ausgezeichnet worden.
This converted to this code:
[NSA-Skandal: Snowden-Doku "Citizenfour" mit Oscar ausgezeichnet](http://www.heise.de/newsticker/meldung/NSA-Skandal-Snowden-Doku-Citizenfour-mit-Oscar-ausgezeichnet-2557279.html)> 23.02.2015 07:07 Bei der Oscar-Verleihung ist der Dokumentarfilm "Citizenfour" über die NSA-Enthüllungen des Edward Snowden als bester Dokumentarfilm ausgezeichnet worden.
Problem is that my tested markdown browsers don't recognized the ">" character as start of a quote because there is no newline before it.
Input:
<p>Did you check use the <code>PHPUnit_Framework_Test<em>Case</em></code>?</p>
Actual:
Did you check use the `PHPUnit_Framework_Test<em>Case</em>`?
Expected:
Did you check use the <code>PHPUnit_Framework_Test*Case*</code>?
Some of the issues being reported are caused by (or related to) one of the following:
We'll need a major revamp of the codebase to address the issues that these things cause. I therefore propose implementing an AST as an intermediate conversion step, similar to how league/commonmark works.
So basically as we traverse the DOM tree, we simultaneously build an AST which mirrors it (using the same Nodes and sub-classes from league/commonmark). This part should be fairly straight-forward. Once the AST is build, we pass that along to renderers which convert that AST representation into the final markdown.
In the long term, perhaps we could eventually merge the two codebases into a single library, which would be awesome! (This wouldn't be done for at least two major versions though)
"No dependencies except for PHP 5.2" is not really true. html2markdown doesn't work without the php-xml package installed and appropriate php extensions enabled.
To begin, thank you for this great tool,
the method setOption is missing on converter, you have to add it or
change "$converter->setOption('strip_tags', true);" by "$converter->getConfig()->setOption('strip_tags', true);" at line 74 in README.md and same at line 93 and 94.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.