tautologistics / node-htmlparser Goto Github PK

Forgiving HTML/XML/RSS Parser in JS for *both* Node and Browsers

License: MIT License

JavaScript 100.00%

node-htmlparser's Introduction

#NodeHtmlParser A forgiving HTML/XML/RSS parser written in JS for both the browser and NodeJS (yes, despite the name it works just fine in any modern browser). The parser can handle streams (chunked data) and supports custom handlers for writing custom DOMs/output.

##Installing

npm install htmlparser

##Running Tests

###Run tests under node: node runtests.js

###Run tests in browser: View runtests.html in any browser

##Usage In Node

var htmlparser = require("htmlparser");
var rawHtml = "Xyz <script language= javascript>var foo = '<<bar>>';< /  script><!--<!-- Waah! -- -->";
var handler = new htmlparser.DefaultHandler(function (error, dom) {
	if (error)
		[...do something for errors...]
	else
		[...parsing done, do something...]
});
var parser = new htmlparser.Parser(handler);
parser.parseComplete(rawHtml);
sys.puts(sys.inspect(handler.dom, false, null));

##Usage In Browser

var handler = new Tautologistics.NodeHtmlParser.DefaultHandler(function (error, dom) {
	if (error)
		[...do something for errors...]
	else
		[...parsing done, do something...]
});
var parser = new Tautologistics.NodeHtmlParser.Parser(handler);
parser.parseComplete(document.body.innerHTML);
alert(JSON.stringify(handler.dom, null, 2));

##Example output

[ { raw: 'Xyz ', data: 'Xyz ', type: 'text' }
  , { raw: 'script language= javascript'
  , data: 'script language= javascript'
  , type: 'script'
  , name: 'script'
  , attribs: { language: 'javascript' }
  , children: 
     [ { raw: 'var foo = \'<bar>\';<'
       , data: 'var foo = \'<bar>\';<'
       , type: 'text'
       }
     ]
  }
, { raw: '<!-- Waah! -- '
  , data: '<!-- Waah! -- '
  , type: 'comment'
  }
]

##Streaming To Parser

while (...) {
	...
	parser.parseChunk(chunk);
}
parser.done();

##Streaming To Parser in Node

fs.createReadStream('./path_to_file.html').pipe(parser);

##Parsing RSS/Atom Feeds

new htmlparser.RssHandler(function (error, dom) {
	...
});

##DefaultHandler Options

###Usage

var handler = new htmlparser.DefaultHandler(
	  function (error) { ... }
	, { verbose: false, ignoreWhitespace: true }
	);

###Option: ignoreWhitespace Indicates whether the DOM should exclude text nodes that consists solely of whitespace. The default value is "false".

####Example: true

The following HTML:

<font>
	<br>this is the text
<font>

becomes:

[ { raw: 'font'
  , data: 'font'
  , type: 'tag'
  , name: 'font'
  , children: 
     [ { raw: 'br', data: 'br', type: 'tag', name: 'br' }
     , { raw: 'this is the text\n'
       , data: 'this is the text\n'
       , type: 'text'
       }
     , { raw: 'font', data: 'font', type: 'tag', name: 'font' }
     ]
  }
]

####Example: false

The following HTML:

<font>
	<br>this is the text
<font>

becomes:

[ { raw: 'font'
  , data: 'font'
  , type: 'tag'
  , name: 'font'
  , children: 
     [ { raw: '\n\t', data: '\n\t', type: 'text' }
     , { raw: 'br', data: 'br', type: 'tag', name: 'br' }
     , { raw: 'this is the text\n'
       , data: 'this is the text\n'
       , type: 'text'
       }
     , { raw: 'font', data: 'font', type: 'tag', name: 'font' }
     ]
  }
]

###Option: verbose Indicates whether to include extra information on each node in the DOM. This information consists of the "raw" attribute (original, unparsed text found between "<" and ">") and the "data" attribute on "tag", "script", and "comment" nodes. The default value is "true".

####Example: true The following HTML:

<a href="test.html">xxx</a>

becomes:

[ { raw: 'a href="test.html"'
  , data: 'a href="test.html"'
  , type: 'tag'
  , name: 'a'
  , attribs: { href: 'test.html' }
  , children: [ { raw: 'xxx', data: 'xxx', type: 'text' } ]
  }
]

####Example: false The following HTML:

<a href="test.html">xxx</a>

becomes:

[ { type: 'tag'
  , name: 'a'
  , attribs: { href: 'test.html' }
  , children: [ { data: 'xxx', type: 'text' } ]
  }
]

###Option: enforceEmptyTags Indicates whether the DOM should prevent children on tags marked as empty in the HTML spec. Typically this should be set to "true" HTML parsing and "false" for XML parsing. The default value is "true".

####Example: true The following HTML:

<link>text</link>

becomes:

[ { raw: 'link', data: 'link', type: 'tag', name: 'link' }
, { raw: 'text', data: 'text', type: 'text' }
]

####Example: false The following HTML:

<link>text</link>

becomes:

[ { raw: 'link'
  , data: 'link'
  , type: 'tag'
  , name: 'link'
  , children: [ { raw: 'text', data: 'text', type: 'text' } ]
  }
]

##DomUtils

###TBD (see utils_example.js for now)

##Related Projects

Looking for CSS selectors to search the DOM? Try Node-SoupSelect, a port of SoupSelect to NodeJS: http://github.com/harryf/node-soupselect

There's also a port of hpricot to NodeJS that uses HtmlParser for HTML parsing: http://github.com/silentrob/Apricot

node-htmlparser's People

Contributors

Stargazers

Watchers

Forkers

davglass bmeck swizec bitter magicmonkey fgnass clement wafflecopter jtwb tomdz ewanleith agilitik kirbysayshi peterjoh taguchimail soney zibx w00w00 tootallnate aredridel estherlacan cystbear fjakobs quangv turanuk jmalonzo johnallsopp darkested vtamara papandreou crystalneth deanmao jarthorn mobify dreamflywang wuchengwei arunoda petrbela xuzhang leei wolfxyx fshost arnaud-lb dcoupe ghostoy junwang1216 sisardor starterstep web5design feedstream opencomb honestqiao sirithink digitaltonic grizk bergie hanbo386 wenkuang kelc 2eron cogitocs monw3c kingsky23 satanders ceclinux youthlab deltreey riematrix genify kruppel alex-2015 5outh davidfoliveira html-shell ksheedlo mlewter abennett1229 minikey modulexcite type-of-read ldrmcml laomu1988 yfqian joknelid zen-li feiman up2dream ageapps chinesedron rcswebdev slre nojaja sheikp lqsyyy kgersen kublaj nikpeevinviqa cdfeng migtorres arquivo1

node-htmlparser's Issues

Add ./index.js

or change lib/node-htmlparser.js to lib/htmlparser.js so I can localize / expose via require.paths using require('htmlparser')

No DefaultHandler in master branch?

I don't know whether I'm missing something or not, please tell me if I do and excuse me if this is too obvious but there is no DefaultHandler method of htmlparser object in master branch (version 2.0.0).
I've tried using this library in my browser but when I inspect htmlparser (Tautologistics.NodeHtmlParser) object, it doesn't have such a method. However it works like a charm in 1.x version!
is there something missing from this branch? or I'm missing something?
Thanks in advance.

meta value is not parsed correctly

Hi there,

If you look at a live journal entry like this one: http://cananian.livejournal.com/60624.html
You can see the

When doing: description = htmlparser.DomUtils.getElements( { tag_name: "meta", name: "description" }, dom);
Instead of having a result like this:

    [ { raw: 'meta name="description" value="the whole post in there"/',  
        data: 'meta name="description" value="the whole post in there"/',  
        type: 'tag',  
        name: 'meta',  
        attribs:   
         { name: 'description',  
           value: 'the whole post in there' } } ]

I have this:

    [ { raw: 'meta name="description" value="the whole post in there"/',
        data: 'meta name="description" value="the whole post in there"/',
        type: 'tag',
        name: 'meta',
        attribs: 
         { name: 'description',
           value: 'the whole post in there' ,
           the: 'the',
           whole: 'whole',
           post: 'post',
           in: 'in',
           there: 'there' } } ]

Hope it helps!

tagStack.last() fails all over

for me. however when I change it to tagStack['-1']() its fine, haha.. no clue what you are doing here, but yeah here is my trace. Which is used with jsdom for parsing even a simple string like <html><body><p>foo</p></body></html>.

    register.test.js test GET /signup: TypeError: Property 'last' of object [object Object] is not a function
     at DefaultHandler.DefaultHandler$handleElement [as handleElement] (/Users/tj/Projects/LearnBoost/tests/functional/support/htmlparser/lib/htmlparser.js:694:26)
     at DefaultHandler.DefaultHandler$writeTag [as writeTag] (/Users/tj/Projects/LearnBoost/tests/functional/support/htmlparser/lib/htmlparser.js:612:8)
     at Parser.Parser$writeHandler [as writeHandler] (/Users/tj/Projects/LearnBoost/tests/functional/support/htmlparser/lib/htmlparser.js:443:20)
     at Parser.Parser$parseTags [as parseTags] (/Users/tj/Projects/LearnBoost/tests/functional/support/htmlparser/lib/htmlparser.js:383:8)
     at Parser.Parser$parseChunk [as parseChunk] (/Users/tj/Projects/LearnBoost/tests/functional/support/htmlparser/lib/htmlparser.js:95:8)
     at Parser.Parser$parseComplete [as parseComplete] (/Users/tj/Projects/LearnBoost/tests/functional/support/htmlparser/lib/htmlparser.js:86:8)
     at Object.ParseHtml (/Users/tj/Projects/LearnBoost/tests/functional/support/jsdom/lib/jsdom/browser/htmltodom.js:62:16)
     at HtmlToDom.appendHtmlToElement (/Users/tj/Projects/LearnBoost/tests/functional/support/jsdom/lib/jsdom/browser/htmltodom.js:73:27)
     at Object.innerHTML (/Users/tj/Projects/LearnBoost/tests/functional/support/jsdom/lib/jsdom/browser/index.js:295:27)
     at Object.write (/Users/tj/Projects/LearnBoost/tests/functional/support/jsdom/lib/jsdom/browser/index.js:202:22)
     at Object.jsdom (/Users/tj/Projects/LearnBoost/tests/functional/support/jsdom/lib/jsdom.js:30:9)
     at /Users/tj/Projects/LearnBoost/tests/functional/register.test.js:26:23
     at next (/Users/tj/Projects/LearnBoost/tests/integration/support/expresso/bin/expresso:769:25)
     at runSuite (/Users/tj/Projects/LearnBoost/tests/integration/support/expresso/bin/expresso:787:6)
     at check (/Users/tj/Projects/LearnBoost/tests/integration/support/expresso/bin/expresso:648:16)
     at runFile (/Users/tj/Projects/LearnBoost/tests/integration/support/expresso/bin/expresso:652:10)
     at Array.forEach (native)
     at runFiles (/Users/tj/Projects/LearnBoost/tests/integration/support/expresso/bin/expresso:629:13)
     at run (/Users/tj/Projects/LearnBoost/tests/integration/support/expresso/bin/expresso:598:5)
     at Object.<anonymous> (/Users/tj/Projects/LearnBoost/tests/integration/support/expresso/bin/expresso:851:13)
     at Module._compile (node.js:462:23)
     at Module._loadScriptSync (node.js:469:10)
     at Module.loadSync (node.js:338:12)
     at Object.runMain (node.js:522:24)
     at Array.0 (node.js:756:12)
     at EventEmitter._tickCallback (node.js:55:22)
     at node.js:773:9

act like a real stream

node has a streams API. why do you have custom stream api

bug when parsing <script> tag using some template system

var htmlparser = require('htmlparser'),
    util = require('util'),
    handler = new htmlparser.DefaultHandler(function(err, dom){}),
    parser = new htmlparser.Parser(handler),
    rawHtml = '<script type="text/template"><h1>Heading1</h1></script>';

parser.parseComplete(rawHtml);
console.log(util.inspect(handler.dom, false, null));

This piece of code discards "<" of <h1> and outputs:

[ { raw: 'script type="text/template"',
    data: 'script type="text/template"',
    type: 'script',
    name: 'script',
    attribs: { type: 'text/template' },
    children: 
     [ { raw: 'h1>Heading1</h1>',  // discard <
         data: 'h1>Heading1</h1>',
         type: 'text' } ] } ]

sync parse

Hi! Is it possible to synchronously parse a fully collected HTML, so that parser signature be the same to JSON.parse, require('qs').parse, require('querystring').parse?

TIA,
--Vladimir

form detection in js

How to detect a form which is created in javascript? I am unable o detect it.

For eg:
var frm = document.createElement("form")

Is this project dead? If so is there an alternative?

It seems like there are a number of high value pull requests and this lib hasn't been touched in a year.

handle '<' and '>' characters in attribute values

i have html that looks like the following:

<span title="first line<br>second line"></span>

it would be great if the 'br' is actually treated as part of the attribute value of 'title' instead of being treated as a new tag element

Streaming To Parser in Node - Example Request

Can you share an example of a working script using Streaming To Parser in Node?
Using your README text generates an error.

fs.createReadStream('./path_to_file.html').pipe(parser);

I get this error in console:

_stream_readable.js:476
  dest.on('unpipe', onunpipe);
       ^
TypeError: Object #<Parser> has no method 'on'
    at ReadStream.Readable.pipe (_stream_readable.js:476:8)
    at Object.<anonymous> (C:\dev\autorelease\script.js:54:8)
    at Module._compile (module.js:456:26)
    at Object.Module._extensions..js (module.js:474:10)
    at Module.load (module.js:356:32)
    at Function.Module._load (module.js:312:12)
    at Function.Module.runMain (module.js:497:10)
    at startup (node.js:119:16)
    at node.js:902:3

Here is my simple script:

// get a file stream reader 
var reader = fs.createReadStream(process.argv[2]);

// get a file stream writer pointing to the json file to write to
var writer = fs.createWriteStream(input_json);

var htmlparser = require("htmlparser");
//var rawHtml = "Xyz <script language= javascript>var foo = '<<bar>>';< /  script><!--<!-- Waah! -- -->";
//var sys = require("sys");

var handler = new htmlparser.DefaultHandler(function (error, dom) {
    if (error)
        logger.log('error', 'handler error in parser', {error: error});
    else
        logger.log('info', '', {dom: JSON.stringify(dom)});
});
var parser = new htmlparser.Parser(handler);
//parser.parseComplete(reader);
//sys.puts(sys.inspect(handler.dom, false, null));

// pipe everything to do the conversion
reader.pipe(parser).pipe(writer);

Suspect code via npm (mal/adware???)

Any ideas why/where this is showing up?

I'm seeing a folder called "testdata" with dozens of links to torrents. I don't see it in source here on git for v 1.7.6, but npm is serving it up and my IDE is listing external dependencies for "show_ads.js", etc. We're also discussing this at https://groups.google.com/forum/#!topic/nodejs/0ZOIIHMp2_o

'>' char used inside attribute value is matched by parser incorrectly as closing whole tag section

Example: <memo if="i>1">...</memo>

"<" symbol not parsed as text when common browsers do it

If a text on the HTML has a "<", the text is not parsed after that.
Example: <title>We <3cupcakes</title>

The "<3cupcakes" is interpreted like being a tag when common browsers parse it like text.

BODY becomes child of HEAD when closing tag is missing

Some old W3C pages were written in old style HTML. e.g. http://www.w3.org/TR/css3-2d-transforms/

node-htmlparser should be more forgiving.
version: 1.7.2

var request = require('request');
var jsdom = require('jsdom');

var url = 'http://www.w3.org/TR/css3-2d-transforms/';
request({uri:url}, function (error, response, body) {
    var html = body;
    var doc = jsdom.jsdom(html, null, {url: url});
    console.log(doc.head+''); //[ HEAD ]
    console.log(doc.body === null); //true
    console.log(doc.head.childNodes[9].tagName); //BODY
});

1.x: Less thans and greater thans in attributes break the parser

Based on the example in http://www.whatwg.org/specs/web-apps/current-work/#attr-iframe-srcdoc :

var rawHtml = '<iframe srcdoc="<p>Yeah, you can see it <a href=&quot;/gallery?mode=cover&amp;amp;page=1&quot;>in my gallery</a>."></iframe>',
    htmlparser = require('./lib/htmlparser'),
    handler = new htmlparser.DefaultHandler(),
    parser = new htmlparser.Parser(handler);
parser.parseComplete(rawHtml);
console.warn(require('util').inspect(handler.dom, false, null));

Output:

[ { raw: 'iframe srcdoc="',
    data: 'iframe srcdoc="',
    type: 'tag',
    name: 'iframe',
    attribs: { srcdoc: 'srcdoc' },
    children: 
     [ { raw: 'p',
         data: 'p',
         type: 'tag',
         name: 'p',
         children: 
          [ { raw: 'Yeah, you can see it ',
              data: 'Yeah, you can see it ',
              type: 'text' },
            { raw: 'a href=&quot;/gallery?mode=cover&amp;amp;page=1&quot;',
              data: 'a href=&quot;/gallery?mode=cover&amp;amp;page=1&quot;',
              type: 'tag',
              name: 'a',
              attribs: { href: '&quot;/gallery?mode=cover&amp;amp;page=1&quot;' },
              children: [ { raw: 'in my gallery', data: 'in my gallery', type: 'text' } ] },
            { raw: '."', data: '."', type: 'text' } ] } ] } ]

Expected output:

[ { raw: 'iframe srcdoc="<p>Yeah, you can see it <a href=&quot;/gallery?mode=cover&amp;amp;page=1&quot;>in my gallery</a>."',
    data: 'iframe srcdoc="<p>Yeah, you can see it <a href=&quot;/gallery?mode=cover&amp;amp;page=1&quot;>in my gallery</a>."',
    type: 'tag',
    name: 'iframe',
   attribs: { srcdoc: '<p>Yeah, you can see it <a href=&quot;/gallery?mode=cover&amp;amp;page=1&quot;>in my gallery</a>.' } } ]

It works if I entitify the less thans and greater thans.

handle improperly escaped attributes

The wild internet sometimes contains weird stuff that makes this parser behave funny.

A tag such as this:
<a href="#" onclick="moveAddCommentBelow("div-comment-579747", 579747, true); return false;" />

Has attributes parsed like so:
{ href: '#'
, onclick: 'moveAddCommentBelow('
, 'div-comment-579747': 'div-comment-579747'
, ',': ','
, '579747,': '579747,'
, 'true);': 'true);'
, return: 'return'
, 'false;': 'false;'
}

<source> tags are not parsed properly

I might be doing the readout wrong, but this is the second time I've picked this up. It seems that <source> isn't identified as a void tag, so they become children of one another when listed inside a <video>:

var htmlparser = require('htmlparser');

var htmlContent = "<html><head></head><body><video><source src=\"foo.ogv\"><source src=\"lol.smaz\"></video><div></div></body></html>";

var handler = new htmlparser.DefaultHandler(function (error, dom) {
  function parse(dom, spacing){
    console.log(spacing, dom.name);
    if(dom.children){
      for(var i=0; i<dom.children.length; ++i){
        parse(dom.children[i], spacing + ' ');
      }
    }
  }
  parse(dom[0], '');
});

new htmlparser.Parser(handler).parseComplete(htmlContent);

cruft in npm package

Looks like you were testing with libxmljs.node in your local folder before running npm package? I noticed this because I wrote a script to scan for C++ node modules and it bumped into this libxmljs.node, which seemed odd because htmlparser does not list it as a dependency in package.json. Just thought you might want to know.

Parser hangs on some input

The parser hangs with 100% cpu usage when parsing this file.

This can be reproduced by cloning https://gist.github.com/ed8a5b157d8644d68ca7.git and running bug.js.

Ignore html tags inside of <SCRIPT> tags

node-htmlparser fails when the following snippet is present in the html document.

http://gist.github.com/498560

Clearly since the offending tag <SCR"+"IPT is inside of another <SCRIPT> tag node-htmlparser should not be trying to parse it.

1.x: HTML comment delimiters inside <script> ends the tag

var rawHtml = '<script>document.write("<!--hello-->");</script>',
    htmlparser = require('./lib/htmlparser'),
    handler = new htmlparser.DefaultHandler(),
    parser = new htmlparser.Parser(handler);
parser.parseComplete(rawHtml);
console.warn(require('util').inspect(handler.dom, false, null));

Output:

[ { raw: 'script',
    data: 'script',
    type: 'script',
    name: 'script',
    children: 
     [ { raw: 'document.write("',
         data: 'document.write("',
         type: 'text' },
       { raw: 'hello', data: 'hello', type: 'comment' },
       { raw: '");', data: '");', type: 'text' } ] } ]

Expected output:

[ { raw: 'script',
    data: 'script',
    type: 'script',
    name: 'script',
    children: 
     [ { raw: 'document.write("<!--hello-->");',
         data: 'document.write("<!--hello-->");',
         type: 'text' } ] } ]

Problem when a <script> tag is not closed.

If the parsed document has a <script> tag which is not closed, the parser acts like the document had nothing.
Example:

var
    htmlparser = require('htmlparser'),
    parser,
    pHandler,
    data = '<html><body><h1>Bla</h1><script src="somewhere"></body></html>';

pHandler = new htmlparser.DefaultHandler(function(err,doc){
    if ( err )
        throw err;
    console.log("doc: ",doc);
});

The result is:
doc: []

Bug in parser

node-htmlparser parses these two codes in the same way!

<span class="copyright link">Copyright content</span>

<span class="copyright link">Copyright content</spane>

Can you give me a quick rundown on how to use this?

I'm trying to integrate this into jsDOM to fix issues with parsing '<' or '>' in attributes.
After I successfully do that, I'll update the directions in the readme based on my experience.

EDIT: Also, if it's ready, can you publish to npm?

[Suggestion] Add support for reserialisation

As a suggestion to the DomUtils "submodule", it'd be neat if it included a way to re-serialize a DOM node back to HTML.
Basically, I'm using this library as a way to pre-process HTML before it's served to the user (an XML-flavoured templating engine, if you wish).

(I didn't find any obvious way to mark an issue as a suggestion/bug/..., hence why I added [suggestion] to the title. Did I miss anything? Github newbie here...)

Convert JSON back to HTML

I would like to convert DOM to JSON, modify the JSON and convet it back to DOM String.

Please suggest

Character References

Node-htmlparser doesn't decode character references, e.g. & or å .

Example: "

A&W

" parses into:

H1

#text( A&W )

The correct parse tree is:

H1

#text( A&W )

http://www.w3.org/TR/html4/charset.html#h-5.3

http://www.w3.org/TR/html4/sgml/entities.html

TypeError: Tautologistics.NodeHtmlParser.DefaultHandler is not a constructor

Hi I copied and pasted your example usage code for browsers and I get this error: TypeError: Tautologistics.NodeHtmlParser.DefaultHandler is not a constructor.

https://jsfiddle.net/skibulk/Lhnmjjs8/

maybe you should update your documents~

/Users/owen/Documents/workspace/node-htmlparser/snippet.js:8
var handler = new htmlparser.DefaultHandler(function(err, dom) {
^
TypeError: undefined is not a function
at Object. (/Users/owen/Documents/workspace/node-htmlparser/snippet.js:8:15)
at Module._compile (module.js:449:26)
at Object.Module._extensions..js (module.js:467:10)
at Module.load (module.js:356:32)
at Function.Module._load (module.js:312:12)
at Module.runMain (module.js:492:10)
at process.startup.processNextTick.process._tickCallback (node.js:244:9)

why can't I npm install v2.0.0

npm install htmlparser will install v1.7.3 but v2.0.0,how can i get the newest version?

crazy IE-generated HTML is not normalized

IE8 (at least) will take the following HTML:

<!DOCTYPE html>
<html>
    <head></head>
    <body></body>
</html>

And convert it to:

<!DOCTYPE HTML>
<HTML>
    <HEAD></HEAD>
    <BODY></BODY>
</HTML>

The neat part: node-htmlparser handles this just fine!

The bad: libraries like soupselect (https://github.com/harryf/node-soupselect) and the DomUtils included with node-htmlparser will fail when trying to select 'body'. The DomUtils will select 'BODY' properly, but it's a pain to have to try to select BOTH 'body' and 'BODY'.

Should this be handled on the parser-side of things or the selector side of things? I'm not sure. Part of me thinks that the parser should normalize the HTML to an extent, such as make all the tags lowercase. At the same time, a selector engine could do this normalization when searching.

Thoughts?

the test file seem to fail in v1.2

Can you verify that the test file works in v.1.2 ?

npm errors

Hey there,

I tried installing htmlparser with npm, and it had some issues:

http://gist.github.com/476153

Thanks!

does not support attribute names without values

example:

<div by-zero name="something">

If a tag like this appears anywhere in the html, the parsing stops shortly after a tag like this. I saw this tag in the reddit source code where my code bombed.

blog.kickstarter.com is not correctly parsed. cdata tags?

http://blog.kickstarter.com/post/5770516169/new-projects-are-rolling-dice

I can't tell what's causing it, but the html gets parsed into approximately 16 root elements, instead of 2 (doctype, html). It could be CDATA, or the combination of crazy tumblr injected content near the bottom.

Any way to get textContent or innerHTML?

I need to get text contents inside an element and before reinventing the wheel I must be sure if anything similiar exists.
Would it be useful if I create a few utility methods that can be merged with DomUtils?

Vulnerable Regular Expression

The following regular expression used in parsing the HTML documents is vulnerable to ReDoS:

/(^\s+|\s+$)/g

The slowdown is moderately low: for 50.000 characters around 2.5 seconds matching time. However, I would still suggest one of the following:

remove the regex,
anchor the regex,
limit the number of characters that can be matched by the repetition,
limit the input size.

If needed, I can provide an actual example showing the slowdown.

TypeError: Cannot read property '_ownerDocument' of undefined

var jsdom = require('jsdom').jsdom;
var doc = jsdom('<html><body><input value="<"></body></html>');

issue discovered originally in JSDOM which uses html parser:

jsdom/jsdom#266

should htmlparser handle doctype definitions?

in the older version of html parser, it was output as a directive, however in the current trunk, it doesn't do anything.

example:

tautologistics / node-htmlparser Goto Github PK

node-htmlparser's Introduction

node-htmlparser's People

Contributors

Stargazers

Watchers

Forkers

node-htmlparser's Issues

A&W

Recommend Projects

Recommend Topics

Recommend Org

Jobs