GithubHelp home page GithubHelp logo

cheeriojs / cheerio Goto Github PK

View Code? Open in Web Editor NEW
27.8K 361.0 1.6K 12.54 MB

The fast, flexible, and elegant library for parsing and manipulating HTML and XML.

Home Page: https://cheerio.js.org

License: MIT License

HTML 23.87% TypeScript 74.24% JavaScript 1.37% CSS 0.43% MDX 0.09%
cheerio jquery htmlparser2 dom htmlparser selector scraper parser html hacktoberfest

cheerio's People

Contributors

0xbadc0ffee avatar 2020steve avatar 5saviahv avatar alexbardas avatar alexindigo avatar andineck avatar arb avatar bensheldon avatar coderaiser avatar cvrebert avatar davidchambers avatar dependabot-preview[bot] avatar dependabot[bot] avatar dianelooney avatar fb55 avatar finspin avatar github-actions[bot] avatar greenkeeper[bot] avatar inikulin avatar jugglinmike avatar kpdecker avatar maciek416 avatar matthewmueller avatar nleush avatar robashton avatar rwaldin avatar stevenvachon avatar twolfson avatar xhmikosr avatar yields avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

cheerio's Issues

Copy the functionality of outerHTML

I'm using cheerio on my node.js server, and using the method html() I can copy the functionality of innerHTML, but I haven't found a way to copy how outerHTML works. How could I do this?

Thanks :)

Context ignored on selection

Forgive me if I've misread the docs, but I can't understand how this kind of result would be useful / intentional:

    var myHTML = '<tr><a>myLink1</a></tr><tr><a>myLink2</a></tr>';
    var $ = cheerio.load(myHTML);
    var rows = $('tr');
    var links = $('a', rows.get(0));
    console.log("There are " + links.length + " links in the first row")

gives me

There are 2 links in the first row.

I'm using the first row as context so I should only get 1 link for that first row. I also tried rows[0] instead of rows.get(0) to no avail.

Is this a bug?

Unnecessary writing to stdout

Lots of unnecessary Here!s, There!s, and random console.logs from debugging. Not a huge priority, but gets kind of annoying.

.html() returns outerHTML, while .html(html) sets innerHTML

I can't believe this one has gone unnoticed. I'm not really sure what to do about it actually. I think it's more useful how it is right now, but I don't like the asymmetry. I also don't like straying away from the jQuery API.

What do you guy's think?

Add support for $("script, link, img")

Being able to select multiple types of tags at once is immensely useful.

A few ways to implement this:

  • Monkey-patch soup-select for now, submit patch to author
  • Implement jQuery's .add() function and use that
  • Take a look at Sizzle engine, and see how difficult the port would be

attribute style includes extra space

It's much more common to have no space between the equal sign in attributes: for instance <a href="http://github.com/">GitHub</a> is much more common than <a href = "http://github.com/">GitHub</a> which cheerio is producing. Is there anything I'm missing here, or would it be better to remove the extra whitespace?

$ = require('cheerio').load('<a>GitHub</a>');
$('a').attr('href', 'http://github.com/');
console.log($.html()) // => '<a href = "http://github.com/">GitHub</a>'

Missing properties on cheerio objects.

Under some circumstances when cheerio objects are generated from html, parent/child relationships aren't created.

I've created the following test case that fails by returning null.

$("<div></div>").append("<div><div></div></div>").children().children().parent();

Issue with selecting nested elements

I was attempting to select elements based on DOM structure, e.g. "li.a div.b span.c" (because span.c exists in other structures as well, but I only want to access those that match this structure), but instead of the 40 elements that match on the page, the code returned 840 elements, with multiple duplicates, so I assume it gets all the possible combinations and just returns them all.

E.g.

$('li.a div.b span.c', html).each(function(i, el) {
// Has 840 items
});

vs.

$('span.c', html).each(function(i, el) {
if($(this).parent('div').attr('class') == 'b' &amp;&amp; $(this).parent('div').parent('li).attr('class') == 'a') {
// Has 40 items
}
});

Is this a (typical) user error or an issue with the selector?

cheerio-soupselect not works now

Hello, Matthew,

I just have downloaded cheerio and found that it doesn't works! I have installed it with npm install cheerio.
Problem in module cheerio-soupselect - it doesn't works with htmlparser2. I dont know what exactly is wrong, but when I downloaded harryf /node-soupselect and placed into the cheerio-soupselect folder and installed htmlparser1, lib become alive.

Please check the current distribution, I think it has errors in cheerio-soupselect module.

Regards, Dmitry

Test not running

Seems like a little typo - the test suite points to tests instead of test


JosProDesk:cheerio jos$ npm install

JosProDesk:cheerio jos$ npm test

> [email protected] test /Users/jos/Sites/cheerio
> coffee -o lib/ src/ && vows tests/test.cheerio.coffee --spec


node.js:201
        throw e; // process.nextTick error, or 'error' event on first tick
              ^
Error: Cannot find module '/Users/jos/Sites/cheerio/tests/test.cheerio'
    at Function._resolveFilename (module.js:334:11)
    at Function._load (module.js:279:25)
    at Module.require (module.js:357:17)
    at require (module.js:368:17)
    at /usr/local/lib/node_modules/vows/bin/vows:496:19
    at Array.reduce (native)
    at importSuites (/usr/local/lib/node_modules/vows/bin/vows:491:18)
    at Object.<anonymous> (/usr/local/lib/node_modules/vows/bin/vows:247:15)
    at Module._compile (module.js:432:26)
    at Object..js (module.js:450:10)
npm ERR! [email protected] test: `coffee -o lib/ src/ && vows tests/test.cheerio.coffee --spec`
npm ERR! `sh "-c" "coffee -o lib/ src/ && vows tests/test.cheerio.coffee --spec"` failed with 1
npm ERR! 
npm ERR! Failed at the [email protected] test script.
npm ERR! This is most likely a problem with the cheerio package,
npm ERR! not with npm itself.
npm ERR! Tell the author that this fails on your system:
npm ERR!     coffee -o lib/ src/ && vows tests/test.cheerio.coffee --spec
npm ERR! You can get their info via:
npm ERR!     npm owner ls cheerio
npm ERR! There is likely additional logging output above.
npm ERR! 
npm ERR! System Darwin 10.8.0
npm ERR! command "node" "/usr/local/bin/npm" "test"
npm ERR! cwd /Users/jos/Sites/cheerio
npm ERR! node -v v0.6.7
npm ERR! npm -v 1.1.0-beta-10
npm ERR! code ELIFECYCLE
npm ERR! message [email protected] test: `coffee -o lib/ src/ && vows tests/test.cheerio.coffee --spec`
npm ERR! message `sh "-c" "coffee -o lib/ src/ && vows tests/test.cheerio.coffee --spec"` failed with 1
npm ERR! 
npm ERR! Additional logging details can be found in:
npm ERR!     /Users/jos/Sites/cheerio/npm-debug.log
npm not ok

val() and option:selected

val() for input, textarea,and selects would be handy, rather than having to use these types:

$('#select option[selected="selected"]').attr('value')
$('#textarea').html()
$('#text_input').attr('value');

In trying to get val(), I also realized that you have to do option[selected="selected"] which is a bit obtuse compared to option:selected, which in jquery, knows all the different ways one may have a selected option.

error: Unmatched selector: &#39;s

First off, thanks for a brilliant project. I'm using cheerio 0.8.0:

My code looks like this:

$($(cols.get(3)).html())

The html returned from the $(cols.get(3)).html() looks like this:

Kid&#39;s Ride&nbsp;<br/>Foo&nbsp;<br/>&nbsp;<br/><font color="#000000">Permitted</font><br/>Category - D&nbsp;&nbsp;<br/><br/>\n

When I try to wrap that output back into the outer $() so that I can do more selects on it, this is the stack trace I see:

    at parse (node_modules/cheerio/node_modules/cheerio-select/node_modules/CSSselect/node_modules/CSSwhat/index.js:109:11)\n    
    at parse (node_modules/cheerio/node_modules/cheerio-select/node_modules/CSSselect/index.js:646:18)\n    
    at Function.iterate (node_modules/cheerio/node_modules/cheerio-select/node_modules/CSSselect/index.js:687:42)\n    
    at node_modules/cheerio/node_modules/cheerio-select/lib/select.js:13:20    
    at [object Object].find (node_modules/cheerio/lib/api/traversing.js:7:14)
    at [object Object].init (node_modules/cheerio/lib/cheerio.js:67:44)
    at node_modules/cheerio/lib/cheerio.js:11:12
    at fn (node_modules/cheerio/lib/api/utils.js:246:12)

I suspect it has something to do with the entity (') in there as other similar html without the single quote in it parses just fine.

How to parse XML with meta, link etc. tags?

I trying to parse XML with cheerio. XML contains tags. And the following script is return empty string in that case:

$('foo').find('link').text()

How to parse XML with tags , and similar?
Thanks in advance.

unescaped attributes (XSS)

jQuery protects from XSS when assigning attributes; cheerio doen't currently.

function xss() {
  var $ = require('cheerio').load('<a>GitHub</a>');
  $('a').attr('href', 'http://github.com/"><script>alert("XSS!")</script><br');
  return $.html();
}

xss() returns:

<a href = "http://github.com/"><script>alert("XSS!")</script><br">GitHub</a>

Cannot handle a full HTML page

I just started using Cheerio and it seems like a great tool. Trying some of the examples on the Cheerio page worked as advertised but when I pass Cheerio a full HTML page, it cannot seem to process it.

This is the HTML I am using:

<!doctype html>
<!--[if lt IE 7]> <html class="no-js ie6 oldie" lang="en"> <![endif]-->
<!--[if IE 7]>    <html class="no-js ie7 oldie" lang="en"> <![endif]-->
<!--[if IE 8]>    <html class="no-js ie8 oldie" lang="en"> <![endif]-->
<!--[if gt IE 8]><!--> <html class="no-js" lang="en"> <!--<![endif]-->
<head>
    <meta charset="utf-8">
    <!-- Browser Compatibility: -->
    <meta http-equiv="X-UA-Compatible" content="IE=9,chrome=1">
    <!-- SEO and social media description: -->
    <meta name="description" content="">
    <!-- Mobile optimization: -->
    <meta name="viewport" content="width=device-width,initial-scale=1">
    <!-- Disable IE6 image menu: -->
    <meta http-equiv="imagetoolbar" content="false">

    <title></title>

    <!-- CSS: -->
    <link rel="stylesheet" href="master.css">

    <!-- (Some) JS: -->
    <!--<script src="modernizr-2.0.6.min.js"></script>-->
</head>

<body>
    <header>
        <div id="logo"></div>
        <nav>

        </nav>
    </header>

    <div id="main">

    </div>

    <footer>
        <div id="small-logo"></div>
        <small>Copyright © <span class="year"></span> Moo cows. All rights reserved.</small>
    </footer>


    <!-- (Rest of) JS: -->

    <!-- Grab Google CDN's jQuery, with a protocol relative URL; fall back to local if offline -->
    <!--<script src="http://ajax.googleapis.com/ajax/libs/jquery/1.7.1/jquery.min.js"></script>
    <script>window.jQuery || document.write('<script src="js/libs/jquery-1.7.1.min.js"><\/script>')</script>-->

    <!-- IE6 Chrome Frame install prompt -->
    <!--[if lt IE 7 ]>
    <script src="//ajax.googleapis.com/ajax/libs/chrome-frame/1.0.3/CFInstall.min.js"></script>
    <script>window.attachEvent('onload',function(){CFInstall.check({mode:'overlay'})})</script>
    <![endif]-->
</body>
</html>

If I try $('title').html('moo') I get "TypeError: Cannot read property '0' of undefined". If I try $('#logo').after('moo') I get "TypeError: Cannot call method 'find' of undefined". When I try $('.year').html('moo') I don't get any error but the change isn't made either. If I don't perform any manipulations and just output $.html(), no error is thrown. (However, it does bork my conditional statements—should I open a separate Issue?)

I don't know if this is in any way related to Issue #12 but the error messages are similar. I tried to narrow down the problem by stripping out some of the script tags, conditional statements, etc from the HTML but that didn't work either.

I am using the latest version of Node.js (0.6.4).

<br> are getting children...

Not sure if this is an issue for here , Htmlparser2, or I am just clueless but whenever I pull in a doc with cheerio and there is code like

<br> <span> blah</span> <br>

that first br is going to get 2 children of span and br....

I thought this wasn't supposed to happen for nodes like br who can't have children when the xmlMode option is false.... or am I just confused...

UG

Traversing ignores <script> & <style> tags

Cheerio (& jQuery) traversing ignores elements that aren't tags. Node-htmlparser treats script and style tags differently (doesn't parse inner content) so it has different types - elem.type is "script" and "style" instead of "tag".

Need to include those tags in the traversing.

Too much differences with jQuery.

Even though, it is faster than jsdom + jquery implementation. There are so many familiar selectors cannot be used with Cherrio. for instance, I can't pass html code into the function... i can't use eq() with the selector... The only advantage for your implementation is the performance gain.

Add css()

Already missing them from jQuery...

Rewrite tests in js

Not really a priority, but it would be nice. This way we can drop coffee as a dep completely.

Whitespace not being rendered

Any whitespace between tags is NOT being rendered with .html()

For example: Link HyperLink

The output of that when rendered with .html() is: LinkHyperLink

It removes the space in-between the 2 tags. This only happens when 2 tags are one after the other. To get around it i have to do the following to the html string: .replace('> <','>&nbsp;<');

.text() not working for empty elements

there seems to be a problem with setting text into an empty div like so:

var cheerio = require('cheerio');
var $ = cheerio.load('

');
$('#one').text('Hello there!');
$.html();

the problem appears to be that line 118 of api/manipulation.coffee is checking that this.children exists before inserting the element.

i'm not sure if this is an issue in cheerio or an issue with htmlparser leaving the children property unassigned, but i though you might want to know.

cheers,
simon

whitespace nodes

The latest version is removing all my nodes only containing whitespace. It's introducing subtle rendering bugs where spaces are missing e.g.

<div><span class="firstname">Jos</span> <span class="lastname">Shepherd</span></div>

is getting output as:

<div><span class="firstname">Jos</span><span class="lastname">Shepherd</span></div>

Is the whitespace stripping by design or is it a bug?

version export keeps opening package.json on cheerio.load

I am running into the following error when I call cheerio.load many times in quick succession:

Error: EMFILE, too many open files '/Users/ryanshaw/Code/noflo/node_modules/cheerio/package.json'
    at Object.openSync (fs.js:238:18)
    at Object.readFileSync (fs.js:128:15)
    at Function.version (/Users/ryanshaw/Code/noflo/node_modules/cheerio/index.js:14:27)
    at /Users/ryanshaw/Code/noflo/node_modules/cheerio/node_modules/underscore/underscore.js:638:27
    at Array.forEach (native)
    at /Users/ryanshaw/Code/noflo/node_modules/cheerio/node_modules/underscore/underscore.js:76:11
    at Function.extend (/Users/ryanshaw/Code/noflo/node_modules/cheerio/node_modules/underscore/underscore.js:636:5)
    at [object Object].extend (/Users/ryanshaw/Code/noflo/node_modules/cheerio/node_modules/underscore/underscore.js:961:26)
    at Function.load (/Users/ryanshaw/Code/noflo/node_modules/cheerio/lib/api/utils.js:243:16)
    at ScrapeHtml.scrapeHtml (/Users/ryanshaw/Code/noflo/components/ScrapeHTML.js:96:19)

Commenting out the following lines in index.js solves the problem:

var version = function() {
  var pkg = require('fs').readFileSync(__dirname + '/package.json', 'utf8');
  return JSON.parse(pkg).version;
};

exports.__defineGetter__('version', version);

Cheerio versions > 0.3.0 fail when trying to get access to selectors nested two or more levels deep.

Hi Matthew,

First off, really like Cheerio. Great work! Fast and easy to use. We are using it heavily and have been able to develop quickly with it.

However, since 0.3.0 we are no longer able to get to nested divs that are two or more levels deep and render additional data in those elements. Here is an example.

I have template that looks like the following:

https://gist.github.com/1500031

When I try to do something like the following it used to work and now it fails.

template = get('games.html');
$ = gamefly.cheerio.load(template);
$('.ttl').text("Call of Duty");

This appears to fail starting with 0.3.1. Any ideas?

Thanks
Christian

doesn't install

(mbp) ~ $ cd ~/scratch/
(mbp) ~/scratch $ mkdir cheerio-install
(mbp) ~/scratch $ cd cheerio-install/
(mbp) ~/scratch/cheerio-install $ npm install cheerio
npm ERR! error installing [email protected] Error: No compatible version found: entities@'>=1.0.0- <2.0.0-'
npm ERR! error installing [email protected] Valid install targets:
npm ERR! error installing [email protected] ["0.1.0","0.1.1"]
npm ERR! error installing [email protected]     at installTargetsError (/usr/local/lib/node_modules/npm/lib/cache.js:424:10)
npm ERR! error installing [email protected]     at /usr/local/lib/node_modules/npm/lib/cache.js:406:17
npm ERR! error installing [email protected]     at saved (/usr/local/lib/node_modules/npm/lib/utils/npm-registry-client/get.js:136:7)
npm ERR! error installing [email protected]     at Object.cb [as oncomplete] (/usr/local/lib/node_modules/npm/node_modules/graceful-fs/graceful-fs.js:36:9)
npm ERR! Error: No compatible version found: entities@'>=1.0.0- <2.0.0-'
npm ERR! Valid install targets:
npm ERR! ["0.1.0","0.1.1"]
npm ERR!     at installTargetsError (/usr/local/lib/node_modules/npm/lib/cache.js:424:10)
npm ERR!     at /usr/local/lib/node_modules/npm/lib/cache.js:406:17
npm ERR!     at saved (/usr/local/lib/node_modules/npm/lib/utils/npm-registry-client/get.js:136:7)
npm ERR!     at Object.cb [as oncomplete] (/usr/local/lib/node_modules/npm/node_modules/graceful-fs/graceful-fs.js:36:9)
npm ERR! Report this *entire* log at:
npm ERR!     <http://github.com/isaacs/npm/issues>
npm ERR! or email it to:
npm ERR!     <[email protected]>
npm ERR! 
npm ERR! System Darwin 11.4.0
npm ERR! command "node" "/usr/local/bin/npm" "install" "cheerio"
npm ERR! cwd /Users/bat/scratch/cheerio-install
npm ERR! node -v v0.6.19
npm ERR! npm -v 1.0.106
npm ERR! 
npm ERR! Additional logging details can be found in:
npm ERR!     /Users/bat/scratch/cheerio-install/npm-debug.log
npm not ok
(mbp) ~/scratch/cheerio-install $

people are apparently using this; how about running npm publish?

.text() should decode HTML entities

jQuery .text() decodes HTML entities:

> $('<p>M&amp;M</p>').text()
"M&M"

cheerio's does not:

> cheerio.load("<p>M&amp;M</p>")("p").text()
'M&amp;M'

is there any reason why "node" : ">=0.4.11" in package.json?

Hi Matthew,

I run the suite test on node v0.4.7 and the suite test pass. Is there any reason why cheerio restrict the usage with node >= v0.4.11?

I will need cheerio with this version of node. Can you modify the version of node in package.json?

Cannot parse text alone

htmlparser 2 will return nothing if you give it something like "hello world"

Therefore, if you run cheerio.load('hello world'). It will return the cryptic error: ".find is undefined"

Full text should be treated as a single text node.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.