GithubHelp home page GithubHelp logo

gahabeen / jsonframe-cheerio Goto Github PK

View Code? Open in Web Editor NEW
199.0 7.0 24.0 110 KB

simple multi-level scraper json input/output for Cheerio

License: MIT License

HTML 8.34% JavaScript 91.66%
selector parsed-data frame json scraping scraper

jsonframe-cheerio's People

Contributors

gahabeen avatar moeahmed avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar

jsonframe-cheerio's Issues

How to get text without nested children's texts

How can I get just "This is some text"? and not "This is some textFirst span textSecond span text"?

<li id="listItem">
    This is some text
    <span id="firstSpan">First span text</span>
    <span id="secondSpan">Second span text</span>
</li>

Example:

let cheerio = require('cheerio');
let $ = cheerio.load(`
<li id="listItem">
    This is some text
    <span id="firstSpan">First span text</span>
    <span id="secondSpan">Second span text</span>
</li>`)

let jsonframe = require('jsonframe-cheerio')
jsonframe($)

let frame = {"text": "li#listItem"}
console.log( $('body').scrape(frame, { string: true } ))
// {
//   "text": "This is some text First span text Second span text"
// }

Compatibility with browser and jQuery

An error is thrown when jsonframe-cheerio is run in the browser with the latest version of jQuery (v3.3.1): Uncaught TypeError: Cannot read property 'toLowerCase' of undefined.

The solution I found was to wrap this line in a try/catch block:

  if (!res.extractor && !res.attribute && $(node).find(res.selector)['0'] && $(node).find(res.selector)['0'].name.toLowerCase() === 'img') {
    res.attribute = 'src';
  }

Specifically, $(node).find(res.selector)['0'] was defined, but the property name in $(node).find(res.selector)['0'].name was not defined, which caused the error.

Also, I made sure to go through all the samples and make sure they work. The fix in #2 works for that case. In addition, these changes also needed to be made:

https://github.com/gahabeen/jsonframe-cheerio#extractor
- "email": "[itemprop=email] < phone",
+ "email": "[itemprop=email] < mail",

https://github.com/gahabeen/jsonframe-cheerio#filter
-	"email1": "[itemprop=email] < phone | uppercase",
-	"email2": "[itemprop=email] < phone | capitalize"
+	"email1": "[itemprop=email] < mail | uppercase",
+	"email2": "[itemprop=email] < mail | capitalize"

Lastly, the timestats option was inoperable. The timestats variable was not passed in the third object argument in getDataFromNodes(…), and was defaulted to false.
gTime, while defined at the beginning, was undefined when it was used:

    if (result['_value']) {
      result['_timestat'] = timeSpent(gTime); // gTime = undefined
    }

I did not find a fix for timestats.

Issues retrieving list

I am having issues with retrieving a list from a set of a tags, I get an empty array.

  • Here is the result in a log:
    episodes: { episodes: [ {}, {} ] }

  • Here is the code to get the list:

          let $ = cheerio.load(
            '<div class="row"> <div class="col s12 m12 l8 content-left"> <div class="content-list z-depth-1"> <h5> CATEGORY: King of Mask Singer <div class="sort input-field" data-link="http://kshowonline.com/category/141/king-of-mask-singer/1" data-sort="1"> <div class="select-wrapper"><span class="caret">▼</span><input type="text" class="select-dropdown" readonly="true" data-activates="select-options-1b022f21-4a87-3fa8-f2be-3a078ea7f4a1" value="Date added (newest)" data-cip-id="cIPJQ342845639"><ul id="select-options-1b022f21-4a87-3fa8-f2be-3a078ea7f4a1" class="dropdown-content select-dropdown "><li class=""><span>Date added (newest)</span></li><li class=""><span>Date added (oldest)</span></li><li class=""><span>Name (A-Z)</span></li><li class=""><span>Name (Z-A)</span></li></ul><select class="initialized"> <option value="1" selected="selected">Date added (newest)</option> <option value="2">Date added (oldest)</option> <option value="3">Name (A-Z)</option> <option value="4">Name (Z-A)</option> </select></div></div></h5> <a href="http://kshowonline.com/kshow/8052-[engsub]-king-of-mask-singer-ep.139" title="King of Mask Singer Ep.139"> <div class="thumbnail"> <div class="video-container center-align"> <div class="img-cover"> <img src="https://c1.staticflickr.com/1/320/18368860329_b2b17d3fb4_n.jpg" alt="King of Mask Singer Ep.139"> </div></div><div class="caption"> King of Mask Singer Ep.139 </div></div></a> <a href="http://kshowonline.com/kshow/8021-[engsub]-king-of-mask-singer-ep.138" title="King of Mask Singer Ep.138"> <div class="thumbnail"> <div class="video-container center-align"> <div class="img-cover"> <img src="https://c1.staticflickr.com/1/320/18368860329_b2b17d3fb4_n.jpg" alt="King of Mask Singer Ep.138"> </div></div><div class="caption"> King of Mask Singer Ep.138 </div></div></a> </div></div></div>'
          );
          let frame = {
            episodes: {
              _s: ".content-list.z-depth-1 a",
              _d: [
                {
                  url: "a @ href",
                  title: "a @ title"
                }
              ]
            }
          };
          jsonframe($);

          let result = $(".col.s12.m12.l8.content-left").scrape(frame, {
            string: true
          });
          console.log("episode results: ", result);
  • Here is the formatted sample html:
<div class="row">
	<div class="col s12 m12 l8 content-left">
    	<div class="content-list z-depth-1">
        	<h5> CATEGORY: King of Mask Singer 
            	<div class="sort input-field" data-link="http://kshowonline.com/category/141/king-of-mask-singer/1" data-sort="1">
                	<div class="select-wrapper">
                    	<span class="caret"></span>
                        <input type="text" class="select-dropdown" readonly="true" data-activates="select-options-1b022f21-4a87-3fa8-f2be-3a078ea7f4a1" value="Date added (newest)" data-cip-id="cIPJQ342845639">
                        <ul id="select-options-1b022f21-4a87-3fa8-f2be-3a078ea7f4a1" class="dropdown-content select-dropdown ">
                          <li class=""><span>Date added (newest)</span></li>
                          <li class=""><span>Date added (oldest)</span></li>
                          <li class=""><span>Name (A-Z)</span></li>
                          <li class=""><span>Name (Z-A)</span></li>
                        </ul>
                        <select class="initialized">
                          <option value="1" selected="selected">Date added (newest)</option>
                          <option value="2">Date added (oldest)</option>
                          <option value="3">Name (A-Z)</option>
                          <option value="4">Name (Z-A)</option>
                        </select>
                    </div>
                </div>
            </h5>
           	<a href="http://kshowonline.com/kshow/8052-[engsub]-king-of-mask-singer-ep.139" title="King of Mask Singer Ep.139"> 
            	<div class="thumbnail">
                	<div class="video-container center-align">
                    	<div class="img-cover">
                        	<img src="https://c1.staticflickr.com/1/320/18368860329_b2b17d3fb4_n.jpg" alt="King of Mask Singer Ep.139">
                        </div>
                    </div>
                 	<div class="caption"> King of Mask Singer Ep.139 </div>
                </div>
            </a>
            <a href="http://kshowonline.com/kshow/8021-[engsub]-king-of-mask-singer-ep.138" title="King of Mask Singer Ep.138">
            	<div class="thumbnail">
                	<div class="video-container center-align">
                    	<div class="img-cover">
                        	<img src="https://c1.staticflickr.com/1/320/18368860329_b2b17d3fb4_n.jpg" alt="King of Mask Singer Ep.138">
                        </div>
                    </div>
                    <div class="caption"> King of Mask Singer Ep.138 </div>
                </div>
            </a>
        </div>
    </div>
</div>

Code typo in docs

In this section in the docs, the example for "link": "a @href" should have a space like the other examples: "link": "a @ href". It seems small, but it screwed me up for a few minutes.

Best,
Ryan

How to get an attribute of a parent item and fetch its children content data too.

Hi,
Jsonframe is really a good idea, thank you for sharing it.
I'm actually trying to scrape some content but i'm facing a problem: i can't manage to take an attribute of an item with its inner data too.
Sample html:

<div class="parents-container">
    <div class="parent" data-foo="a">
        <span class="child">...</span>
        <span class="child">...</span>
        ...
    </div>
    <div class="parent" data-foo="b">
        <span class="child">...</span>
        <span class="child">...</span>
         ...
    </div>
    <div class="parent" data-foo="c">
         <span class="child">...</span>
         <span class="child">...</span>
          ...
   </div>
</div>

This is the model i can use for this (i did not test it):

var data = {
     parents: {
        _s: ".parent",
        _d: [{
               foo: ?????? //how to get data-foo attribute value of the current ".parent" element?
               children: {
                   _s: ".child",
                   _d: [{ ... }] //child data
               }
              }]
     }
}

This should return an object with a property named "parents" that corresponds to an array of objects. Each object in the array, should represent a parent item.
Like this:

{ parents: [{ foo: "...", children: [{ child data }, {child data} ... ]}, {foo: "...", children: [{...}] }, {foo: "", children: [{}], ... }

How should i write my model in order to include the data-foo attribute ?

Thanks

Pls Enrich docs

Please enrich the documentation, there are many filters, extractors and other functionalities that are in the changelog but not in the sessions. So we need to look all the changelog to find amazing filters like between(string&&sting) and others...
I could help with the docs if you want it.

Thanks for this amazing repo

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.