gahabeen / jsonframe-cheerio Goto Github PK
View Code? Open in Web Editor NEWsimple multi-level scraper json input/output for Cheerio
License: MIT License
simple multi-level scraper json input/output for Cheerio
License: MIT License
How can I get just "This is some text"? and not "This is some textFirst span textSecond span text"?
<li id="listItem">
This is some text
<span id="firstSpan">First span text</span>
<span id="secondSpan">Second span text</span>
</li>
Example:
let cheerio = require('cheerio');
let $ = cheerio.load(`
<li id="listItem">
This is some text
<span id="firstSpan">First span text</span>
<span id="secondSpan">Second span text</span>
</li>`)
let jsonframe = require('jsonframe-cheerio')
jsonframe($)
let frame = {"text": "li#listItem"}
console.log( $('body').scrape(frame, { string: true } ))
// {
// "text": "This is some text First span text Second span text"
// }
An error is thrown when jsonframe-cheerio is run in the browser with the latest version of jQuery (v3.3.1): Uncaught TypeError: Cannot read property 'toLowerCase' of undefined
.
The solution I found was to wrap this line in a try/catch block:
if (!res.extractor && !res.attribute && $(node).find(res.selector)['0'] && $(node).find(res.selector)['0'].name.toLowerCase() === 'img') {
res.attribute = 'src';
}
Specifically, $(node).find(res.selector)['0']
was defined, but the property name
in $(node).find(res.selector)['0'].name
was not defined, which caused the error.
Also, I made sure to go through all the samples and make sure they work. The fix in #2 works for that case. In addition, these changes also needed to be made:
https://github.com/gahabeen/jsonframe-cheerio#extractor
- "email": "[itemprop=email] < phone",
+ "email": "[itemprop=email] < mail",
https://github.com/gahabeen/jsonframe-cheerio#filter
- "email1": "[itemprop=email] < phone | uppercase",
- "email2": "[itemprop=email] < phone | capitalize"
+ "email1": "[itemprop=email] < mail | uppercase",
+ "email2": "[itemprop=email] < mail | capitalize"
Lastly, the timestats
option was inoperable. The timestats
variable was not passed in the third object argument in getDataFromNodes(…)
, and was defaulted to false
.
gTime, while defined at the beginning, was undefined when it was used:
if (result['_value']) {
result['_timestat'] = timeSpent(gTime); // gTime = undefined
}
I did not find a fix for timestats
.
I am having issues with retrieving a list from a set of a tags, I get an empty array.
Here is the result in a log:
episodes: { episodes: [ {}, {} ] }
Here is the code to get the list:
let $ = cheerio.load(
'<div class="row"> <div class="col s12 m12 l8 content-left"> <div class="content-list z-depth-1"> <h5> CATEGORY: King of Mask Singer <div class="sort input-field" data-link="http://kshowonline.com/category/141/king-of-mask-singer/1" data-sort="1"> <div class="select-wrapper"><span class="caret">▼</span><input type="text" class="select-dropdown" readonly="true" data-activates="select-options-1b022f21-4a87-3fa8-f2be-3a078ea7f4a1" value="Date added (newest)" data-cip-id="cIPJQ342845639"><ul id="select-options-1b022f21-4a87-3fa8-f2be-3a078ea7f4a1" class="dropdown-content select-dropdown "><li class=""><span>Date added (newest)</span></li><li class=""><span>Date added (oldest)</span></li><li class=""><span>Name (A-Z)</span></li><li class=""><span>Name (Z-A)</span></li></ul><select class="initialized"> <option value="1" selected="selected">Date added (newest)</option> <option value="2">Date added (oldest)</option> <option value="3">Name (A-Z)</option> <option value="4">Name (Z-A)</option> </select></div></div></h5> <a href="http://kshowonline.com/kshow/8052-[engsub]-king-of-mask-singer-ep.139" title="King of Mask Singer Ep.139"> <div class="thumbnail"> <div class="video-container center-align"> <div class="img-cover"> <img src="https://c1.staticflickr.com/1/320/18368860329_b2b17d3fb4_n.jpg" alt="King of Mask Singer Ep.139"> </div></div><div class="caption"> King of Mask Singer Ep.139 </div></div></a> <a href="http://kshowonline.com/kshow/8021-[engsub]-king-of-mask-singer-ep.138" title="King of Mask Singer Ep.138"> <div class="thumbnail"> <div class="video-container center-align"> <div class="img-cover"> <img src="https://c1.staticflickr.com/1/320/18368860329_b2b17d3fb4_n.jpg" alt="King of Mask Singer Ep.138"> </div></div><div class="caption"> King of Mask Singer Ep.138 </div></div></a> </div></div></div>'
);
let frame = {
episodes: {
_s: ".content-list.z-depth-1 a",
_d: [
{
url: "a @ href",
title: "a @ title"
}
]
}
};
jsonframe($);
let result = $(".col.s12.m12.l8.content-left").scrape(frame, {
string: true
});
console.log("episode results: ", result);
<div class="row">
<div class="col s12 m12 l8 content-left">
<div class="content-list z-depth-1">
<h5> CATEGORY: King of Mask Singer
<div class="sort input-field" data-link="http://kshowonline.com/category/141/king-of-mask-singer/1" data-sort="1">
<div class="select-wrapper">
<span class="caret">▼</span>
<input type="text" class="select-dropdown" readonly="true" data-activates="select-options-1b022f21-4a87-3fa8-f2be-3a078ea7f4a1" value="Date added (newest)" data-cip-id="cIPJQ342845639">
<ul id="select-options-1b022f21-4a87-3fa8-f2be-3a078ea7f4a1" class="dropdown-content select-dropdown ">
<li class=""><span>Date added (newest)</span></li>
<li class=""><span>Date added (oldest)</span></li>
<li class=""><span>Name (A-Z)</span></li>
<li class=""><span>Name (Z-A)</span></li>
</ul>
<select class="initialized">
<option value="1" selected="selected">Date added (newest)</option>
<option value="2">Date added (oldest)</option>
<option value="3">Name (A-Z)</option>
<option value="4">Name (Z-A)</option>
</select>
</div>
</div>
</h5>
<a href="http://kshowonline.com/kshow/8052-[engsub]-king-of-mask-singer-ep.139" title="King of Mask Singer Ep.139">
<div class="thumbnail">
<div class="video-container center-align">
<div class="img-cover">
<img src="https://c1.staticflickr.com/1/320/18368860329_b2b17d3fb4_n.jpg" alt="King of Mask Singer Ep.139">
</div>
</div>
<div class="caption"> King of Mask Singer Ep.139 </div>
</div>
</a>
<a href="http://kshowonline.com/kshow/8021-[engsub]-king-of-mask-singer-ep.138" title="King of Mask Singer Ep.138">
<div class="thumbnail">
<div class="video-container center-align">
<div class="img-cover">
<img src="https://c1.staticflickr.com/1/320/18368860329_b2b17d3fb4_n.jpg" alt="King of Mask Singer Ep.138">
</div>
</div>
<div class="caption"> King of Mask Singer Ep.138 </div>
</div>
</a>
</div>
</div>
</div>
In this section in the docs, the example for "link": "a @href"
should have a space like the other examples: "link": "a @ href"
. It seems small, but it screwed me up for a few minutes.
Best,
Ryan
Hi,
Jsonframe is really a good idea, thank you for sharing it.
I'm actually trying to scrape some content but i'm facing a problem: i can't manage to take an attribute of an item with its inner data too.
Sample html:
<div class="parents-container">
<div class="parent" data-foo="a">
<span class="child">...</span>
<span class="child">...</span>
...
</div>
<div class="parent" data-foo="b">
<span class="child">...</span>
<span class="child">...</span>
...
</div>
<div class="parent" data-foo="c">
<span class="child">...</span>
<span class="child">...</span>
...
</div>
</div>
This is the model i can use for this (i did not test it):
var data = {
parents: {
_s: ".parent",
_d: [{
foo: ?????? //how to get data-foo attribute value of the current ".parent" element?
children: {
_s: ".child",
_d: [{ ... }] //child data
}
}]
}
}
This should return an object with a property named "parents" that corresponds to an array of objects. Each object in the array, should represent a parent item.
Like this:
{ parents: [{ foo: "...", children: [{ child data }, {child data} ... ]}, {foo: "...", children: [{...}] }, {foo: "", children: [{}], ... }
How should i write my model in order to include the data-foo attribute ?
Thanks
Please enrich the documentation, there are many filters, extractors and other functionalities that are in the changelog but not in the sessions. So we need to look all the changelog to find amazing filters like between(string&&sting) and others...
I could help with the docs if you want it.
Thanks for this amazing repo
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.