Comments (11)
@saltyshiomix
I have tried out the new version of the nest-crawler
package and it works exactly as what I expected.
I highly appreciate your work. Thank you sooo much! :)
from nest-crawler.
Oh, the best words for me!
Big thanks to your kindness and I'll implement it later today 👍
from nest-crawler.
Now [email protected]
is out!
It internally uses scrape-it
, and you can use its options like how: 'html'
👍
from nest-crawler.
Glad to hear that!
And thank you for your PR #2 , I just merged 👍
from nest-crawler.
Hi, @junnstudio !
Could you tell me the specific example you want to get as the raw HTML?
I'll implement and update nest-crawler for you or tell you the best library :)
from nest-crawler.
Hi, @saltyshiomix !
I am developing a crawler that scrapes the content of some pieces of news.
Instead of getting back the results with all the tags removed, I'd like to get the raw HTML code so that I can easily embed it into my pages and use some extra css to re-style the HTML tags inside it in my own way.
Of course I can use other libraries for this problem but I really love your nest-crawler
package and hope this issue will be tackled soon :)
from nest-crawler.
Let me know how do you think this API like below?
interface ExampleCom {
title: string;
info: string;
}
const data: ExampleCom = await this.crawler.fetchRawHtml({
target: 'http://example.com',
fetch: {
title: 'h1',
info: {
selector: 'p > a',
attr: 'href',
},
},
});
console.log(data);
// {
// title: '<h1>Example Domain</h1>',
// info: '<a href="http://www.iana.org/domains/example">More information...</a>'
// }
from nest-crawler.
@saltyshiomix
Sometimes, I don't really need to get the raw HTML code for all of the attributes inside fetch
, but just some of them.
In my opinion it'd be great if your package simply provide me with something like rawHtml: true
:
interface ExampleCom {
title: string;
info: string;
}
const data: ExampleCom = await this.crawler.fetch({
target: 'http://example.com',
fetch: {
title: 'h1',
info: {
selector: 'p > a',
rawHtml: true
},
},
});
// {
// title: 'Example Domain',
// info: '<a href="http://www.iana.org/domains/example">More information...</a>'
// }
Another solution that you can reference is the how: 'html'
attribute of the scrape-it
package at https://github.com/IonicaBizau/scrape-it.
from nest-crawler.
Yes, I formally used scrape-it
, but it can't resove cheerio types at the moment.
Now my PR was merged, so I'll proudly accept your advices :)
from nest-crawler.
nest-crawler
fetches data by default using scrape-it
, but sometimes we need to wait until DOM loaded like single page applications.
If you specify the option waitFor
, nest-crawler
fetches data by using puppeteer
.
In this way we can retrieve any web data from the crawler server :)
from nest-crawler.
By the way, if you want to use puppeteer on Ubuntu, we must install other dependencies like this:
$ sudo apt install gconf-service libasound2 libatk1.0-0 libatk-bridge2.0-0 libc6 libcairo2 libcups2 libdbus-1-3 libexpat1 libfontconfig1 libgcc1 libgconf-2-4 libgdk-pixbuf2.0-0 libglib2.0-0 libgtk-3-0 libnspr4 libpango-1.0-0 libpangocairo-1.0-0 libstdc++6 libx11-6 libx11-xcb1 libxcb1 libxcomposite1 libxcursor1 libxdamage1 libxext6 libxfixes3 libxi6 libxrandr2 libxrender1 libxss1 libxtst6 ca-certificates fonts-liberation libappindicator1 libnss3 lsb-release xdg-utils wget
If you use Ubuntu, please remember this :)
from nest-crawler.
Related Issues (4)
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from nest-crawler.