Is there any ways to get the raw HTML code (keeping all tags) inside an element? I hav

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Hi, <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url=

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

nest-crawler fetches data by default using <code clas

Unable to get raw HTML code inside an element about nest-crawler HOT 11 CLOSED

junnstudio commented on June 14, 2024 2

Unable to get raw HTML code inside an element

from nest-crawler.

Comments (11)

junnstudio commented on June 14, 2024 2

@saltyshiomix
I have tried out the new version of the nest-crawler package and it works exactly as what I expected.
I highly appreciate your work. Thank you sooo much! :)

from nest-crawler.

saltyshiomix commented on June 14, 2024 1

@junnstudio

Oh, the best words for me!
Big thanks to your kindness and I'll implement it later today 👍

from nest-crawler.

saltyshiomix commented on June 14, 2024 1

@junnstudio

Now [email protected] is out!

It internally uses scrape-it, and you can use its options like how: 'html' 👍

from nest-crawler.

saltyshiomix commented on June 14, 2024 1

@junnstudio

Glad to hear that!
And thank you for your PR #2 , I just merged 👍

from nest-crawler.

saltyshiomix commented on June 14, 2024

Hi, @junnstudio !

Could you tell me the specific example you want to get as the raw HTML?

I'll implement and update nest-crawler for you or tell you the best library :)

from nest-crawler.

junnstudio commented on June 14, 2024

Hi, @saltyshiomix !
I am developing a crawler that scrapes the content of some pieces of news.
Instead of getting back the results with all the tags removed, I'd like to get the raw HTML code so that I can easily embed it into my pages and use some extra css to re-style the HTML tags inside it in my own way.
Of course I can use other libraries for this problem but I really love your nest-crawler package and hope this issue will be tackled soon :)

from nest-crawler.

saltyshiomix commented on June 14, 2024

@junnstudio

Let me know how do you think this API like below?

interface ExampleCom {
  title: string;
  info: string;
}

const data: ExampleCom = await this.crawler.fetchRawHtml({
  target: 'http://example.com',
  fetch: {
    title: 'h1',
    info: {
      selector: 'p > a',
      attr: 'href',
    },
  },
});

console.log(data);
// {
//   title: '<h1>Example Domain</h1>',
//   info: '<a href="http://www.iana.org/domains/example">More information...</a>'
// }

from nest-crawler.

junnstudio commented on June 14, 2024

@saltyshiomix
Sometimes, I don't really need to get the raw HTML code for all of the attributes inside fetch, but just some of them.
In my opinion it'd be great if your package simply provide me with something like rawHtml: true:

interface ExampleCom {
  title: string;
  info: string;
}

const data: ExampleCom = await this.crawler.fetch({
  target: 'http://example.com',
  fetch: {
    title: 'h1',
    info: {
      selector: 'p > a',
      rawHtml: true
    },
  },
});

// {
//   title: 'Example Domain',
//   info: '<a href="http://www.iana.org/domains/example">More information...</a>'
// }

Another solution that you can reference is the how: 'html' attribute of the scrape-it package at https://github.com/IonicaBizau/scrape-it.

from nest-crawler.

saltyshiomix commented on June 14, 2024

@junnstudio

Yes, I formally used scrape-it, but it can't resove cheerio types at the moment.

Now my PR was merged, so I'll proudly accept your advices :)

from nest-crawler.

saltyshiomix commented on June 14, 2024

nest-crawler fetches data by default using scrape-it, but sometimes we need to wait until DOM loaded like single page applications.

If you specify the option waitFor, nest-crawler fetches data by using puppeteer.
In this way we can retrieve any web data from the crawler server :)

from nest-crawler.

saltyshiomix commented on June 14, 2024

By the way, if you want to use puppeteer on Ubuntu, we must install other dependencies like this:

$ sudo apt install gconf-service libasound2 libatk1.0-0 libatk-bridge2.0-0 libc6 libcairo2 libcups2 libdbus-1-3 libexpat1 libfontconfig1 libgcc1 libgconf-2-4 libgdk-pixbuf2.0-0 libglib2.0-0 libgtk-3-0 libnspr4 libpango-1.0-0 libpangocairo-1.0-0 libstdc++6 libx11-6 libx11-xcb1 libxcb1 libxcomposite1 libxcursor1 libxdamage1 libxext6 libxfixes3 libxi6 libxrandr2 libxrender1 libxss1 libxtst6 ca-certificates fonts-liberation libappindicator1 libnss3 lsb-release xdg-utils wget

If you use Ubuntu, please remember this :)

from nest-crawler.

Unable to get raw HTML code inside an element about nest-crawler HOT 11 CLOSED

Comments (11)

Related Issues (4)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent

Jobs