GithubHelp home page GithubHelp logo

Comments (11)

junnstudio avatar junnstudio commented on June 14, 2024 2

@saltyshiomix
I have tried out the new version of the nest-crawler package and it works exactly as what I expected.
I highly appreciate your work. Thank you sooo much! :)

from nest-crawler.

saltyshiomix avatar saltyshiomix commented on June 14, 2024 1

@junnstudio

Oh, the best words for me!
Big thanks to your kindness and I'll implement it later today 👍

from nest-crawler.

saltyshiomix avatar saltyshiomix commented on June 14, 2024 1

@junnstudio

Now [email protected] is out!

It internally uses scrape-it, and you can use its options like how: 'html' 👍

from nest-crawler.

saltyshiomix avatar saltyshiomix commented on June 14, 2024 1

@junnstudio

Glad to hear that!
And thank you for your PR #2 , I just merged 👍

from nest-crawler.

saltyshiomix avatar saltyshiomix commented on June 14, 2024

Hi, @junnstudio !

Could you tell me the specific example you want to get as the raw HTML?

I'll implement and update nest-crawler for you or tell you the best library :)

from nest-crawler.

junnstudio avatar junnstudio commented on June 14, 2024

Hi, @saltyshiomix !
I am developing a crawler that scrapes the content of some pieces of news.
Instead of getting back the results with all the tags removed, I'd like to get the raw HTML code so that I can easily embed it into my pages and use some extra css to re-style the HTML tags inside it in my own way.
Of course I can use other libraries for this problem but I really love your nest-crawler package and hope this issue will be tackled soon :)

from nest-crawler.

saltyshiomix avatar saltyshiomix commented on June 14, 2024

@junnstudio

Let me know how do you think this API like below?

interface ExampleCom {
  title: string;
  info: string;
}

const data: ExampleCom = await this.crawler.fetchRawHtml({
  target: 'http://example.com',
  fetch: {
    title: 'h1',
    info: {
      selector: 'p > a',
      attr: 'href',
    },
  },
});

console.log(data);
// {
//   title: '<h1>Example Domain</h1>',
//   info: '<a href="http://www.iana.org/domains/example">More information...</a>'
// }

from nest-crawler.

junnstudio avatar junnstudio commented on June 14, 2024

@saltyshiomix
Sometimes, I don't really need to get the raw HTML code for all of the attributes inside fetch, but just some of them.
In my opinion it'd be great if your package simply provide me with something like rawHtml: true:

interface ExampleCom {
  title: string;
  info: string;
}

const data: ExampleCom = await this.crawler.fetch({
  target: 'http://example.com',
  fetch: {
    title: 'h1',
    info: {
      selector: 'p > a',
      rawHtml: true
    },
  },
});

// {
//   title: 'Example Domain',
//   info: '<a href="http://www.iana.org/domains/example">More information...</a>'
// }

Another solution that you can reference is the how: 'html' attribute of the scrape-it package at https://github.com/IonicaBizau/scrape-it.

from nest-crawler.

saltyshiomix avatar saltyshiomix commented on June 14, 2024

@junnstudio

Yes, I formally used scrape-it, but it can't resove cheerio types at the moment.

Now my PR was merged, so I'll proudly accept your advices :)

from nest-crawler.

saltyshiomix avatar saltyshiomix commented on June 14, 2024

nest-crawler fetches data by default using scrape-it, but sometimes we need to wait until DOM loaded like single page applications.

If you specify the option waitFor, nest-crawler fetches data by using puppeteer.
In this way we can retrieve any web data from the crawler server :)

from nest-crawler.

saltyshiomix avatar saltyshiomix commented on June 14, 2024

By the way, if you want to use puppeteer on Ubuntu, we must install other dependencies like this:

$ sudo apt install gconf-service libasound2 libatk1.0-0 libatk-bridge2.0-0 libc6 libcairo2 libcups2 libdbus-1-3 libexpat1 libfontconfig1 libgcc1 libgconf-2-4 libgdk-pixbuf2.0-0 libglib2.0-0 libgtk-3-0 libnspr4 libpango-1.0-0 libpangocairo-1.0-0 libstdc++6 libx11-6 libx11-xcb1 libxcb1 libxcomposite1 libxcursor1 libxdamage1 libxext6 libxfixes3 libxi6 libxrandr2 libxrender1 libxss1 libxtst6 ca-certificates fonts-liberation libappindicator1 libnss3 lsb-release xdg-utils wget

If you use Ubuntu, please remember this :)

from nest-crawler.

Related Issues (4)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.