GithubHelp home page GithubHelp logo

eeerrrttty / website-cloner Goto Github PK

View Code? Open in Web Editor NEW

This project forked from website-scraper/website-scraper-puppeteer

0.0 1.0 0.0 59 KB

Most advanced plugin to clone pages

License: MIT License

JavaScript 84.98% HTML 15.02%

website-cloner's Introduction

Version Downloads Node.js CI Test Coverage

website-scraper-puppeteer

Plugin for website-scraper which returns html for dynamic websites using puppeteer.

This module is an Open Source Software maintained by one developer in free time. If you want to thank the author of this module you can use GitHub Sponsors or Patreon.

Requirements

  • nodejs version >= 14.14
  • website-scraper version >= 5

Installation

npm install website-scraper website-scraper-puppeteer

Usage

import scrape from 'website-scraper';
import PuppeteerPlugin from 'website-scraper-puppeteer';

await scrape({
    urls: ['https://www.instagram.com/gopro/'],
    directory: '/path/to/save',
    plugins: [ 
      new PuppeteerPlugin({
        launchOptions: { headless: false }, /* optional */
        scrollToBottom: { timeout: 10000, viewportN: 10 }, /* optional */
        blockNavigation: true, /* optional */
      })
    ]
});

Puppeteer plugin constructor accepts next params:

  • launchOptions - (optional) - puppeteer launch options, can be found in puppeteer docs
  • scrollToBottom - (optional) - in some cases, the page needs to be scrolled down to render its assets (lazyloading). Because some pages can be really endless, the scrolldown process can be interrupted before reaching the bottom when one or both of the bellow limitations are reached:
    • timeout - in milliseconds
    • viewportN - viewport height multiplier
  • blockNavigation - (optional) - defines whether navigation away from the page is permitted or not. If it is set to true, then the page is locked to the current url and redirects with location.replace(anotherPage) will not pass. Defaults to false

How it works

It starts Chromium in headless mode which just opens page and waits until page is loaded. It is far from ideal because probably you need to wait until some resource is loaded or click some button or log in. Currently this module doesn't support such functionality.

website-cloner's People

Contributors

s0ph1e avatar dependabot[bot] avatar aivus avatar dvdtsr avatar jpaulomotta avatar snyk-bot avatar

Watchers

James Cloos avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.