GithubHelp home page GithubHelp logo

themaximalist / scrape.js Goto Github PK

View Code? Open in Web Editor NEW
2.0 2.0 0.0 143 KB

Web Scraping Library for Node.js

Home Page: https://scrapejs.themaximalist.com/

License: MIT License

JavaScript 28.22% Shell 0.54% HTML 30.00% CSS 41.25%
scraping web web-scraping

scrape.js's Introduction

Scrape.js

Scrape.js โ€” Web Scraping Library for Node.js

GitHub Repo stars NPM Downloads GitHub code size in bytes GitHub License

Scrape.js is an easy to use web scraping library for Node.js.

const data = await scrape("https://example.com");
// { url, html }

Features

  • Fast
  • Scrape nearly any website
  • Headless JavaScript scraping
  • Auto proxy rotation
  • ...it just works
  • MIT License

Install

Install Scrape.js from NPM:

npm install @themaximalist/scrape.js

Config

Scrape.js uses Zen Rows for proxy rotation. To use it acquire a Zen Rows API key and setup the environment variable.

ZENROWS_API_KEY=abcxyz123

Scrape.js can be used without proxies, but is less effective.

Usage

Using Scrape.js is as simple as calling a function with a website URL.

const scrape = require("@themaximalist/scrape.js");
await scrape("http://example.com"); // { url, html }

You can specify additional options to scrape() for more control:

const data = await scrape("https://example.com", {
    headless: true,
    proxy: true
});
// { url, html }

API

The Scrape.js API is a simple function you call with your URL, with an optional config object.

await scrape(
    url, // URL to scrape
    {
        headless: true, // Use JavaScript headless scraping
        proxy: true, // Use proxy rotation
        method: "GET", // HTTP Request method
        timeout: 3000, // Scrape timeout in ms
        userAgent: "Mozilla/5.0...", // User Agent
    }
);

URL (required)

  • url <string>: URL to scrape

Options

  • headless <bool>: Enable JavaScript. Default is true.
  • proxy <bool>: Use proxy with request. Default is true.
  • method <string>: HTTP request method, usually GET or POST. Default is GET.
  • timeout <int>: Max request time in ms. Default is 3500.
  • userAgent <string>: User agent for request. Default is Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/112.0.0.0 Safari/537.36.

Response

Scrape.js returns an object containing the final url and html content.

const { url, html } = await scrape("https://example.com");
console.log(url); // https://example.com/
console.log(html); // <html...

The Scrape.js API is a simple and reliable way to scrape the HTML from any website.

Debug

Scrape.js uses the debug npm module with the scrape.js namespace.

View debug logs by setting the DEBUG environment variable.

> DEBUG=scrape.js*
> node src/get_website_html.js
# debug logs

Examples

View tests to examples on how to use Scrape.js.

Projects

Scrape.js is currently used in the following projects:

  • News Score โ€” score the news, score the news, rewrite the headlines

License

MIT

Author

Created by The Maximalist, see our open-source projects.

scrape.js's People

Contributors

themaximalist avatar

Stargazers

 avatar 9kopb avatar

Watchers

 avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.