GithubHelp home page GithubHelp logo

imclab / cheers Goto Github PK

View Code? Open in Web Editor NEW

This project forked from fallanic/cheers

0.0 2.0 0.0 247 KB

Scrape a website efficiently, block by block, page by page. Based on cheerio and curl.

License: MIT License

JavaScript 100.00%

cheers's Introduction

Cheers

Scrape a website efficiently, block by block, page by page.

Motivations

This is a Cheerio based scraper, useful to extract data from a website using CSS selectors.
The motivation behind this package is to provide a simple cheerio-based scraping tool, able to divide a website into blocks, and transform each block into a JSON object using CSS selectors.

Built on top of the excellents :

https://github.com/cheeriojs/cheerio
https://github.com/chriso/curlrequest
https://github.com/kriskowal/q

CSS mapping syntax inspired by :

https://github.com/dharmafly/noodle

Getting Started

Install the module with: npm install cheers

Usage

Configuration options:

  • config.url : the URL to scrape
  • config.blockSelector : the CSS selector to apply on the page to divide it in scraping blocks. This field is optional (will use "body" by default)
  • config.scrape : the definition of what you want to extract in each block. Each key has two mandatory attributes : selector (a CSS selector or . to stay on the current node) and extract. The possible values for extract are text, html, outerHTML, a RegExp or the name of an attribute of the html element (e.g. "href")
var cheers = require('cheers');

//let's scrape this excellent JS news website
var config = {
    url: "http://www.echojs.com/",
    blockSelector: "article",
    scrape: {
        title: {
            selector: "h2 a",
            extract: "text"
        },
        link: {
            selector: "h2 a",
            extract: "href"
        },
        articleInnerHtml: {
            selector: ".",
            extract: "html"
        },
        articleOuterHtml: {
            selector: ".",
            extract: "outerHTML"
        },
        articlePublishedTime: {
            selector: 'p',
            extract: /\d* (?:hour[s]?|day[s]?) ago/
        }
    }
};

cheers.scrape(config).then(function (results) {
    console.log(JSON.stringify(results));
}).catch(function (error) {
    console.error(error);
});

Roadmap

  • Option to use request instead of curl
  • Option to change the user agent
  • Command line tool
  • Website pagination
  • Option to use a headless browser
  • Unit tests

Contributors

Cheers!

License

Copyright (c) 2014 Fabien Allanic
Licensed under the MIT license.

cheers's People

Watchers

 avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.