GithubHelp home page GithubHelp logo

rlugojr / scrape-it Goto Github PK

View Code? Open in Web Editor NEW

This project forked from ionicabizau/scrape-it

0.0 2.0 0.0 41 KB

:crystal_ball: A Node.js scraper for humans.

Home Page: http://ionicabizau.net/blog/30-how-to-write-a-web-scraper-in-node-js

License: MIT License

JavaScript 90.27% HTML 4.99% Shell 4.74%

scrape-it's Introduction

scrape-it

scrape-it

Patreon PayPal AMA Travis Version Downloads Get help on Codementor

A Node.js scraper for humans.

☁️ Installation

$ npm i --save scrape-it

📋 Example

const scrapeIt = require("scrape-it");

// Promise interface
scrapeIt("http://ionicabizau.net", {
    title: ".header h1"
  , desc: ".header h2"
  , avatar: {
        selector: ".header img"
      , attr: "src"
    }
}).then(page => {
    console.log(page);
});

// Callback interface
scrapeIt("http://ionicabizau.net", {
    // Fetch the articles
    articles: {
        listItem: ".article"
      , data: {

            // Get the article date and convert it into a Date object
            createdAt: {
                selector: ".date"
              , convert: x => new Date(x)
            }

            // Get the title
          , title: "a.article-title"

            // Nested list
          , tags: {
                listItem: ".tags > span"
            }

            // Get the content
          , content: {
                selector: ".article-content"
              , how: "html"
            }
        }
    }

    // Fetch the blog pages
  , pages: {
        listItem: "li.page"
      , name: "pages"
      , data: {
            title: "a"
          , url: {
                selector: "a"
              , attr: "href"
            }
        }
    }

    // Fetch some other data from the page
  , title: ".header h1"
  , desc: ".header h2"
  , avatar: {
        selector: ".header img"
      , attr: "src"
    }
}, (err, page) => {
    console.log(err || page);
});
// { articles:
//    [ { createdAt: Mon Mar 14 2016 00:00:00 GMT+0200 (EET),
//        title: 'Pi Day, Raspberry Pi and Command Line',
//        tags: [Object],
//        content: '<p>Everyone knows (or should know)...a" alt=""></p>\n' },
//      { createdAt: Thu Feb 18 2016 00:00:00 GMT+0200 (EET),
//        title: 'How I ported Memory Blocks to modern web',
//        tags: [Object],
//        content: '<p>Playing computer games is a lot of fun. ...' },
//      { createdAt: Mon Nov 02 2015 00:00:00 GMT+0200 (EET),
//        title: 'How to convert JSON to Markdown using json2md',
//        tags: [Object],
//        content: '<p>I love and ...' } ],
//   pages:
//    [ { title: 'Blog', url: '/' },
//      { title: 'About', url: '/about' },
//      { title: 'FAQ', url: '/faq' },
//      { title: 'Training', url: '/training' },
//      { title: 'Contact', url: '/contact' } ],
//   title: 'Ionică Bizău',
//   desc: 'Web Developer,  Linux geek and  Musician',
//   avatar: '/images/logo.png' }

📝 Documentation

scrapeIt(url, opts, cb)

A scraping module for humans.

Params

  • String|Object url: The page url or request options.
  • Object opts: The options passed to scrapeHTML method.
  • Function cb: The callback function.

Return

  • Promise A promise object.

scrapeIt.scrapeHTML($, opts)

Scrapes the data in the provided element.

Params

  • Cheerio $: The input element.

  • Object opts: An object containing the scraping information. If you want to scrape a list, you have to use the listItem selector:

    • listItem (String): The list item selector.
    • data (Object): The fields to include in the list objects:
      • <fieldName> (Object|String): The selector or an object containing:
        • selector (String): The selector.
        • convert (Function): An optional function to change the value.
        • how (Function|String): A function or function name to access the value.
        • attr (String): If provided, the value will be taken based on the attribute name.
        • trim (Boolean): If false, the value will not be trimmed (default: true).
        • eq (Number): If provided, it will select the nth element.
        • listItem (Object): An object, keeping the recursive schema of the listItem object. This can be used to create nested lists.

    Example:

    {
       articles: {
           listItem: ".article"
         , data: {
               createdAt: {
                   selector: ".date"
                 , convert: x => new Date(x)
               }
             , title: "a.article-title"
             , tags: {
                   listItem: ".tags > span"
               }
             , content: {
                   selector: ".article-content"
                 , how: "html"
               }
           }
       }
    }

    If you want to collect specific data from the page, just use the same schema used for the data field.

    Example:

    {
         title: ".header h1"
       , desc: ".header h2"
       , avatar: {
             selector: ".header img"
           , attr: "src"
         }
    }

Return

  • Object The scraped data.

😋 How to contribute

Have an idea? Found a bug? See how to contribute.

💰 Donations

Another way to support the development of my open-source modules is to set up a recurring donation, via Patreon. 🚀

PayPal donations are appreciated too! Each dollar helps.

Thanks! ❤️

💫 Where is this library used?

If you are using this library in one of your projects, add it in this list.

  • 3abn—A 3ABN radio client in the terminal.
  • bandcamp-scraper (by Simon Thiboutôt)—A scraper for https://bandcamp.com
  • cevo-lookup (by Zack Boehm)—Searchs the CEVO Suspension List for bans by SteamID
  • codementor—A scraper for codementor.io.
  • proxylist (by self_refactor)—Get free proxy list
  • sahibinden (by Cagatay Cali)—Simple sahibinden.com bot
  • sahibindenServer (by Cagatay Cali)—Simple sahibinden.com bot server side
  • sgdq-collector (by Benjamin Congdon)—Collects Twitch / Donation information and pushes data to Firebase
  • ubersetzung (by self_refactor)—translate words with examples from German to English
  • ui-studentsearch (by Rakha Kanz Kautsar)—API for majapahit.cs.ui.ac.id/studentsearch

📜 License

MIT © Ionică Bizău

scrape-it's People

Contributors

brunocascio avatar formatted avatar ionicabizau avatar mastert avatar

Watchers

 avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.