GithubHelp home page GithubHelp logo

miguelramosfdz / scrape-it Goto Github PK

View Code? Open in Web Editor NEW

This project forked from ionicabizau/scrape-it

0.0 1.0 0.0 55 KB

:crystal_ball: A Node.js scraper for humans.

Home Page: http://ionicabizau.net/blog/30-how-to-write-a-web-scraper-in-node-js

License: MIT License

JavaScript 86.21% Shell 4.06% HTML 9.73%

scrape-it's Introduction

scrape-it

scrape-it

Support me on Patreon Buy me a book PayPal Travis Version Downloads

A Node.js scraper for humans.

☁️ Installation

$ npm i --save scrape-it

📋 Example

const scrapeIt = require("scrape-it");

// Promise interface
scrapeIt("http://ionicabizau.net", {
    title: ".header h1"
  , desc: ".header h2"
  , avatar: {
        selector: ".header img"
      , attr: "src"
    }
}).then(page => {
    console.log(page);
});

// Callback interface
scrapeIt("http://ionicabizau.net", {
    // Fetch the articles
    articles: {
        listItem: ".article"
      , data: {

            // Get the article date and convert it into a Date object
            createdAt: {
                selector: ".date"
              , convert: x => new Date(x)
            }

            // Get the title
          , title: "a.article-title"

            // Nested list
          , tags: {
                listItem: ".tags > span"
            }

            // Get the content
          , content: {
                selector: ".article-content"
              , how: "html"
            }
        }
    }

    // Fetch the blog pages
  , pages: {
        listItem: "li.page"
      , name: "pages"
      , data: {
            title: "a"
          , url: {
                selector: "a"
              , attr: "href"
            }
        }
    }

    // Fetch some other data from the page
  , title: ".header h1"
  , desc: ".header h2"
  , avatar: {
        selector: ".header img"
      , attr: "src"
    }
}, (err, page) => {
    console.log(err || page);
});
// { articles:
//    [ { createdAt: Mon Mar 14 2016 00:00:00 GMT+0200 (EET),
//        title: 'Pi Day, Raspberry Pi and Command Line',
//        tags: [Object],
//        content: '<p>Everyone knows (or should know)...a" alt=""></p>\n' },
//      { createdAt: Thu Feb 18 2016 00:00:00 GMT+0200 (EET),
//        title: 'How I ported Memory Blocks to modern web',
//        tags: [Object],
//        content: '<p>Playing computer games is a lot of fun. ...' },
//      { createdAt: Mon Nov 02 2015 00:00:00 GMT+0200 (EET),
//        title: 'How to convert JSON to Markdown using json2md',
//        tags: [Object],
//        content: '<p>I love and ...' } ],
//   pages:
//    [ { title: 'Blog', url: '/' },
//      { title: 'About', url: '/about' },
//      { title: 'FAQ', url: '/faq' },
//      { title: 'Training', url: '/training' },
//      { title: 'Contact', url: '/contact' } ],
//   title: 'Ionică Bizău',
//   desc: 'Web Developer,  Linux geek and  Musician',
//   avatar: '/images/logo.png' }

❓ Get Help

There are few ways to get help:

  1. Please post questions on Stack Overflow. You can open issues with questions, as long you add a link to your Stack Overflow question.
  2. For bug reports and feature requests, open issues. 🐛
  3. For direct and quick help from me, you can use Codementor. 🚀

📝 Documentation

scrapeIt(url, opts, cb)

A scraping module for humans.

Params

  • String|Object url: The page url or request options.
  • Object opts: The options passed to scrapeHTML method.
  • Function cb: The callback function.

Return

  • Promise A promise object.

scrapeIt.scrapeHTML($, opts)

Scrapes the data in the provided element.

Params

  • Cheerio $: The input element.

  • Object opts: An object containing the scraping information. If you want to scrape a list, you have to use the listItem selector:

    • listItem (String): The list item selector.
    • data (Object): The fields to include in the list objects:
      • <fieldName> (Object|String): The selector or an object containing:
        • selector (String): The selector.
        • convert (Function): An optional function to change the value.
        • how (Function|String): A function or function name to access the value.
        • attr (String): If provided, the value will be taken based on the attribute name.
        • trim (Boolean): If false, the value will not be trimmed (default: true).
        • closest (String): If provided, returns the first ancestor of the given element.
        • eq (Number): If provided, it will select the nth element.
        • listItem (Object): An object, keeping the recursive schema of the listItem object. This can be used to create nested lists.

    Example:

    {
       articles: {
           listItem: ".article"
         , data: {
               createdAt: {
                   selector: ".date"
                 , convert: x => new Date(x)
               }
             , title: "a.article-title"
             , tags: {
                   listItem: ".tags > span"
               }
             , content: {
                   selector: ".article-content"
                 , how: "html"
               }
             , traverseOtherNode: {
                   selector: ".upperNode"
                 , closest: "div"
                 , convert: x => x.length
               }
           }
       }
    }

    If you want to collect specific data from the page, just use the same schema used for the data field.

    Example:

    {
         title: ".header h1"
       , desc: ".header h2"
       , avatar: {
             selector: ".header img"
           , attr: "src"
         }
    }

Return

  • Object The scraped data.

😋 How to contribute

Have an idea? Found a bug? See how to contribute.

💖 Support my projects

I open-source almost everything I can, and I try to reply everyone needing help using these projects. Obviously, this takes time. You can integrate and use these projects in your applications for free! You can even change the source code and redistribute (even resell it).

However, if you get some profit from this or just want to encourage me to continue creating stuff, there are few ways you can do it:

  • Starring and sharing the projects you like 🚀

  • PayPal—You can make one-time donations via PayPal. I'll probably buy a coffee tea. 🍵

  • Support me on Patreon—Set up a recurring monthly donation and you will get interesting news about what I'm doing (things that I don't share with everyone).

  • Bitcoin—You can send me bitcoins at this address (or scanning the code below): 1P9BRsmazNQcuyTxEqveUsnf5CERdq35V6

Thanks! ❤️

💫 Where is this library used?

If you are using this library in one of your projects, add it in this list. ✨

  • 3abn—A 3ABN radio client in the terminal.
  • bandcamp-scraper (by Simon Thiboutôt)—A scraper for https://bandcamp.com
  • cevo-lookup (by Zack Boehm)—Searchs the CEVO Suspension List for bans by SteamID
  • codementor—A scraper for codementor.io.
  • degusta-scrapper (by yohendry hurtado)—desgusta scrapper for alexa skill
  • proxylist (by self_refactor)—Get free proxy list
  • rs-api (by Alex Kempf)—Simple wrapper for RuneScape APIs written in node.
  • sahibinden (by Cagatay Cali)—Simple sahibinden.com bot
  • sahibindenServer (by Cagatay Cali)—Simple sahibinden.com bot server side
  • sgdq-collector (by Benjamin Congdon)—Collects Twitch / Donation information and pushes data to Firebase
  • trump-cabinet-picks (by Linda Haviv)—NYT cabinet predictions for Trump admin.
  • ubersetzung (by self_refactor)—translate words with examples from German to English
  • ui-studentsearch (by Rakha Kanz Kautsar)—API for majapahit.cs.ui.ac.id/studentsearch

📜 License

MIT © Ionică Bizău

scrape-it's People

Contributors

brunocascio avatar formatted avatar hjc1983 avatar ionicabizau avatar mastert avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.