GithubHelp home page GithubHelp logo

ineed's Introduction

logo

Build Status

Web scraping and HTML-reprocessing. The easy way.

ineed allows you collect useful data from web pages using simple and nice API. Let's collect images, hyperlinks, scripts and stylesheets from http://google.com:

var ineed = require('ineed');

ineed.collect.images.hyperlinks.scripts.stylesheets.from('http://google.com',
    function (err, response, result) {
        console.log(result);
    });

Also, it can be used to build HTML-reprocessing pipelines (like jch/html-pipeline but for Node) with elegance. E.g. we have the following html:

<!DOCTYPE html>
<html>
<head>
    <title></title>
    <style type="text/css">
        border-radius()
          -webkit-border-radius: arguments
          -moz-border-radius: arguments
          border-radius: arguments

        ul
          margin: 5px
          border: 1px solid
          border-radius: 5px
    </style>
</head>
<body>
This is an H1
=============
*   Red
*   Green
*   Blue
</body>
</html>

Let's render it's <style> with stylus and convert text from Markdown to HTML using marked then assemble results back to HTML:

var ineed = require('ineed'),
    stylus = require('stylus'),
    marked = require('marked');

function renderStylus(code) {
    var css = null;

    stylus.render(code, null, function (err, result) {
        css = result;
    });

    return css;
}

var resultHtml = ineed.reprocess.cssCode(renderStylus).texts(marked).fromHtml(html);

And the resultHtml will be:

<!DOCTYPE html>
<html>
<head>
    <title></title>
    <style type="text/css">
        .ul {
            margin: 5px;
            border: 1px solid;
            -webkit-border-radius: 5px;
            -moz-border-radius: 5px;
            border-radius: 5px;
        }
    </style>
</head>
<body>
<h1 id="this-is-an-h1">This is an H1</h1>
<ul>
    <li>Red</li>
    <li>Green</li>
    <li>Blue</li>
</ul>
</body>
</html>

ineed doesn't build and traverse DOM-tree, it operates on sequence of HTML tokens instead. Whole processing is done in one-pass, therefore, it's blazing fast! The token stream is produced by parse5 which parses HTML exactly the same way modern browsers do.

ineed provides built-in collectors and reprocessors that covers a wide variety of common use cases. However, if you feel that something is missing, then you can easily extend ineed with custom plugins.

Install

$ npm install ineed

Usage

The general form:

ineed.<action>[.<plugin>...].<from*>

from* methods

.fromHtml(html)

Accepts html string as an argument and synchronously returns result of the action.

Example:

var result = ineed.collect.texts.fromHtml('<div>Hey ya</div>');
.from(options, callback)

Asynchronously loads specified page and invokes callback(err, response, result). It passes response object to callback in case if you need response headers or status codes. The first argument can be either a url or an options object. The set of options is the same as in @mikeal's request, so you can use POST method, set request headers or do any other advanced setup for the page request.

Examples:

ineed.collect.jsCode.from('https://github.com', function (err, response, result) {
    console.log(response.statusCode);
    console.log(response.headers);    
});
ineed.collect.title.from({
    url: 'https://test.domain/',
    method: 'POST',
    form: 'input=test'
}, function (err, response, result) {
    console.log(result);
});

.collect action

Collects information specified by plugin set. The result of the action is an object that contains individual plugin outputs as properties.

Example:

var result = ineed.collect.texts.images.scripts.fromHtml(html);

will produce result:

{
    "texts" : <collected texts>,
    "images" : <collected images>,
    "scripts" : <collected scripts>
}

Built-in plugins:

Plugin Description Output
.comments Collects HTML comments Array of comment string
.cssCode Collects CSS code enclosed in <style> tags Array of CSS code strings
.hyperlinks Collects URL (see remark below) and text of the hyperlinks Array of {href:[String], text:[String]} objects
.images Collects absolute URL (see remark below) and alt attribute of the images Array of {src:[String], alt:[String]} objects
.jsCode Collects JavaScript code enclosed in <script> tags Array of JavaScript code strings
.scripts Collects URL (see remark below) of the external .js-files, specified via <script> tags with src attribute Array of script URLs
.stylesheets Collects URL (see remark below) of the external *.css-files, specified via <link> tag Array of stylesheet URLs
.texts Collects all text nodes in <body> of the document with except for <script> and <style> tag's content. Speaking clearly: all end-user visible text. Array of text strings
.title Collects document title Document title string

Remark: All URLs are collected in respect to <base> tag. The resulting URL will be an absolute URL if .from() method was used, <base> tag constains absolute URL or raw collected URL is already absolute.

.reprocess action

Applies plugins' replacing functions to the source HTML-string. The result of the action is the reprocessed HTML-string.

Example:

//Delete all HTML comments and render emoji
var result = ineed.reprocess
    .comments(function() {
        return null;
    })
    .texts(function(text, escapeHtml) {
        return escapeHtml(text).replace(/:beer:/g, '<img src="emoji/unicode/1f37a.png" alt=":beer:">');
    })
    .fromHtml(html);

Built-in plugins:

Plugin replacer arguments Description
.comments(replacer) replacer(commentString) Replaces HTML commentString with the value returned by replacer. Comment will be deleted from markup if null is returned.
.cssCode(replacer) replacer(cssCodeString) Replaces cssCodeString enclosed in <style> tag with the value returned by replacer.
.hyperlinks(replacer) replacer(pageBaseUrl, hrefAttrValue) Replaces <a> tag href attribute value with the value returned by replacer.
.images(replacer) replacer(pageBaseUrl, srcAttrValue) Replaces <img> tag src attribute value with the value returned by replacer.
.jsCode(replacer) replacer(jsCodeString) Replaces jsCodeString enclosed in the <script> tag with the value returned by replacer.
.scripts(replacer) replacer(pageBaseUrl, srcAttrValue) Replaces <script> tag src attribute value with the value returned by replacer.
.stylesheets(replacer) replacer(pageBaseUrl, hrefAttrValue) Replaces <link rel="stylesheet"> tag href attribute value with the value returned by replacer.
.texts(replacer) replacer(text, escapeHtmlFunc) Replaces text with the value returned by replacer. Returned value will be not HTML escaped, so HTML code can be used as replacer result. You can manually apply escapeHtmlFunc(str) to force HTML escaping of the result.
.title(replacer) replacer(title) Replaces page title with the value returned by replacer.

Custom plugins

There are two kinds of plugins: those that extends .collect action and those that extends .reprocess action. To enable plugin use .using() function, which will return new instance of ineed with enabled plugin.

Example:

ineed
    .using(myPlugin1)
    .using(myPlugin2)
    .collect
    ...

Common structure

Plugins of both kinds are objects and in addition to the kind-specific properties they should have the following properties:

.extends

Indicates which action will be extended by plugin. Can be 'collect' or 'reprocess'. Required property.

.name

Name of the plugin. Required property. It should reflect the target of the plugin action. Plugin will be accessable under the name in the .collect or .reprocess objects and will be used as result property name for .collect action. E.g. plugin has name='tagNames'. If it extends .collect:

//Enable plugin and use it
var result = ineed.using(plugin).collect.tagNames.fromHtml(html);

//Access plugin results
var pluginResults = result.tagNames;

If it extends .reprocess:

 var reprocessedHtml = ineed
        .using(plugin)
        .reprocess
        .tagNames(function (tagName) {
            if (tagName === 'applet')
                return 'object';
        })
        .fromHtml(html);

.init(ctx, ...)

Function that initializes plugin. Required field. It always receives ctx object as it first argument. ctx contains some useful common information regarding current pipeline state:

ctx.baseUrl

Base URL of all resources on the page with respect to <base> tag.

ctx.leadingStartTag

Leading non-self-closing start tag for the current HTML token. Can be used to determine parent of the text nodes. Will be null if the leading start tag was self-closing or any leading end tag was met.

ctx.inBody

Indicates if emitted tokens are in <body> tag.

Token handlers

Plugin can have one or more HTML token handlers:

  • .onDoctype(doctype)

Where doctype:

{
    name: [String],
    publicId: [String],
    systemId: [String]
}
  • .onStartTag(startTag)

Where startTag:

{
    tagName: [String],
    attrs: [Array],
    selfClosing: [Boolean]
}
  • .onEndTag(tagName)
  • .onText(text)
  • .onComment(commentText)

Collecting plugins

Collecting plugins in addition to common properties should have .getCollection() method that should return items collected by plugin.

Example of the collecting plugin:

//Collects tagNames in <body>
var plugin = {
    extends: 'collect',
    name: 'tagNamesInBody',

    init: function (ctx) {
        this.ctx = ctx;
        this.tagNames = [];
    },

    addTagName: function (tagName) {
        if (this.ctx.inBody && this.tagNames.indexOf(tagName) < 0)
            this.tagNames.push(tagName);
    },

    onStartTag: function (startTag) {
        this.addTagName(startTag.tagName);
    },

    onEndTag: function (tagName) {
        this.addTagName(tagName);
    },

    getCollection: function () {
        return this.tagNames;
    }
};

//Let's use it
var results = ineed.using(plugin).collect.tagNamesInBody.fromHtml(html);
console.log(results.tagNamesInBody);

Reprocessing plugins

Reprocessing plugin's init method in addition to ctx object receives all arguments passed to plugin in pipeline (e.g. replacer() function). If token handler returns null then token will be deleted from the pipeline and no farther processing by other plugins will be performed on it and it will not appear in the resulting HTML. If handler returns modified token then it will be passed to the farther plugins and will appear in the resulting HTML. If handler doesn't returns value or returns undefined then token will be passed unmodified to the farther plugins.

Example of the reprocessing plugin:

//Replaces or deletes tagNames in <body>
var plugin = {
    extends: 'reprocess',
    name: 'tagNamesInBody',

    init: function (ctx, replacer) {
        this.ctx = ctx;
        this.replacer = replacer;
    },

    onStartTag: function (startTag) {
        if (this.ctx.inBody) {
            startTag.tagName = this.replacer(startTag.tagName);

            //Delete token if tagName is null
            return startTag.tagName === null ? null : startTag;
        }
    },

    onEndTag: function (tagName) {
        if (this.ctx.inBody)
            return this.replacer(tagName);
    }
};

//Let's use it
var reprocessedHtml = ineed
    .using(plugin)
    .reprocess
    .tagNamesInBody(function (tagName) {
        //Delete <noscript> start and end tags
        if (tagName === 'noscript')
            return null;
        
        //Replace <div> with <p>
        if(tagName === 'div')
            return 'p';
            
        return tagName;
    }).fromHtml(html);

Testing

$ npm test

Questions or suggestions?

If you have any questions, please feel free to create an issue here on github.

Author

Ivan Nikulin ([email protected])

ineed's People

Contributors

inikulin avatar strugee avatar

Stargazers

Jesse Johnson avatar GunnerY avatar  avatar  avatar Martin Amm avatar Yann Moussaba avatar Cat  avatar Silvan Büdenbender avatar Yaroslav Gaponov avatar Tal Vegvizer avatar  avatar Nick Bretz avatar  avatar Anurag Jain avatar Oleg avatar Mark Graham avatar lleohao avatar Charlike Mike Reagent avatar Parker avatar Brian Barnett avatar Brendan Sparrow avatar Taieb avatar Christian Ehrhardt avatar Brian Curliss avatar Lan Qingyong avatar Moose avatar PaulZeng avatar Egor Rudinsky avatar Warren Sadler avatar Puddin' avatar  avatar Habib MAALEM avatar teddav avatar roll avatar Apoorva Saxena avatar Parsa Fatehi avatar Mohamed Abd Elhalim avatar Ketzelle avatar  avatar Fábio Ávila avatar  avatar Eduardo Peixe avatar shikakun avatar AMAGI / Jun Yuri avatar Lajpat Shah avatar Derek Parry avatar Humanismusic avatar bhoga avatar Hakkotsu avatar Richard Ivan avatar Hernán Veiras avatar Denis Nazarov avatar Meka Claude avatar Ma Yuewen avatar Seb avatar  avatar Sylvain Chamberland avatar Gerald LePage avatar Quinton Pike avatar Mike Mitchell avatar Hasit Mistry avatar Mhd Sami Al Mouhtaseb avatar Jim avatar Anatolii avatar swh avatar Chun-Hao Lien avatar  avatar Jesper Olesen avatar Francois-Guillaume Ribreau avatar Seokje Park avatar flyeven avatar Marko Jankovic avatar Hieu Do avatar Alexey Markov avatar  avatar Sandrino Di Mattia avatar RYeah Sh avatar Eddie Zaneski avatar Patrick avatar Pana avatar Damian Płaza avatar Ray Patterson avatar Michael Chiche avatar  avatar Vladimir Rodkin avatar Lochlan Bunn avatar Joshua Bemenderfer avatar Piyush avatar Ira McMahon avatar Stav Geffen avatar Leonard C. Modoran avatar Mike Vegeto avatar Amr Eldib avatar Dan avatar C Dorn avatar Tamas Fodor avatar Andres Vidal avatar  avatar Luiz Bills avatar Anibal avatar

Watchers

azu avatar  avatar  avatar Arnstein Henriksen avatar  avatar James Cloos avatar Tom Byrer avatar Denis Stoyanov avatar Michael Anthony avatar Richard Ivan avatar Menjanahary avatar Jānis avatar balamutih avatar  avatar

ineed's Issues

Phantomjs or Casperjs with Ineed

Hello, I am new to webscraping and I am trying to crawl and collect links on a site to learn more about it. I really like how easy your package is to use but was curious if/how you could use it with phantom or casper. Do you have any examples you could share with me or tutorials I could look at? I am also looking at using simplecrawler with ineed as well.

Uses DOS-style newlines.

dmn gen is nice, but now I've got Ctrl-M littering my .npmignore.

Could you please fix it to use platform newlines?

Ineed question

Hi, just started using ineed. Maybe i'm doing something wrong, but the same code gives me different results :

  • sometimes, it works as expected, and retrieves the data i'm looking for.
  • sometimes, without raising an error, i get not result from the collect() method. It looks as if the wasn't even a connection to the target site.

Is there a limitation in ineed that prevents connecting too often to a site?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.