GithubHelp home page GithubHelp logo

bndr / node-read Goto Github PK

View Code? Open in Web Editor NEW
635.0 18.0 39.0 57 KB

Get Readable Content from any page. Based on Arc90's readability project using cheerio engine.

License: Apache License 2.0

Makefile 0.37% JavaScript 99.63%

node-read's Introduction

NPM

Node-read

Get Readable Content from any page. Based on Arc90's readability project.

Features

  1. Blazingly Fast. This project is based on Cheerio engine, which is 8x times faster than JSDOM.

Why not Node-readability

Before starting this project I used Node-readability, but the dependencies of that project plus the slowness of JSDOM made it very frustrating to work with. The compiling of contextify module (dependency of JSDOM) failed 9/10 times. And if you wanted to use node-readability with node-webkit you had to manually rebuild contextify with nw-gyp, which is not the optimal solution.

So I decided to write my own version of Arc90's Readability using the fast Cheerio engine with the least number of dependencies.

The Usage of this module is similiar to node-readability, so it's easy to switch.

Install

npm install node-read

Usage

read(html [, options], callback)

Where

  • html url or html code.
  • options is an optional options object
  • callback is the callback to run - callback(error, article, meta)

Example

var read = require('node-read');

read('http://howtonode.org/really-simple-file-uploads', function(err, article, res) {

  // Main Article.
  console.log(article.content);
  
  // Title
  console.log(article.title);

  // HTML 
  console.log(article.html);
  
  // DOM
  console.log(article.dom);
  
});

TODO

  • Examples, Docs
  • Get Comments with articles
  • Get the Author of the article
  • Better removal of unnecessary nodes
  • Better scoring of content:
    • Based on siblings
    • Based on content length, common words
    • Link density, Image density, other common elements density

node-read's People

Contributors

abeltramo avatar bndr avatar bryant1410 avatar scheeser avatar tjatse avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

node-read's Issues

Links inside article content removed

First off, node-read is an awesome project and has been working great! Thanks a ton.
This issue has come up a few times -- it seems like links inside article content get removed, which can be jarring when the link is wrapping a sentence mid-paragraph.
An example is this Vox article: http://www.vox.com/2015/7/27/9044485/rush-limbaugh-donald-trump
See in-browser:
image
and in reader-mode:
image

I think the issue is this comparison not using the length of the link's inner-text: https://github.com/bndr/node-read/blob/master/lib/utils.js#L100

I'll make a PR shortly.

Check the type of node, and get its Weight

/**

  • Check the type of node, and get its Weight
    **/
    function initializeNode(node) {
    if (!node || node.length == 0) return 0;
    var tag = node.get(0).name;
    if (nodeTypes['mostPositive'].indexOf(tag) != -1) return 5 + getClassWeight(node);
    if (nodeTypes['positive'].indexOf(tag)) != -1) return 3 + getClassWeight(node);
    if (nodeTypes['negative'].indexOf(tag)) != -1) return -3 + getClassWeight(node);
    if (nodeTypes['mostNegative'].indexOf(tag)) != -1) return -5 + getClassWeight(node);
    return -1;
    }

Why not len=80 & What is the use of linkDensity

  var linkDensity = getLinkDensity(node, $);
  var len = node.text().length;
  if (len < 3) return;

  if (len > 80 && linkDensity < 0.25) {
    append = true;
  } else if (len < 80 && linkDensity == 0 && node.text().replace(regexps.trimRe, "").length > 0) {
    append = true;
  }

Content not extracted correctly

Parsing this page I get unrelated content from another block, which identifies related content.

read('http://www.tvnet.lv/zinas/arvalstis/507357-krievija_pie_ukrainas_robezas_sakoncentrejusi_200_tanku', function(err, article, res){ console.log(article.content);})

Output (formatted to fit):

<div id="article" class="article">
<a href="http://www.tvnet.lv/zinas/arvalstis/507296-krievija_draud_ukraina_ievest_miera_uzturetajus" 
class="thumb330_4-3"><img src="http://itvnet.lv/article/zinas/507296_330x248.jpg" alt=""></a>
<p>Ukrainā attīstoties sliktākajam scenārijam, Maskava «atcerēsies» par Krievijas parlamenta 
augšpalātas doto atļauju ievest kaimiņvalstī armiju, paziņojis Krievijas vēstnieks ANO Vitālijs
Čurkins. Viņš piebildis, ka gadījumā, ja vardarbība Ukrainas dienvidaustrumos nerimsies, tad 
Krievija sasaukšot ANO Drošības padomes ārkārtas sēdi.</p> </div>

Error when doing require('node-read')

I'm getting:

Uncaught TypeError: Cannot convert undefined or null to object

Seems to happen here:

function getInverseObj(obj){
    return Object.keys(obj).sort().reduce(function(inverse, name){
        inverse[obj[name]] = "&" + name + ";";
        return inverse;
    }, {});
}

Without the require my project runs without an issue...
Am I missing some dependency or anything?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.