GithubHelp home page GithubHelp logo

simdrouin / npm-pdfreader Goto Github PK

View Code? Open in Web Editor NEW

This project forked from adrienjoly/npm-pdfreader

0.0 2.0 0.0 44 KB

Read text and parse tables from PDF files. Includes automatic column detection, and rule-based parsing.

Home Page: https://www.npmjs.com/package/pdfreader

License: MIT License

JavaScript 97.59% HTML 2.41%

npm-pdfreader's Introduction

pdfreader

Read text and parse tables from PDF files.

Supports tabular data with automatic column detection, and rule-based parsing.

This module is meant to be run using Node.js only. It does not work from a web browser.

Installation, tests and CLI usage

npm install pdfreader
cd node_modules/pdfreader
npm test
node parse.js test/sample.pdf

Raw PDF reading

The PdfReader class reads a PDF file, and calls a function on each item found while parsing that file.

An item object can match one of the following objects:

  • null, when the parsing is over, or an error occured.
  • {file:{path:string}}, when a PDF file is being opened.
  • {page:integer}, when a new page is being parsed, provides the page number, starting at 1.
  • {text:string, x:float, y:float, w:float, h:float...}, represents each text with its position.

Example:

new PdfReader().parseFileItems("sample.pdf", function(err, item){
  if (err)
    callback(err);
  else if (!item)
    callback();
  else if (item.text)
    console.log(item.text);
});

Example: parsing lines of text from a PDF file

example cv resume parse convert pdf to text

Here is the code required to convert this PDF file into text:

var pdfreader = require('pdfreader');

var rows = {}; // indexed by y-position

function printRows() {
  Object.keys(rows) // => array of y-positions (type: float)
    .sort((y1, y2) => parseFloat(y1) - parseFloat(y2)) // sort float positions
    .forEach((y) => console.log((rows[y] || []).join('')));
}

new pdfreader.PdfReader().parseFileItems('CV_ErhanYasar.pdf', function(err, item){
  if (!item || item.page) {
    // end of file, or page
    printRows();
    console.log('PAGE:', item.page);
    rows = {}; // clear rows for next page
  }
  else if (item.text) {
    // accumulate text items into rows object, per line
    (rows[item.y] = rows[item.y] || []).push(item.text);
  }
});

Fork this example from parsing a CV/résumé.

Example: parsing a table from a PDF file

example cv resume parse convert pdf table to text

Here is the code required to convert this PDF file into a textual table:

var pdfreader = require('pdfreader');

const nbCols = 2;
const cellPadding = 40; // each cell is padded to fit 40 characters
const columnQuantitizer = (item) => parseFloat(item.x) >= 20;

const padColumns = (array, nb) =>
  Array.apply(null, {length: nb}).map((val, i) => array[i] || []);
  // .. because map() skips undefined elements

const mergeCells = (cells) => (cells || [])
  .map((cell) => cell.text).join('') // merge cells
  .substr(0, cellPadding).padEnd(cellPadding, ' '); // padding

const renderMatrix = (matrix) => (matrix || [])
  .map((row, y) => padColumns(row, nbCols)
    .map(mergeCells)
    .join(' | ')
  ).join('\n');

var table = new pdfreader.TableParser();

new pdfreader.PdfReader().parseFileItems(filename, function(err, item){
  if (!item || item.page) {
    // end of file, or page
    console.log(renderMatrix(table.getMatrix()));
    console.log('PAGE:', item.page);
    table = new pdfreader.TableParser(); // new/clear table for next page
  } else if (item.text) {
    // accumulate text items into rows object, per line
    table.processItem(item, columnQuantitizer(item));
  }
});

Fork this example from parsing a CV/résumé.

Rule-based data extraction

The Rule class can be used to define and process data extraction rules, while parsing a PDF document.

Rule instances expose "accumulators": methods that defines the data extraction strategy to be used for each rule.

Example:

var processItem = Rule.makeItemProcessor([
  Rule.on(/^Hello \"(.*)\"$/).extractRegexpValues().then(displayValue),
  Rule.on(/^Value\:/).parseNextItemValue().then(displayValue),
  Rule.on(/^c1$/).parseTable(3).then(displayTable),
  Rule.on(/^Values\:/).accumulateAfterHeading().then(displayValue),
]);
new PdfReader().parseFileItems("sample.pdf", function(err, item){
  processItem(item);
});

npm-pdfreader's People

Contributors

adrienjoly avatar

Watchers

 avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.