GithubHelp home page GithubHelp logo

adrienjoly / npm-pdfreader Goto Github PK

View Code? Open in Web Editor NEW
580.0 10.0 74.0 1.95 MB

šŸšœ Parse text and tables from PDF files.

Home Page: https://www.npmjs.com/package/pdfreader

License: MIT License

JavaScript 5.45% HTML 93.82% Rich Text Format 0.73%
data-extraction pdf-converter parsing javascript tabular-data pdf-reader parse-tables rule-based-parsing

npm-pdfreader's People

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

npm-pdfreader's Issues

hi

codeļ¼š
new pdfreader.PdfReader().parseFileItems(
fileAllname,
function (err, item) {
if(item&&item.page){
item5.allpage=item.page
} else if (item.text) {
initlist.push(item.text)
}
}
console.log(initlist)

I assign a value to my custom data in the function of parsing pdf, and then output it. It is found that the data is still the initial value. How should I deal with the processed data?

Remote code execution

Lodash is vulnerable to remote code execution (RCE) due to the potential to modify the properties of objects in memory. A remote attacker could run arbitrary commands on a vulnerable server, or cause the server to crash, by maliciously crafting an object via the zip functionality of Lodash.

embarassingly slow

Describe the bug
Parsing a pdf file containing 235 pages takes up to 8 seconds (just doing nothing with the received tokens - apparently the lexer alone takes up that much time) :-p

To Reproduce

const parseStart = process.hrtime();
new PdfReader().parseBuffer(result.data, (err, item) => {
                        // the pdf reader signals the end of the parsing process
                        // by calling this function with the item set as undefined
                        if (!item) {
                            const parseEnd = process.hrtime(parseStart);
                            this.logger.log(`parse pdf completed in ${parseEnd[0]}.${Math.floor(parseEnd[1] / 10e6)}s`);

                            observer.next(table);
                            observer.complete();
                            return
                        }
});

Expected behavior
I would expect to have a pdf file of this size to have no longer than 2 seconds to parse.

Screenshots, outputs or logs
parse pdf completed in 7.42s SkybriefingDaylightAdapter

Desktop (please complete the following information):

  • OS: Windows 10, but it doesn't matter, its the same on a linux virtual machine.

Additional context
I have attached a sample file.
pdf.pdf

loadMetaData error: TypeError: Cannot read property 'metadata' of null

Hello!

I have been using pdfreader for some time, both locally on MacOS and in Docker, Node 15.14.0, and everything worked flawlessly. After updating project dependencies, I am getting the following output for any PDF files:

Warning: Setting up fake worker.
loadMetaData error: TypeError: Cannot read property 'metadata' of null
{
  parserError: "loadMetaData error: TypeError: Cannot read property 'metadata' of null"
}

I have created a clean project with the only dependency "pdfreader": "^1.2.12" and an example code from the documentation:

const { PdfReader } = require("pdfreader");
const fs = require("fs");

fs.readFile("./sample.pdf", (err, pdfBuffer) => {
  new PdfReader().parseBuffer(pdfBuffer, function (err, item) {
    if (err) console.log(err);
    else if (!item) console.log("no item");
    else if (item.text) console.log(item.text);
  });
});

...and I am still getting the same message. I played with with versions of pdfkit and pdf2json, but did not solve the problem.

Update: this error appears on 1.2.12. With 1.2.11, it works properly.

pdf2json dependency is broken

pdf2json update from 1.2.5 to 1.3.0 removed formImage, which pdfreader requires, without incrementing the major version number as they should with incompatible changes. NPM thinks that 1.2.5 and 1.3.* are compatible. As a temporary patch, you could set the dependency in package.json to something like <=1.2.5

Pdfreader cannot be used with Electron (because of `async` variable in pdf2json)

We wanted to test the library in Electron but it doesn't import properly the async library, pdfparser.js line 6:

async = require("async");

As you may know async is a reserved word in Chrome and you guys are trying to import async library by overwriting the async reserved word without var or let in front of the async var.

pdfreader.parseFileItems throw error can not catch ?

pdf url : http://www.hkexnews.hk/listedco/listconews/sehk/2018/0228/LTN20180228058_C.pdf

Error: Required "glyf" or "loca" tables are not found
at error (eval at (E:\git\9w\node_modules\pdf2json\lib\pdf.js:60:1), :193:7)
at Font_checkAndRepair [as checkAndRepair] (eval at (E:\git\9w\node_modules\pdf2json\lib\pdf.js:60:1), :12213:11)
at new Font (eval at (E:\git\9w\node_modules\pdf2json\lib\pdf.js:60:1), :10756:21)
at PartialEvaluator_translateFont [as translateFont] (eval at (E:\git\9w\node_modules\pdf2json\lib\pdf.js:60:1), :8161:14)
at PartialEvaluator_loadFont [as loadFont] (eval at (E:\git\9w\node_modules\pdf2json\lib\pdf.js:60:1), :7311:29)
at PartialEvaluator_handleSetFont [as handleSetFont] (eval at (E:\git\9w\node_modules\pdf2json\lib\pdf.js:60:1), :7154:23)
at PartialEvaluator_getOperatorList [as getOperatorList] (eval at (E:\git\9w\node_modules\pdf2json\lib\pdf.js:60:1), :7470:37)
at Object.eval [as onResolve] (eval at (E:\git\9w\node_modules\pdf2json\lib\pdf.js:60:1), :4345:26)
at Object.runHandlers (eval at (E:\git\9w\node_modules\pdf2json\lib\pdf.js:60:1), :864:35)

trouble parsing files created by wkhtmltopdf

hi adrien

created a pdf file from your sample.html by wkhtmltopdf.
unfortunately i can not parse it proberly with your test.js.
just log item.text.

problem is, i have to parse generated files.
generated file is attached.

do you have any clue?

thanks in advance.

greets
zorla

sample3.pdf

having trouble parsing data that extends between pages

so far, usage of this library has been really good, but I've run into an issue. basically I have a table I'm parsing data from (not using pdfreader.TableParser right now) that is split between two pages.

when I parse through each page, I use logic that finds the heading of the title to determine where it begins, and the heading of the next table to determine where it ends.

if I cannot parse across both pages, I cannot get all the data from the table.

from my understanding, I am looping through each page in my below code. I would love any suggestions as I've sort of hit a roadblock here.

please note that I'm using the Serverless framework and invoking it that way; sorry if it's very unrepeatable for anybody.

code:

function getNextIndexItem(rows, num, currentItem) {
  /*
  get the item that appears next in the array passed into the rows param.
  param rows: array of rows parsed on the page
  param num: number of indexes past the current index (currentItem)
  param currentItem: the index of the current item in the array
  */ 
  let keys = Object.keys(rows);
  let nextIndex = keys.indexOf(currentItem) + num;
  let nextItem = keys[nextIndex];
  let nextField = (rows[nextItem] || []).join(' ');
  let finalStr = nextField.split(':')[nextField.split(':').length-1];

  return finalStr;
}

function pdfReader(pdfFilePath, parsedData, callback) {
  const pdfreader = require("pdfreader");

  let rows = {}; // indexed by y-position
  let tableIndexes = 0;

  new pdfreader.PdfReader().parseFileItems(pdfFilePath, (err, item) => {
    if (err) callback(err);

    if (item) {
      if (item.page) {
        // end of file, or page
        Object.keys(rows) // => array of y-positions (type: float)
          .sort((y1, y2) => parseFloat(y1) - parseFloat(y2)) // sort float positions
          .forEach(yValue => {
            // rows[y] is an array of text for a line.
            let line = (rows[yValue] || []).join('');  // construct line of text

            if (line.includes('Table Name')) {
              tableIndexes = 0;
              for (let i = 0; i < 500; i++) {
                if (getNextIndexItem(rows, i, yValue).includes('Next Table Name')) {
                  tableIndexes = i;  // get index of last table row
                  break;
                }
            }
            for (let i =2; i < tableIndexes; i++) {  // start at 2 to avoid the heading row of the table
              console.log(`List row #${i}: ${getNextIndexItem(rows, i, yValue)}`);
            }
          });
        rows = {}; // clear rows for next page
      } else if (item.text) {
        if (!rows[item.y]) {
          rows[item.y] = [];
        }
        rows[item.y].push(item.text);
      }
    } else {
      // we're done here
      callback(parsedData);
    }
  });
};

module.exports.test = function() {
  pdfReader('doc.pdf', (err, data) => {
    if (err) console.log(err);
  });
}

here's the table in the PDF:
Capture

Fails when uploading file that contains comments within PDF

When uploading and processing a PDF that contain comments, pdfreader is unable to handle the request, and my backend node service fails. I'm able to use PdfReader().parseBuffer(file, function(err, item) to process the buffered file, and it's able to read the file and first item, but it fails going forward.

Is this a known bug, and if so, is there anyway I can handle this accordingly, or a way to detect the file has comments and return an error. I've tried some work arounds, but the service just fails every time.

Update pdf2json to version 1.1.9

Pdf2json package is using old version of lodash 4.15 and it has some vulnerabilities. Please update the version of pdf2json to 1.1.9 and it will fix the issue.

Unable to catch Parse error

Hi @adrienjoly ,

I am using pdfreader to parse ### pdf documents. However in my application if I bump into runtime error while parsing pdf I want to use a particular logic. Below is code and trace of exception while reading pdf document. Issue is that the error is not getting caught in if(err) condition. Am I missing anything in catching the exception shown below?

Thanks,
Ji

Code snippet and exception trace:

function readPDFPages(buffer, reader = (new PdfReader())) {

  console.log('reading pdf pages: ');
  console.log(buffer);

  return new Promise((resolve, reject) => {
    let pages = [];
    reader.parseBuffer(buffer, (err, item) => {

      if (err) {
        console.log("err in parsed buffer");
        console.log(err);
        reject(err)
      }
      else if (!item)
        resolve(pages);

      else if (item.page)
        resolve(pages);
    });
  });

}

Exception trace:
Error: Illegal character: 41
    at error (eval at <anonymous> (/var/task/node_modules/pdf2json/lib/pdf.js:64:1), <anonymous>:195:9)
    at Lexer_getObj [as getObj] (eval at <anonymous> (/var/task/node_modules/pdf2json/lib/pdf.js:64:1), <anonymous>:24616:11)
    at Parser_shift [as shift] (eval at <anonymous> (/var/task/node_modules/pdf2json/lib/pdf.js:64:1), <anonymous>:24038:32)
    at Parser_makeStream [as makeStream] (eval at <anonymous> (/var/task/node_modules/pdf2json/lib/pdf.js:64:1), <anonymous>:24195:12)
    at Parser_getObj [as getObj] (eval at <anonymous> (/var/task/node_modules/pdf2json/lib/pdf.js:64:1), <anonymous>:24079:18)
    at XRef_fetch [as fetch] (eval at <anonymous> (/var/task/node_modules/pdf2json/lib/pdf.js:64:1), <anonymous>:5753:22)
    at XRef_fetchIfRef [as fetchIfRef] (eval at <anonymous> (/var/task/node_modules/pdf2json/lib/pdf.js:64:1), <anonymous>:5699:19)
    at Dict_get [as get] (eval at <anonymous> (/var/task/node_modules/pdf2json/lib/pdf.js:64:1), <anonymous>:4759:28)
    at Page_getPageProp [as getPageProp] (eval at <anonymous> (/var/task/node_modules/pdf2json/lib/pdf.js:64:1), <anonymous>:4213:28)
    at Page.get content [as content] (eval at <anonymous> (/var/task/node_modules/pdf2json/lib/pdf.js:64:1), <anonymous>:4227:19)

Missing text height

Hello, I am using the code snippet from the documentation to parse lines of text from pdf page.
The lines of text are getting parsed, however for some reason the height / h property is missing from item.
I need the height in order to detect the text getting out of bound of a certain box on the pdf page.

Here is the code snippet, that I used:
`
let rows = {};
let addressRows = [];

const printRows = () => {
  Object.keys(rows)
    .sort((y1, y2) => parseFloat(y1) - parseFloat(y2))
    .forEach((y) => {
      addressRows.push((rows[y] || []).join(''));
    });
}

new pdfreader.PdfReader().parseFileItems(tempPath, function (err, item) {
  if (!item || item.page) {
    printRows();
    if (!item) {
      console.log('addressRows: ', addressRows)
    }
  } else if (item.text) {
    console.log(item);

    // accumulate text items into rows object, per line
    (rows[item.y] = rows[item.y] || []).push(item.text);
  }

});`

Here is the log I received on terminal.

Screen Shot 2020-12-12 at 5 34 23 PM

Any idea why the height/h property is missing?
Thank you.

parseTable is not working

When I run "node parse.js test/sample.pdf" I see that parseTable is not working, and table from sample file is not parsed.

Check that error handling still works after upgrade to pdf2json v1.1.7 (PR #25)

In PR #25 (merged today without publishing a new version on npm), @noshadil upgraded pdf2json to v1.1.7.

This upgrade changes the parameters of the two top-level events triggered by pdf2json: pdfParser_dataError and pdfParser_dataReady, both handled by pdfreader.

In his PR, @noshadil did update the handler for pdfParser_dataReady accordingly, but not the one for pdfParser_dataError. => It may mean that top-level error handling is broken in the current build, but I don't have enough time to check this at that point.

Next steps:

  • create an automated tests to check that top-level errors can be caught. it should pass with pdf2json v1.1.2 (used in previous build of pdfreader)
  • if that test does not pass with pdf2json v1.1.7, fix error handling and propose a PR

Pdfreader from Electron 15 not working

const fs = require('fs');
const pdfreader = require("pdfreader");
fs.readFile('./test.pdf', function (err, buffer) {
if (err) return console.log(err);
new pdfreader.PdfReader().parseBuffer(buffer, function (err, item) {
if (err) callback(err);
else if (!item) callback();
else if (item.text) console.log(item.text);
});
});
VM1448:195 Uncaught Error: No PDFJS.workerSrc specified
at error (eval at (C:\Users\dell\agspdftoexcel\node_modules\pdfreader\node_modules\pdf2json\lib\pdf.js:63), :195:9)
at new WorkerTransport (eval at (C:\Users\dell\agspdftoexcel\node_modules\pdfreader\node_modules\pdf2json\lib\pdf.js:63), :42961:9)
at Object.getDocument (eval at (C:\Users\dell\agspdftoexcel\node_modules\pdfreader\node_modules\pdf2json\lib\pdf.js:63), :42559:15)
at PDFJSClass.parsePDFData (C:\Users\dell\agspdftoexcel\node_modules\pdfreader\node_modules\pdf2json\lib\pdf.js:224)
at PDFParser.#startParsingPDF (C:\Users\dell\agspdftoexcel\node_modules\pdfreader\node_modules\pdf2json\pdfparser.js:85)
at PDFParser.parseBuffer (C:\Users\dell\agspdftoexcel\node_modules\pdfreader\node_modules\pdf2json\pdfparser.js:142)
at PdfReader.parseBuffer (C:\Users\dell\agspdftoexcel\node_modules\pdfreader\PdfReader.js:72)
at C:\Users\dell\agspdftoexcel\app.js:11
at FSReqCallback.readFileAfterClose [as oncomplete] (node:internal/fs/read_file_context:68)

modifying properties of Obj

Minimist is a parse argument options module. Affected versions of this package are vulnerable to Prototype Pollution. The library could be tricked into adding or modifying properties of Object

Xmldom is used as parser and xml serializer.The library could be tricked into adding or modifying the xml

[QUESTION] How to get raw text from PDF

I am trying to parse a pdf and catagorize information based on text formatting/decoration. How do you suggest I do that?
For example, I have a pdf in which the structure is repeated:
S.No. BOLD+UNDERLINED TITLE para

How do I catagorize this data into an array of objects:
[ { sno: "", title: "", desc: "" }, ... ]

Does not work in node v16.10.0

Describe the bug
When running with node v16.10.0, the methods parseBuffer and parseFileItems do not work as expected.

To Reproduce
Use node version 16.10.0 and try to read the text of a pdf using parseBuffer or parseFileItems

Expected behavior
The callback passed to parseBuffer or parseFileItems should be called with each pdf item found.

Current behavior
The callback passed to parseBuffer or parseFileItems only gets called once with the file: { path } or file: { buffer } data, and never gets called with the pdf items or pages.

Screenshots, outputs or logs

Desktop (please complete the following information):

  • OS: macOS Big Sur 11.4
  • Browser: node.js
  • Version: 16.10.0

Aditional Info

The same code works correctly in at least node 16.3.0

Doesn't work on PDFs converted online

Hi,

This npm module is awesome, I really love it :)

However I do have one thing of note. See here, this module works on pdfs exported via Word, Excel, PowerPoint but not on PDFs that were generated from online sources (e.g. online2pdf). Is there a reason for this?

Thanks.

Async reading

I am using nodejs express server. On GET request I expect to receive read PDF file. But reading is async process.
`new pdfreader.PdfReader().parseFileItems('CHK.pdf', function (err, item) {

if (!item || item.page) {
res.send(printRows());
}
});`

Is there any way to wait, until all the pages will be read and only then send response back to client?

How to get the metadata(author, name) from the file

how to get metadata from file example author, name etc.
I'm trying something like:

const { PdfReader } = require("pdfreader");
const fs = require("fs");

fs.readFile("example.pdf", (err, pdfBuffer) => {
  // pdfBuffer contains the file content
  new PdfReader().parseBuffer(pdfBuffer, function (err, item) {
    if (err) {
      console.log("Error", err);
      return false;
    }

    if (item && item.file) { // <-- item.file not exists
    }

    if (item) {
      console.log(Object.keys(item)); // <-- has nothing to do with metadata
    }

  });
});

Outdated version at NPM

Would it be possible to push latest version to NPM? The latest available there seems to be 1.0.7 while current one on github is 1.1.3.

Doesn't work with latest updates

I don't know why both 'pdf-to-text' and 'pdfreader' doesn't work even the conditions are met as I know.

Briefly, it lacks of "require" function even I add "require, requirejs, and require.js" with npm and even after adding a line of script require.js to my html, it produces the error below. Here is the codePen or more explanatory Stackoverflow

ekran resmi 2017-03-14 14 32 09

PS: I tried to include /, ', and combination of them both at the beginning and end of the require functions inside but nothing worked yet.

Some characters are missing / corrupt (e.g. ligatures)

First, I just want to thank you for creating this package. It's really helped us.

Describe the bug
While most of the text is there, a few characters are missing from my PDF.

Here's the PDF. It was produced by using a headless Chrome 67.0.3396.87 on Ubuntu to print the screen to PDF.
Scenario-4.1-RiskTables-FQA.pdf

To Reproduce
Here's a minimalist test:

const PdfReader = require("pdfreader").PdfReader;
const fs = require("fs");
const path = require("path");

const filename = path.join("c:", "temp", "Scenario-4.1-RiskTables-FQA.pdf");
console.log("Reading " + filename + "...");

new Promise((resolve, reject) => {
    let pdfText = "";
    fs.readFile(filename, (err, pdfBuffer) => {
        console.log("Found buffer with " + pdfBuffer.length + " bytes.");
        new PdfReader().parseBuffer(pdfBuffer, function(err, item){
            if (err) {
                reject(err);
            } else if (!item) {
                resolve(pdfText);
            } else if (item.text) {
                //console.log("Found item: " + JSON.stringify(item));
                pdfText += item.text;
            }
        });
    });
}).then((pdfText) => {
    console.log("Found PDF Text: " + pdfText);
}).catch(e => {
    console.log("ERROR", e);
});

Expected behavior
I would expect to see all of the characters. Open the PDF and you'll notice the sentence "Effective RMP:" on the first page just above "Default 5x5 RMP V1.0". In the text that gets exported from the file, it says "E ective RMP".

Screenshots, outputs or logs
Here's the log of what this program produces for me:

Reading c:\temp\Scenario-4.1-RiskTables-FQA.pdf...
Found buffer with 69972 bytes.
Found PDF Text: 8/29/2018QbDVisionRiskTablesabout:blank1/3QbDVisionExportedBy:RyanRocketExportDate:Aug29,2018at1:44pm G MT C ompany:RocketsRUSProject:PRJ-6-PrintTestProjectReportDate:Aug29,2018RiskTablesReportī‚†FQARiskTableAsofAug29,2018at11:59pm G MTRiskTable:FQARiskTableDate:Aug29,2018E ectiveRMP:Default5x5RMPV1.08/29/2018QbDVisionRiskTablesabout:blank2/3FQA-32-Appearance[NOTAPPROVED]1(1%) C olor,shapeandappearancearenotdirectlylinkedtosafetyande cacy.Therefore,theyarenotcritical.10(1%)100(1%)None C M-78-NA[NOTAPPROVED]NoneFQA-40-Assay[NOTAPPROVED]100(100%)Processvariablesmaya ecttheassayofthedrugproduct.1000(100%)10000(100%)IPTandRelease C M-79-Unknown[NOTAPPROVED]TPP-88-DosageFormsandStrengths[NOTAPPROVED]TPP-91-AdverseReactions[NOTAPPROVED]TPP-95-Overdosage[NOTAPPROVED]TPP-98-NonclinicalToxicology[NOTAPPROVED]FQA-52- C ontainer C losureSystem[NOTAPPROVED]100(100%)Packagingoptionshavenotbeenidenti ed1000(100%)10000(100%)SuitablepackagingoptionswillbeinvestigatedduringdevelopmentprocessNone C M-78-NA[NOTAPPROVED]TPP-101-HowSupplied/StorageandHandling[NOTAPPROVED]FQA-45- C ontentUniformity[NOTAPPROVED]100(100%)Variabilityincontentuniformitywilla ectsafetyande cacy.1000(100%)10000(100%)Bothformulationandprocessvariablesimpactcontentuniformity,sothis C QAwillbeevaluatedthroughoutproductandprocessdevelopment.ReleaseTestOnly C M-79-Unknown[NOTAPPROVED]TPP-88-DosageFormsandStrengths[NOTAPPROVED]TPP-95-Overdosage[NOTAPPROVED]FQA-42-DegradationProducts[NOTAPPROVED]100(100%)Formulationandprocessvariablescanimpactdegradationproducts.1000(100%)10000(100%)Degradationproductswillbeassessedduringproductandprocessdevelopment.IPTandRelease C M-79-Unknown[NOTAPPROVED]TPP-91-AdverseReactions[NOTAPPROVED]TPP-98-NonclinicalToxicology[NOTAPPROVED]TPP-101-HowSupplied/StorageandHandling[NOTAPPROVED]FQA-47-Dissolution[NOTAPPROVED]100(100%)Bothformulationandprocessvariablesa ectthedissolutionpro le.1000(100%)10000(100%)This C QAwillbeinvestigatedthroughoutformulationandprocessdevelopment.ReleaseTestOnly C M-79-Unknown[NOTAPPROVED]TPP-97- C linicalPharmacology[NOTAPPROVED]FQA-37-Friability[NOTAPPROVED]25(25%)AtargetofNMT1.0%w/wofmeanweightlossassuresalowimpactonpatientsafetyande cacyandminimizescustomercomplaints.250(25%)2500(25%)ReleaseTestOnly C M-79-Unknown[NOTAPPROVED]TPP-97- C linicalPharmacology[NOTAPPROVED]FQA-38-Identi cation[NOTAPPROVED]100(100%)Identi cationiscriticalforsafetyande cacy.1000(100%)10000(100%)IPTandRelease C M-79-Unknown[NOTAPPROVED]TPP-88-DosageFormsandStrengths[NOTAPPROVED]TPP-91-AdverseReactions[NOTAPPROVED]FQA-50-MicrobialLimits[NOTAPPROVED]10(10%)Non-compliancewithmicrobiallimitswillimpactpatientsafety.However,inthiscase,theriskofmicrobialgrowthisverylowbecauserollercompaction(drygranulation)isutilizedforthisproduct.Therefore,this C QAwillnotbediscussedindetailduringformulationandprocessdevelopment.100(10%)1000(10%)NoneReleaseTestOnly C M-79-Unknown[NOTAPPROVED]TPP-98-NonclinicalToxicology[NOTAPPROVED]FQA-33-Odor[NOTAPPROVED]1(1%)Ingeneral,anoticeableodorisnotdirectlylinkedtosafetyande cacy,butodorcana ectpatientacceptability.10(1%)100(1%)None C M-78-NA[NOTAPPROVED]NoneFQA-49-ResidualSolvents[NOTAPPROVED]5(5%)Residualsolventscanimpactsafety.However,nosolventisusedinthedrugproductmanufacturingprocessandthedrugproductcomplieswithUSP<467>Option1.Therefore,formulationandprocessvariablesareunlikelytoimpactthis C QA.50(5%)500(5%)NoneReleaseTestOnly C M-79-Unknown[NOTAPPROVED]TPP-91-AdverseReactions[NOTAPPROVED]TPP-98-NonclinicalToxicology[NOTAPPROVED]FQA-35-Score C on guration[NOTAPPROVED]1(1%)Scorecon gurationisnotcriticalfortheacetriptantablet.10(1%)100(1%)None C M-78-NA[NOTAPPROVED]NoneFQA-34-Size[NOTAPPROVED]1(1%)SeeTargetJusti cation10(1%)100(1%)None C M-78-NA[NOTAPPROVED]NoneFQAī…• C riticalityī… C riticalityJusti cationī…ProcessRiskī…RPNī…RecommendedActionsī… C ontrolStrategyī… C ontrolMethodsī…TPPLinksī…8/29/2018QbDVisionRiskTablesabout:blank3/3Ā©2018 C herry C ircleSoftware,Inc.FQA-43-Water C ontent[NOTAPPROVED]25(25%)However,inthiscase,acetriptanisnotsensitivetohydrolysisandmoisturewillnotimpactstability.250(25%)2500(25%)NoneReleaseTestOnly C M-79-Unknown[NOTAPPROVED]TPP-88-DosageFormsandStrengths[NOTAPPROVED]FQAī…• C riticalityī… C riticalityJusti cationī…ProcessRiskī…RPNī…RecommendedActionsī… C ontrolStrategyī… C ontrolMethodsī…TPPLinksī…

Desktop (please complete the following information):

  • OS: Ubuntu 14, running in a docker container
  • Browser Chome 67.0.3396.87
  • Version Pdfreader v 0.2.5

Additional context
Thank you again for creating this package.

Invalid XRef stream header

Describe the bug
A clear and concise description of what the bug is.
Unable to process PDF

To Reproduce
List the steps you followed and/or share your code to help us reproduce the bug

  1. Feed in the PDF as a buffer
  2. Attempt to extract text from the PDF

Expected behavior
A clear and concise description of what you expected to happen.

Extract text from PDF

Screenshots, outputs or logs
If applicable, add screenshots, outputs or logs to help explain your problem.

    (while reading XRef): Error: Invalid XRef stream header

      at XRef_readXRef [as readXRef] (eval at Object.<anonymous> (node_modules/pdf2json/lib/pdf.js:64:1), <anonymous>:5682:9)

  console.log
    XRefParseException: 
        at XRefParseExceptionClosure (eval at Object.<anonymous> (/Users/tsopic/telegram_bot/node_modules/pdf2json/lib/pdf.js:64:1), <anonymous>:379:34)
        at eval (eval at Object.<anonymous> (/Users/tsopic/repo/node_modules/pdf2json/lib/pdf.js:64:1), <anonymous>:384:3)

Desktop (please complete the following information):

NODE - 14

tested on both mac and linux

Additional context
Add any other context about the problem here.

locate absolute position of item

Im trying to find a string in a string in a pdf and want to get its x and y location on a page.
It seems item.x and item.y are relative to the item "above". it seems impossible to me to find out which x to add to get the absolute position of an item.
is there any way?

Having some trouble with parseTable

Hi there,

Think you got a good idea here, but I'm trying to figure out how to correctly parse a table. I don't think the displayTable() you have in your test file is logging.. I'm just having trouble figuring out the pattern. Anyways, do you have any advice for me?

Thanks in advance and I hope you have good day :)

Troy

var _ = require('lodash');
var PdfReader = require('pdfreader').PdfReader;
var Rule = require('pdfreader').Rule;

function displayTable(table){
    console.log('Object.keys(table)',Object.keys(table));
    _.map(table.rows, function(row){
        console.log('row',row);
    });
}
var sampleRules = [
    Rule.on(/^c1$/).parseTable(3).then(displayTable)
  ];
var processItemSample = Rule.makeItemProcessor(sampleRules);

var samplePathToPdf = __dirname + '/sample.pdf';
new PdfReader().parseFileItems(samplePathToPdf, function(err, item){
    if (err){
        console.log(err);
    }
    else {
        processItemSample(item);
    }
});

Here is my output

Object.keys(table) [ 'items', 'rows', 'matrix' ]
row [ { x: 20.408,
    y: 10.501,
    w: 0.9436,
    clr: 0,
    A: 'left',
    R: [ [Object] ],
    text: 'c2' },
  { x: 28.299,
    y: 10.501,
    w: 0.9436,
    clr: 0,
    A: 'left',
    R: [ [Object] ],
    text: 'c3' },
  { x: 14.979,
    y: 11.447,
    w: 0.5,
    clr: 0,
    A: 'left',
    R: [ [Object] ],
    text: '1' },
  { x: 29.249,
    y: 11.447,
    w: 1.25,
    clr: 0,
    A: 'left',
    R: [ [Object] ],
    text: '2.3' } ]
row [ { x: 19.513,
    y: 12.363,
    w: 2,
    clr: 0,
    A: 'left',
    R: [ [Object] ],
    text: 'hello' },
  { x: 27.068,
    y: 12.363,
    w: 2.333,
    clr: 0,
    A: 'left',
    R: [ [Object] ],
    text: 'world' },
  { x: 12.964,
    y: 13.248,
    w: 3.055,
    clr: 0,
    A: 'left',
    R: [ [Object] ],
    text: 'Values:' } ]
row [ { x: 12.964,
    y: 14.835,
    w: 0.5,
    clr: 0,
    A: 'left',
    R: [ [Object] ],
    text: '1' },
  { x: 12.964,
    y: 16.423,
    w: 0.5,
    clr: 0,
    A: 'left',
    R: [ [Object] ],
    text: '2' } ]
row [ { x: 12.964,
    y: 18.01,
    w: 0.5,
    clr: 0,
    A: 'left',
    R: [ [Object] ],
    text: '3' } ]

The automated release is failing šŸšØ

šŸšØ The automated release from the master branch failed. šŸšØ

I recommend you give this issue a high priority, so other packages depending on you could benefit from your bug fixes and new features.

You can find below the list of errors reported by semantic-release. Each one of them has to be resolved in order to automatically publish your package. Iā€™m sure you can resolve this šŸ’Ŗ.

Errors are usually caused by a misconfiguration or an authentication problem. With each error reported below you will find explanation and guidance to help you to resolve it.

Once all the errors are resolved, semantic-release will release your package the next time you push a commit to the master branch. You can also manually restart the failed CI job that runs semantic-release.

If you are not sure how to resolve this, here is some links that can help you:

If those donā€™t help, or if this issue is reporting something you think isnā€™t right, you can always ask the humans behind semantic-release.


No npm token specified.

An npm token must be created and set in the NPM_TOKEN environment variable on your CI environment.

Please make sure to create an npm token and to set it in the NPM_TOKEN environment variable on your CI environment. The token must allow to publish to the registry https://registry.npmjs.org/.


Good luck with your project āœØ

Your semantic-release bot šŸ“¦šŸš€

Wrong file content reading files with same path

I'm writing an application that reads the content of some files in a directory. Files are meant to be replaced (same filename but different content).

If I use parseFileItems two times with the same path but different files the result is always the content of the old file.

I solved reading the file content with fs.readFile and passing the buffer to parseBuffer.

Your source code looks fine to me, maybe it'is a problem with pdf2json/pdfparser but I'm not sure so I'm reporting to you.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    šŸ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. šŸ“ŠšŸ“ˆšŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ā¤ļø Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.