adrienjoly / npm-pdfreader Goto Github PK
View Code? Open in Web Editor NEWš Parse text and tables from PDF files.
Home Page: https://www.npmjs.com/package/pdfreader
License: MIT License
š Parse text and tables from PDF files.
Home Page: https://www.npmjs.com/package/pdfreader
License: MIT License
I assign a value to my custom data in the function of parsing pdf, and then output it. It is found that the data is still the initial value. How should I deal with the processed data?
Lodash is vulnerable to remote code execution (RCE) due to the potential to modify the properties of objects in memory. A remote attacker could run arbitrary commands on a vulnerable server, or cause the server to crash, by maliciously crafting an object via the zip functionality of Lodash.
Describe the bug
Parsing a pdf file containing 235 pages takes up to 8 seconds (just doing nothing with the received tokens - apparently the lexer alone takes up that much time) :-p
To Reproduce
const parseStart = process.hrtime();
new PdfReader().parseBuffer(result.data, (err, item) => {
// the pdf reader signals the end of the parsing process
// by calling this function with the item set as undefined
if (!item) {
const parseEnd = process.hrtime(parseStart);
this.logger.log(`parse pdf completed in ${parseEnd[0]}.${Math.floor(parseEnd[1] / 10e6)}s`);
observer.next(table);
observer.complete();
return
}
});
Expected behavior
I would expect to have a pdf file of this size to have no longer than 2 seconds to parse.
Screenshots, outputs or logs
parse pdf completed in 7.42s SkybriefingDaylightAdapter
Desktop (please complete the following information):
Additional context
I have attached a sample file.
pdf.pdf
Hello!
I have been using pdfreader
for some time, both locally on MacOS and in Docker, Node 15.14.0, and everything worked flawlessly. After updating project dependencies, I am getting the following output for any PDF files:
Warning: Setting up fake worker.
loadMetaData error: TypeError: Cannot read property 'metadata' of null
{
parserError: "loadMetaData error: TypeError: Cannot read property 'metadata' of null"
}
I have created a clean project with the only dependency "pdfreader": "^1.2.12"
and an example code from the documentation:
const { PdfReader } = require("pdfreader");
const fs = require("fs");
fs.readFile("./sample.pdf", (err, pdfBuffer) => {
new PdfReader().parseBuffer(pdfBuffer, function (err, item) {
if (err) console.log(err);
else if (!item) console.log("no item");
else if (item.text) console.log(item.text);
});
});
...and I am still getting the same message. I played with with versions of pdfkit
and pdf2json
, but did not solve the problem.
Update: this error appears on 1.2.12
. With 1.2.11
, it works properly.
I am not able to find out when a file finished your reading. I need call a function when a file finished your reading.
pdf2json update from 1.2.5 to 1.3.0 removed formImage, which pdfreader requires, without incrementing the major version number as they should with incompatible changes. NPM thinks that 1.2.5 and 1.3.* are compatible. As a temporary patch, you could set the dependency in package.json to something like <=1.2.5
We wanted to test the library in Electron but it doesn't import properly the async library, pdfparser.js line 6:
async = require("async");
As you may know async is a reserved word in Chrome and you guys are trying to import async library by overwriting the async reserved word without var or let in front of the async var.
pdf url : http://www.hkexnews.hk/listedco/listconews/sehk/2018/0228/LTN20180228058_C.pdf
Error: Required "glyf" or "loca" tables are not found
at error (eval at (E:\git\9w\node_modules\pdf2json\lib\pdf.js:60:1), :193:7)
at Font_checkAndRepair [as checkAndRepair] (eval at (E:\git\9w\node_modules\pdf2json\lib\pdf.js:60:1), :12213:11)
at new Font (eval at (E:\git\9w\node_modules\pdf2json\lib\pdf.js:60:1), :10756:21)
at PartialEvaluator_translateFont [as translateFont] (eval at (E:\git\9w\node_modules\pdf2json\lib\pdf.js:60:1), :8161:14)
at PartialEvaluator_loadFont [as loadFont] (eval at (E:\git\9w\node_modules\pdf2json\lib\pdf.js:60:1), :7311:29)
at PartialEvaluator_handleSetFont [as handleSetFont] (eval at (E:\git\9w\node_modules\pdf2json\lib\pdf.js:60:1), :7154:23)
at PartialEvaluator_getOperatorList [as getOperatorList] (eval at (E:\git\9w\node_modules\pdf2json\lib\pdf.js:60:1), :7470:37)
at Object.eval [as onResolve] (eval at (E:\git\9w\node_modules\pdf2json\lib\pdf.js:60:1), :4345:26)
at Object.runHandlers (eval at (E:\git\9w\node_modules\pdf2json\lib\pdf.js:60:1), :864:35)
HIļ¼I have a PDF file, which can be opened and copied.
But it cannot be read, please help me, thank you!
4443.pdf
hi adrien
created a pdf file from your sample.html by wkhtmltopdf.
unfortunately i can not parse it proberly with your test.js.
just log item.text.
problem is, i have to parse generated files.
generated file is attached.
do you have any clue?
thanks in advance.
greets
zorla
I have a file can not read. Can you take a look?
cv_81_vietnamworks_11121.pdf
so far, usage of this library has been really good, but I've run into an issue. basically I have a table I'm parsing data from (not using pdfreader.TableParser right now) that is split between two pages.
when I parse through each page, I use logic that finds the heading of the title to determine where it begins, and the heading of the next table to determine where it ends.
if I cannot parse across both pages, I cannot get all the data from the table.
from my understanding, I am looping through each page in my below code. I would love any suggestions as I've sort of hit a roadblock here.
please note that I'm using the Serverless framework and invoking it that way; sorry if it's very unrepeatable for anybody.
code:
function getNextIndexItem(rows, num, currentItem) {
/*
get the item that appears next in the array passed into the rows param.
param rows: array of rows parsed on the page
param num: number of indexes past the current index (currentItem)
param currentItem: the index of the current item in the array
*/
let keys = Object.keys(rows);
let nextIndex = keys.indexOf(currentItem) + num;
let nextItem = keys[nextIndex];
let nextField = (rows[nextItem] || []).join(' ');
let finalStr = nextField.split(':')[nextField.split(':').length-1];
return finalStr;
}
function pdfReader(pdfFilePath, parsedData, callback) {
const pdfreader = require("pdfreader");
let rows = {}; // indexed by y-position
let tableIndexes = 0;
new pdfreader.PdfReader().parseFileItems(pdfFilePath, (err, item) => {
if (err) callback(err);
if (item) {
if (item.page) {
// end of file, or page
Object.keys(rows) // => array of y-positions (type: float)
.sort((y1, y2) => parseFloat(y1) - parseFloat(y2)) // sort float positions
.forEach(yValue => {
// rows[y] is an array of text for a line.
let line = (rows[yValue] || []).join(''); // construct line of text
if (line.includes('Table Name')) {
tableIndexes = 0;
for (let i = 0; i < 500; i++) {
if (getNextIndexItem(rows, i, yValue).includes('Next Table Name')) {
tableIndexes = i; // get index of last table row
break;
}
}
for (let i =2; i < tableIndexes; i++) { // start at 2 to avoid the heading row of the table
console.log(`List row #${i}: ${getNextIndexItem(rows, i, yValue)}`);
}
});
rows = {}; // clear rows for next page
} else if (item.text) {
if (!rows[item.y]) {
rows[item.y] = [];
}
rows[item.y].push(item.text);
}
} else {
// we're done here
callback(parsedData);
}
});
};
module.exports.test = function() {
pdfReader('doc.pdf', (err, data) => {
if (err) console.log(err);
});
}
When uploading and processing a PDF that contain comments, pdfreader is unable to handle the request, and my backend node service fails. I'm able to use PdfReader().parseBuffer(file, function(err, item) to process the buffered file, and it's able to read the file and first item, but it fails going forward.
Is this a known bug, and if so, is there anyway I can handle this accordingly, or a way to detect the file has comments and return an error. I've tried some work arounds, but the service just fails every time.
Nhan_Thien_CV.pdf
When I parse this cv , It parse data on each row incorrect
Actual :
Emailthien.nhan2310@[email protected]
Expected :
Email [email protected]
Pdf2json package is using old version of lodash 4.15 and it has some vulnerabilities. Please update the version of pdf2json to 1.1.9 and it will fix the issue.
Hi,
i am unable to find done , complete event of
new pdfreader.PdfReader().parseFileItems
any points on same or get totol number of pages in pdf !
Hi @adrienjoly ,
I am using pdfreader to parse ### pdf documents. However in my application if I bump into runtime error while parsing pdf I want to use a particular logic. Below is code and trace of exception while reading pdf document. Issue is that the error is not getting caught in if(err) condition. Am I missing anything in catching the exception shown below?
Thanks,
Ji
Code snippet and exception trace:
function readPDFPages(buffer, reader = (new PdfReader())) {
console.log('reading pdf pages: ');
console.log(buffer);
return new Promise((resolve, reject) => {
let pages = [];
reader.parseBuffer(buffer, (err, item) => {
if (err) {
console.log("err in parsed buffer");
console.log(err);
reject(err)
}
else if (!item)
resolve(pages);
else if (item.page)
resolve(pages);
});
});
}
Exception trace:
Error: Illegal character: 41
at error (eval at <anonymous> (/var/task/node_modules/pdf2json/lib/pdf.js:64:1), <anonymous>:195:9)
at Lexer_getObj [as getObj] (eval at <anonymous> (/var/task/node_modules/pdf2json/lib/pdf.js:64:1), <anonymous>:24616:11)
at Parser_shift [as shift] (eval at <anonymous> (/var/task/node_modules/pdf2json/lib/pdf.js:64:1), <anonymous>:24038:32)
at Parser_makeStream [as makeStream] (eval at <anonymous> (/var/task/node_modules/pdf2json/lib/pdf.js:64:1), <anonymous>:24195:12)
at Parser_getObj [as getObj] (eval at <anonymous> (/var/task/node_modules/pdf2json/lib/pdf.js:64:1), <anonymous>:24079:18)
at XRef_fetch [as fetch] (eval at <anonymous> (/var/task/node_modules/pdf2json/lib/pdf.js:64:1), <anonymous>:5753:22)
at XRef_fetchIfRef [as fetchIfRef] (eval at <anonymous> (/var/task/node_modules/pdf2json/lib/pdf.js:64:1), <anonymous>:5699:19)
at Dict_get [as get] (eval at <anonymous> (/var/task/node_modules/pdf2json/lib/pdf.js:64:1), <anonymous>:4759:28)
at Page_getPageProp [as getPageProp] (eval at <anonymous> (/var/task/node_modules/pdf2json/lib/pdf.js:64:1), <anonymous>:4213:28)
at Page.get content [as content] (eval at <anonymous> (/var/task/node_modules/pdf2json/lib/pdf.js:64:1), <anonymous>:4227:19)
Hello, I am using the code snippet from the documentation to parse lines of text from pdf page.
The lines of text are getting parsed, however for some reason the height / h property is missing from item.
I need the height in order to detect the text getting out of bound of a certain box on the pdf page.
Here is the code snippet, that I used:
`
let rows = {};
let addressRows = [];
const printRows = () => {
Object.keys(rows)
.sort((y1, y2) => parseFloat(y1) - parseFloat(y2))
.forEach((y) => {
addressRows.push((rows[y] || []).join(''));
});
}
new pdfreader.PdfReader().parseFileItems(tempPath, function (err, item) {
if (!item || item.page) {
printRows();
if (!item) {
console.log('addressRows: ', addressRows)
}
} else if (item.text) {
console.log(item);
// accumulate text items into rows object, per line
(rows[item.y] = rows[item.y] || []).push(item.text);
}
});`
Here is the log I received on terminal.
Any idea why the height/h property is missing?
Thank you.
When I run "node parse.js test/sample.pdf" I see that parseTable is not working, and table from sample file is not parsed.
In PR #25 (merged today without publishing a new version on npm), @noshadil upgraded pdf2json to v1.1.7.
This upgrade changes the parameters of the two top-level events triggered by pdf2json: pdfParser_dataError
and pdfParser_dataReady
, both handled by pdfreader.
In his PR, @noshadil did update the handler for pdfParser_dataReady
accordingly, but not the one for pdfParser_dataError
. => It may mean that top-level error handling is broken in the current build, but I don't have enough time to check this at that point.
Next steps:
const fs = require('fs');
const pdfreader = require("pdfreader");
fs.readFile('./test.pdf', function (err, buffer) {
if (err) return console.log(err);
new pdfreader.PdfReader().parseBuffer(buffer, function (err, item) {
if (err) callback(err);
else if (!item) callback();
else if (item.text) console.log(item.text);
});
});
VM1448:195 Uncaught Error: No PDFJS.workerSrc specified
at error (eval at (C:\Users\dell\agspdftoexcel\node_modules\pdfreader\node_modules\pdf2json\lib\pdf.js:63), :195:9)
at new WorkerTransport (eval at (C:\Users\dell\agspdftoexcel\node_modules\pdfreader\node_modules\pdf2json\lib\pdf.js:63), :42961:9)
at Object.getDocument (eval at (C:\Users\dell\agspdftoexcel\node_modules\pdfreader\node_modules\pdf2json\lib\pdf.js:63), :42559:15)
at PDFJSClass.parsePDFData (C:\Users\dell\agspdftoexcel\node_modules\pdfreader\node_modules\pdf2json\lib\pdf.js:224)
at PDFParser.#startParsingPDF (C:\Users\dell\agspdftoexcel\node_modules\pdfreader\node_modules\pdf2json\pdfparser.js:85)
at PDFParser.parseBuffer (C:\Users\dell\agspdftoexcel\node_modules\pdfreader\node_modules\pdf2json\pdfparser.js:142)
at PdfReader.parseBuffer (C:\Users\dell\agspdftoexcel\node_modules\pdfreader\PdfReader.js:72)
at C:\Users\dell\agspdftoexcel\app.js:11
at FSReqCallback.readFileAfterClose [as oncomplete] (node:internal/fs/read_file_context:68)
Minimist is a parse argument options module. Affected versions of this package are vulnerable to Prototype Pollution. The library could be tricked into adding or modifying properties of Object
Xmldom is used as parser and xml serializer.The library could be tricked into adding or modifying the xml
I am trying to parse a pdf and catagorize information based on text formatting/decoration. How do you suggest I do that?
For example, I have a pdf in which the structure is repeated:
S.No. BOLD+UNDERLINED TITLE para
How do I catagorize this data into an array of objects:
[ { sno: "", title: "", desc: "" }, ... ]
Describe the bug
When running with node v16.10.0
, the methods parseBuffer
and parseFileItems
do not work as expected.
To Reproduce
Use node version 16.10.0 and try to read the text of a pdf using parseBuffer
or parseFileItems
Expected behavior
The callback passed to parseBuffer
or parseFileItems
should be called with each pdf item found.
Current behavior
The callback passed to parseBuffer
or parseFileItems
only gets called once with the file: { path }
or file: { buffer }
data, and never gets called with the pdf items or pages.
Screenshots, outputs or logs
Desktop (please complete the following information):
Aditional Info
The same code works correctly in at least node 16.3.0
Hi,
This npm module is awesome, I really love it :)
However I do have one thing of note. See here, this module works on pdfs exported via Word, Excel, PowerPoint but not on PDFs that were generated from online sources (e.g. online2pdf). Is there a reason for this?
Thanks.
NPM is showing me 4 vulnerabilities, (3 low, 1 high) for pdf2json. Patched after 0.5.0. I recommend updating dependencies list.
Fazenda Santo AntoĢnio ā Gleba B1-Memorial.pdf
Can u help?
I am trying to read this file.
I am using nodejs express server. On GET request I expect to receive read PDF file. But reading is async process.
`new pdfreader.PdfReader().parseFileItems('CHK.pdf', function (err, item) {
if (!item || item.page) {
res.send(printRows());
}
});`
Is there any way to wait, until all the pages will be read and only then send response back to client?
how to get metadata from file example author, name etc.
I'm trying something like:
const { PdfReader } = require("pdfreader");
const fs = require("fs");
fs.readFile("example.pdf", (err, pdfBuffer) => {
// pdfBuffer contains the file content
new PdfReader().parseBuffer(pdfBuffer, function (err, item) {
if (err) {
console.log("Error", err);
return false;
}
if (item && item.file) { // <-- item.file not exists
}
if (item) {
console.log(Object.keys(item)); // <-- has nothing to do with metadata
}
});
});
Hi, I am using the npm pdfreader. The code is not responding for a file of 150 MB which has 10000 pages.
Below is my code.
var pdfreader = require('pdfreader');
new pdfreader.PdfReader().parseFileItems('demo.pdf', function(err, item){
console.log(item.text)
});
Thank you for the help in advance.
Would it be possible to push latest version to NPM? The latest available there seems to be 1.0.7 while current one on github is 1.1.3.
How to get content by page number?Is there a function for this?
can you please release a new version to npm?
I don't know why both 'pdf-to-text' and 'pdfreader' doesn't work even the conditions are met as I know.
Briefly, it lacks of "require" function even I add "require, requirejs, and require.js" with npm and even after adding a line of script require.js to my html, it produces the error below. Here is the codePen or more explanatory Stackoverflow
PS: I tried to include /, ', and combination of them both at the beginning and end of the require functions inside but nothing worked yet.
First, I just want to thank you for creating this package. It's really helped us.
Describe the bug
While most of the text is there, a few characters are missing from my PDF.
Here's the PDF. It was produced by using a headless Chrome 67.0.3396.87 on Ubuntu to print the screen to PDF.
Scenario-4.1-RiskTables-FQA.pdf
To Reproduce
Here's a minimalist test:
const PdfReader = require("pdfreader").PdfReader;
const fs = require("fs");
const path = require("path");
const filename = path.join("c:", "temp", "Scenario-4.1-RiskTables-FQA.pdf");
console.log("Reading " + filename + "...");
new Promise((resolve, reject) => {
let pdfText = "";
fs.readFile(filename, (err, pdfBuffer) => {
console.log("Found buffer with " + pdfBuffer.length + " bytes.");
new PdfReader().parseBuffer(pdfBuffer, function(err, item){
if (err) {
reject(err);
} else if (!item) {
resolve(pdfText);
} else if (item.text) {
//console.log("Found item: " + JSON.stringify(item));
pdfText += item.text;
}
});
});
}).then((pdfText) => {
console.log("Found PDF Text: " + pdfText);
}).catch(e => {
console.log("ERROR", e);
});
Expected behavior
I would expect to see all of the characters. Open the PDF and you'll notice the sentence "Effective RMP:" on the first page just above "Default 5x5 RMP V1.0". In the text that gets exported from the file, it says "E ective RMP".
Screenshots, outputs or logs
Here's the log of what this program produces for me:
Reading c:\temp\Scenario-4.1-RiskTables-FQA.pdf...
Found buffer with 69972 bytes.
Found PDF Text: 8/29/2018QbDVisionRiskTablesabout:blank1/3QbDVisionExportedBy:RyanRocketExportDate:Aug29,2018at1:44pm G MT C ompany:RocketsRUSProject:PRJ-6-PrintTestProjectReportDate:Aug29,2018RiskTablesReportīFQARiskTableAsofAug29,2018at11:59pm G MTRiskTable:FQARiskTableDate:Aug29,2018E ectiveRMP:Default5x5RMPV1.08/29/2018QbDVisionRiskTablesabout:blank2/3FQA-32-Appearance[NOTAPPROVED]1(1%) C olor,shapeandappearancearenotdirectlylinkedtosafetyande cacy.Therefore,theyarenotcritical.10(1%)100(1%)None C M-78-NA[NOTAPPROVED]NoneFQA-40-Assay[NOTAPPROVED]100(100%)Processvariablesmaya ecttheassayofthedrugproduct.1000(100%)10000(100%)IPTandRelease C M-79-Unknown[NOTAPPROVED]TPP-88-DosageFormsandStrengths[NOTAPPROVED]TPP-91-AdverseReactions[NOTAPPROVED]TPP-95-Overdosage[NOTAPPROVED]TPP-98-NonclinicalToxicology[NOTAPPROVED]FQA-52- C ontainer C losureSystem[NOTAPPROVED]100(100%)Packagingoptionshavenotbeenidenti ed1000(100%)10000(100%)SuitablepackagingoptionswillbeinvestigatedduringdevelopmentprocessNone C M-78-NA[NOTAPPROVED]TPP-101-HowSupplied/StorageandHandling[NOTAPPROVED]FQA-45- C ontentUniformity[NOTAPPROVED]100(100%)Variabilityincontentuniformitywilla ectsafetyande cacy.1000(100%)10000(100%)Bothformulationandprocessvariablesimpactcontentuniformity,sothis C QAwillbeevaluatedthroughoutproductandprocessdevelopment.ReleaseTestOnly C M-79-Unknown[NOTAPPROVED]TPP-88-DosageFormsandStrengths[NOTAPPROVED]TPP-95-Overdosage[NOTAPPROVED]FQA-42-DegradationProducts[NOTAPPROVED]100(100%)Formulationandprocessvariablescanimpactdegradationproducts.1000(100%)10000(100%)Degradationproductswillbeassessedduringproductandprocessdevelopment.IPTandRelease C M-79-Unknown[NOTAPPROVED]TPP-91-AdverseReactions[NOTAPPROVED]TPP-98-NonclinicalToxicology[NOTAPPROVED]TPP-101-HowSupplied/StorageandHandling[NOTAPPROVED]FQA-47-Dissolution[NOTAPPROVED]100(100%)Bothformulationandprocessvariablesa ectthedissolutionpro le.1000(100%)10000(100%)This C QAwillbeinvestigatedthroughoutformulationandprocessdevelopment.ReleaseTestOnly C M-79-Unknown[NOTAPPROVED]TPP-97- C linicalPharmacology[NOTAPPROVED]FQA-37-Friability[NOTAPPROVED]25(25%)AtargetofNMT1.0%w/wofmeanweightlossassuresalowimpactonpatientsafetyande cacyandminimizescustomercomplaints.250(25%)2500(25%)ReleaseTestOnly C M-79-Unknown[NOTAPPROVED]TPP-97- C linicalPharmacology[NOTAPPROVED]FQA-38-Identi cation[NOTAPPROVED]100(100%)Identi cationiscriticalforsafetyande cacy.1000(100%)10000(100%)IPTandRelease C M-79-Unknown[NOTAPPROVED]TPP-88-DosageFormsandStrengths[NOTAPPROVED]TPP-91-AdverseReactions[NOTAPPROVED]FQA-50-MicrobialLimits[NOTAPPROVED]10(10%)Non-compliancewithmicrobiallimitswillimpactpatientsafety.However,inthiscase,theriskofmicrobialgrowthisverylowbecauserollercompaction(drygranulation)isutilizedforthisproduct.Therefore,this C QAwillnotbediscussedindetailduringformulationandprocessdevelopment.100(10%)1000(10%)NoneReleaseTestOnly C M-79-Unknown[NOTAPPROVED]TPP-98-NonclinicalToxicology[NOTAPPROVED]FQA-33-Odor[NOTAPPROVED]1(1%)Ingeneral,anoticeableodorisnotdirectlylinkedtosafetyande cacy,butodorcana ectpatientacceptability.10(1%)100(1%)None C M-78-NA[NOTAPPROVED]NoneFQA-49-ResidualSolvents[NOTAPPROVED]5(5%)Residualsolventscanimpactsafety.However,nosolventisusedinthedrugproductmanufacturingprocessandthedrugproductcomplieswithUSP<467>Option1.Therefore,formulationandprocessvariablesareunlikelytoimpactthis C QA.50(5%)500(5%)NoneReleaseTestOnly C M-79-Unknown[NOTAPPROVED]TPP-91-AdverseReactions[NOTAPPROVED]TPP-98-NonclinicalToxicology[NOTAPPROVED]FQA-35-Score C on guration[NOTAPPROVED]1(1%)Scorecon gurationisnotcriticalfortheacetriptantablet.10(1%)100(1%)None C M-78-NA[NOTAPPROVED]NoneFQA-34-Size[NOTAPPROVED]1(1%)SeeTargetJusti cation10(1%)100(1%)None C M-78-NA[NOTAPPROVED]NoneFQAī
C riticalityī
C riticalityJusti cationī
ProcessRiskī
RPNī
RecommendedActionsī
C ontrolStrategyī
C ontrolMethodsī
TPPLinksī
8/29/2018QbDVisionRiskTablesabout:blank3/3Ā©2018 C herry C ircleSoftware,Inc.FQA-43-Water C ontent[NOTAPPROVED]25(25%)However,inthiscase,acetriptanisnotsensitivetohydrolysisandmoisturewillnotimpactstability.250(25%)2500(25%)NoneReleaseTestOnly C M-79-Unknown[NOTAPPROVED]TPP-88-DosageFormsandStrengths[NOTAPPROVED]FQAī
C riticalityī
C riticalityJusti cationī
ProcessRiskī
RPNī
RecommendedActionsī
C ontrolStrategyī
C ontrolMethodsī
TPPLinksī
Desktop (please complete the following information):
Additional context
Thank you again for creating this package.
Describe the bug
A clear and concise description of what the bug is.
Unable to process PDF
To Reproduce
List the steps you followed and/or share your code to help us reproduce the bug
Expected behavior
A clear and concise description of what you expected to happen.
Extract text from PDF
Screenshots, outputs or logs
If applicable, add screenshots, outputs or logs to help explain your problem.
(while reading XRef): Error: Invalid XRef stream header
at XRef_readXRef [as readXRef] (eval at Object.<anonymous> (node_modules/pdf2json/lib/pdf.js:64:1), <anonymous>:5682:9)
console.log
XRefParseException:
at XRefParseExceptionClosure (eval at Object.<anonymous> (/Users/tsopic/telegram_bot/node_modules/pdf2json/lib/pdf.js:64:1), <anonymous>:379:34)
at eval (eval at Object.<anonymous> (/Users/tsopic/repo/node_modules/pdf2json/lib/pdf.js:64:1), <anonymous>:384:3)
Desktop (please complete the following information):
NODE - 14
tested on both mac and linux
Additional context
Add any other context about the problem here.
Im trying to find a string in a string in a pdf and want to get its x and y location on a page.
It seems item.x and item.y are relative to the item "above". it seems impossible to me to find out which x to add to get the absolute position of an item.
is there any way?
Hi,
How can I extract images from pdf ?
Can this be used to extract form data so I can have the field name and value?
Hello, i would like to make some edit to the JSON output provided by the library then convert it back to PDF, please any help would be greatly appreciated
Hi there,
Think you got a good idea here, but I'm trying to figure out how to correctly parse a table. I don't think the displayTable()
you have in your test file is logging.. I'm just having trouble figuring out the pattern. Anyways, do you have any advice for me?
Thanks in advance and I hope you have good day :)
Troy
var _ = require('lodash');
var PdfReader = require('pdfreader').PdfReader;
var Rule = require('pdfreader').Rule;
function displayTable(table){
console.log('Object.keys(table)',Object.keys(table));
_.map(table.rows, function(row){
console.log('row',row);
});
}
var sampleRules = [
Rule.on(/^c1$/).parseTable(3).then(displayTable)
];
var processItemSample = Rule.makeItemProcessor(sampleRules);
var samplePathToPdf = __dirname + '/sample.pdf';
new PdfReader().parseFileItems(samplePathToPdf, function(err, item){
if (err){
console.log(err);
}
else {
processItemSample(item);
}
});
Here is my output
Object.keys(table) [ 'items', 'rows', 'matrix' ]
row [ { x: 20.408,
y: 10.501,
w: 0.9436,
clr: 0,
A: 'left',
R: [ [Object] ],
text: 'c2' },
{ x: 28.299,
y: 10.501,
w: 0.9436,
clr: 0,
A: 'left',
R: [ [Object] ],
text: 'c3' },
{ x: 14.979,
y: 11.447,
w: 0.5,
clr: 0,
A: 'left',
R: [ [Object] ],
text: '1' },
{ x: 29.249,
y: 11.447,
w: 1.25,
clr: 0,
A: 'left',
R: [ [Object] ],
text: '2.3' } ]
row [ { x: 19.513,
y: 12.363,
w: 2,
clr: 0,
A: 'left',
R: [ [Object] ],
text: 'hello' },
{ x: 27.068,
y: 12.363,
w: 2.333,
clr: 0,
A: 'left',
R: [ [Object] ],
text: 'world' },
{ x: 12.964,
y: 13.248,
w: 3.055,
clr: 0,
A: 'left',
R: [ [Object] ],
text: 'Values:' } ]
row [ { x: 12.964,
y: 14.835,
w: 0.5,
clr: 0,
A: 'left',
R: [ [Object] ],
text: '1' },
{ x: 12.964,
y: 16.423,
w: 0.5,
clr: 0,
A: 'left',
R: [ [Object] ],
text: '2' } ]
row [ { x: 12.964,
y: 18.01,
w: 0.5,
clr: 0,
A: 'left',
R: [ [Object] ],
text: '3' } ]
master
branch failed. šØI recommend you give this issue a high priority, so other packages depending on you could benefit from your bug fixes and new features.
You can find below the list of errors reported by semantic-release. Each one of them has to be resolved in order to automatically publish your package. Iām sure you can resolve this šŖ.
Errors are usually caused by a misconfiguration or an authentication problem. With each error reported below you will find explanation and guidance to help you to resolve it.
Once all the errors are resolved, semantic-release will release your package the next time you push a commit to the master
branch. You can also manually restart the failed CI job that runs semantic-release.
If you are not sure how to resolve this, here is some links that can help you:
If those donāt help, or if this issue is reporting something you think isnāt right, you can always ask the humans behind semantic-release.
An npm token must be created and set in the NPM_TOKEN
environment variable on your CI environment.
Please make sure to create an npm token and to set it in the NPM_TOKEN
environment variable on your CI environment. The token must allow to publish to the registry https://registry.npmjs.org/
.
Good luck with your project āØ
Your semantic-release bot š¦š
I'm writing an application that reads the content of some files in a directory. Files are meant to be replaced (same filename but different content).
If I use parseFileItems two times with the same path but different files the result is always the content of the old file.
I solved reading the file content with fs.readFile and passing the buffer to parseBuffer.
Your source code looks fine to me, maybe it'is a problem with pdf2json/pdfparser but I'm not sure so I'm reporting to you.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
š Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ššš
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ā¤ļø Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.