Comments (10)
Today I learned about Ligatures. Apparently it's common when turning text into PDF to replace ff and fi with a ligature instead, so it looks better.
I guess this is a feature request, instead, to support the standard ligatures that Chrome uses when printing to PDF.
I figured out a work around, from this stackoverflow post:
body {
/* This turns off chrome merging characters in fonts when printing to PDF */
font-variant-ligatures: none;
font-feature-settings: "liga" 0;
}
This seems to remove the ligatures so Chrome doesn't print them.
from npm-pdfreader.
- Patch
parse.js
to include charcodes in output
diff --git a/parse.js b/parse.js
index d8901cd..83565d3 100644
--- a/parse.js
+++ b/parse.js
@@ -12,7 +12,7 @@ function printRawItems(filename, callback){
else if (item.page)
console.log("page =", item.page);
else if (item.x)
- console.log([item.x, item.y, item.oc, item.A, Math.floor(item.w), item.text].join("\t"));
+ console.log([item.x, item.y, item.text, item.text.charCodeAt(0)].join(" | "));
else
console.warn(item);
});
- Parse the pdf file
$ node parse.js "Scenario-4.1-RiskTables-FQA.pdf" >pdf.log
[...]
6.883 | 10.509 | E | 69
7.502 | 10.509 | � | 0
8.276 | 10.509 | e | 101
8.895 | 10.509 | c | 99
9.436 | 10.509 | t | 116
9.823 | 10.509 | i | 105
10.133 | 10.509 | v | 118
10.674 | 10.509 | e | 101
11.603 | 10.509 | R | 82
12.299 | 10.509 | M | 77
13.305 | 10.509 | P | 80
[...]
- Check charcodes (on tabular version of output, for higher readability)
x | y | char | charcode |
---|---|---|---|
6.883 | 10.509 | E | 69 |
7.502 | 10.509 | � | 0 |
8.276 | 10.509 | e | 101 |
8.895 | 10.509 | c | 99 |
9.436 | 10.509 | t | 116 |
9.823 | 10.509 | i | 105 |
10.133 | 10.509 | v | 118 |
10.674 | 10.509 | e | 101 |
11.603 | 10.509 | R | 82 |
12.299 | 10.509 | M | 77 |
13.305 | 10.509 | P | 80 |
=> observations:
- the
ff
ligature is indeed stored as one text item of length 1 (i.e. just one character) - the charcode of that characters is 0 => this text item carries no information about the ligature. it was probably added just as a placeholder when exporting the pdf file.
- at that point, by reading just the text items parsed from the PDF file, I don't see any trace of how the ligature is actually displayed...
from npm-pdfreader.
Hi, and thanks for this very complete and well written issue!
From what I see, I can think of 3 hypotheses:
-
- The "FF" part of "effective" was replaced by some extended Unicode character that is not necessary visible when you logged the string resulting from the text extraction
-
- That FF string was treated separately by the PDF writer => instead of being included to the same text entry as the rest of the line, it's added as a different text entry, anywhere in the list of entries (aka "items") defined in the PDF file. => In that case, we would need to implement a way to automatically re-insert it in the initial text item by relying on the coordinates of those text entries. (Not trivial to do)
-
- Worst case scenario: the FF part was written as a graphic entry instead of a text entry. If that's the case, I don't think that pdfreader will be able to help on this, as it was designed to parse text entries only...
Does that help? Please let us know of your findings and decisions! :-)
from npm-pdfreader.
Thanks for sharing learnings, that's interesting!
Do you know how ligatures are stored in the PDF file ? Are they still considered as text? Or graphics, or yet another type of item?
(Knowing that would be handy I'm cases when users of pdfreader have no control over the generation of the PDF file)
from npm-pdfreader.
It looks like there's some strange character in the PDF in it's place, but I don't know anything more than that. Unfortunately I'm not enough of a PDF expert.
from npm-pdfreader.
from npm-pdfreader.
The PDF is attached to the issue when I reported it. It looks like it's a strange unicode character.
from npm-pdfreader.
Ha. I have no idea either. Magic? :-)
from npm-pdfreader.
I don't know if it may help, but a colleague using Apache Tika to extract text from PDF files told me that he was getting better results by installing the fonts used by the PDF file:
Me: During the talk (and our discussions), you mentioned that installing the fonts used in the PDF file would improve Tika's precision while extracting the text. Does your npm package take care of that?
Tim: The package does not handle that part. But you can get the list of all fonts in a PDF file using pdffonts (part of Xpdf), and get the list of fonts installed on your system with fc-list (at least, that's how it works on Ubuntu). The output from fc-list can be cumbersome to parse, so I made my own script, font-exists to see if a specific font (from the output of pdffonts) exists on my system. Then finding and installing the missing fonts is a manual process that can't be automated.
from npm-pdfreader.
No news => closing.
from npm-pdfreader.
Related Issues (20)
- ReferenceError: MozBlobBuilder is not defined HOT 1
- example not working as shown in the document HOT 4
- PDF parse failed HOT 11
- Question: Missing spaces in text output HOT 1
- Can't import library: Unexpected token '(' HOT 7
- 'ava' is not recognized as an internal or external command - When testing compatibility HOT 5
- Clarification on latest release of pdfreader HOT 2
- Gettin items as promise HOT 1
- Make it async // return item as promise HOT 3
- Importing module failed HOT 4
- [ERR_REQUIRE_ESM] require() of ES Module is not supported HOT 11
- It is not supporting the TS type declaration. HOT 24
- aws lambda, sst. Error: Identifier '__dirname' has already been declared HOT 2
- Rules not working
- Import when using the library in Cypress HOT 1
- Can I get the coordinates of the given text from all pages of the PDF?
- Setting up fake worker.
- Cannot read text based PDF file content HOT 4
- Where is `displayValue` declared? HOT 1
- isDebugMode - No such file or directory "test\data\05-versions-space.pdf" HOT 2
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from npm-pdfreader.