First, I just want to thank you for creating this package. It's really helped us.

Today I learned about <a href="https://en.wikipedia.org/wiki/Typographic_ligature" rel

Patch parse.js to include charcodes in output</

Some characters are missing / corrupt (e.g. ligatures) about npm-pdfreader HOT 10 CLOSED

adrienjoly commented on May 13, 2024

Some characters are missing / corrupt (e.g. ligatures)

from npm-pdfreader.

Comments (10)

rsshilli commented on May 13, 2024 1

Today I learned about Ligatures. Apparently it's common when turning text into PDF to replace ff and fi with a ligature instead, so it looks better.

I guess this is a feature request, instead, to support the standard ligatures that Chrome uses when printing to PDF.

I figured out a work around, from this stackoverflow post:

body {
  /* This turns off chrome merging characters in fonts when printing to PDF */
  font-variant-ligatures: none;
  font-feature-settings: "liga" 0;
}

This seems to remove the ligatures so Chrome doesn't print them.

from npm-pdfreader.

adrienjoly commented on May 13, 2024 1

Patch parse.js to include charcodes in output

diff --git a/parse.js b/parse.js
index d8901cd..83565d3 100644
--- a/parse.js
+++ b/parse.js
@@ -12,7 +12,7 @@ function printRawItems(filename, callback){
     else if (item.page)
       console.log("page =", item.page);
     else if (item.x)
-      console.log([item.x, item.y, item.oc, item.A, Math.floor(item.w), item.text].join("\t"));
+      console.log([item.x, item.y, item.text, item.text.charCodeAt(0)].join(" | "));
     else
       console.warn(item);
   });

Parse the pdf file

$ node parse.js "Scenario-4.1-RiskTables-FQA.pdf" >pdf.log

[...]
6.883 | 10.509 | E | 69
7.502 | 10.509 | � | 0
8.276 | 10.509 | e | 101
8.895 | 10.509 | c | 99
9.436 | 10.509 | t | 116
9.823 | 10.509 | i | 105
10.133 | 10.509 | v | 118
10.674 | 10.509 | e | 101
11.603 | 10.509 | R | 82
12.299 | 10.509 | M | 77
13.305 | 10.509 | P | 80
[...]

Check charcodes (on tabular version of output, for higher readability)

x	y	char	charcode
6.883	10.509	E	69
7.502	10.509	�	0
8.276	10.509	e	101
8.895	10.509	c	99
9.436	10.509	t	116
9.823	10.509	i	105
10.133	10.509	v	118
10.674	10.509	e	101
11.603	10.509	R	82
12.299	10.509	M	77
13.305	10.509	P	80

=> observations:

the ff ligature is indeed stored as one text item of length 1 (i.e. just one character)
the charcode of that characters is 0 => this text item carries no information about the ligature. it was probably added just as a placeholder when exporting the pdf file.
at that point, by reading just the text items parsed from the PDF file, I don't see any trace of how the ligature is actually displayed...

from npm-pdfreader.

adrienjoly commented on May 13, 2024

Hi, and thanks for this very complete and well written issue!

From what I see, I can think of 3 hypotheses:

1. The "FF" part of "effective" was replaced by some extended Unicode character that is not necessary visible when you logged the string resulting from the text extraction
1. That FF string was treated separately by the PDF writer => instead of being included to the same text entry as the rest of the line, it's added as a different text entry, anywhere in the list of entries (aka "items") defined in the PDF file. => In that case, we would need to implement a way to automatically re-insert it in the initial text item by relying on the coordinates of those text entries. (Not trivial to do)
1. Worst case scenario: the FF part was written as a graphic entry instead of a text entry. If that's the case, I don't think that pdfreader will be able to help on this, as it was designed to parse text entries only...

Does that help? Please let us know of your findings and decisions! :-)

from npm-pdfreader.

adrienjoly commented on May 13, 2024

Thanks for sharing learnings, that's interesting!

Do you know how ligatures are stored in the PDF file ? Are they still considered as text? Or graphics, or yet another type of item?

(Knowing that would be handy I'm cases when users of pdfreader have no control over the generation of the PDF file)

from npm-pdfreader.

rsshilli commented on May 13, 2024

It looks like there's some strange character in the PDF in it's place, but I don't know anything more than that. Unfortunately I'm not enough of a PDF expert.

from npm-pdfreader.

adrienjoly commented on May 13, 2024

I can have a look if you don't mind sharing the PDF file with me.

from npm-pdfreader.

rsshilli commented on May 13, 2024

The PDF is attached to the issue when I reported it. It looks like it's a strange unicode character.

from npm-pdfreader.

rsshilli commented on May 13, 2024

Ha. I have no idea either. Magic? :-)

from npm-pdfreader.

adrienjoly commented on May 13, 2024

I don't know if it may help, but a colleague using Apache Tika to extract text from PDF files told me that he was getting better results by installing the fonts used by the PDF file:

Me: During the talk (and our discussions), you mentioned that installing the fonts used in the PDF file would improve Tika's precision while extracting the text. Does your npm package take care of that?

Tim: The package does not handle that part. But you can get the list of all fonts in a PDF file using pdffonts (part of Xpdf), and get the list of fonts installed on your system with fc-list (at least, that's how it works on Ubuntu). The output from fc-list can be cumbersome to parse, so I made my own script, font-exists to see if a specific font (from the output of pdffonts) exists on my system. Then finding and installing the missing fonts is a manual process that can't be automated.

from npm-pdfreader.

adrienjoly commented on May 13, 2024

No news => closing.

from npm-pdfreader.

Some characters are missing / corrupt (e.g. ligatures) about npm-pdfreader HOT 10 CLOSED

Comments (10)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent

Jobs