GithubHelp home page GithubHelp logo

Comments (10)

rsshilli avatar rsshilli commented on May 13, 2024 1

Today I learned about Ligatures. Apparently it's common when turning text into PDF to replace ff and fi with a ligature instead, so it looks better.

I guess this is a feature request, instead, to support the standard ligatures that Chrome uses when printing to PDF.

I figured out a work around, from this stackoverflow post:

body {
  /* This turns off chrome merging characters in fonts when printing to PDF */
  font-variant-ligatures: none;
  font-feature-settings: "liga" 0;
}

This seems to remove the ligatures so Chrome doesn't print them.

from npm-pdfreader.

adrienjoly avatar adrienjoly commented on May 13, 2024 1
  1. Patch parse.js to include charcodes in output
diff --git a/parse.js b/parse.js
index d8901cd..83565d3 100644
--- a/parse.js
+++ b/parse.js
@@ -12,7 +12,7 @@ function printRawItems(filename, callback){
     else if (item.page)
       console.log("page =", item.page);
     else if (item.x)
-      console.log([item.x, item.y, item.oc, item.A, Math.floor(item.w), item.text].join("\t"));
+      console.log([item.x, item.y, item.text, item.text.charCodeAt(0)].join(" | "));
     else
       console.warn(item);
   });
  1. Parse the pdf file
$ node parse.js "Scenario-4.1-RiskTables-FQA.pdf" >pdf.log

[...]
6.883 | 10.509 | E | 69
7.502 | 10.509 || 0
8.276 | 10.509 | e | 101
8.895 | 10.509 | c | 99
9.436 | 10.509 | t | 116
9.823 | 10.509 | i | 105
10.133 | 10.509 | v | 118
10.674 | 10.509 | e | 101
11.603 | 10.509 | R | 82
12.299 | 10.509 | M | 77
13.305 | 10.509 | P | 80
[...]
  1. Check charcodes (on tabular version of output, for higher readability)
x y char charcode
6.883 10.509 E 69
7.502 10.509 0
8.276 10.509 e 101
8.895 10.509 c 99
9.436 10.509 t 116
9.823 10.509 i 105
10.133 10.509 v 118
10.674 10.509 e 101
11.603 10.509 R 82
12.299 10.509 M 77
13.305 10.509 P 80

=> observations:

  • the ff ligature is indeed stored as one text item of length 1 (i.e. just one character)
  • the charcode of that characters is 0 => this text item carries no information about the ligature. it was probably added just as a placeholder when exporting the pdf file.
  • at that point, by reading just the text items parsed from the PDF file, I don't see any trace of how the ligature is actually displayed...

from npm-pdfreader.

adrienjoly avatar adrienjoly commented on May 13, 2024

Hi, and thanks for this very complete and well written issue!

From what I see, I can think of 3 hypotheses:

    1. The "FF" part of "effective" was replaced by some extended Unicode character that is not necessary visible when you logged the string resulting from the text extraction
    1. That FF string was treated separately by the PDF writer => instead of being included to the same text entry as the rest of the line, it's added as a different text entry, anywhere in the list of entries (aka "items") defined in the PDF file. => In that case, we would need to implement a way to automatically re-insert it in the initial text item by relying on the coordinates of those text entries. (Not trivial to do)
    1. Worst case scenario: the FF part was written as a graphic entry instead of a text entry. If that's the case, I don't think that pdfreader will be able to help on this, as it was designed to parse text entries only...

Does that help? Please let us know of your findings and decisions! :-)

from npm-pdfreader.

adrienjoly avatar adrienjoly commented on May 13, 2024

Thanks for sharing learnings, that's interesting!

Do you know how ligatures are stored in the PDF file ? Are they still considered as text? Or graphics, or yet another type of item?

(Knowing that would be handy I'm cases when users of pdfreader have no control over the generation of the PDF file)

from npm-pdfreader.

rsshilli avatar rsshilli commented on May 13, 2024

It looks like there's some strange character in the PDF in it's place, but I don't know anything more than that. Unfortunately I'm not enough of a PDF expert.

from npm-pdfreader.

adrienjoly avatar adrienjoly commented on May 13, 2024

from npm-pdfreader.

rsshilli avatar rsshilli commented on May 13, 2024

The PDF is attached to the issue when I reported it. It looks like it's a strange unicode character.

from npm-pdfreader.

rsshilli avatar rsshilli commented on May 13, 2024

Ha. I have no idea either. Magic? :-)

from npm-pdfreader.

adrienjoly avatar adrienjoly commented on May 13, 2024

I don't know if it may help, but a colleague using Apache Tika to extract text from PDF files told me that he was getting better results by installing the fonts used by the PDF file:

Me: During the talk (and our discussions), you mentioned that installing the fonts used in the PDF file would improve Tika's precision while extracting the text. Does your npm package take care of that?

Tim: The package does not handle that part. But you can get the list of all fonts in a PDF file using pdffonts (part of Xpdf), and get the list of fonts installed on your system with fc-list (at least, that's how it works on Ubuntu). The output from fc-list can be cumbersome to parse, so I made my own script, font-exists to see if a specific font (from the output of pdffonts) exists on my system. Then finding and installing the missing fonts is a manual process that can't be automated.

from npm-pdfreader.

adrienjoly avatar adrienjoly commented on May 13, 2024

No news => closing.

from npm-pdfreader.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.