ropensci / pdftools Goto Github PK

View Code? Open in Web Editor NEW

510.0 29.0 69.0 1.11 MB

Text Extraction, Rendering and Converting of PDF Documents

Home Page: https://docs.ropensci.org/pdftools

License: Other

R 38.99% C++ 60.90% Shell 0.11%

pdf-format text-extraction poppler-library poppler r rstats pdf-files pdftools r-package

pdftools's Introduction

pdftools

Introduction

Scientific articles are typically locked away in PDF format, a format designed primarily for printing but not so great for searching or indexing. The new pdftools package allows for extracting text and metadata from pdf files in R. From the extracted plain-text one could find articles discussing a particular drug or species name, without having to rely on publishers providing metadata, or pay-walled search engines.

The pdftools slightly overlaps with the Rpoppler package by Kurt Hornik. The main motivation behind developing pdftools was that Rpoppler depends on glib, which does not work well on Mac and Windows. The pdftools package uses the poppler c++ interface together with Rcpp, which results in a lighter and more portable implementation.

Installation

On Windows and Mac the binary packages can be installed directly from CRAN:

install.packages("pdftools")

Installation on Linux requires the poppler development library. For Ubuntu 18.04 (Bionic) and Ubuntu 20.04 (Focal) we provide backports of poppler version 22.02 to support the latest functionality:

sudo add-apt-repository -y ppa:cran/poppler
sudo apt-get update
sudo apt-get install -y libpoppler-cpp-dev

On other versions of Debian or Ubuntu simply use::

sudo apt-get install libpoppler-cpp-dev

If you want to install the package from source on MacOS you need brew:

brew install poppler

On Fedora:

sudo yum install poppler-cpp-devel

Getting started

The ?pdftools manual page shows a brief overview of the main utilities. The most important function is pdf_text which returns a character vector of length equal to the number of pages in the pdf. Each string in the vector contains a plain text version of the text on that page.

library(pdftools)
download.file("http://arxiv.org/pdf/1403.2805.pdf", "1403.2805.pdf", mode = "wb")
txt <- pdf_text("1403.2805.pdf")

# first page text
cat(txt[1])

# second page text
cat(txt[2])

In addition, the package has some utilities to extract other data from the PDF file. The pdf_toc function shows the table of contents, i.e. the section headers which pdf readers usually display in a menu on the left. It looks pretty in JSON:

# Table of contents
toc <- pdf_toc("1403.2805.pdf")

# Show as JSON
jsonlite::toJSON(toc, auto_unbox = TRUE, pretty = TRUE)

Other functions provide information about fonts, attachments and metadata such as the author, creation date or tags.

# Author, version, etc
info <- pdf_info("1403.2805.pdf")

# Table with fonts
fonts <- pdf_fonts("1403.2805.pdf")

Rendering pdf files

Another feature of pdftools is rendering of PDF files to bitmap arrays (images). The poppler library provides all functionality to implement a complete PDF reader, including graphical display of the content. In R we can use pdf_render_page to render a page of the PDF into a bitmap, which can be stored as e.g. png or jpeg.

# renders pdf to bitmap array
bitmap <- pdf_render_page("1403.2805.pdf", page = 1)

# save bitmap image
png::writePNG(bitmap, "page.png")
webp::write_webp(bitmap, "page.webp")

Limitations and related packages

Tables

Data scientists are often interested in data from tables. Unfortunately the pdf format is pretty dumb and does not have notion of a table (unlike for example HTML). Tabular data in a pdf file is nothing more than strategically positioned lines and text, which makes it difficult to extract the raw data with pdftools.

txt <- pdf_text("http://arxiv.org/pdf/1406.4806.pdf")

# some tables
cat(txt[18])
cat(txt[19])

The tabulizer package is dedicated to extracting tables from PDF, and includes interactive tools for selecting tables. However, tabulizer depends on rJava and therefore requires additional setup steps or may be impossible to use on systems where Java cannot be installed.

It is possible to use pdftools with some creativity to parse tables from PDF documents, which does not require Java to be installed.

Scanned text

If you want to extract text from scanned text present in a pdf, you'll need to use OCR (optical character recognition). Please refer to the rOpenSci tesseract package that provides bindings to the Tesseract OCR engine. In particular read the section of its vignette about reading from PDF files using pdftools and tesseract.

pdftools's People

Contributors

Stargazers

Watchers

pdftools's Issues

missing data on pdf's with different page sizes

After calling pdf_text; i got the text, nevertheless some pages are clipped. Also missing data.

it's similar to the landscape problem, but not the same. As not all pages are same size. Also some data is missing

im calling the function directly on the file , no other configurations

reference issue: #7

Script

# after downloading the file and saving it as 0003_PDF198_206_articulo.pdf
current_pdf <- '0003_PDF198_206_articulo.pdf'
pdf_ejemplo <- paste0(current_pdf)
texto_extraido <- pdf_text(pdf_ejemplo)
pdf_output_file_name <- str_replace(current_pdf,".pdf",".txt")
pdf_output_file <- paste0(pdf_output_file_name)
write.table(x=texto_extraido,file = pdf_output_file,row.names = FALSE,col.names = FALSE,quote = FALSE,fileEncoding = 'UTF-8')
pdf_output_file_name

Data

The example PDF: https://revistas.unlp.edu.ar/raab/article/view/198/206
The output of pdf_text: 0003_PDF198_206_articulo.txt

some clipped:

page 2 ( numbered 6 ): have clipped lines ( line 36 txt, is line 7 of page 2 in pdf )
page 4 ( numbered 8 ): have clipped lines ( 2nd paragraph 2nd line )

some missing:

page 2 ( numbered 6 ): have missing lines ( 1st paragraph, and some words on 2nd paragraph)
page 4( numbered 8 ): have missing lines ( 1st paragraph )

Thanks in advance! Also great work with pdftools , love it :D!

poppler_config links to pdf_info man page, but no entry for it

not that important, but curious if you intended to include it on the pdf_info man page

Feature Request: extract images from pdfs

I would like to move my work extracting images from pdfs from calling pdfimages to your package. Would you be interested in including this command to an R function?

I feel this would open up more opportunities for analysis of images that are within pdfs. My goal is to get better analysis of tables that are scanned copies transferred into pdf. #

What went wrong parsing this pdf?

It may well be the case that the .pdf is mal-formed, but I am not equipped to figure that out myself.

The URL for the pdf is here:

http://www.mmb.pa.gov/Pricing%20Information/Wholesale%20or%20Retail/Documents/2011-07-Wholesale_Retail_Pricing.pdf

Code to get to the part that doesn't seem to be parsing properly:

URL = paste0("http://www.mmb.pa.gov/Pricing%20Information/",
             "Wholesale%20or%20Retail/Documents/",
             "2011-07-Wholesale_Retail_Pricing.pdf")

x = strsplit(pdf_text(URL)[1L], split = "\n")[[1L]]

#Troubled result:
x[56L:64L]
# <suppressing output since it's ugly>

#Expected output is just a single element looking roughly like the other lines near there:
#  "               Gallon                $3.9732              $3.7146          $3.5182        $3.4043              $4.11         $3.88         $3.70            $3.60"

Basically it seems like too many newlines (\n) were found. It could well be that these are actually in the .pdf...

It seems like there is indeed a problem with the pdf itself, as using the command line tool pdftotext also resulted in all of the numbers in that row being duplicated in the output.

Is there any way for the parser to have recognized these duplicates as being "phantom" and removed them appropriately?

pdf_text cutting off portion of page printed in landscape

pdf_text is not scanning in the part of the page that is past 8.5" wide. I created an Excel and a Word doc and saved them as landscape and it is scanning entire page into R. So, maybe it is something specific to my pdf.

Add link to tesseract in the README?

Maybe some users will come here after getting a pdf that is a scanned image of text and not know what to do?

pdf_data(): column space is not documented

I noticed that there is no information on what column space actually means when using pdf_data().

The only reference I found so far is that its meaning might be unclear: https://discuss.ropensci.org/t/pdftools-2-0-powerful-pdf-text-extraction-tools/1520/4

pdf_render_page bitmap triggers error with writeTIFF

Per tesseracts example

Render pdf to jpeg/tiff image

bitmap <- pdf_render_page(news, dpi = 300)
tiff::writeTIFF(bitmap, "page.tiff")

Error in tiff::writeTIFF(bitmap, "page.tiff") :
INTEGER() can only be applied to a 'integer', not a 'raw'

sessionInfo

R version 3.4.2 (2017-09-28)
Platform: x86_64-apple-darwin15.6.0 (64-bit)
Running under: macOS Sierra 10.12.6

[1] tesseract_1.6 magick_1.5 tiff_0.1-5 pdftools_1.5

Package zip extraction failure windows version 1.8 on CRAN

Version of 1.8 of this package hosted on CRAN for windows that is downloaded by default using the install.package("pdftools") cannot be extracted per the error below:

Warning in install.packages :
downloaded length 9216309 != reported length 10127221

Attempting to install from the downloaded zip file of the archive also fails as does a manual extraction of the zip file using 7zip.

The non-development version 1.8 of the package hosted on CRAN can be extracted and installed without issue.

PDF Read is getting truncated

I am trying to read a pdf but PDFTools is not reading the complete line and truncating some text in the right columns. Could you suggest some alternative method to solve it?

factor to character in pdf_fonts

Minor point, but I get

str(pdf_fonts("R-exts.pdf"))
#> 'data.frame':    22 obs. of  4 variables:
#>  $ name    : Factor w/ 22 levels "AYLODX+CMTI10",..: 17 12 9 5 16 21 19 1 6 3 ...
#>  $ type    : Factor w/ 1 level "type1": 1 1 1 1 1 1 1 1 1 1 ...
#>  $ embedded: logi  TRUE TRUE TRUE TRUE TRUE TRUE ...
#>  $ file    : Factor w/ 1 level "": 1 1 1 1 1 1 1 1 1 1 ...

would you be willing to change the factor columns to character? seems safer, less likely to cause problems downstream if users don't realize they're dealing with factors

Some characters transformed into strings from R (pdf_text). Error not present in the Linux executable (pdftotext)

I was wrong,
sorry

Greyscale rendering

pdf_render_page() returns rgba bitmaps even for greyscale pdf files.

Unable to load on Mac

> sessionInfo()
R version 3.5.0 (2018-04-23)
Platform: x86_64-apple-darwin15.6.0 (64-bit)
Running under: macOS High Sierra 10.13.4

Matrix products: default
BLAS: /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/3.5/Resources/lib/libRlapack.dylib

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] stats     graphics  grDevices utils     datasets 
[6] methods   base     

loaded via a namespace (and not attached):
[1] compiler_3.5.0 tools_3.5.0    yaml_2.1.19   
> install.packages("pdftools")
trying URL 'https://cran.rstudio.com/bin/macosx/el-capitan/contrib/3.5/pdftools_1.6.tgz'
Content type 'application/x-gzip' length 6297304 bytes (6.0 MB)
==================================================
downloaded 6.0 MB


The downloaded binary packages are in
	/var/folders/wb/9jk064jn1g70btj2nsrg864h0000gn/T//RtmpFGo7ke/downloaded_packages
> library(pdftools)
Error: package or namespace load failed for ‘pdftools’ in dyn.load(file, DLLpath = DLLpath, ...):
 unable to load shared object '/Library/Frameworks/R.framework/Versions/3.5/Resources/library/pdftools/libs/pdftools.so':
  dlopen(/Library/Frameworks/R.framework/Versions/3.5/Resources/library/pdftools/libs/pdftools.so, 6): Library not loaded: /opt/X11/lib/libxcb-shm.0.dylib
  Referenced from: /Library/Frameworks/R.framework/Versions/3.5/Resources/library/pdftools/libs/pdftools.so
  Reason: image not found

unable to read pdf

I am trying to read a pdf file and convert that into an image because it contains scan copy of the data. but I am not able to read it

I am trying the basic code

library(pdftools)
 pdf<-"pdfdata/CMB0760301.PDF"
 bitmap <- pdf_render_page("pdfdata/CMB0760301.PDF")

PDF error: Invalid page count 0
Error in poppler_render_page(loadfile(pdf), page, dpi, opw, upw, antialiasing,  : 
  Invalid page.

Please tell me why is happening and how should I avoid it.

Render to svg via cairo

Convert pdf to svg or graphics device? https://www.cairographics.org/cookbook/renderpdf/

Can provide lines' info?

Because I want to extract tables' info, lines' info (position, width, height) are useful. For example, part of pdf below.

error reporting in pdf_text

Errors are a little hard to parse, they are call combined into one string. Though maybe this is is good enough?

an example

download.file("https://github.com/sckott/scott/raw/gh-pages/pdfs/Chamberlain%26Rudgers2011EvolEcol.pdf", 
              "paper.pdf")
pdftools::pdf_text('paper.pdf')

#> poppler/error: Invalid shared object hint table offsetpoppler/error: Failed to get object num from 
#> hint tables for page 1poppler/error: Failed parsing page 1 using hint tablespoppler/error: Failed to 
#> get object num from hint tables for page 1poppler/error: Failed parsing page 1 using 
#> hint tablespoppler/error: Failed to get object num from hint tables for page 1poppler/error: 
#> Failed parsing page 1 using hint tables
#> ... cutoff

Use R.home("doc") instead of Sys.getenv("R_DOC_DIR")

Some of the examples in pdftools use Sys.getenv("R_DOC_DIR") instead of R.home("doc") to find the documentation directory. It looks like R_DOC_DIR is not defined by default in R-3.2.4 on Windows so the examples fail.

> version$platform
[1] "x86_64-w64-mingw32"
> example(pdf_info)

pdf_nf> # Just a random pdf file
pdf_nf> file.copy(file.path(Sys.getenv("R_DOC_DIR"), "NEWS.pdf"), "news.pdf")
[1] FALSE

pdf_nf> info <- pdf_info("news.pdf")
Error in normalizePath(path.expand(path), winslash, mustWork) :
  path[1]="news.pdf": The system cannot find the file specified

Also, the file.copy() is not desirable, as it leaves a file behind in the current directory.
I think the example should be
pdf_info(file.path(R.home("doc"), "NEWS.pdf"))

Bill Dunlap
[email protected]

pdf_text returns empty strings for specific files V2

Hi,
I'm trying to read ~1700 pdfs from urls and most are working but ~150 are not. For example this one: pdftools::pdf_text("http://www.dt.tesoro.it/export/sites/sitodt/modules/documenti_en/debito_pubblico/risultati_aste/risultati_aste_btp_10_anni/10-Years-BTP-Auction-Results-30.12.2002.pdf") gives an empty string.

This is notified as a bug here: #24 but downloading the dev version didn't fix the issue for me.

However, this pdf has quite different meta data to the ones that read properly as it is not linearised, has many "\n" in its metadata and has layout of one-column. pdftools::pdf_info("http://www.dt.tesoro.it/export/sites/sitodt/modules/documenti_en/debito_pubblico/risultati_aste/risultati_aste_btp_10_anni/10-Years-BTP-Auction-Results-30.12.2002.pdf")

Is this a new bug or the same bug, or is this known functionality because this pdf is bad?

Thanks for any help- this package has already saved so many hours of work!

Cheers,
Grace

An example that reads as expected: pdftools::pdf_text("http://www.dt.tesoro.it/export/sites/sitodt/modules/documenti_en/debito_pubblico/risultati_aste/risultati_aste_btp_10_anni/10-Year-BTP--Auction-Results-27.02.2.pdf")

sessionInfo()
R version 3.5.1 (2018-07-02)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows >= 8 x64 (build 9200)

Matrix products: default

locale:
[1] LC_COLLATE=English_United Kingdom.1252 LC_CTYPE=English_United Kingdom.1252
[3] LC_MONETARY=English_United Kingdom.1252 LC_NUMERIC=C
[5] LC_TIME=English_United Kingdom.1252

attached base packages:
[1] stats graphics grDevices utils datasets methods base

other attached packages:
[1] pdftools_1.8

loaded via a namespace (and not attached):
[1] compiler_3.5.1 tools_3.5.1 Rcpp_0.12.18

Suggestion of table to data.table (or data.frame) toolbox

I love everything this package wants to be. I have a suggestion. I read that turning pdf tables into R objects is not easy since there are many ways one can setup a table in a PDF. But surely there are several common ways or several variations on a theme. Would it be possible to include (in the package or outside of it) a few tools or suggestions to get the naïve or very part-time coder started? Or how about a blog post tutorial? I am sure if I saw 3 worked examples it would help me tremendously.

Getting a poppler version error when trying to use pdf_data()

Can't use the pdf_data() function due to version error:

Error in poppler_pdf_data(loadfile(pdf), opw, upw) : 
  This feature requires poppler >= 0.63. You have 0.61.0

Session info ---------------------------------------------------------------------------------------
 setting  value                       
 version  R version 3.5.0 (2018-04-23)
 system   i386, mingw32               
 ui       RStudio (1.1.453)           
 language (EN)                        
 collate  English_United States.1252  
 tz       America/New_York            
 date     2018-08-13                  

Packages -------------------------------------------------------------------------------------------
 package   * version date       source                            
 base      * 3.5.0   2018-04-23 local                             
 compiler    3.5.0   2018-04-23 local                             
 datasets  * 3.5.0   2018-04-23 local                             
 devtools  * 1.13.6  2018-06-27 CRAN (R 3.5.1)                    
 digest      0.6.15  2018-01-28 CRAN (R 3.5.1)                    
 graphics  * 3.5.0   2018-04-23 local                             
 grDevices * 3.5.0   2018-04-23 local                             
 memoise     1.1.0   2017-04-21 CRAN (R 3.5.1)                    
 methods   * 3.5.0   2018-04-23 local                             
 pdftools  * 1.8     2018-08-13 Github (ropensci/pdftools@30f1f4c)
 Rcpp        0.12.18 2018-07-23 CRAN (R 3.5.1)                    
 stats     * 3.5.0   2018-04-23 local                             
 tools       3.5.0   2018-04-23 local                             
 utils     * 3.5.0   2018-04-23 local                             
 withr       2.1.2   2018-03-15 CRAN (R 3.5.1)                    
 yaml        2.2.0   2018-07-25 CRAN (R 3.5.1)

Normalizing Mathematical Pi

See if we can add tools to convert custom font into regular unicode.

Mathematical PI spec: https://files.acrobat.com/a/preview/b445ea2f-fcbb-44af-a798-fc854d8dd9b5

Ames2004.pdf

Stitch/combine pages

See if we can use libpoppler to split and combine pages from pdf files.

Reference Tabula / tabulizer in README

You might want to reference Tabula and tabulizer in the final section of the README. They work quite well for extracting tables and are a better suggestion than "with a little creativity you might be able to parse the table data."

Inserts empty lines and spaces between letters of random words

Inserts empty lines and spaces b e t w e e n letters of r a n d o m words.

This text:

Becomes this text:

"Numéro 476. Japon f i n i t mal l'année. Peuple privé de tout sans

i l l u s i o n s quoique la presse ramène l e s revers successifs aux

Philippines et en Birmanie aux proportions escarmouches entre

avions. Situation économique toujours lamentable et prix a t -

teignent un niveau i n o u i . Conditions en Mandchourie meilleures

mais faute de transports et organisation, vivres de c e t t e c o l o -

nie arrivent d i f f i c i l e m e n t e t peuple s o u f f r e de l a faim et du

froid. Malaise accru par perturbations du t r a f i c f e r r o v i a i r e"

Already existing OCR in the PDF has no spaces. The direct copy-paste from Acrobat Pro DC looks like this:

"Numéro 476. Japon finit mal l'année. Peuple privé de tout sans
illusions quoique la presse ramène les revers successifs aux
Philippines et en Birmanie aux proportions escarmouches entre
avions. Situation économique toujours lamentable et prix at¬
teignent un niveau inoui. Conditions en Mandchourie meilleures
mais faute de transports et organisation, vivres de cette colo¬
nie arrivent difficilement et peuple souffre de la faim et du
froid. Malaise accru par perturbations du trafic ferroviaire"

R version 3.2.5 (2016-04-14)
Platform: i386-w64-mingw32/i386 (32-bit)
Running under: Windows 7 (build 7601) Service Pack 1

locale:
[1] LC_COLLATE=French_Switzerland.1252 LC_CTYPE=French_Switzerland.1252
[3] LC_MONETARY=French_Switzerland.1252 LC_NUMERIC=C
[5] LC_TIME=French_Switzerland.1252

Install issue on Ubuntu

Hi - I get the following error while trying to install on Ubuntu. I installed libpoppler-cpp-dev prior to installing pdftools. Error:

* installing to library ‘/home/dev/R/x86_64-pc-linux-gnu-library/3.2’
* installing *source* package ‘pdftools’ ...
** package ‘pdftools’ successfully unpacked and MD5 sums checked
Found pkg-config cflags and libs!
Using PKG_CFLAGS=-I/usr/local/include/poppler/cpp -I/usr/local/include/poppler
Using PKG_LIBS=-L/usr/local/lib -lpoppler-cpp
** libs
g++ -I/usr/share/R/include -DNDEBUG -I/usr/local/include/poppler/cpp -I/usr/local/include/poppler    -I"/usr/local/lib/R/site-library/Rcpp/include"   -fpic  -g -O2 -fstack-protector --param=ssp-buffer-size=4 -Wformat -Werror=format-security -D_FORTIFY_SOURCE=2 -g  -c RcppExports.cpp -o RcppExports.o
g++ -I/usr/share/R/include -DNDEBUG -I/usr/local/include/poppler/cpp -I/usr/local/include/poppler    -I"/usr/local/lib/R/site-library/Rcpp/include"   -fpic  -g -O2 -fstack-protector --param=ssp-buffer-size=4 -Wformat -Werror=format-security -D_FORTIFY_SOURCE=2 -g  -c bindings.cpp -o bindings.o
g++ -I/usr/share/R/include -DNDEBUG -I/usr/local/include/poppler/cpp -I/usr/local/include/poppler    -I"/usr/local/lib/R/site-library/Rcpp/include"   -fpic  -g -O2 -fstack-protector --param=ssp-buffer-size=4 -Wformat -Werror=format-security -D_FORTIFY_SOURCE=2 -g  -c init.cpp -o init.o
g++ -shared -L/usr/lib/R/lib -Wl,-Bsymbolic-functions -Wl,-z,relro -o pdftools.so RcppExports.o bindings.o init.o -L/usr/local/lib -lpoppler-cpp -L/usr/lib/R/lib -lR
installing to /home/dev/R/x86_64-pc-linux-gnu-library/3.2/pdftools/libs
** R
** preparing package for lazy loading
** help
*** installing help indices
** building package indices
** testing if installed package can be loaded
Error in dyn.load(file, DLLpath = DLLpath, ...) :
  unable to load shared object '/home/dev/R/x86_64-pc-linux-gnu-library/3.2/pdftools/libs/pdftools.so':
  /home/dev/R/x86_64-pc-linux-gnu-library/3.2/pdftools/libs/pdftools.so: undefined symbol: _ZN7poppler24set_debug_error_functionEPFvRKSsPvES2_
Error: loading failed
Execution halted
ERROR: loading failed
* removing ‘/home/dev/R/x86_64-pc-linux-gnu-library/3.2/pdftools’

Poppler-data and CJK

Thank you for creating the package. It is excellent when working under English pdf files.

When I try to use it for pdf with Chinese (an example is attached)
example.pdf

I have received the following warning:

Warning: error: Missing language pack for 'Adobe-CNS1' mapping

I have tried to download standalone poppler pdftotext and try to add the library. It seems to work in the commnand. How can I add the poppler-data to the pdftools installation so that I can work in R?

Thanks.

Submitting a table parser?

Hi @jeroen,

I have started writing a function that does a decent job at inferring rows and columns from the output of pdf_data(). I find it works well in a surprising number of table cases, though I fully admit, not all of them. We're using it internally and I wonder if you'd have interest including it in pdftools? I could submit a PR and work on it with you.

PDF error: Couldn't find a font for 'Helvetica'

Hello, I'm trying to convert a Pdf created via knitr to PNG.

When using pdf_convert I get this error: "PDF error: Couldn't find a font for 'Helvetica'".

Helvetica is undoubtedly installed on my machine, so I don't understand where this message comes from.

Thanks

circle symbols don't appear on the converted png

The circle symbol (pch =1 or 16, 19, 20, 21) is blank or don't appear in the png when converted from a pdf.
For example:

pdf("baseplot.pdf")
plot(0:25, rnorm(26), type = "b", pch = 0:25, col = "red")
dev.off()
pdftools::pdf_convert("baseplot.pdf", dpi = 300, filenames = "baseplot.png")

Plots generated by ggplot2 are also affected.

Versions: pdftools 1.5, R 3.3.3, Windows 10.

Thank you

Some pages in some pdfs are quite slowly converted to pngs

some examples of slowly converted pages are below, table lines are dotted:

text to pdf?

I have the following function:

FunctionFun <- function(.csv)
{
  sink('Dimension.txt')
  csv <- read.csv(.csv)
  dimValue <- dim(csv)
  print("The dimension of the dataset is:")
  print(dimValue)
  return(dimValue)
  sink('Dimension.txt', append=TRUE)
}

I want to write another function that would convert the Dimension.txt to Dimension.pdf. Can I achieve this using pdftools?

pdf_info failing with embedded NUL

image: rstudio/rocker
packages: devtools install from github
crossposted in case not a bug: https://stackoverflow.com/questions/53293124/pdftools-embeded-nul-in-string

I'm trying to download a file and read it's info automatically, from the following link:

http://www.leyes.congreso.gob.pe/Documentos/2016_2021/Proyectos_de_Ley_y_de_Resoluciones_Legislativas/PL0361420181108.pdf

The problem is that when I try to read the information on the pdf, I get an error. It seems to happen on and off, I can't see a good reason why. The error appears to be Linux only.

    library(pdftools)
    link = "http://www.leyes.congreso.gob.pe/Documentos/2016_2021/Proyectos_de_Ley_y_de_Resoluciones_Legislativas/PL0361420181108.pdf"
    download.file(link, "somefile.pdf")
    pdf_info("somefile.pdf")
    Error in poppler_pdf_info(loadfile(pdf), opw, upw) : 
      Embedded NUL in string.

What else I've tried:

Tried downloading using mode = "wb"
Tried downloading with httr using the write_disk method
Tried downloading manually on windows and it works! :(
Tried downloading using curl:download() and it still failed.

My suspicion was/is that it has to do with the way I'm downloading the file. But, I don't know what alternatives I should be trying.

Page orientation

First, I want to just say that this is fantastic package and has been extremely helpful, thank you.

I'm writing a parser to extract data from unstructured pdfs, and sometimes the pages are rotated 90 degrees. I'm aware that the mediabox stores properties like page width and page height, and with a few exceptions, I can back out the page orientation using that.

My question is whether accessing the mediabox is possible using the PDFTools package, or if you know of any other means I can do this within my R program? Any solution will be much appreciated!

R crashes if we use pdf_text to open a password protected file and password is incorrect

Determining Local Font Name?

As of now pdftools can extract global font information -- that is, all fonts used in a single document. I'd like to get local font information. Ideally, I'd like to determine the font used for each word in a page of text. Are there existing methods for this in the pdftools package? Or am I better off using a different package?

For instance, the sample pdf (http://arxiv.org/pdf/1403.2805.pdf) has several different fonts for the first page. I'd like to know which words use a given font.

In my case, I want to tag bolded words extracted from my pdf of interest. I used PDF-Xchange viewer to manually determine that the bolded words are using a different, bolded version of the original font. Since there are thousands of bolded words, I'd like a procedure to automatically identify and tag them during the text extraction process.

I've attached the pdf of interest. See page 195 for an example.
Flavor-Bible-epub.pdf

png writer problem on windows

pdf_convert() seems to generate corrupted png files on windows. Probably a bug in libpoppler.

Peer certificate cannot be authenticated

What's the work-around to an online pdf accessible in browser (Firefox, but not Chrome)?

My error:

my.url <- "https://www.asafm.army.mil/documents/BudgetMaterial/fy2005/oma-v1.pdf"
my.pdf <- pdf_text(my.url)

Error in open.connection(con, "rb") : 
  cannot open the connection to 'https://www.asafm.army.mil/documents/BudgetMaterial/fy2005/oma-v1.pdf'
In addition: Warning message:
In open.connection(con, "rb") :
  URL 'https://www.asafm.army.mil/documents/BudgetMaterial/fy2005/oma-v1.pdf': status was 'Peer certificate cannot be authenticated with given CA certificates'

Missing ZapfDingbats on Windows

I tried to convert a pdf image to png. Everythings seems to go well, except for some dots that are no longer visible in the png.

netherlands_pdf.pdf

I posted this problem on SO.

`R version 3.3.2 (2016-10-31)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows Server 2008 R2 x64 (build 7601) Service Pack 1

locale:
[1] LC_COLLATE=Dutch_Netherlands.1252  LC_CTYPE=Dutch_Netherlands.1252    LC_MONETARY=Dutch_Netherlands.1252
[4] LC_NUMERIC=C                       LC_TIME=Dutch_Netherlands.1252

Can't read PDF containing parenthesis

Hi, I'm using pdftools with pdf_text to read the contents into a variable. Some PDFs cannot be read. This is the error I'm getting.

PDF error (38384): Illegal character ')'

Any help is appreciated

How to include in CRAN package

Great work with pdftools. Thank you.

I recently put a package textreadr on CRAN and I get an error with solaris machines as pdftools is not available. https://cran.r-project.org/web/checks/check_results_textreadr.html

In pdftools' checks https://cran.r-project.org/web/checks/check_results_pdftools.html I don't see solaris listed.

How are you able to make CRAN not check for solaris?
How can I proceed with importing pdftools and drop the errors?

Maybe these are the same question.

How can I get rid of two Headers which are changing every other page?

Hi everyone here!

I have some problem that I cannot solve myself. I have a pdf file and I am working on it. So far, I was able to read it and I have it in my laptop.

The file has as two Headers which are changing every other page. I would like to read the file into my laptop transforming it into txt format, but get rid of the headers by the time I load it into my computer.
Would you mind giving me a help in doing it?

It is possible to do so by using “pdftools" R-projec package?
If it is not possible, do you know a way to do this one time I have it loaded in txt format in my laptop? Of course I mean using R-project.

If you want, you can get the pdf file from here: http://www.arqmain.net/Applets/Einstein-Albert-Mi-Credo-Humanista.pdf

miss reading of fi and fl

fi and fl are being read as <U+FB01> and <U+FB01>.

library(tidytext)
library(tidyverse)
library(rvest)
library(pdftools)

report <- 
  "http://www.bhp.com/~/media/bhp/documents/investors/reports/2011/bhpbillitonannualreport2011.pdf?la=en" %>%
  pdf_text %>%
  paste(collapse = " ") %>%
  tibble(text = .) %>%
  unnest_tokens(word, text) %>% 
  arrange(word) %>%
  rowid_to_column("ID") %>%
  dplyr::filter(between(ID, 23, 673))

report %>% purrr::pluck("word") %>% str_detect(pattern = "fi") %>% sum() # 0
report %>% purrr::pluck("word") %>% str_detect(pattern = "ﬁ") %>% sum() # 572
report %>% purrr::pluck("word") %>% str_detect(pattern = "fl") %>% sum() # 0
report %>% purrr::pluck("word") %>% str_detect(pattern = "ﬂ") %>% sum() # 79

Could not parse ligature component

Hi, thanks for the great package. But I had a problem parsing this file.

Here is the reprex:

library(pdftools)
library(tidyverse)
pdf_path <- "myfile.pdf"
pdf_info(pdf_path)
#> $version
#> [1] "1.6"
#> 
#> $pages
#> [1] 16
#> 
#> $encrypted
#> [1] FALSE
#> 
#> $linearized
#> [1] FALSE
#> 
#> $keys
#> $keys$Type
#> [1] ""
#> 
#> $keys$Producer
#> [1] "Oracle BI Publisher 12.2.1.3.0"
#> 
#> 
#> $created
#> [1] "2106-02-07 09:28:15 +03"
#> 
#> $modified
#> [1] "2106-02-07 09:28:15 +03"
#> 
#> $metadata
#> [1] ""
#> 
#> $locked
#> [1] FALSE
#> 
#> $attachments
#> [1] FALSE
#> 
#> $layout
#> [1] "no_layout"
the_text <- pdftools::pdf_text(pdf_path)
#> PDF error: Could not parse ligature component "missing" of "missing_glyph" in parseCharName
#> PDF error: Could not parse ligature component "glyph" of "missing_glyph" in parseCharName
#> PDF error: Could not parse ligature component "missing" of "missing_glyph" in parseCharName
#> PDF error: Could not parse ligature component "glyph" of "missing_glyph" in parseCharName
#> PDF error: Could not parse ligature component "missing" of "missing_glyph" in parseCharName

There were more of this "PDF error" but I truncated them.

There is a similar issue raised here on SE, perhaps it is relevant.

pdf_text returns empty strings for specific files

Hi Jeroen,

I'm having an issue in which calling pdftools::pdf_text on a specific set of files is returning nothing but a single empty string ("") per page. I'm able to read in other PDF documents without issue. What makes this especially weird is that as of a week ago, there were no issues while working with these files on my Mac, and then a some point about a week ago this issue started and is persisting on both Mac and PC. No updates to R, pdftools, or any other software I can think of that would cause this on both machines.

The files are all similar, they're public data sets of Idaho payroll expenses. Here they are:
https://pibuzz.com/wp-content/uploads/post%20documents/Idaho%202013.pdf
http://mediad.publicbroadcasting.net/p/kisu/files/workforce.pdf
https://ibis.sco.idaho.gov/pubtrans/workforce/Workforce%20by%20Name%20Summary-en-us.pdf

For example, the first file contains 1012 pages:

identical(
  pdftools::pdf_text("https://pibuzz.com/wp-content/uploads/post%20documents/Idaho%202013.pdf"), 
  rep("", 1012)
)
#> TRUE

Downloading first and then reading in the local file gives the same result.

Using pdftotext at the command line works great (works using either flag -layout or -table).

Here's the output of pdf_info:

pdftools::pdf_info("https://pibuzz.com/wp-content/uploads/post%20documents/Idaho%202013.pdf")

$version
[1] "1.4"

$pages
[1] 1012

$encrypted
[1] FALSE

$linearized
[1] TRUE

$keys
$keys$Producer
[1] "PDF Engine win32 - (10.1)"


$created
[1] "2013-11-19 04:39:33 CST"

$modified
[1] "2013-11-19 20:45:26 CST"

$metadata
[1] "<?xpacket begin=\"ï»¿\" id=\"W5M0MpCehiHzreSzNTczkc9d\"?>\n<x:xmpmeta xmlns:x=\"adobe:ns:meta/\" x:xmptk=\"Adobe XMP Core 4.2.1-c043 52.372728, 2009/01/18-15:08:04        \">\n   <rdf:RDF xmlns:rdf=\"http://www.w3.org/1999/02/22-rdf-syntax-ns#\">\n      <rdf:Description rdf:about=\"\"\n            xmlns:xmp=\"http://ns.adobe.com/xap/1.0/\">\n         <xmp:CreateDate>2013-11-19T03:39:33-07:00</xmp:CreateDate>\n         <xmp:ModifyDate>2013-11-19T18:45:26-08:00</xmp:ModifyDate>\n         <xmp:MetadataDate>2013-11-19T18:45:26-08:00</xmp:MetadataDate>\n      </rdf:Description>\n      <rdf:Description rdf:about=\"\"\n            xmlns:pdf=\"http://ns.adobe.com/pdf/1.3/\">\n         <pdf:Producer>PDF Engine win32 - (10.1)</pdf:Producer>\n      </rdf:Description>\n      <rdf:Description rdf:about=\"\"\n            xmlns:dc=\"http://purl.org/dc/elements/1.1/\">\n         <dc:format>application/pdf</dc:format>\n      </rdf:Description>\n      <rdf:Description rdf:about=\"\"\n            xmlns:xmpMM=\"http://ns.adobe.com/xap/1.0/mm/\">\n         <xmpMM:DocumentID>uuid:b05f6f5f-4a32-4606-8cc0-4de378ad1853</xmpMM:DocumentID>\n         <xmpMM:InstanceID>uuid:b06b16b2-4fbc-4c99-8435-0fc05830c526</xmpMM:InstanceID>\n      </rdf:Description>\n   </rdf:RDF>\n</x:xmpmeta>\n                                                                                                    \n                                                                                                    \n                                                                                                    \n                                                                                                    \n                                                                                                    \n                                                                                                    \n                                                                                                    \n                                                                                                    \n                                                                                                    \n                                                                                                    \n                                                                                                    \n                                                                                                    \n                                                                                                    \n                                                                                                    \n                                                                                                    \n                                                                                                    \n                                                                                                    \n                                                                                                    \n                                                                                                    \n                                                                                                    \n                           \n<?xpacket end=\"w\"?>"

$locked
[1] FALSE

$attachments
[1] FALSE

$layout
[1] "no_layout"

And here's session info from my PC:

devtools::session_info()

Session info ----------------------------------------------------------------------------------------------------------------------------------------------
 setting  value                       
 version  R version 3.4.3 (2017-11-30)
 system   x86_64, mingw32             
 ui       RStudio (1.1.383)           
 language (EN)                        
 collate  English_United States.1252  
 tz       America/Chicago             
 date     2018-01-18                  

Packages --------------------------------------------------------------------------------------------------------------------------------------------------
 package   * version date       source        
 base      * 3.4.3   2017-12-06 local         
 compiler    3.4.3   2017-12-06 local         
 datasets  * 3.4.3   2017-12-06 local         
 devtools    1.13.4  2017-11-09 CRAN (R 3.4.2)
 digest      0.6.14  2018-01-14 CRAN (R 3.4.3)
 graphics  * 3.4.3   2017-12-06 local         
 grDevices * 3.4.3   2017-12-06 local         
 memoise     1.1.0   2017-04-21 CRAN (R 3.4.3)
 methods   * 3.4.3   2017-12-06 local         
 pdftools    1.5     2017-11-05 CRAN (R 3.4.2)
 Rcpp        0.12.14 2017-11-23 CRAN (R 3.4.3)
 stats     * 3.4.3   2017-12-06 local         
 tools       3.4.3   2017-12-06 local         
 utils     * 3.4.3   2017-12-06 local         
 withr       2.1.1   2017-12-19 CRAN (R 3.4.3)
 yaml        2.1.16  2017-12-12 CRAN (R 3.4.3)

Let me know if you have questions or need more info from me. Also, thanks for all your work on this package and other tools!

poppler_pdf_text not enough space

I need to read a lot of pdfs. I loop through each pdf and for about 50 pdfs it works fine. Then it says poppler_pdf_text doesn't have enough space. I have deleted objects created during the loop and did gc() afterwards but it still crashes the program. Is there some other way this package is building up memory usage going through multiple pdfs?

Question: page dimensions

Thanks for an useful package @jeroen! 👌

I was wondering whether it's a limitation of the underlying library or of the PDF format to not get the page dimensions from pdf_info? See example at the bottom. Is there any other way for me to get the dimensions of pages in a PDF?

The reason why I'd like to get page dimensions is that I'm developing a function creating a PDF, with an argument defining the size of the paper, and I'd like to add an unit test for that aspect.

library("pdftools")
#> Warning: package 'pdftools' was built under R version 3.5.1
pdf_file <- file.path(R.home("doc"), "NEWS.pdf")
pdf_info(pdf_file)
#> $version
#> [1] "1.5"
#> 
#> $pages
#> [1] 89
#> 
#> $encrypted
#> [1] FALSE
#> 
#> $linearized
#> [1] FALSE
#> 
#> $keys
#> $keys$Author
#> [1] ""
#> 
#> $keys$Title
#> [1] ""
#> 
#> $keys$Subject
#> [1] ""
#> 
#> $keys$Creator
#> [1] "LaTeX with hyperref package"
#> 
#> $keys$Producer
#> [1] "pdfTeX-1.40.19"
#> 
#> $keys$Keywords
#> [1] ""
#> 
#> $keys$Trapped
#> [1] ""
#> 
#> $keys$PTEX.Fullbanner
#> [1] "This is MiKTeX-pdfTeX 2.9.6642 (1.40.19)"
#> 
#> 
#> $created
#> [1] "2018-04-23 12:33:33 CEST"
#> 
#> $modified
#> [1] "2018-04-23 12:33:33 CEST"
#> 
#> $metadata
#> [1] ""
#> 
#> $locked
#> [1] FALSE
#> 
#> $attachments
#> [1] FALSE
#> 
#> $layout
#> [1] "no_layout"

Created on 2018-07-21 by the reprex package (v0.2.0).

Session info

devtools::session_info()
#> Session info -------------------------------------------------------------
#>  setting  value                       
#>  version  R version 3.5.0 (2018-04-23)
#>  system   x86_64, mingw32             
#>  ui       RTerm                       
#>  language (EN)                        
#>  collate  English_United States.1252  
#>  tz       Europe/Paris                
#>  date     2018-07-21
#> Packages -----------------------------------------------------------------
#>  package   * version    date       source                            
#>  backports   1.1.2      2017-12-13 CRAN (R 3.5.0)                    
#>  base      * 3.5.0      2018-04-23 local                             
#>  compiler    3.5.0      2018-04-23 local                             
#>  datasets  * 3.5.0      2018-04-23 local                             
#>  devtools    1.13.5     2018-02-18 CRAN (R 3.5.0)                    
#>  digest      0.6.15     2018-01-28 CRAN (R 3.5.0)                    
#>  evaluate    0.10.1     2017-06-24 CRAN (R 3.5.0)                    
#>  graphics  * 3.5.0      2018-04-23 local                             
#>  grDevices * 3.5.0      2018-04-23 local                             
#>  htmltools   0.3.6.9001 2018-06-16 Github (rstudio/htmltools@3aee819)
#>  knitr       1.20       2018-02-20 CRAN (R 3.5.0)                    
#>  magrittr    1.5        2014-11-22 CRAN (R 3.5.0)                    
#>  memoise     1.1.0      2017-04-21 CRAN (R 3.5.0)                    
#>  methods   * 3.5.0      2018-04-23 local                             
#>  pdftools  * 1.8        2018-05-27 CRAN (R 3.5.1)                    
#>  Rcpp        0.12.17    2018-05-18 CRAN (R 3.5.0)                    
#>  rmarkdown   1.10       2018-06-11 CRAN (R 3.5.0)                    
#>  rprojroot   1.3-2      2018-01-03 CRAN (R 3.4.3)                    
#>  stats     * 3.5.0      2018-04-23 local                             
#>  stringi     1.2.3      2018-06-12 CRAN (R 3.5.0)                    
#>  stringr     1.3.1      2018-05-10 CRAN (R 3.5.0)                    
#>  tools       3.5.0      2018-04-23 local                             
#>  utils     * 3.5.0      2018-04-23 local                             
#>  withr       2.1.2      2018-03-15 CRAN (R 3.4.4)                    
#>  yaml        2.1.19     2018-05-01 CRAN (R 3.5.0)

can pdf_data provide page's height & width?

pdf_data's output is below,

# A tibble: 430 x 6
   width height     x     y space text  
          
 1    29      8   154   139 TRUE  Mazda 
 2    19      8   187   139 FALSE RX4   
 3    29      8   154   151 TRUE  Mazda 
 4    19      8   187   151 TRUE  RX4   
 5    19      8   210   151 FALSE Wag   
 6    31      8   154   163 TRUE  Datsun
 7    14      8   189   163 FALSE 710   
 8    30      8   154   176 TRUE  Hornet
 9     4      8   188   176 TRUE  4     
10    23      8   196   176 FALSE Drive

however, I don't know the page's width and height.Hence, I can't transform coordinates into the percentage of the page to others. I considered using pdf_convert, but uncertain dpi, and converting pngs is slow.Can you give me some suggestions?

`PDF_data` not working

I am getting an error with the poppler version though it is up to date:

library(pdftools)
pdf_data("YG-Archive-DatingSocialMediaInternal-090818.pdf")
#> Error in poppler_pdf_data(loadfile(pdf), opw, upw): This feature requires poppler >= 0.63. You have 0.71.0
poppler_config()
#> $version
#> [1] "0.71.0"
#> 
#> $can_render
#> [1] TRUE
#> 
#> $supported_image_formats
#> [1] "png"  "jpeg" "jpg"  "tiff" "pnm"
devtools::session_info()
#> ─ Session info ──────────────────────────────────────────────────────────
#>  setting  value                       
#>  version  R version 3.5.1 (2018-07-02)
#>  os       macOS Sierra 10.12.6        
#>  system   x86_64, darwin15.6.0        
#>  ui       X11                         
#>  language (EN)                        
#>  collate  en_US.UTF-8                 
#>  ctype    en_US.UTF-8                 
#>  tz       America/New_York            
#>  date     2018-11-30                  
#> 
#> ─ Packages ──────────────────────────────────────────────────────────────
#>  package     * version    date       lib source                         
#>  assertthat    0.2.0      2017-04-11 [1] CRAN (R 3.5.0)                 
#>  backports     1.1.2      2017-12-13 [1] CRAN (R 3.5.0)                 
#>  base64enc     0.1-3      2015-07-28 [1] CRAN (R 3.5.0)                 
#>  callr         3.0.0      2018-08-24 [1] CRAN (R 3.5.0)                 
#>  cli           1.0.1      2018-09-25 [1] CRAN (R 3.5.0)                 
#>  crayon        1.3.4      2017-09-16 [1] CRAN (R 3.5.0)                 
#>  debugme       1.1.0      2017-10-22 [1] CRAN (R 3.5.0)                 
#>  desc          1.2.0      2018-10-06 [1] local                          
#>  devtools      2.0.1      2018-10-26 [1] CRAN (R 3.5.1)                 
#>  digest        0.6.18     2018-10-10 [1] CRAN (R 3.5.0)                 
#>  evaluate      0.12       2018-10-09 [1] CRAN (R 3.5.0)                 
#>  fs            1.2.6      2018-08-23 [1] CRAN (R 3.5.0)                 
#>  glue          1.3.0      2018-07-17 [1] CRAN (R 3.5.0)                 
#>  htmltools     0.3.6      2017-04-28 [1] CRAN (R 3.5.0)                 
#>  knitr         1.20       2018-09-21 [1] Github (yihui/knitr@0da648b)   
#>  magrittr      1.5        2014-11-22 [1] CRAN (R 3.5.0)                 
#>  memoise       1.1.0      2017-04-21 [1] CRAN (R 3.5.0)                 
#>  pdftools    * 1.8        2018-05-27 [1] CRAN (R 3.5.1)                 
#>  pkgbuild      1.0.2      2018-10-16 [1] CRAN (R 3.5.0)                 
#>  pkgload       1.0.2      2018-10-29 [1] CRAN (R 3.5.1)                 
#>  prettyunits   1.0.2      2015-07-13 [1] CRAN (R 3.5.0)                 
#>  processx      3.2.0.9000 2018-11-13 [1] Github (r-lib/processx@8374340)
#>  ps            1.2.1      2018-11-06 [1] CRAN (R 3.5.0)                 
#>  R6            2.3.0      2018-10-04 [1] CRAN (R 3.5.0)                 
#>  Rcpp          1.0.0      2018-11-07 [1] CRAN (R 3.5.0)                 
#>  remotes       2.0.2      2018-10-30 [1] CRAN (R 3.5.0)                 
#>  rlang         0.3.0.1    2018-10-25 [1] CRAN (R 3.5.0)                 
#>  rmarkdown     1.10       2018-06-11 [1] CRAN (R 3.5.0)                 
#>  rprojroot     1.3-2      2018-01-03 [1] CRAN (R 3.5.0)                 
#>  sessioninfo   1.1.1      2018-11-05 [1] CRAN (R 3.5.0)                 
#>  stringi       1.2.4      2018-07-20 [1] CRAN (R 3.5.0)                 
#>  stringr       1.3.1      2018-05-10 [1] CRAN (R 3.5.0)                 
#>  testthat      2.0.1      2018-10-13 [1] CRAN (R 3.5.0)                 
#>  usethis       1.4.0.9000 2018-11-13 [1] local                          
#>  withr         2.1.2      2018-03-15 [1] CRAN (R 3.5.0)                 
#>  yaml          2.2.0      2018-07-25 [1] CRAN (R 3.5.0)                 
#> 
#> [1] /Library/Frameworks/R.framework/Versions/3.5/Resources/library

^{Created on 2018-11-30 by the reprex package (v0.2.1)}