GithubHelp home page GithubHelp logo

sckott / pdfimager Goto Github PK

View Code? Open in Web Editor NEW
26.0 2.0 3.0 2.62 MB

Extract images from pdfs using poppler <https://poppler.freedesktop.org/>

Home Page: https://sckott.github.io/pdfimager/

License: Other

Makefile 6.62% R 93.38%
r rstats poppler pdfimages pdf

pdfimager's Introduction

pdfimager's People

Contributors

sckott avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar

pdfimager's Issues

Error: pdfimages version 3.04

Following the vignette, I get an error:

library("pdfimager")
x <- system.file("examples/BachmanEtal2020.pdf", package="pdfimager")
pdimg_meta(x)

Results in:

"Error: pdfimages version 3.04"

I'm using Mac, Darwin, 20.5.0.

gsub pattern in pdimg_image

Hey @sckott , currently the pdimg_image implementation contains '.pdf' pattern in gsub function

if (is.null(dir)) {
dir <- file.path(tempdir(), gsub(".pdf", "", basename(path)), "img")
} else {
# expand path in case there's a tilda or similar
dir <- path.expand(dir)
dir <- file.path(dir, gsub(".pdf", "", basename(path)), "img")
}

which is treaten as a regular expression. This means that every pattern of '.pdf' will be replaced with ''.

Would you consider adding fixed = TRUE parameter to gsub or at least rewriting the pattern to .pdf$ so it's clear we only want to subsitute the extension of the file?

With the current implementation some files gets more parts subtituted than you were planning to

> gsub('.pdf', '', 'file_name_pdf_1.pdf')
[1] "file_name_1"
> gsub('.pdf', '', 'file_namepdf_1.pdf')
[1] "file_nam_1"
> gsub('.pdf', '', 'epdf_1.pdf')
[1] "_1"

compared to

> gsub('.pdf$', '', 'epdf_1.pdf')
[1] "epdf_1"

Session info


> sessionInfo()
R version 4.1.0 (2021-05-18)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 19044)

Matrix products: default

locale:
[1] LC_COLLATE=English_Europe.1252  LC_CTYPE=English_Europe.1252    LC_MONETARY=English_Europe.1252
[4] LC_NUMERIC=C                    LC_TIME=English_Europe.1252    

attached base packages:
[1] stats     graphics  grDevices datasets  utils     methods   base     

other attached packages:
[1] cvparser_0.6.2

loaded via a namespace (and not attached):
 [1] NLP_0.2-1           Rcpp_1.0.9          dbplyr_2.1.1        pillar_1.7.0       
 [5] compiler_4.1.0      hocr_0.0.0.9002     pdfimager_0.0.2.91  tools_4.1.0        
 [9] sys_3.4             pdftools_3.1.1      digest_0.6.29       jsonlite_1.8.0     
[13] lifecycle_1.0.1     tibble_3.1.6        png_0.1-7           pkgconfig_2.0.3    
[17] rlang_1.0.6         DBI_1.1.2           cli_3.5.0           rstudioapi_0.13    
[21] parallel_4.1.0      rJava_1.0-6         stringr_1.4.0       dplyr_1.0.8        
[25] httr_1.4.2          xml2_1.3.3          rappdirs_0.3.3      hms_1.1.1          
[29] generics_0.1.3      vctrs_0.3.8         fs_1.5.2            askpass_1.1        
[33] tidyselect_1.1.2    tau_0.0-24          glue_1.6.2          data.table_1.14.2  
[37] qpdf_1.1            R6_2.5.1            fansi_1.0.2         tabulizer_0.2.2    
[41] tzdb_0.2.0          readr_2.1.2         genderizeR_2.1.1    purrr_0.3.4        
[45] tidyr_1.2.1         magrittr_2.0.3      tabulizerjars_1.0.1 ellipsis_0.3.2     
[49] tesseract_5.1.0     textcat_1.0-7       assertthat_0.2.1    rengine_0.19.0     
[53] renv_0.15.5-59      utf8_1.2.2          stringi_1.7.8       tm_0.7-8           
[57] slam_0.1-50         crayon_1.5.1 

Alternative?

I can't use pdfimager on shinyapps.io (and it was a bit slow, although worked well!), so I've sought alternatives, but nothing is really great.

Here's a mostly-R painful hack at parsing PDFs for images. It only works sometimes, but it would be great if someone could improve this. Posting here in case anyone cares!

library(stringr)
library(hexView)

.PROBLEM <- function(type, aMessage) {

  newMessage <- paste0("problem",
                       type,
                       " in ",
                       as.list(sys.call(-1))[[1]],
                       "(): ",
                       aMessage,
                       ".")
  if(type == "error") stop(newMessage, call. = FALSE)
  message(newMessage)

}


isPDF <- function(aFileName,
                  verbose = TRUE) {

  fileContents <- suppressWarnings(try(
    readLines(aFileName, n = 5, ok = TRUE, warn = FALSE),
    silent = TRUE))

  # error catch when downloading PDFs online
  if(inherits(fileContents, "try-error")) {
    if(verbose == TRUE) {
      message(paste0("Failed validation, with error: ", fileContents[1]))
    } else {
      if(verbose ==  "quiet") return(FALSE)
      # when verbose = FALSE, secret message intended for PDF_downloader success
      message(" poor internet connection,", appendLF = FALSE)
    }
    return(FALSE)
  }

  return(any(!is.na(str_extract(fileContents, ".*%PDF-1.*"))))
}

#' Attempts to extract all images from a PDF
#'
#' Tries to extract images within a PDF file.  Currently does not support
#' decoding of images in CCITT compression formats. However, will still save
#' these images to file; as a record of the number of images detected in the PDF.
#'
#' @param file The file name and location of a PDF file.  Prompts
#'    for file name if none is explicitly called.
#'
#' @return A vector of file names saved as images.
#'
#' @export PDF_extractImages

PDF_extractImages <- function(file = file.choose()) {

  # check if file is a PDF
  if(isPDF(file, verbose = "quiet") != TRUE) {
    .PROBLEM("error",
                     "file not PDF format")
  }

  # read file in HEX also ASCII
  rawFile <- hexView::readRaw(file, human = "char")

  # test if read file characters is same as file size
  if (length(hexView::blockValue(rawFile)) != file.info(file)$size) {
    .PROBLEM("error",
                     "possible size reading error of PDF")
  }

  # extract images embeded as PDF objects
  createdFiles_bin <- scanPDFobjects(rawFile, file)
  #if(quiet != TRUE) message(paste0(createdFiles_bin), " ")

  # extract images embeded in XML
  createdFiles_XML <- scanPDFXML(rawFile, file)
  #if(quiet != TRUE) message(paste0(createdFiles_XML), " ")

  theSavedFileNames <- c(createdFiles_bin, createdFiles_XML)

  #print(round(7/3) + 7 %% 3)
  #if(ignore != TRUE) {
  #
  #  par(mfrow = c(2,3), las = 1)
  #  for(i in 1:6) {
  #    figure_display(theSavedFileNames[i])
  #    mtext(theSavedFileNames[i], col = "red", cex = 1.2)
  #  }
  #}

  return(theSavedFileNames)
}

scanPDFobjects <- function (rawFile, file) {

  # collapse ASCII to a single string
  theStringFile <- paste(hexView::blockValue(rawFile), collapse = '')

  # split string by PDF objects and keep delimiter
  theObjects <- paste(unlist(strsplit(theStringFile, "endobj")), "endobj", sep="")

  # identify and screen candidate objects with images
  candidateObjects <- c(which(str_extract(theObjects, "XObject/Width") == "XObject/Width"),
                        which(str_extract(theObjects, "Image") == "Image"))

  removeObjects <- c(which(str_extract(theObjects, "PDF/Text") == "PDF/Text"),
                     which(str_extract(theObjects, "PDF /Text") == "PDF /Text"))


  candidateObjects <- unique(candidateObjects[! candidateObjects %in% removeObjects])

  if(length(candidateObjects) == 0) {
    return("No PDF image objects detected.")
  }

  # generate file names for candidate images
  fileNames <- paste(rep(tools::file_path_sans_ext(file), length(candidateObjects)),
                     "_bin_", 1:length(candidateObjects), ".jpg", sep="")

  # extract and save all image binaries found in PDF
  theNewFiles <- sapply(1:length(candidateObjects),
                        function(x, y, z) PDFobjectToImageFile(y[x],
                                                               theObjects,
                                                               file,
                                                               z[x]),
                        y = candidateObjects, z = fileNames)

  return(theNewFiles)
}

PDFobjectToImageFile <- function (objectLocation,
                                  theObjects,
                                  theFile,
                                  imageFileName) {

  # parse object by stream & endstream
  parsedImageObject <-  unlist(strsplit(theObjects[objectLocation], "stream"))

  # extract key char locations of image in PDF with trailingChars as a correction
  # for "stream" being followed by 2 return characters
  trailingChars <- "  "
  startImageLocation <- nchar(paste(parsedImageObject[1],
                                    "stream", trailingChars, sep = ""))
  endImageLocation <- startImageLocation +
    nchar(substr(parsedImageObject[2],
                 1,
                 nchar(parsedImageObject[2]) - nchar("end")))
  PDFLocation <- nchar(paste(theObjects[1:(objectLocation - 1)], collapse = ''))

  # extract binary of image from PDF
  PDFImageBlock <- hexView::readRaw(theFile,
                                    offset = PDFLocation + startImageLocation,
                                    nbytes = endImageLocation, machine = "binary")

  # sometimes some of the orginal file format unicode is missing, this helps clean
  # this issue for jpgs at least
  if((PDFImageBlock$fileRaw[1] == "d8") && (PDFImageBlock$fileRaw[2] == "ff"))
    PDFImageBlock$fileRaw <- c(as.raw('0xff'), PDFImageBlock$fileRaw)

  # save binary of image to new file
  detectedImageFile <- file(imageFileName, "wb")
  writeBin(PDFImageBlock$fileRaw, detectedImageFile)
  close(detectedImageFile)

  # TO DO RETURN INFO ABOUT SUCCESSFUL FILE SAVE
  return(imageFileName)
}

scanPDFXML <- function (rawFile, file) {

  # collapse ASCII to a single string
  theStringFile <- paste(hexView::blockValue(rawFile), collapse = '')

  # split by XML tags with images and keep delimiter
  theObjects <- paste(unlist(strsplit(theStringFile, "xmpGImg:image>")),
                      "xmpGImg:image>", sep="")

  # identify objects with images
  candidateObjects <- which(str_extract(theObjects, "</xmpGImg:image>") == "</xmpGImg:image>")

  if(length(candidateObjects) == 0) {
    return("No XML image objects detected.")
  }

  # generate file names for candidate images
  fileNames <- paste(rep(tools::file_path_sans_ext(file), length(candidateObjects)),
                     "_XML_", 1:length(candidateObjects), ".jpg", sep="")

  # extract and save all image binaries found in PDF
  theNewFiles <- sapply(1:length(candidateObjects),
                        function(x, y, z) PDFXMLToImageFile(y[x],
                                                            theObjects,
                                                            file,
                                                            z[x]),
                        y = candidateObjects, z = fileNames)

  return(theNewFiles)
}

PDFXMLToImageFile <- function (objectLocation,
                               theObjects,
                               theFile,
                               imageFileName) {

  # parse encoded XML image and clean
  parsedImage <- unlist(strsplit(theObjects[objectLocation], "</xmpGImg:image>"))
  parsedImage <- gsub("&#xA;", "", parsedImage[1])

  # decode image to base64
  decodedImage <- RCurl::base64Decode(parsedImage, "raw")

  # save binary of image to new file
  detectedImageFile <- file(imageFileName, "wb")
  writeBin(decodedImage, detectedImageFile)
  close(detectedImageFile)

  # TO DO RETURN INFO ABOUT SUCCESSFUL FILE SAVE
  return(imageFileName)
}

PDF_extractImages()

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.