I'm a software engineer at Fred Hutch in the Fred Hutch Data Science Lab
- website
- blog
- mastodon - @[email protected]
- blueskey - sckott
- ๐ ย Fun fact: ~90% of bee species are solitary (not social)
Extract images from pdfs using poppler <https://poppler.freedesktop.org/>
Home Page: https://sckott.github.io/pdfimager/
License: Other
I'm a software engineer at Fred Hutch in the Fred Hutch Data Science Lab
Following the vignette, I get an error:
library("pdfimager")
x <- system.file("examples/BachmanEtal2020.pdf", package="pdfimager")
pdimg_meta(x)
Results in:
"Error: pdfimages version 3.04"
I'm using Mac, Darwin, 20.5.0.
Hey @sckott , currently the pdimg_image implementation contains '.pdf' pattern in gsub
function
Lines 72 to 78 in 19004ec
which is treaten as a regular expression. This means that every pattern of '.pdf' will be replaced with ''.
Would you consider adding fixed = TRUE
parameter to gsub
or at least rewriting the pattern to .pdf$
so it's clear we only want to subsitute the extension of the file?
With the current implementation some files gets more parts subtituted than you were planning to
> gsub('.pdf', '', 'file_name_pdf_1.pdf')
[1] "file_name_1"
> gsub('.pdf', '', 'file_namepdf_1.pdf')
[1] "file_nam_1"
> gsub('.pdf', '', 'epdf_1.pdf')
[1] "_1"
compared to
> gsub('.pdf$', '', 'epdf_1.pdf')
[1] "epdf_1"
> sessionInfo()
R version 4.1.0 (2021-05-18)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 19044)
Matrix products: default
locale:
[1] LC_COLLATE=English_Europe.1252 LC_CTYPE=English_Europe.1252 LC_MONETARY=English_Europe.1252
[4] LC_NUMERIC=C LC_TIME=English_Europe.1252
attached base packages:
[1] stats graphics grDevices datasets utils methods base
other attached packages:
[1] cvparser_0.6.2
loaded via a namespace (and not attached):
[1] NLP_0.2-1 Rcpp_1.0.9 dbplyr_2.1.1 pillar_1.7.0
[5] compiler_4.1.0 hocr_0.0.0.9002 pdfimager_0.0.2.91 tools_4.1.0
[9] sys_3.4 pdftools_3.1.1 digest_0.6.29 jsonlite_1.8.0
[13] lifecycle_1.0.1 tibble_3.1.6 png_0.1-7 pkgconfig_2.0.3
[17] rlang_1.0.6 DBI_1.1.2 cli_3.5.0 rstudioapi_0.13
[21] parallel_4.1.0 rJava_1.0-6 stringr_1.4.0 dplyr_1.0.8
[25] httr_1.4.2 xml2_1.3.3 rappdirs_0.3.3 hms_1.1.1
[29] generics_0.1.3 vctrs_0.3.8 fs_1.5.2 askpass_1.1
[33] tidyselect_1.1.2 tau_0.0-24 glue_1.6.2 data.table_1.14.2
[37] qpdf_1.1 R6_2.5.1 fansi_1.0.2 tabulizer_0.2.2
[41] tzdb_0.2.0 readr_2.1.2 genderizeR_2.1.1 purrr_0.3.4
[45] tidyr_1.2.1 magrittr_2.0.3 tabulizerjars_1.0.1 ellipsis_0.3.2
[49] tesseract_5.1.0 textcat_1.0-7 assertthat_0.2.1 rengine_0.19.0
[53] renv_0.15.5-59 utf8_1.2.2 stringi_1.7.8 tm_0.7-8
[57] slam_0.1-50 crayon_1.5.1
I can't use pdfimager on shinyapps.io (and it was a bit slow, although worked well!), so I've sought alternatives, but nothing is really great.
Here's a mostly-R painful hack at parsing PDFs for images. It only works sometimes, but it would be great if someone could improve this. Posting here in case anyone cares!
library(stringr)
library(hexView)
.PROBLEM <- function(type, aMessage) {
newMessage <- paste0("problem",
type,
" in ",
as.list(sys.call(-1))[[1]],
"(): ",
aMessage,
".")
if(type == "error") stop(newMessage, call. = FALSE)
message(newMessage)
}
isPDF <- function(aFileName,
verbose = TRUE) {
fileContents <- suppressWarnings(try(
readLines(aFileName, n = 5, ok = TRUE, warn = FALSE),
silent = TRUE))
# error catch when downloading PDFs online
if(inherits(fileContents, "try-error")) {
if(verbose == TRUE) {
message(paste0("Failed validation, with error: ", fileContents[1]))
} else {
if(verbose == "quiet") return(FALSE)
# when verbose = FALSE, secret message intended for PDF_downloader success
message(" poor internet connection,", appendLF = FALSE)
}
return(FALSE)
}
return(any(!is.na(str_extract(fileContents, ".*%PDF-1.*"))))
}
#' Attempts to extract all images from a PDF
#'
#' Tries to extract images within a PDF file. Currently does not support
#' decoding of images in CCITT compression formats. However, will still save
#' these images to file; as a record of the number of images detected in the PDF.
#'
#' @param file The file name and location of a PDF file. Prompts
#' for file name if none is explicitly called.
#'
#' @return A vector of file names saved as images.
#'
#' @export PDF_extractImages
PDF_extractImages <- function(file = file.choose()) {
# check if file is a PDF
if(isPDF(file, verbose = "quiet") != TRUE) {
.PROBLEM("error",
"file not PDF format")
}
# read file in HEX also ASCII
rawFile <- hexView::readRaw(file, human = "char")
# test if read file characters is same as file size
if (length(hexView::blockValue(rawFile)) != file.info(file)$size) {
.PROBLEM("error",
"possible size reading error of PDF")
}
# extract images embeded as PDF objects
createdFiles_bin <- scanPDFobjects(rawFile, file)
#if(quiet != TRUE) message(paste0(createdFiles_bin), " ")
# extract images embeded in XML
createdFiles_XML <- scanPDFXML(rawFile, file)
#if(quiet != TRUE) message(paste0(createdFiles_XML), " ")
theSavedFileNames <- c(createdFiles_bin, createdFiles_XML)
#print(round(7/3) + 7 %% 3)
#if(ignore != TRUE) {
#
# par(mfrow = c(2,3), las = 1)
# for(i in 1:6) {
# figure_display(theSavedFileNames[i])
# mtext(theSavedFileNames[i], col = "red", cex = 1.2)
# }
#}
return(theSavedFileNames)
}
scanPDFobjects <- function (rawFile, file) {
# collapse ASCII to a single string
theStringFile <- paste(hexView::blockValue(rawFile), collapse = '')
# split string by PDF objects and keep delimiter
theObjects <- paste(unlist(strsplit(theStringFile, "endobj")), "endobj", sep="")
# identify and screen candidate objects with images
candidateObjects <- c(which(str_extract(theObjects, "XObject/Width") == "XObject/Width"),
which(str_extract(theObjects, "Image") == "Image"))
removeObjects <- c(which(str_extract(theObjects, "PDF/Text") == "PDF/Text"),
which(str_extract(theObjects, "PDF /Text") == "PDF /Text"))
candidateObjects <- unique(candidateObjects[! candidateObjects %in% removeObjects])
if(length(candidateObjects) == 0) {
return("No PDF image objects detected.")
}
# generate file names for candidate images
fileNames <- paste(rep(tools::file_path_sans_ext(file), length(candidateObjects)),
"_bin_", 1:length(candidateObjects), ".jpg", sep="")
# extract and save all image binaries found in PDF
theNewFiles <- sapply(1:length(candidateObjects),
function(x, y, z) PDFobjectToImageFile(y[x],
theObjects,
file,
z[x]),
y = candidateObjects, z = fileNames)
return(theNewFiles)
}
PDFobjectToImageFile <- function (objectLocation,
theObjects,
theFile,
imageFileName) {
# parse object by stream & endstream
parsedImageObject <- unlist(strsplit(theObjects[objectLocation], "stream"))
# extract key char locations of image in PDF with trailingChars as a correction
# for "stream" being followed by 2 return characters
trailingChars <- " "
startImageLocation <- nchar(paste(parsedImageObject[1],
"stream", trailingChars, sep = ""))
endImageLocation <- startImageLocation +
nchar(substr(parsedImageObject[2],
1,
nchar(parsedImageObject[2]) - nchar("end")))
PDFLocation <- nchar(paste(theObjects[1:(objectLocation - 1)], collapse = ''))
# extract binary of image from PDF
PDFImageBlock <- hexView::readRaw(theFile,
offset = PDFLocation + startImageLocation,
nbytes = endImageLocation, machine = "binary")
# sometimes some of the orginal file format unicode is missing, this helps clean
# this issue for jpgs at least
if((PDFImageBlock$fileRaw[1] == "d8") && (PDFImageBlock$fileRaw[2] == "ff"))
PDFImageBlock$fileRaw <- c(as.raw('0xff'), PDFImageBlock$fileRaw)
# save binary of image to new file
detectedImageFile <- file(imageFileName, "wb")
writeBin(PDFImageBlock$fileRaw, detectedImageFile)
close(detectedImageFile)
# TO DO RETURN INFO ABOUT SUCCESSFUL FILE SAVE
return(imageFileName)
}
scanPDFXML <- function (rawFile, file) {
# collapse ASCII to a single string
theStringFile <- paste(hexView::blockValue(rawFile), collapse = '')
# split by XML tags with images and keep delimiter
theObjects <- paste(unlist(strsplit(theStringFile, "xmpGImg:image>")),
"xmpGImg:image>", sep="")
# identify objects with images
candidateObjects <- which(str_extract(theObjects, "</xmpGImg:image>") == "</xmpGImg:image>")
if(length(candidateObjects) == 0) {
return("No XML image objects detected.")
}
# generate file names for candidate images
fileNames <- paste(rep(tools::file_path_sans_ext(file), length(candidateObjects)),
"_XML_", 1:length(candidateObjects), ".jpg", sep="")
# extract and save all image binaries found in PDF
theNewFiles <- sapply(1:length(candidateObjects),
function(x, y, z) PDFXMLToImageFile(y[x],
theObjects,
file,
z[x]),
y = candidateObjects, z = fileNames)
return(theNewFiles)
}
PDFXMLToImageFile <- function (objectLocation,
theObjects,
theFile,
imageFileName) {
# parse encoded XML image and clean
parsedImage <- unlist(strsplit(theObjects[objectLocation], "</xmpGImg:image>"))
parsedImage <- gsub("
", "", parsedImage[1])
# decode image to base64
decodedImage <- RCurl::base64Decode(parsedImage, "raw")
# save binary of image to new file
detectedImageFile <- file(imageFileName, "wb")
writeBin(decodedImage, detectedImageFile)
close(detectedImageFile)
# TO DO RETURN INFO ABOUT SUCCESSFUL FILE SAVE
return(imageFileName)
}
PDF_extractImages()
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.