ropensci / tabulizer Goto Github PK

View Code? Open in Web Editor NEW

526.0 38.0 68.0 19.73 MB

Bindings for Tabula PDF Table Extractor Library

Home Page: https://docs.ropensci.org/tabulizer

License: Other

R 87.97% TeX 12.03%

tabula tabular-data pdf java pdf-document r r-package ropensci rstats peer-reviewed

tabulizer's Introduction

tabulapdf: Extract tables from PDF documents

tabulapdf provides R bindings to the Tabula java library, which can be used to computationaly extract tables from PDF documents.

Note: tabulapdf is released under the MIT license, as is Tabula itself.

Installation

tabulapdf depends on rJava, which implies a system requirement for Java. This can be frustrating, especially on Windows. The preferred Windows workflow is to use Chocolatey to obtain, configure, and update Java. You need do this before installing rJava or attempting to use tabulapdf. More on this and troubleshooting below.

tabulapdf is not available on CRAN, but it can be installed from rOpenSci’s R-Universe:

install.packages("tabulapdf", repos = c("https://ropensci.r-universe.dev", "https://cloud.r-project.org"))

To install the latest development version:

if (!require(remotes)) install.packages("remotes")

# on 64-bit Windows
remotes::install_github(c("ropensci/tabulapdf"), INSTALL_opts = "--no-multiarch")

# elsewhere
remotes::install_github(c("ropensci/tabulapdf"))

Code Examples

The main function, extract_tables() provides an R clone of the Tabula command line application:

library("tabulapdf")
f <- system.file("examples", "data.pdf", package = "tabulapdf")
out1 <- extract_tables(f)
str(out1)
## List of 4
##  $ : chr [1:32, 1:10] "mpg" "21.0" "21.0" "22.8" ...
##  $ : chr [1:7, 1:5] "Sepal.Length " "5.1 " "4.9 " "4.7 " ...
##  $ : chr [1:7, 1:6] "" "145 " "146 " "147 " ...
##  $ : chr [1:15, 1] "supp" "VC" "VC" "VC" ...

By default, it returns the most table-like R structure available: a matrix. It can also write the tables to disk or attempt to coerce them to data.frames using the output argument. It is also possible to select tables from only specified pages using the pages argument.

out2 <- extract_tables(f, pages = 1, guess = FALSE, output = "data.frame")
str(out2)
## List of 1
##  $ :'data.frame':       33 obs. of  13 variables:
##   ..$ X   : chr [1:33] "Mazda RX4 " "Mazda RX4 Wag " "Datsun 710 " "Hornet 4 Drive " ...
##   ..$ mpg : num [1:33] 21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
##   ..$ cyl : num [1:33] 6 6 4 6 8 6 8 4 4 6 ...
##   ..$ X.1 : int [1:33] NA NA NA NA NA NA NA NA NA NA ...
##   ..$ disp: num [1:33] 160 160 108 258 360 ...
##   ..$ hp  : num [1:33] 110 110 93 110 175 105 245 62 95 123 ...
##   ..$ drat: num [1:33] 3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
##   ..$ wt  : num [1:33] 2.62 2.88 2.32 3.21 3.44 ...
##   ..$ qsec: num [1:33] 16.5 17 18.6 19.4 17 ...
##   ..$ vs  : num [1:33] 0 0 1 1 0 1 0 1 1 1 ...
##   ..$ am  : num [1:33] 1 1 1 0 0 0 0 0 0 0 ...
##   ..$ gear: num [1:33] 4 4 4 3 3 3 3 4 4 4 ...
##   ..$ carb: int [1:33] 4 4 1 1 2 1 4 2 2 4 ...

It is also possible to manually specify smaller areas within pages to look for tables using the area and columns arguments to extract_tables(). This facilitates extraction from smaller portions of a page, such as when a table is embeded in a larger section of text or graphics.

Another function, extract_areas() implements this through an interactive style in which each page of the PDF is loaded as an R graphic and the user can use their mouse to specify upper-left and lower-right bounds of an area. Those areas are then extracted auto-magically (and the return value is the same as for extract_tables()). Here’s a shot of it in action:

locate_areas() handles the area identification process without performing the extraction, which may be useful as a debugger.

extract_text() simply returns text, possibly separately for each (specified) page:

out3 <- extract_text(f, page = 3)
cat(out3, sep = "\n")
## len supp dose
## 4.2 VC 0.5
## 11.5 VC 0.5
## 7.3 VC 0.5
## 5.8 VC 0.5
## 6.4 VC 0.5
## 10.0 VC 0.5
## 11.2 VC 0.5
## 11.2 VC 0.5
## 5.2 VC 0.5
## 7.0 VC 0.5
## 16.5 VC 1.0
## 16.5 VC 1.0
## 15.2 VC 1.0
## 17.3 VC 1.0
## 22.5 VC 1.0
## 3

Note that for large PDF files, it is possible to run up against Java memory constraints, leading to a java.lang.OutOfMemoryError: Java heap space error message. Memory can be increased using options(java.parameters = "-Xmx16000m") set to some reasonable amount of memory.

Some other utility functions are also provided (and made possible by the Java Apache PDFBox library):

extract_text() converts the text of an entire file or specified pages into an R character vector.
split_pdf() and merge_pdfs() split and merge PDF documents, respectively.
extract_metadata() extracts PDF metadata as a list.
get_n_pages() determines the number of pages in a document.
get_page_dims() determines the width and height of each page in pt (the unit used by area and columns arguments).
make_thumbnails() converts specified pages of a PDF file to image files.

Installing Java on Windows with Chocolatey

In command prompt, install Chocolately if you don’t already have it:

@powershell -NoProfile -ExecutionPolicy Bypass -Command "iex ((new-object net.webclient).DownloadString('https://chocolatey.org/install.ps1'))" && SET PATH=%PATH%;%ALLUSERSPROFILE%\chocolatey\bin

Then, install java using the following command:

choco install openjdk11

You may also need to then set the JAVA_HOME environment variable to the path to your Java installation (e.g., C:\Program Files\Java\jdk-11\bin). This can be done:

within R using Sys.setenv(JAVA_HOME = "C:/Program Files/Java/jdk-11/bin") (note slashes), or
from command prompt using the setx command: setx JAVA_HOME C:\Program Files\Java\jdk-11\bin, or
from PowerShell, using the .NET framework: [Environment]::SetEnvironmentVariable("JAVA_HOME", "C:\Program Files\Java\jdk-11\bin", "User"), or
from the Start Menu, via Control Panel » System » Advanced » Environment Variables (instructions here).

You should now be able to safely open R, and use rJava and tabulapdf. Note, however, that some users report that rather than setting this variable, they instead need to delete it (e.g., with Sys.setenv(JAVA_HOME = "")), so if the above instructions fail, that is the next step in troubleshooting.

From PowerShell, you should see something like this after running java -version:

OpenJDK Runtime Environment (build 11.0.22+7-post-Ubuntu-0ubuntu222.04.1)
OpenJDK 64-Bit Server VM (build 11.0.22+7-post-Ubuntu-0ubuntu222.04.1, mixed mode, sharing)

Troubleshooting

Some notes for troubleshooting common installation problems:

On Mac OS and Linux, we tested with OpenJDK version 11. The package is configure to ask for that version of Java. If you have a different version of Java installed, you may need to change the JAVA_HOME environment variable to point to the correct version. You need to ensure that R has been installed with Java support. This can often be fixed by running R CMD javareconf on the command line (possibly with sudo, etc. depending on your system setup).
On Windows, make sure you have permission to write to and install packages to your R directory before trying to install the package. This can be changed from “Properties” on the right-click context menu. Alternatively, you can ensure write permission by choosing “Run as administrator” when launching R (again, from the right-click context menu).

tabulizer's People

Contributors

Stargazers

Watchers

tabulizer's Issues

Consider adding tabularizerjars to remotes until it's on CRAN?

Multiple table in 1 page

Migrated from ropensci/tabulizerjars#1 (@khun84)

Is there param that I can parse in to extract more than 1 table per page?

I have a pdf page with 2 tables:

table 1 is 2 columns and multiple rows
table 2 has 2 columns and multiple rows, but some of the cells are merged).

I use the extract_table() function with default param and the output only has 1 table (table 1).

What I can think of is to set method = 'asis' but I do not know to proceed with the output java object. Is there any documentation I can refer to?

Handle .jar files for CRAN

The .jar files are currently 8MB and apparently too big for CRAN. Standard practice is to dump these to a separate, rarely updated package, which we can do, but Tabula is a relatively young library so that may not work quite yet.

Installation fails

Hi,

I followed your instructions to install tabulizer (Windows x64) but the installation always fails:

also installing the dependency ‘png’

leeper/tabulizerjars     leeper/tabulizer 
             "0.1.2"                   NA 
Warning messages:
1: running command '"C:/PROGRA~1/R/R-33~1.0/bin/x64/R" CMD INSTALL --no-multiarch -l "C:\Users\kasus\Documents\R\win-library\3.3" C:\Users\kasus\AppData\Local\Temp\RtmpqIMHaQ/downloaded_packages/png_0.1-7.tar.gz' had status 1 
2: In utils::install.packages(to_install, type = "source", contriburl = contrib,  :
  installation of package ‘png’ had non-zero exit status
3: running command '"C:/PROGRA~1/R/R-33~1.0/bin/x64/R" CMD INSTALL --no-multiarch -l "C:\Users\kasus\Documents\R\win-library\3.3" C:\Users\kasus\AppData\Local\Temp\RtmpqIMHaQ/ghitdrat/src/contrib/tabulizer_0.1.21.tar.gz' had status 1 
4: In utils::install.packages(to_install, type = "source", contriburl = contrib,  :
  installation of package ‘tabulizer’ had non-zero exit status

I tried several things like different paths, different java versions, etc., but all without success. Can you help me out?

blank images in locate_areas and make_thumbnails

Hi,

I'm having some issues with PDF files that I created from scans (TIFF -> adobe acrobat OCR -> PDF). extract_tables works fine on most of them, but occasionally misses the last row of a table. I therefore tried to get better results with locate_areas. However, the shiny app simply shows an empty image. The same happens when I use make_thumbnails on these files; a blank PNG is created.

Both functions work fine with the demo file that ships with the package, so I suspect that it has something to do with the PDF I created. Do I need to enable particular PDF features when I collate the scanned TIFs?

For completeness sake, below is my sessionInfo. I also tried to attach a sample PDF, but that currently fails (I suspect because of the ongoing S3 issues). I'll try again later.

R version 3.3.2 (2016-10-31)
Platform: x86_64-apple-darwin13.4.0 (64-bit)
Running under: OS X Yosemite 10.10.5

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] shiny_1.0.0      tabulizer_0.1.23

loaded via a namespace (and not attached):
 [1] tabulizerjars_0.1.2 R6_2.2.0            htmltools_0.3.5     tools_3.3.2         Rcpp_0.12.9         jsonlite_1.2        digest_0.6.12      
 [8] xtable_1.8-2        httpuv_1.3.3        miniUI_0.1.1        mime_0.5            rJava_0.9-8         png_0.1-7

locate areas / extract tables renders unexpected results

I have a small issue with the way locate_areas and extract_tables interact

I use something like:

areas_to_extract<-locate_areas(PDFfile)
extract_tables(PDFfile, area=areas_to_extract)

areas_to_extract is a list of length pages, with each position representing a page.
Positions representing pages that I have specified areas for contain coordinates,
while the pages that I have not indicated an area for, are left empty.

When passing the generated list to extract_tables, empty positions invoke the autodetection algorithm to try and find tables. This seems rather illogical to me, as I had previously reviewed these pages manually as to assure that these pages in fact do not contain tables.

A possible solution may be that extract_tables skips a page in case no area is indicated for a particular page, so that the autodetection is not triggered. I think it would improve efficiency and consistent, and should be fairly easy to implement.

extract_tables error message

I have four pdfs that from what I can tell were created by the same source. extract_tables works perfect for three of them. For the fourth I get the following error message,

Error in .jcall("RJavaTools", "Ljava/lang/Object;", "invokeMethod", cl,  : 
  java.util.NoSuchElementException

extract_areas and extract_text appear to work fine with the pdf in question.

Do any watchers of this repo understand this error message? My googling has not been successful. Is there a specific pdf attribute that I should make sure exists in order for extract_tables to work successfully?

I apologize for not having a reproducible example but the pdfs I'm working with are confidential.

extract_text without trimming

I am not sure this is an issue per se but I think it would be very useful to preserve the spacing of the text without trimming. For example, if something appeared on screen as

" [whitespace....................]hello world gfdaggfdagfda [whitespace....................]"

right now i believe Tabulizer would yield

" hello world gfdaggfdagfda"

Another example would be

" hello world [whitespace....................] gfdaggfdagfda [whitespace....................]"

tabulizer might yield

" hello world gfdaggfdagfda "

Perhaps there is a way to do this now, but I missed it. Even trying something like extract_tables(guess=FALSE,columns...) won't do the trick because of the aforementioned trimming issue. The only thing I can think of doing is literally creating coordinate by coordinate columns. Like,

extract_tables(file=f,guess=FALSE,pages=1,columns=list(seq(1,900,by=1)))

Perhaps that is the recommended move? But it seems less than ideal as it is incredibly computationally expensive for what its doing

Loading into R unsuccessful

I followed the instruction, and installed the Java6. R keeps throwing out the same warning message:
-> ghit::install_github(c("ropenscilabs/tabulizerjars", "ropenscilabs/tabulizer"))

ropenscilabs/tabulizerjars     ropenscilabs/tabulizer 
                        NA                         NA 
Warning messages:
1: In utils::install.packages(to_install, type = type, repos = repos,  :
  installation of package ‘tabulizerjars’ had non-zero exit status
2: In utils::install.packages(to_install, type = type, repos = repos,  :
  installation of package ‘tabulizer’ had non-zero exit status

Any one has had the same issue? Any help would be appreciated.
I am using Unix, and R version is 3.3.2.

Handling issues related to upgrade to PDFBox 2.0

The tabula-java library is moving to PDFBox 2.0, which will have consequences not only for the tabula API but also for some of the utility functions that tabulizer implements by calling PDFBox classes directly. This is flagged as an issue at tabulizerjars and will likely have numerous consequences for tabulizer. Any help identifying and correcting this issues will be appreciated.

"Tabulizer not available for R 3.3.2"

Attempted to run through the tutorial given on DataSciencePlus.com and went to install the Tabulizer package and the R console throws up the following warning:

> install.packages("tabulizer")
Warning in install.packages : package ‘tabulizer’ is not available (for R version 3.3.2)

I did not see any open issues or discussion on the development page regarding this and wanted to bring it to the attention of the maintainers.

Please let me know if you all have any further questions
Javier - javier.ignacio.alonso (at) gmail dot com

How to install from local directory?

Hi I have downloaded the zip file to C:/Users/Public.
My operating system is windows7 64 bit. My version of R is 3.3.2
I have tried ...

> install.packages('C:/Users/Public/tabulizer-master.zip', repos = NULL, type="binary", INSTALL_opts = "--no-multiarch")
> library(`tabulizer-master`)
Error in library(`tabulizer-master`) : 
  there is no package called ‘tabulizer-master’
> library(`tabulizer`)
Error in library(tabulizer) : there is no package called ‘tabulizer’

I have also tried

 ghit::install_github(c("leeper/tabulizerjars", "leeper/tabulizer"), INSTALL_opts = "--no-multiarch", verbose = TRUE) 
Parsing reponame for 'leeper/tabulizerjars'...
Creating local git repository for tabulizerjars in C:\Users\MARSHL~1\AppData\Local\Temp\RtmpqkiGbZ\tabulizerjarsa0c2a8e4674...
Checking out package tabulizerjars to local git repository...
Error in git2r::fetch(gitrepo, name = "github", credentials = credentials) : 
  Error in 'git2r_remote_fetch': failed to send request: A connection with the server could not be established

> devtools::install_github("leeper/tabulizer")
Error in curl::curl_fetch_disk(url, x$path, handle = handle) : 
  Couldn't connect to server

Can you please give some instructions for installing tabulizer from a local zip files? Thank you.

Add column and area tests

This is documented but not tested. It can be done using the same test PDF file to extract a subset of columns.

java.lang.IllegalArgumentException: Comparison method violates its general contract!

I had no issues installing the package and running the code example in the readme, so I know that my installation was successful.

The following attempt blew up:

location <- "http://usda.mannlib.cornell.edu/usda/nass/CropProd//2000s/2002/CropProd-11-12-2002.pdf"
  
out <- extract_tables(location)

The error was :
Error in .jcall("RJavaTools", "Ljava/lang/Object;", "invokeMethod", cl,  : 
  java.lang.IllegalArgumentException: Comparison method violates its general contract!

here is my sessionInfo():

R version 3.3.2 (2016-10-31)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 7 x64 (build 7601) Service Pack 1

locale:
[1] LC_COLLATE=English_United States.1252  LC_CTYPE=English_United States.1252   
[3] LC_MONETARY=English_United States.1252 LC_NUMERIC=C                          
[5] LC_TIME=English_United States.1252    

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] tabulizer_0.1.22

loaded via a namespace (and not attached):
[1] tabulizerjars_0.1.2 tools_3.3.2         rJava_0.9-8         png_0.1-7

Understanding the error messages

Can you help me understand the following warnings and steps that I need to take to avoid them:

extracted_data_all <- list_of_files %>% lapply(extract_tables, guess = TRUE, method = 'character')
# Fontconfig warning: "/etc/fonts/infinality/conf.d/41-repl-os-win.conf", line 148: Having multiple values in <test> isn't supported and may not work as expected
# Fontconfig warning: "/etc/fonts/infinality/conf.d/41-repl-os-win.conf", line 160: Having multiple values in <test> isn't supported and may not work as expected
# May 31, 2016 3:43:34 PM org.apache.pdfbox.pdmodel.font.PDTrueTypeFont getawtFont
# INFO: Can't find the specified font Microsoft Sans Serif
# May 31, 2016 3:43:34 PM org.apache.fontbox.ttf.TrueTypeFont initializeTable
# SEVERE: An error occured when reading table name
# java.io.EOFException
#   at java.io.RandomAccessFile.readUnsignedShort(RandomAccessFile.java:769)
#   at org.apache.fontbox.ttf.RAFDataStream.readUnsignedShort(RAFDataStream.java:118)
#   at org.apache.fontbox.ttf.NamingTable.initData(NamingTable.java:53)
#   at org.apache.fontbox.ttf.TrueTypeFont.initializeTable(TrueTypeFont.java:280)
#   at org.apache.fontbox.ttf.TrueTypeFont.getNaming(TrueTypeFont.java:114)
#   at org.apache.fontbox.util.FontManager.analyzeTTF(FontManager.java:112)
#   at org.apache.fontbox.util.FontManager.loadFonts(FontManager.java:75)
#   at org.apache.fontbox.util.FontManager.findTTFontname(FontManager.java:290)
#   at org.apache.fontbox.util.FontManager.findTTFont(FontManager.java:326)
#   at org.apache.pdfbox.pdmodel.font.PDTrueTypeFont.getExternalFontFile2(PDTrueTypeFont.java:584)
#   at org.apache.pdfbox.pdmodel.font.PDTrueTypeFont.getawtFont(PDTrueTypeFont.java:510)
#   at org.apache.pdfbox.pdmodel.font.PDSimpleFont.drawString(PDSimpleFont.java:110)
#   at org.apache.pdfbox.pdfviewer.PageDrawer.processTextPosition(PageDrawer.java:260)
#   at org.apache.pdfbox.util.PDFStreamEngine.processEncodedText(PDFStreamEngine.java:499)
#   at org.apache.pdfbox.util.operator.ShowTextGlyph.process(ShowTextGlyph.java:62)
#   at org.apache.pdfbox.util.PDFStreamEngine.processOperator(PDFStreamEngine.java:557)
#   at org.apache.pdfbox.util.PDFStreamEngine.processSubStream(PDFStreamEngine.java:268)
#   at org.apache.pdfbox.util.PDFStreamEngine.processSubStream(PDFStreamEngine.java:235)
#   at org.apache.pdfbox.util.PDFStreamEngine.processStream(PDFStreamEngine.java:215)
#   at org.apache.pdfbox.pdfviewer.PageDrawer.drawPage(PageDrawer.java:139)
#   at org.apache.pdfbox.pdmodel.PDPage.convertToImage(PDPage.java:801)
#   at technology.tabula.detectors.NurminenDetectionAlgorithm.detect(NurminenDetectionAlgorithm.java:93)
#   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
#   at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
#   at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
#   at java.lang.reflect.Method.invoke(Method.java:606)
#   at RJavaTools.invokeMethod(RJavaTools.java:386)
# 
# May 31, 2016 3:43:35 PM org.apache.fontbox.util.FontManager findTTFontname
# WARNING: Font not found: Microsoft Sans Serif
# May 31, 2016 3:43:35 PM org.apache.pdfbox.pdmodel.font.PDTrueTypeFont getawtFont
# INFO: Can't find the specified font Microsoft Sans Serif,Italic
# May 31, 2016 3:43:35 PM org.apache.fontbox.util.FontManager findTTFontname
# WARNING: Font not found: Microsoft Sans Serif,Italic
# May 31, 2016 3:43:36 PM org.apache.pdfbox.pdmodel.font.PDTrueTypeFont getawtFont
# INFO: Can't find the specified font Microsoft Sans Serif
# May 31, 2016 3:43:36 PM org.apache.fontbox.util.FontManager findTTFontname
# WARNING: Font not found: Microsoft Sans Serif
# May 31, 2016 3:43:37 PM org.apache.pdfbox.pdmodel.font.PDTrueTypeFont getawtFont
# INFO: Can't find the specified font Microsoft Sans Serif
# May 31, 2016 3:43:37 PM org.apache.fontbox.util.FontManager findTTFontname
# WARNING: Font not found: Microsoft Sans Serif
# May 31, 2016 3:43:37 PM org.apache.pdfbox.pdmodel.font.PDTrueTypeFont getawtFont
# INFO: Can't find the specified font Microsoft Sans Serif,Italic
# May 31, 2016 3:43:37 PM org.apache.fontbox.util.FontManager findTTFontname
# WARNING: Font not found: Microsoft Sans Serif,Italic
# May 31, 2016 3:43:38 PM org.apache.pdfbox.pdmodel.font.PDTrueTypeFont getawtFont
# INFO: Can't find the specified font Microsoft Sans Serif
# May 31, 2016 3:43:38 PM org.apache.fontbox.util.FontManager findTTFontname
# WARNING: Font not found: Microsoft Sans Serif
# May 31, 2016 3:43:41 PM org.apache.pdfbox.pdmodel.font.PDTrueTypeFont getawtFont
# INFO: Can't find the specified font Microsoft Sans Serif
# May 31, 2016 3:43:41 PM org.apache.fontbox.util.FontManager findTTFontname
# WARNING: Font not found: Microsoft Sans Serif
# May 31, 2016 3:43:41 PM org.apache.pdfbox.pdmodel.font.PDTrueTypeFont getawtFont
# INFO: Can't find the specified font Microsoft Sans Serif,Italic
# May 31, 2016 3:43:41 PM org.apache.fontbox.util.FontManager findTTFontname
# WARNING: Font not found: Microsoft Sans Serif,Italic
# May 31, 2016 3:43:42 PM org.apache.pdfbox.pdmodel.font.PDTrueTypeFont getawtFont
# INFO: Can't find the specified font Microsoft Sans Serif,Italic
# May 31, 2016 3:43:42 PM org.apache.fontbox.util.FontManager findTTFontname
# WARNING: Font not found: Microsoft Sans Serif,Italic
# May 31, 2016 3:43:42 PM org.apache.pdfbox.pdmodel.font.PDTrueTypeFont getawtFont
# INFO: Can't find the specified font Microsoft Sans Serif
# May 31, 2016 3:43:42 PM org.apache.fontbox.util.FontManager findTTFontname
# WARNING: Font not found: Microsoft Sans Serif
# May 31, 2016 3:43:42 PM org.apache.pdfbox.pdmodel.font.PDTrueTypeFont getawtFont
# INFO: Can't find the specified font Microsoft Sans Serif
# May 31, 2016 3:43:42 PM org.apache.fontbox.util.FontManager findTTFontname
# WARNING: Font not found: Microsoft Sans Serif

Here is the session info

sessionInfo()
# R version 3.3.0 (2016-05-03)
# Platform: x86_64-pc-linux-gnu (64-bit)
# Running under: Ubuntu 14.04.4 LTS
# 
# locale:
#  [1] LC_CTYPE=en_IN.UTF-8          LC_NUMERIC=C                  LC_TIME=en_IN.UTF-8          
#  [4] LC_COLLATE=en_IN.UTF-8        LC_MONETARY=en_IN.UTF-8       LC_MESSAGES=en_IN.UTF-8      
#  [7] LC_PAPER=en_IN.UTF-8          LC_NAME=en_IN.UTF-8           LC_ADDRESS=en_IN.UTF-8       
# [10] LC_TELEPHONE=en_IN.UTF-8      LC_MEASUREMENT=en_IN.UTF-8    LC_IDENTIFICATION=en_IN.UTF-8
# 
# attached base packages:
# [1] stats     graphics  grDevices utils     datasets  methods   base     
# 
# other attached packages:
# [1] data.table_1.9.6 magrittr_1.5     rlist_0.4.6.1    stringr_1.0.0    dplyr_0.4.3      tabulizer_0.1.14
# 
# loaded via a namespace (and not attached):
#  [1] tabulizerjars_0.1.2 R6_2.1.2            assertthat_0.1      parallel_3.3.0      DBI_0.4-1          
#  [6] tools_3.3.0         Rcpp_0.12.5         stringi_1.1.1       chron_2.3-47        rJava_0.9-8        
# [11] png_0.1-7

extract_areas got wrong results for numerical values in pdf table?

I extracted table in pdf (text in format, not scanned image) by

extract_areas(file, encoding="UTF-8")
Interactive operation in Rstudio viewer (looks very blurred), see the screenshot:
https://goo.gl/OFvOLn

then the data.frame output got wrong results, like that:
https://goo.gl/YRnU2d
The column number was right, and got right English characters, but the values stored were all wrong.

I cannot figured out what's the possible issues. If someone can help to verify the problem, the pdf file can be downloaded from:
http://drp.mk/i/0qpZDLvxkW

The R environments of mine:

sessionInfo()
R version 3.3.1 (2016-06-21)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows >= 8 x64 (build 9200)

locale:
[1] LC_COLLATE=Chinese (Traditional)_Taiwan.950
[2] LC_CTYPE=Chinese (Traditional)_Taiwan.950
[3] LC_MONETARY=Chinese (Traditional)_Taiwan.950
[4] LC_NUMERIC=C
[5] LC_TIME=Chinese (Traditional)_Taiwan.950

attached base packages:
[1] stats graphics grDevices utils datasets methods base

other attached packages:
[1] shiny_0.13.2 tabulizer_0.1.22 magrittr_1.5 data.table_1.9.6

loaded via a namespace (and not attached):
[1] Rcpp_0.12.7 png_0.1-7 digest_0.6.10 mime_0.5
[5] chron_2.3-47 R6_2.1.3 jsonlite_1.0 xtable_1.8-2
[9] git2r_0.15.0 ghit_0.2.12 miniUI_0.1.1 tabulizerjars_0.1.2
[13] Cairo_1.5-9 tools_3.3.1 rsconnect_0.4.3 httpuv_1.3.3
[17] rJava_0.9-8 htmltools_0.3.5

Better rstudio integration for extract_areas function

Hopefully from the rstudio team can help with this.

Not loading in R

Hi, I confess I'm not an expert R user but I seem to have some problems in installing Tabulizer in R.

I'm using R Studio and working in a 64bit Windows environment.

I tried loading the package using this line (as I had seen in another thread):

ghit::install_github(c("leeper/tabulizerjars", "leeper/tabulizer"), INSTALL_opts = "--no-multiarch", dependencies = c("Depends", "Imports"))

And that is what I got as an answer:

leeper/tabulizerjars leeper/tabulizer
NA NA
Warning messages:
1: running command '"C:/PROGRA~~1/R/R-33~~1.0/bin/x64/R" CMD INSTALL --no-multiarch -l "C:\Users...\Documents\R\win-library\3.3" C:\Users...\AppData\Local\Temp\RtmpeML0Qt/ghitdrat/src/contrib/tabulizerjars_0.9.2.tar.gz' had status 1
2: In utils::install.packages(to_install, type = type, repos = repos, :
installation of package ‘tabulizerjars’ had non-zero exit status
3: running command '"C:/PROGRA~~1/R/R-33~~1.0/bin/x64/R" CMD INSTALL --no-multiarch -l "C:\Users...\Documents\R\win-library\3.3" C:\Users...\AppData\Local\Temp\RtmpeML0Qt/ghitdrat/src/contrib/tabulizer_0.1.24.tar.gz' had status 1
4: In utils::install.packages(to_install, type = type, repos = repos, :
installation of package ‘tabulizer’ had non-zero exit status

Could you help me with that?

Thanks in advance.

Integrating extract_tables with Shiny-app - no reactivity

Thanks for this awesome package. It works well on all the .pdf-documents I have tried it on. I do however have a problem integrating the extract_tables / extract_text functions with my own Shiny-app.

More specifically the problem is that the fileInput-function to upload files doesn't seem to recognize that a new file has been uploaded. This works instantly with other R-functions like read.csv or pdf_text in the pdftools-library.

This works with pdftools :

library(pdftools)
shinyServer(function(input, output) {
    output$contents <- renderText({

        inFile <- input$file1

        if (is.null(inFile))
            return(NULL)
        pdf_text(inFile$datapath)
    })
})

This doesn't work with tabulizer :

library(shiny);library(tabulizer)
shinyServer(function(input, output) {
    output$contents <- renderText({#renderTable

        inFile <- input$file1

        if (is.null(inFile))
            return(NULL)
       extract_text(inFile$datapath)
        #extract_tables(inFile$datapath)[[1]]
        #read.csv(inFile$datapath, header=input$header, sep=input$sep, 
        #         quote=input$quote)
    })
})

ui.R is the same in both cases.

shinyUI(fluidPage(
    titlePanel("Uploading Files"),
    sidebarLayout(
        sidebarPanel(
            fileInput('file1', 'Choose PDF File',
                      accept=c('.pdf'))#,c("application/pdf","adobe-portable-document-format",".pdf"))
        ),
        mainPanel(
            tableOutput('contents')
        )
    )
))

I'm working on a MacPro with
OS X 10.11.4
R 3.2.3
RStudio Version 0.99.887

Installing tabulizer

I'm pretty well stuck trying to install the package as described in the instructions. After installing chocolatey, then Java, then the ghit package, after running the code as laid out in the instructions:
ghit::install_github(c("ropenscilabs/tabulizerjars", "ropenscilabs/tabulizer"), INSTALL_opts = "--no-multiarch")
I get the following error:
Error in build_and_insert(p$pkgname, d, vers, build_args, verbose = verbose) :
Package build for tabulizerjars failed!
In addition: Warning message:
running command '"C:/PROGRA~~1/MIE74D~~1/MRO-33~1.1/bin/x64/R" CMD build C:\Users\USERNAME\AppData\Local\Temp\Rtmp0mPJjx\tabulizerjars1cd83bee68cf ' had status 1

Any help troubleshooting would be very much appreciated.

Handle non-latin encodings

This seems really challenging given the quirkiness of PDF format, but is the big issue to left to implement from rOpenSci onboarding

Pdf ideas for examples

Scientific papers often have tables and one would surely like to use the area argument.
Bus timetables, e.g. http://www.apsrtc.gov.in/Airport%20Liner%20Timings.pdf or http://www.morbihan.fr/fileadmin/Les_services/Vos_deplacements/Transports_collectifs/Fiches_horaires_TIM/TIM7-Hiver-Printemps-2016.pdf p.3

Handle Encrypted PDFs

Tabulizer can handle encrypted PDFs through a password. I should expose this functionality for completeness sake.

All this requires is optionally passing a password argument to the objectextractor constructor here.

A parameter in extract_tables to assume row names and colnames from first row and column

If tables are coming in to R as matrices, Could the conversion to data.frames be made simpler with a parameter for assuming the first row or columns should be row and columns names?

annoying warning

Hello,
I get plenty of messages:

There is already a file with this name in the temporary directory. It will be overwritten.

I saw that they come from localize_file(). It seems that the temp path (which is probably not needed for non-URL loaded files) is built using tempdir(), which explains the above message.
Why not using tempfile() to build a proper temporary path ? Or do not copy the file if it is not needed ?

P.S
tabulizer works great !

Extract Area around a matching string

Hello,
I have tables in the format

abc : 12345566
cde : 456782
gef : 45345435

where abc,def are the same and the other number vary. When I extract specific area, I get dataframe with 2 columns which is perfect. My problem however is , the tables sometimes split over two pages depending on the extra lines on number side and there is one value "xyz" which is present for some tables.

Is there a way to be able to get the area around a search string that way I know at which value the table got split in second page and also , if "xyz" is present , I can change the area accordingly.

Hopefully I am making sense...

Password Protection error for PDF's without PW protection

Hi,

I'm having trouble with some PDFs (which work with the tabula browser). When trying to load them with `extract_areas`` the following error is raised:

f <- "path_to_pdf/0101.pdf"
out1 <- extract_areas(f, pages=c(1))

Error in .jcall("RJavaTools", "Ljava/lang/Object;", "invokeMethod", cl,  : 
  java.io.IOException
In addition: Warning messages:
1: In load_doc(file, password = password) :
  PDF appears to be password protected and no password was supplied.
2: In load_doc(file, password = password) :
  PDF appears to be password protected and no password was supplied.

An example PDF is available here:
http://www.insee.fr/fr/ppp/bases-de-donnees/donnees-detaillees/circo_leg/donnees/0101.pdf
It is not password protected.

is there a limit on the size of the extraction?

I tried to extract 816 pages using extract_tables from a PDF that has a size of 8.2MB. After 10 minutes of running, the following error message popped up:

Error in .jcall("RJavaTools", "Ljava/lang/Object;", "invokeMethod", cl, : java.lang.OutOfMemoryError: Java heap space

Would really appreciate your help!

Thank you!

extract_tables() error: java.lang.NoSuchMethodError

I've just installed tabulizer from github. I'm using MacOS Sierra. I also installed the legacy java from the link given on the install instructions: https://support.apple.com/kb/DL1572?locale=en_US.

When I use extract_tables(), I get the following error:

f <- system.file("examples", "data.pdf", package = "tabulizer")
extract_tables(f)
Error in .jcall("RJavaTools", "Ljava/lang/Object;", "invokeMethod", cl,  : 
  java.lang.NoSuchMethodError: java.lang.Integer.compare(II)I

extract_tables() doesn't work with the new update

Hi,

All my packages were deleted due to some stupid mistake. Upon reinstalling (in windows), extract_tables() doesn't work anymore and gives me this error.

Error in .jcall("RJavaTools", "Ljava/lang/Object;", "invokeMethod", cl,  : 
  java.lang.IllegalArgumentException: Comparison method violates its general contract!

I reinstalled java and but it gives me the same error. I tried repeating the process on my mac as well and it shows the same error. Is there something I'm doing wrong?

Improve Shiny-based `extract_areas()` functionality

Both locate_areas() and extract_areas() use, optionally, a Shiny interface to identify areas. This could probably improved because I'm not much of a Shiny expert. Any advice on improvements and new functionality can be pitched here and/or submitted as PRs.

Comprehensive installation instructions

This is a help wanted! issue. There are a wide and somewhat recurring set of installation issues (I've tried to systematically label these issues on GitHub), mostly related to variation in Java and rJava across platforms. If anyone wants to contribute PRs to help document troubleshooting, please submit them around this issue. In particular, the README could benefit from even more detail about installation processes, based around OS:

Error using get_n_pages()

Hi, an error as follows occurred when I tried to get the number of pages of a PDF file. I am not sure if it's because of the size of the file, but increasing the JAVA memory didn't help.

Error in .jcall("RJavaTools", "Ljava/lang/Object;", "invokeMethod", cl, : org.apache.pdfbox.exceptions.WrappedIOException

Thank you!

Not loading into R like described

I've seemingly exhausted my limited knowledge but I'm unable to get the package to load into RStudio. I've installed rJava, and updated Java on my PC (Win7). I followed your instructions on installation, but I think I'm either missing a step or not downloading the correct versions of tabula. Any help would be greatly appreciated.

I'm using 3.3.1, if that matters.
Thanks!

Dependencies not available

When I tried to install the package, the following error message appear:

Warning: dependencies ‘BiocInstaller’, ‘Rcompression’, ‘glmmADMB’, ‘lme4.0’, ‘cacheSweave’, ‘weaver’, ‘graph’, ‘Biobase’, ‘GenomicRanges’, ‘marray’, ‘affy’, ‘limma’, ‘Rcampdf’, ‘Rgraphviz’, ‘tm.lexicon.GeneralInquirer’, ‘ReportingTools’, ‘globaltest’, ‘R2wd’, ‘RDCOMClient’, ‘rhdf5’ are not available

and was unable to continue installation.

Would you please help with this issue? Thank you!

memory issues

Dear Tabulizer team,

When extracting hundreds of PDFs, is there a good way to clear memory? The memory use keeps growing and I assume this is due to unreleased objects floating around in the heap.

Add vignette

locate_areas() widget issue

I am getting the following error when using locate_areas() with the native or reduced widgets;

Graphics device does not support event handling...
Entering reduced functionality mode.
Click upper-left and then lower-right corners of area.
Error in try_area_reduced(file = file, dims = dims, area = area, warn = warn) : 
  Graphics device does not support rasterImage() plotting

When searching this error, I was led to this file: try_area_methods.R

Looking on line 74 there is this condition that shows the final line of my error:

if (grDevices::dev.capabilities()[["rasterImage"]] != "no") {
        stop("Graphics device does not support rasterImage() plotting")
    }

There is also this similar condition on line 103 (note the != vs == while producing the same error):

if (grDevices::dev.capabilities()[["rasterImage"]] == "no") {
        stop("Graphics device does not support rasterImage() plotting")
    }

Having checked my grDevices::dev.capabilities(), rasterImage is enabled and so I would think this error would not apply to me. Is the condition from line 74 flipped and causing this error?

I attempted to clone the repo and make the change myself but couldn't figure out how to install locally, so I am pointing it out here.

Java Error

I am encountering a Java error when I try to use tabulizer. Here is the relevant code and the error:

The options command maxes out the memory available to Java.

I'm working on a MacBook Pro (Retina, 13-inch, Mid 2014), 2.8 GHz Intel Core i5, 16 GB 1600 MHz DDR3
R 3.3.2

It seems when scraping a repeated filename, the first file remains cached

I was scraping repeated filenames in different folders and when I execute extract_table and/or extract_areas it returns the first instance of the filename.

For me worked a workaround: load again the library.

tabulizer is not available (for R version 3.3.2)

I tried installing this R package and it gave below error.

Installing package into ‘C:/Users/hskir/Documents/R/win-library/3.3’
(as ‘lib’ is unspecified)
Warning in install.packages :
package ‘tabulizer’ is not available (for R version 3.3.2)

May I know which version is supported. Thanks

PDF page in left and right format

Thanks for contributing this awesome package.
Most of my pages are seperated by a blank area in the middle, so that left paragraph and right paragraph are independent. Interestingly, extract_tables sometimes works well, sometimes it regards left and right paragraphs as an entire table. In the second condition, some columns in the table are combined and it's hard to extract information. I've upload test.pdf.

I wonder is it possible that function could auto detect or parameter specified this kind of format. Thank you.

Area problems with multiple pages

I keep getting an error trying to use the area parameter with a specified page range:

eg using this file and the command:

extract_tables('Lap Analysis.pdf',guess=F,pages=2,area=list(c(178, 10, 800,40)))

I can extract a header, but if I set pages=c(2,3) or remove the pages parameter I get an error:

May 2, 2016 10:55:48 AM org.apache.pdfbox.cos.COSDocument finalize
WARNING: Warning: You did not close a PDF Document
Error in .jcall("RJavaTools", "Ljava/lang/Object;", "invokeMethod", cl,  : 
   java.util.NoSuchElementException

Example use cases, tutorials, and applications

This is a help wanted! issue for anyone to contribute example uses of tabulizer to the package wiki. The idea is to add links to existing blog posts or tutorials, as well as add new examples to the wiki itself that showcase various functionality. Anyone can add an example by editing the wiki directly.

Specifying area and using nospreadsheet=True doesn't work

I have a pdf where guessing or autodetect tables isn't able to find the table and the table is a Stream table(columns separated by white spaces). I used tabula-py where the code goes like:

df=tabula.read_pdf("sample.pdf",nospreadsheet=True,area=(321.3,49.459,836.719,567.109))
I get an empty dataframe after executing .But mentioning the same area and using it through tabula in windows produces me a output which is what i want.

is there any way?

Non-western import test extracts two tables instead of one.

> f3 <- "https://github.com/tabulapdf/tabula-java/raw/master/src/test/resources/technology/tabula/arabic.pdf"
> tab3 <- tabulizer::extract_tables(f3, method = "asis")
> tab3
[1] "Java-Object{[[technology.tabula.TableWithRulingLines[x=0.0,y=72.0,w=612.0,h=720.0,bottom=792.000000,right=612.000000], technology.tabula.TableWithRulingLines[x=0.0,y=0.0,w=612.0,h=792.0,bottom=792.000000,right=612.000000]]]}"

Better options for output

Currently, the output is a character matrix, which kind of makes sense. But there are other options:

List of character matrices (current default)
List of character vectors
- Could be delimited for parsing via read.csv(), etc.
List of data.frames
- This would be nice, but shouldn't be default because some tables won't work well with it if they're not perfectly rectangular. This would also enable automatic variable typing, which would be nice.
Tabula's CSVWriter (implemented but not exposed)
Tabula's TSVWriter
Tabula's JSONWriter

add split and merge functions

This might be useful for handling portions of a very large PDF document, or for combining many PDFs into one for use with extract_areas() or extract_tables(), or for some other purposes.

split_pdf()
https://pdfbox.apache.org/docs/1.8.12/javadocs/org/apache/pdfbox/util/Splitter.html#Splitter()

merge_pdfs()
https://pdfbox.apache.org/docs/1.8.12/javadocs/org/apache/pdfbox/util/PDFMergerUtility.html#PDFMergerUtility()

Subscript out of bounds error for much the same PDF

Thanks for this awesome package. It works well on all the .pdf-documents I have tried it on. I do however have a problem about the extract_tables like below. Also, You can reproduce this in your R studio, too.

This works with this pdf in 2015 :

library(tabulizer)
path2pdf <- "/Users/HidetakaKo/Desktop/2015-cookpad.pdf"
out <- extract_tables(path2pdf)
as.data.frame(out[[1]])

This doesn't work with this pdf in 2016 :

library(tabulizer)
path2pdf <- "/Users/HidetakaKo/Desktop/2016-cookpad.pdf"
out <- extract_tables(path2pdf)
as.data.frame(out[[1]])

These .pdf-documents format is much the same with the previous one.

I'm working on a MacAir with
OS X 10.11.6
R 3.3.1
Exploratory Desktop
RStudio Version 0.99.887

Error installing

I have tried to install in RStudio both

ghit::install_github(c("ropenscilabs/tabulizerjars", "ropenscilabs/tabulizer"), INSTALL_opts = "--no-multiarch")

and

ghit::install_github(c("leeper/tabulizerjars", "leeper/tabulizer"), INSTALL_opts = "--no-multiarch", dependencies = c("Depends", "Imports"))

according to http://stackoverflow.com/questions/39132202/trouble-installing-tabulizer-package-for-r

but the result shows

'leeper/tabulizerjars leeper/tabulizer
"0.1.2" NA
There were 20 warnings (use warnings() to see them)'

There is tabulizerjars in the library but not tabulizer.

when i typed

'install_github("ropensci/tabulizer")'

it shows

Error in read.dcf(file = tmpf) : cannot open the connection
In addition: Warning message:
In read.dcf(file = tmpf) :
cannot open compressed file 'c:\temp\RtmpWQoyv5/ghitdrat/src/contrib/PACKAGES', probable

reason 'No such file or directory'

When I typed

ghit::install_github("leeper/tabulizer", INSTALL_opts = "--no-multiarch")
it shows

leeper/tabulizer
NA

Anyone know how to solve this please? There is no tabulizer in the library at the moment.

Thank You.

ropensci / tabulizer Goto Github PK

tabulizer's Introduction

tabulapdf: Extract tables from PDF documents

Installation

Code Examples

Installing Java on Windows with Chocolatey

Troubleshooting

Meta

tabulizer's People

Contributors

Stargazers

Watchers

Forkers

tabulizer's Issues

Recommend Projects

Recommend Topics

Recommend Org

Jobs