GithubHelp home page GithubHelp logo

rostrovsky / pdf-table Goto Github PK

View Code? Open in Web Editor NEW
65.0 7.0 12.0 148 KB

Java utility for parsing PDF tabular data using Apache PDFBox and OpenCV

License: MIT License

Java 100.00%
opencv opencv3 pdfbox tables table java8 java-library pdf-parsing

pdf-table's Introduction

PDF-table

What is PDF-table?

PDF-table is Java utility library that can be used for parsing tabular data in PDF documents.
Core processing of PDF documents is performed with utilization of Apache PDFBox and OpenCV.

Prerequisites

JDK

JAVA 8 is required.

External dependencies

pdf-table requires compiled OpenCV 3.4.2 to work properly:

  1. Download OpenCV v3.4.2 from https://github.com/opencv/opencv/releases/tag/3.4.2

  2. Unpack it and add to your system PATH:

    • Windows: <opencv dir>\build\java\x64

    • Linux: TODO

Installation

<dependency>
  <groupId>com.github.rostrovsky</groupId>
  <artifactId>pdf-table</artifactId>
  <version>1.0.0</version>
</dependency>

Usage

Parsing PDFs

When PDF document page is being parsed, following operations are performed:

  1. Page is converted to grayscale image [OpenCV].

  2. Binary Inverted Threshold (BIT) is applied to grayscaled image [OpenCV].

  3. Contours are detected on BIT image and contour mask is created (additional Canny filtering can be turned on in this step) [OpenCV].

  4. Contour mask is XORed with BIT image [OpenCV].

  5. Contours are detected once again on XORed image (additional Canny filtering can be turned on in this step) [OpenCV].

  6. Final contours are drawn [OpenCV].

  7. Bounding rectangles are detected from final contours [OpenCV].

  8. PDF is being parsed region-by-region using bounding rectangles coordinates [Apache PDFBox].

Above algorithm is mostly derived from http://stackoverflow.com/a/23106594.

For more information about parsed output, refer to Output format

single-threaded example

class SingleThreadParser {
    public static void main(String[] args) throws IOException {
        PDDocument pdfDoc = PDDocument.load(new File("some.pdf"));
        PdfTableReader reader = new PdfTableReader();
        List<ParsedTablePage> parsed = reader.parsePdfTablePages(pdfDoc, 1, pdfDoc.getNumberOfPages());
    }
}

multi-threaded example

class MultiThreadParser {
    public static void main(String[] args) throws IOException {
        final int THREAD_COUNT = 8;
        PDDocument pdfDoc = PDDocument.load(new File("some.pdf"));
        PdfTableReader reader = new PdfTableReader();

        // parse pages simultaneously
        ExecutorService executor = Executors.newFixedThreadPool(THREAD_COUNT);
        List<Future<ParsedTablePage>> futures = new ArrayList<>();
        for (final int pageNum : IntStream.rangeClosed(1, pdfDoc.getNumberOfPages()).toArray()) {
            Callable<ParsedTablePage> callable = () -> {
                ParsedTablePage page = reader.parsePdfTablePage(pdfDoc, pageNum);
                return page;
            };
            futures.add(executor.submit(callable));
        }

        // collect parsed pages
        List<ParsedTablePage> unsortedParsedPages = new ArrayList<>(pdfDoc.getNumberOfPages());
        try {
            for (Future<ParsedTablePage> f : futures) {
                ParsedTablePage page = f.get();
                unsortedParsedPages.add(page.getPageNum() - 1, page);
            }
        } catch (Exception e) {
            throw new RuntimeException(e);
        }

        // sort pages by pageNum
        List<ParsedTablePage> sortedParsedPages = unsortedParsedPages.stream()
                .sorted((p1, p2) -> Integer.compare(p1.getPageNum(), p2.getPageNum())).collect(Collectors.toList());
    }
}

Saving PDF pages as PNG images

PDF-Table provides methods for saving PDF pages as PNG images.
Rendering DPI can be modified in PdfTableSettings (see: Parsing settings).

single-threaded example

class SingleThreadPNGDump {
    public static void main(String[] args) throws IOException {
        PDDocument pdfDoc = PDDocument.load(new File("some.pdf"));
        Path outputPath = Paths.get("C:", "some_directory");
        PdfTableReader reader = new PdfTableReader();
        reader.savePdfPagesAsPNG(pdfDoc, 1, pdfDoc.getNumberOfPages(), outputPath);
    }
}

multi-threaded example

class MultiThreadPNGDump {
    public static void main(String[] args) throws IOException {
        final int THREAD_COUNT = 8;
        Path outputPath = Paths.get("C:", "some_directory");
        PDDocument pdfDoc = PDDocument.load(new File("some.pdf"));
        PdfTableReader reader = new PdfTableReader();

        ExecutorService executor = Executors.newFixedThreadPool(THREAD_COUNT);
        List<Future<Boolean>> futures = new ArrayList<>();
        for (final int pageNum : IntStream.rangeClosed(1, pdfDoc.getNumberOfPages()).toArray()) {
            Callable<Boolean> callable = () -> {
                reader.savePdfPageAsPNG(pdfDoc, pageNum, outputPath);
                return true;
            };
            futures.add(executor.submit(callable));
        }

        try {
            for (Future<Boolean> f : futures) {
                f.get();
            }
        } catch (Exception e) {
            throw new RuntimeException(e);
        }
    }
}

Saving debug PNG images

When tables in PDF document cannot be parsed correctly with default settings, user can save debug images that show page at various stages of processing.
Using these images, user can adjust PdfTableSettings accordingly to achieve desired results (see: Parsing settings).

single-threaded example

class SingleThreadDebugImgsDump {
    public static void main(String[] args) throws IOException {
        PDDocument pdfDoc = PDDocument.load(new File("some.pdf"));
        Path outputPath = Paths.get("C:", "some_directory");
        PdfTableReader reader = new PdfTableReader();
        reader.savePdfTablePagesDebugImages(pdfDoc, 1, pdfDoc.getNumberOfPages(), outputPath);
    }
}

multi-threaded example

class MultiThreadDebugImgsDump {
    public static void main(String[] args) throws IOException {
        final int THREAD_COUNT = 8;
        Path outputPath = Paths.get("C:", "some_directory");
        PDDocument pdfDoc = PDDocument.load(new File("some.pdf"));
        PdfTableReader reader = new PdfTableReader();

        ExecutorService executor = Executors.newFixedThreadPool(THREAD_COUNT);
        List<Future<Boolean>> futures = new ArrayList<>();
        for (final int pageNum : IntStream.rangeClosed(1, pdfDoc.getNumberOfPages()).toArray()) {
            Callable<Boolean> callable = () -> {
                reader.savePdfTablePagesDebugImage(pdfDoc, pageNum, outputPath);
                return true;
            };
            futures.add(executor.submit(callable));
        }

        try {
            for (Future<Boolean> f : futures) {
                f.get();
            }
        } catch (Exception e) {
            throw new RuntimeException(e);
        }
    }
}

Parsing settings

PDF rendering and OpenCV filtering settings are stored in PdfTableSettings object.

Custom settings instance can be passed to PdfTableReader constructor when non-default values are needed:

(...)

// build settings object
PdfTableSettings settings = PdfTableSettings.getBuilder()
                .setCannyFiltering(true)
                .setCannyApertureSize(5)
                .setCannyThreshold1(40)
                .setCannyThreshold2(190.5)
                .setPdfRenderingDpi(160)
                .build();

// pass settings to reader
PdfTableReader reader = new PdfTableReader(settings);

Output format

Each parsed PDF page is being returned as ParsedTablePage object:

(...)

PDDocument pdfDoc = PDDocument.load(new File("some.pdf"));
PdfTableReader reader = new PdfTableReader();

// first page in document has index == 1, not 0 !
ParsedTablePage firstPage = reader.parsePdfTablePage(pdfDoc, 1);

// getting page number
assert firstPage.getPageNum() == 1;

// rows and cells are zero-indexed just like elements of the List
// getting first row
ParsedTablePage.ParsedTableRow firstRow = firstPage.getRow(0);

// getting third cell in second row
String thirdCellContent = firstPage.getRow(1).getCell(2);

// cell content usually contain <CR><LF> characters,
// so it is recommended to trim them before processing
double thirdCellNumericValue = Double.valueOf(thirdCellContent.trim());

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.