GithubHelp home page GithubHelp logo

jonathanlink / pdflayouttextstripper Goto Github PK

View Code? Open in Web Editor NEW
1.5K 53.0 204.0 21.62 MB

Converts a pdf file into a text file while keeping the layout of the original pdf. Useful to extract the content from a table in a pdf file for instance. This is a subclass of PDFTextStripper class (from the Apache PDFBox library).

Home Page: https://jonathanlink.ch/PDFLayoutTextStripper.html

License: Apache License 2.0

Java 100.00%
layout text java pdf extract data-extraction pdfbox

pdflayouttextstripper's Introduction

PDFLayoutTextStripper

Converts a PDF file into a text file while keeping the layout of the original PDF. Useful to extract the content from a table or a form in a PDF file. PDFLayoutTextStripper is a subclass of PDFTextStripper class (from the Apache PDFBox library).

Use cases

Data extraction from a table in a PDF file example

Data extraction from a form in a PDF file example

How to install

Maven

<dependency>
  <groupId>io.github.jonathanlink</groupId>
  <artifactId>PDFLayoutTextStripper</artifactId>
  <version>2.2.3</version>
</dependency>

Manual

  1. Install apache pdfbox manually (to get the v2.0.6 click here ) and its two dependencies commons-logging.jar and fontbox

warning: only pdfbox versions from version 2.0.0 upwards are compatible with this version of PDFLayoutTextStripper.java

How to use on Linux/Mac

cd PDFLayoutTextStripper
javac -cp .:/pathto/pdfbox-2.0.6.jar:/pathto/commons-logging-1.2.jar:/pathto/PDFLayoutTextStripper/fontbox-2.0.6.jar *.java
java -cp .:/pathto/pdfbox-2.0.6.jar:/pathto/commons-logging-1.2.jar:/pathto/PDFLayoutTextStripper/fontbox-2.0.6.jar test

How to use on Windows

The same as for Linux (see above) but replace : with ;

Sample code

import java.io.File;
import java.io.FileNotFoundException;
import java.io.IOException;
import org.apache.pdfbox.io.RandomAccessFile;
import org.apache.pdfbox.pdfparser.PDFParser;
import org.apache.pdfbox.pdmodel.PDDocument;
import org.apache.pdfbox.text.PDFTextStripper;

public class test {
	public static void main(String[] args) {
		String string = null;
        try {
            PDFParser pdfParser = new PDFParser(new RandomAccessFile(new File("./samples/bus.pdf"), "r"));
            pdfParser.parse();
            PDDocument pdDocument = new PDDocument(pdfParser.getDocument());
            PDFTextStripper pdfTextStripper = new PDFLayoutTextStripper();
            string = pdfTextStripper.getText(pdDocument);
        } catch (FileNotFoundException e) {
            e.printStackTrace();
        } catch (IOException e) {
            e.printStackTrace();
        };
        System.out.println(string);
	}
}

Contributors

Thanks to

  • Dmytro Zelinskyy for reporting an issue with its correction (v2.2.3)
  • Ho Ting Cheng for reporting an issue (v2.1)
  • James Sullivan for having updated the code to make it work with the latest version of PDFBox (v2.0)

pdflayouttextstripper's People

Contributors

dependabot[bot] avatar impressivewebs avatar jamessullivan avatar jonathanlink avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

pdflayouttextstripper's Issues

Issue running master

Master returns the following error:

in:

$ javac -cp .:pdfbox.jar:commons-logging.jar:fontbox.jar *.java

out:

test.java:20: error: cannot find symbol
            PDFTextStripper pdfTextStripper = new PDFLayoutTextStripper();
                                                  ^
  symbol:   class PDFLayoutTextStripper
  location: class test
1 error

PR #33 fixes this.

Is it android compitible?

Can I use this library in my android app?

Actually, I tried and got following error : java.lang.NoClassDefFoundError: Failed resolution of: Ljava/awt/color/ColorSpace;

Justified text with spaces

Why am I getting the text justified with spaces while the PDF content is not justified?
Here is how one sample line looks like:
This is a max 3-member assignment which is to be submitted CFU repository on EE.

Comparison method violates its general contract!

If I use JavaSE-1.7 I get error java.lang.IllegalArgumentException: Comparison method violates its general contract! for some PDF documents which I am trying to convert.

If use below versions it works fine. I read online that MergeSort implementation has changed.
This link may help you: https://stackoverflow.com/questions/11441666/java-error-comparison-method-violates-its-general-contract

This method has some problem I guess.
private void sortTextPositionList(final List textList) {
TextPositionComparator comparator = new TextPositionComparator();
Collections.sort(textList, comparator);
}

getNumberOfNewLinesFromPreviousTextPosition

If height is 0 (can happen in some documents) the variable ""int numberOfLines" will be 2147483647 (Integer.MAX_VALUE). This will resolut in adding too much empty lines.

quick dirty fix but it would be better to find out why height is sometimes 0.
if(height==0){height=1;};
before
int numberOfLines = (int) (Math.floor( textYPosition - previousTextYPosition) / height );

New line

While parsing tabular data a new line is invoked every time this condition is met:
if ( textYPosition > previousTextYPosition )

Now this is too sensitive if a row of a table contains two different font sizes.
It doesn't have to be a huge difference in font size.
One point in a font size is enough for the existing function
getNumberOfNewLinesFromPreviousTextPosition()
to call for a new line, which of course results in a bad text output.

I've modified this function to have simple threshold, while checking for new line:
if ( textYPosition - previousTextYPosition > newLineHeightThreshold )
and now it works just perfect.

BTW: great job Jonathan with this little class :)

PDFLayoutTextStripper StringIndexOutOfBoundsException

I am using PDFLayoutTextStripper with some PDF's and works perfecly, but today I came across the attached PDF returning the following exception:

java.lang.StringIndexOutOfBoundsException: String index out of range: -1

at java.lang.String.charAt(String.java:658)
at com.xxx.pdfbox.TextLine.isSpaceCharacterAtIndex(TextLine.java:58)

tim.pdf

PDCIDFontType2 - Could not read embedded OTF for font TimesNewRoman0

Hi,

I am trying to extract data from attached PDF as you mentioned in your sample code but I am getting font error. just fyi I am using pdfbox 2.0.6

Here below is the log:

[main] WARN org.apache.pdfbox.pdmodel.font.FileSystemFontProvider - New fonts found, font cache will be re-built
[main] WARN org.apache.pdfbox.pdmodel.font.FileSystemFontProvider - Building on-disk font cache, this may take a while
[main] WARN org.apache.pdfbox.pdmodel.font.FileSystemFontProvider - Finished building on-disk font cache, found 463 fonts
[main] WARN org.apache.pdfbox.pdmodel.font.PDCIDFontType2 - Could not read embedded OTF for font TimesNewRoman0
java.io.EOFException
at org.apache.fontbox.ttf.MemoryTTFDataStream.readSignedShort(MemoryTTFDataStream.java:138)
at org.apache.fontbox.ttf.OS2WindowsMetricsTable.read(OS2WindowsMetricsTable.java:827)
at org.apache.fontbox.ttf.TrueTypeFont.readTable(TrueTypeFont.java:335)
at org.apache.fontbox.ttf.TTFParser.parseTables(TTFParser.java:174)
at org.apache.fontbox.ttf.TTFParser.parse(TTFParser.java:150)
at org.apache.fontbox.ttf.OTFParser.parse(OTFParser.java:79)
at org.apache.fontbox.ttf.OTFParser.parse(OTFParser.java:27)
at org.apache.fontbox.ttf.TTFParser.parse(TTFParser.java:106)
at org.apache.fontbox.ttf.OTFParser.parse(OTFParser.java:73)
at org.apache.pdfbox.pdmodel.font.PDCIDFontType2.(PDCIDFontType2.java:117)
at org.apache.pdfbox.pdmodel.font.PDCIDFontType2.(PDCIDFontType2.java:69)
at org.apache.pdfbox.pdmodel.font.PDFontFactory.createDescendantFont(PDFontFactory.java:125)
at org.apache.pdfbox.pdmodel.font.PDType0Font.(PDType0Font.java:129)
at org.apache.pdfbox.pdmodel.font.PDFontFactory.createFont(PDFontFactory.java:83)
at org.apache.pdfbox.pdmodel.PDResources.getFont(PDResources.java:143)
at org.apache.pdfbox.contentstream.operator.text.SetFontAndSize.process(SetFontAndSize.java:60)
at org.apache.pdfbox.contentstream.PDFStreamEngine.processOperator(PDFStreamEngine.java:838)
at org.apache.pdfbox.contentstream.PDFStreamEngine.processStreamOperators(PDFStreamEngine.java:495)
at org.apache.pdfbox.contentstream.PDFStreamEngine.processStream(PDFStreamEngine.java:469)
at org.apache.pdfbox.contentstream.PDFStreamEngine.processPage(PDFStreamEngine.java:150)
at org.apache.pdfbox.text.LegacyPDFStreamEngine.processPage(LegacyPDFStreamEngine.java:139)
at org.apache.pdfbox.text.PDFTextStripper.processPage(PDFTextStripper.java:391)
at PDFLayoutTextStripper.processPage(PDFLayoutTextStripper.java:81)
at org.apache.pdfbox.text.PDFTextStripper.processPages(PDFTextStripper.java:319)
at org.apache.pdfbox.text.PDFTextStripper.writeText(PDFTextStripper.java:266)
at org.apache.pdfbox.text.PDFTextStripper.getText(PDFTextStripper.java:227)
at Scan.readPDF(Scan.java:130)
at Scan.main(Scan.java:40)
[main] WARN org.apache.pdfbox.pdmodel.font.PDCIDFontType2 - Using fallback font LiberationSans for CID-keyed TrueType font TimesNewRoman0
[main] WARN org.apache.pdfbox.pdmodel.font.PDType0Font - No Unicode mapping for CID+32 (32) in font TimesNewRoman0
[main] WARN org.apache.pdfbox.pdmodel.font.PDCIDFontType2 - Failed to find a character mapping for 32 in TimesNewRoman0
java.lang.IllegalArgumentException: Comparison method violates its general contract!
at java.util.TimSort.mergeHi(TimSort.java:895)
at java.util.TimSort.mergeAt(TimSort.java:512)
at java.util.TimSort.mergeCollapse(TimSort.java:437)
at java.util.TimSort.sort(TimSort.java:241)
at java.util.Arrays.sort(Arrays.java:1435)
at java.util.Collections.sort(Collections.java:230)
at PDFLayoutTextStripper.sortTextPositionList(PDFLayoutTextStripper.java:114)
at PDFLayoutTextStripper.writePage(PDFLayoutTextStripper.java:92)
at org.apache.pdfbox.text.PDFTextStripper.processPage(PDFTextStripper.java:392)
at PDFLayoutTextStripper.processPage(PDFLayoutTextStripper.java:81)
at org.apache.pdfbox.text.PDFTextStripper.processPages(PDFTextStripper.java:319)
at org.apache.pdfbox.text.PDFTextStripper.writeText(PDFTextStripper.java:266)
at org.apache.pdfbox.text.PDFTextStripper.getText(PDFTextStripper.java:227)
at Scan.readPDF(Scan.java:130)
at Scan.main(Scan.java:40)

CLMS#56589936#2.PDF

Imports Issue

org.apache.pdfbox.io IMPORTS aren't working for me;
where do I download them and get them to work?

Create LICENSE file

Hi, could you please provide a file with the licensing information in the root directory, so that the applying license is easily apparent and also will correctly linked in the GitHub UI? Thanks.

Error String index out of range: -1 in PDFLayoutTextStripper

Hi,
Hi have this code, with attached PDF to test.
public void doStrip() {
String string = null;
try {
PDFParser pdfParser = new PDFParser(new RandomAccessFile(new File("D:/escaner/errorsPDFBOX/AN20-0149-0602201842.pdf"), "r"));
pdfParser.parse();
PDDocument pdDocument = new PDDocument(pdfParser.getDocument());
PDFTextStripper pdfTextStripper = new PDFLayoutTextStripper();
string = pdfTextStripper.getText(pdDocument);
BufferedWriter writer = Files.newBufferedWriter(FileSystems.getDefault().getPath("D:/escaner","fichero.txt"), Charset.forName("UTF-8"));
writer.write(string);
writer.flush();
writer.close();
} catch (InvalidPasswordException e) {
// TODO Auto-generated catch block
e.printStackTrace();
} catch (IOException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
}

AN20-0149-0602201842.pdf
I have this exception error:
Exception in thread "main" java.lang.StringIndexOutOfBoundsException: String index out of range: -1
at java.lang.String.charAt(String.java:658)
at com.sagedillepasa.gestion.TextLine.isSpaceCharacterAtIndex(PDFLayoutTextStripper.java:269)
at com.sagedillepasa.gestion.TextLine.getNextValidIndex(PDFLayoutTextStripper.java:283)
at com.sagedillepasa.gestion.TextLine.computeIndexForCharacter(PDFLayoutTextStripper.java:263)
at com.sagedillepasa.gestion.TextLine.writeCharacterAtIndex(PDFLayoutTextStripper.java:229)
at com.sagedillepasa.gestion.PDFLayoutTextStripper.writeLine(PDFLayoutTextStripper.java:127)
at com.sagedillepasa.gestion.PDFLayoutTextStripper.writeTextPositionList(PDFLayoutTextStripper.java:157)
at com.sagedillepasa.gestion.PDFLayoutTextStripper.iterateThroughTextList(PDFLayoutTextStripper.java:152)
at com.sagedillepasa.gestion.PDFLayoutTextStripper.writePage(PDFLayoutTextStripper.java:96)
at org.apache.pdfbox.text.PDFTextStripper.processPage(PDFTextStripper.java:392)
at com.sagedillepasa.gestion.PDFLayoutTextStripper.processPage(PDFLayoutTextStripper.java:80)
at org.apache.pdfbox.text.PDFTextStripper.processPages(PDFTextStripper.java:319)
at org.apache.pdfbox.text.PDFTextStripper.writeText(PDFTextStripper.java:266)
at org.apache.pdfbox.text.PDFTextStripper.getText(PDFTextStripper.java:227)
at com.sagedillepasa.gestion.test.doStrip(test.java:44)
at com.sagedillepasa.gestion.test.main(test.java:61)

Deploy artifact to Maven Central

Hi, could you please provide an artifact with this class on Maven Central, so others can easily use it without having to build it from source? Thanks!

To print some character every time a table is encountered.

I am using your code to extract data of tables from my pdf but the spacing between the columns is not equal and the data is multiline, is it possible to print any symbol or character everytime it encounters the boundary of a cell or a rectangle. So that it will be easy to know that data belongs to which column or what

isSpaceCharacterAtIndex

return this.line.charAt(index) != SPACE_CHARACTER;

it can happen that index is larger as char array length.

try..catch or test against line.length

I lazy fixed with try... catch and result is ok for me. Better to find out why index is larger as line.length.

txt samples

It's really great you have a couple samples up, but do consider posting the pdf source and txt file result as samples.
Thanks!

Tail characters getting stripped off

I am working with a host of PDF reports and while I am able to maintain the layout using your class, sometimes the tail characters are getting stripped off, but the parent class i.e. PDFTextStripper works fine.

Does this have anything to do with this.setCurrentPageWidth(pageRectangle.getWidth()); ??

By the way great work with the class, made the process of extracting tables so easy.

Sample Code Doesn't Work

The Sample Code in the Readme file indicates that PDFParser takes in a RandomAccessFile and a string as a constructor.

There is no constructor present with this signature however.

Zero height of TextPosition is not handled

Hi,

I faced case where PDFBox returns 0 as a height of TextPosition in the method "getNumberOfNewLinesFromPreviousTextPosition". Not sure why we have such strange height, but it leads to very large loop in the "createNewEmptyNewLines" method. If the height of TextPosition is 0 then we try to divide by 0 and numberOfLines becomes equal to Integer.MAX_VALUE.
I suggest to add a new condition (textPosition.getHeight() != 0) for this case to that line. Does it make sense? If so then I can a create pull request.

Regards,
Timur

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.