jonathanlink / pdflayouttextstripper Goto Github PK

Converts a pdf file into a text file while keeping the layout of the original pdf. Useful to extract the content from a table in a pdf file for instance. This is a subclass of PDFTextStripper class (from the Apache PDFBox library).

Home Page: https://jonathanlink.ch/PDFLayoutTextStripper.html

License: Apache License 2.0

Java 100.00%

layout text java pdf extract data-extraction pdfbox

pdflayouttextstripper's Introduction

PDFLayoutTextStripper

Converts a PDF file into a text file while keeping the layout of the original PDF. Useful to extract the content from a table or a form in a PDF file. PDFLayoutTextStripper is a subclass of PDFTextStripper class (from the Apache PDFBox library).

Use cases

Data extraction from a table in a PDF file

Data extraction from a form in a PDF file

How to install

Maven

<dependency>
  <groupId>io.github.jonathanlink</groupId>
  <artifactId>PDFLayoutTextStripper</artifactId>
  <version>2.2.3</version>
</dependency>

Manual

Install apache pdfbox manually (to get the v2.0.6 click here ) and its two dependencies commons-logging.jar and fontbox

warning: only pdfbox versions from version 2.0.0 upwards are compatible with this version of PDFLayoutTextStripper.java

How to use on Linux/Mac

cd PDFLayoutTextStripper
javac -cp .:/pathto/pdfbox-2.0.6.jar:/pathto/commons-logging-1.2.jar:/pathto/PDFLayoutTextStripper/fontbox-2.0.6.jar *.java
java -cp .:/pathto/pdfbox-2.0.6.jar:/pathto/commons-logging-1.2.jar:/pathto/PDFLayoutTextStripper/fontbox-2.0.6.jar test

How to use on Windows

The same as for Linux (see above) but replace : with ;

Sample code

import java.io.File;
import java.io.FileNotFoundException;
import java.io.IOException;
import org.apache.pdfbox.io.RandomAccessFile;
import org.apache.pdfbox.pdfparser.PDFParser;
import org.apache.pdfbox.pdmodel.PDDocument;
import org.apache.pdfbox.text.PDFTextStripper;

public class test {
	public static void main(String[] args) {
		String string = null;
        try {
            PDFParser pdfParser = new PDFParser(new RandomAccessFile(new File("./samples/bus.pdf"), "r"));
            pdfParser.parse();
            PDDocument pdDocument = new PDDocument(pdfParser.getDocument());
            PDFTextStripper pdfTextStripper = new PDFLayoutTextStripper();
            string = pdfTextStripper.getText(pdDocument);
        } catch (FileNotFoundException e) {
            e.printStackTrace();
        } catch (IOException e) {
            e.printStackTrace();
        };
        System.out.println(string);
	}
}

Contributors

Thanks to

Dmytro Zelinskyy for reporting an issue with its correction (v2.2.3)
Ho Ting Cheng for reporting an issue (v2.1)
James Sullivan for having updated the code to make it work with the latest version of PDFBox (v2.0)

pdflayouttextstripper's People

Contributors

Stargazers

Watchers

Forkers

send2cloud fengyr treewater stigfire drawcode manasrk enjoyandroid zhjunvj rakhmad vdt linpengcheng tdilcy neuroradiology markmo qgzang muharremokutan openaccess siddharthgopi artisdom a1ip number0 luoyiqi manniru magicknight socialloser1 techscientist vugit nunb priestd09 liweiww123 hhy5277 vamsijkrishna wombatpm vuthaihoc codernaut davidarchi stephan972 pietroferrara minkione wqssyq golgo47 lumiqai fengweijp nangal bradparks madnight zavster fladev harborluo impressivewebs jackiejiao mboo2005 semtle jamessullivan tmcf cjehng97 jayeshdhamechai venkatabuddhiraju appsecai-test zhoubaozhou rajkumarmagar mrdgan zjytechnology william179825800 zhangyong15 fashtimedotcom softmade-timobruentjen ofelipedias onofri andross2 pibicha ayseyo bharatrsharma rubbishgod kumar-shubham xtuyaowu ryan-ac sriyogesh94 jesfie minsifansi wanedenis dgsaigit kchrs keshav-chaure zengtianyu1215 walkerzhou nn-tony blueflychief jiyulongxu stblinux shenjun134 edgeowner lkg arvind-handa123 billho chunyu-lin-bjtu lfh-francis denis554 jerhe mkhalid-s

pdflayouttextstripper's Issues

will it provide a python wrapper

GitHub release

Could you release versioned tarballs here on GitHub?

Issue running master

Master returns the following error:

in:

$ javac -cp .:pdfbox.jar:commons-logging.jar:fontbox.jar *.java

out:

test.java:20: error: cannot find symbol
            PDFTextStripper pdfTextStripper = new PDFLayoutTextStripper();
                                                  ^
  symbol:   class PDFLayoutTextStripper
  location: class test
1 error

PR #33 fixes this.

Is it android compitible?

Can I use this library in my android app?

Actually, I tried and got following error : java.lang.NoClassDefFoundError: Failed resolution of: Ljava/awt/color/ColorSpace;

Justified text with spaces

Why am I getting the text justified with spaces while the PDF content is not justified?
Here is how one sample line looks like:
This is a max 3-member assignment which is to be submitted CFU repository on EE.

Feature Request

Please make PDFLayoutTextStripper compatible with PdfBox-Android so that it can be used on mobile Android devices. Thanks.

difference with pdftotext -layout?

What is the difference between this library and the pdftotext command?

Comparison method violates its general contract!

If I use JavaSE-1.7 I get error java.lang.IllegalArgumentException: Comparison method violates its general contract! for some PDF documents which I am trying to convert.

If use below versions it works fine. I read online that MergeSort implementation has changed.
This link may help you: https://stackoverflow.com/questions/11441666/java-error-comparison-method-violates-its-general-contract

This method has some problem I guess.
private void sortTextPositionList(final List textList) {
TextPositionComparator comparator = new TextPositionComparator();
Collections.sort(textList, comparator);
}

Comparison method violates its general contract!

Hi Jonathan,
i ran into the same problem with the latest version.
Please see closed issue #16 for more details and an example sheet.

getNumberOfNewLinesFromPreviousTextPosition

If height is 0 (can happen in some documents) the variable ""int numberOfLines" will be 2147483647 (Integer.MAX_VALUE). This will resolut in adding too much empty lines.

quick dirty fix but it would be better to find out why height is sometimes 0.
if(height==0){height=1;};
before
int numberOfLines = (int) (Math.floor( textYPosition - previousTextYPosition) / height );

New line

While parsing tabular data a new line is invoked every time this condition is met:
if ( textYPosition > previousTextYPosition )

Now this is too sensitive if a row of a table contains two different font sizes.
It doesn't have to be a huge difference in font size.
One point in a font size is enough for the existing function
getNumberOfNewLinesFromPreviousTextPosition()
to call for a new line, which of course results in a bad text output.

I've modified this function to have simple threshold, while checking for new line:
if ( textYPosition - previousTextYPosition > newLineHeightThreshold )
and now it works just perfect.

BTW: great job Jonathan with this little class :)

PDFLayoutTextStripper StringIndexOutOfBoundsException

I am using PDFLayoutTextStripper with some PDF's and works perfecly, but today I came across the attached PDF returning the following exception:

java.lang.StringIndexOutOfBoundsException: String index out of range: -1

at java.lang.String.charAt(String.java:658)
at com.xxx.pdfbox.TextLine.isSpaceCharacterAtIndex(TextLine.java:58)

tim.pdf

PDCIDFontType2 - Could not read embedded OTF for font TimesNewRoman0

Hi,

I am trying to extract data from attached PDF as you mentioned in your sample code but I am getting font error. just fyi I am using pdfbox 2.0.6

Here below is the log:

[main] WARN org.apache.pdfbox.pdmodel.font.FileSystemFontProvider - New fonts found, font cache will be re-built
[main] WARN org.apache.pdfbox.pdmodel.font.FileSystemFontProvider - Building on-disk font cache, this may take a while
[main] WARN org.apache.pdfbox.pdmodel.font.FileSystemFontProvider - Finished building on-disk font cache, found 463 fonts
[main] WARN org.apache.pdfbox.pdmodel.font.PDCIDFontType2 - Could not read embedded OTF for font TimesNewRoman0
java.io.EOFException
at org.apache.fontbox.ttf.MemoryTTFDataStream.readSignedShort(MemoryTTFDataStream.java:138)
at org.apache.fontbox.ttf.OS2WindowsMetricsTable.read(OS2WindowsMetricsTable.java:827)
at org.apache.fontbox.ttf.TrueTypeFont.readTable(TrueTypeFont.java:335)
at org.apache.fontbox.ttf.TTFParser.parseTables(TTFParser.java:174)
at org.apache.fontbox.ttf.TTFParser.parse(TTFParser.java:150)
at org.apache.fontbox.ttf.OTFParser.parse(OTFParser.java:79)
at org.apache.fontbox.ttf.OTFParser.parse(OTFParser.java:27)
at org.apache.fontbox.ttf.TTFParser.parse(TTFParser.java:106)
at org.apache.fontbox.ttf.OTFParser.parse(OTFParser.java:73)
at org.apache.pdfbox.pdmodel.font.PDCIDFontType2.(PDCIDFontType2.java:117)
at org.apache.pdfbox.pdmodel.font.PDCIDFontType2.(PDCIDFontType2.java:69)
at org.apache.pdfbox.pdmodel.font.PDFontFactory.createDescendantFont(PDFontFactory.java:125)
at org.apache.pdfbox.pdmodel.font.PDType0Font.(PDType0Font.java:129)
at org.apache.pdfbox.pdmodel.font.PDFontFactory.createFont(PDFontFactory.java:83)
at org.apache.pdfbox.pdmodel.PDResources.getFont(PDResources.java:143)
at org.apache.pdfbox.contentstream.operator.text.SetFontAndSize.process(SetFontAndSize.java:60)
at org.apache.pdfbox.contentstream.PDFStreamEngine.processOperator(PDFStreamEngine.java:838)
at org.apache.pdfbox.contentstream.PDFStreamEngine.processStreamOperators(PDFStreamEngine.java:495)
at org.apache.pdfbox.contentstream.PDFStreamEngine.processStream(PDFStreamEngine.java:469)
at org.apache.pdfbox.contentstream.PDFStreamEngine.processPage(PDFStreamEngine.java:150)
at org.apache.pdfbox.text.LegacyPDFStreamEngine.processPage(LegacyPDFStreamEngine.java:139)
at org.apache.pdfbox.text.PDFTextStripper.processPage(PDFTextStripper.java:391)
at PDFLayoutTextStripper.processPage(PDFLayoutTextStripper.java:81)
at org.apache.pdfbox.text.PDFTextStripper.processPages(PDFTextStripper.java:319)
at org.apache.pdfbox.text.PDFTextStripper.writeText(PDFTextStripper.java:266)
at org.apache.pdfbox.text.PDFTextStripper.getText(PDFTextStripper.java:227)
at Scan.readPDF(Scan.java:130)
at Scan.main(Scan.java:40)
[main] WARN org.apache.pdfbox.pdmodel.font.PDCIDFontType2 - Using fallback font LiberationSans for CID-keyed TrueType font TimesNewRoman0
[main] WARN org.apache.pdfbox.pdmodel.font.PDType0Font - No Unicode mapping for CID+32 (32) in font TimesNewRoman0
[main] WARN org.apache.pdfbox.pdmodel.font.PDCIDFontType2 - Failed to find a character mapping for 32 in TimesNewRoman0
java.lang.IllegalArgumentException: Comparison method violates its general contract!
at java.util.TimSort.mergeHi(TimSort.java:895)
at java.util.TimSort.mergeAt(TimSort.java:512)
at java.util.TimSort.mergeCollapse(TimSort.java:437)
at java.util.TimSort.sort(TimSort.java:241)
at java.util.Arrays.sort(Arrays.java:1435)
at java.util.Collections.sort(Collections.java:230)
at PDFLayoutTextStripper.sortTextPositionList(PDFLayoutTextStripper.java:114)
at PDFLayoutTextStripper.writePage(PDFLayoutTextStripper.java:92)
at org.apache.pdfbox.text.PDFTextStripper.processPage(PDFTextStripper.java:392)
at PDFLayoutTextStripper.processPage(PDFLayoutTextStripper.java:81)
at org.apache.pdfbox.text.PDFTextStripper.processPages(PDFTextStripper.java:319)
at org.apache.pdfbox.text.PDFTextStripper.writeText(PDFTextStripper.java:266)
at org.apache.pdfbox.text.PDFTextStripper.getText(PDFTextStripper.java:227)
at Scan.readPDF(Scan.java:130)
at Scan.main(Scan.java:40)

CLMS#56589936#2.PDF

can't generate text file

Imports Issue

org.apache.pdfbox.io IMPORTS aren't working for me;
where do I download them and get them to work?

Text extraction of PDF AcroForm Fields

Look like current git version does not support extracting PDF text of format AcroForm fields. Do you plan to support that?

Attaching a PDF file, for which filled information does not get extracted.

Marjorie_DMV.pdf

Create LICENSE file

Hi, could you please provide a file with the licensing information in the root directory, so that the applying license is easily apparent and also will correctly linked in the GitHub UI? Thanks.

Error String index out of range: -1 in PDFLayoutTextStripper

Hi,
Hi have this code, with attached PDF to test.
public void doStrip() {
String string = null;
try {
PDFParser pdfParser = new PDFParser(new RandomAccessFile(new File("D:/escaner/errorsPDFBOX/AN20-0149-0602201842.pdf"), "r"));
pdfParser.parse();
PDDocument pdDocument = new PDDocument(pdfParser.getDocument());
PDFTextStripper pdfTextStripper = new PDFLayoutTextStripper();
string = pdfTextStripper.getText(pdDocument);
BufferedWriter writer = Files.newBufferedWriter(FileSystems.getDefault().getPath("D:/escaner","fichero.txt"), Charset.forName("UTF-8"));
writer.write(string);
writer.flush();
writer.close();
} catch (InvalidPasswordException e) {
// TODO Auto-generated catch block
e.printStackTrace();
} catch (IOException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
}

AN20-0149-0602201842.pdf
I have this exception error:
Exception in thread "main" java.lang.StringIndexOutOfBoundsException: String index out of range: -1
at java.lang.String.charAt(String.java:658)
at com.sagedillepasa.gestion.TextLine.isSpaceCharacterAtIndex(PDFLayoutTextStripper.java:269)
at com.sagedillepasa.gestion.TextLine.getNextValidIndex(PDFLayoutTextStripper.java:283)
at com.sagedillepasa.gestion.TextLine.computeIndexForCharacter(PDFLayoutTextStripper.java:263)
at com.sagedillepasa.gestion.TextLine.writeCharacterAtIndex(PDFLayoutTextStripper.java:229)
at com.sagedillepasa.gestion.PDFLayoutTextStripper.writeLine(PDFLayoutTextStripper.java:127)
at com.sagedillepasa.gestion.PDFLayoutTextStripper.writeTextPositionList(PDFLayoutTextStripper.java:157)
at com.sagedillepasa.gestion.PDFLayoutTextStripper.iterateThroughTextList(PDFLayoutTextStripper.java:152)
at com.sagedillepasa.gestion.PDFLayoutTextStripper.writePage(PDFLayoutTextStripper.java:96)
at org.apache.pdfbox.text.PDFTextStripper.processPage(PDFTextStripper.java:392)
at com.sagedillepasa.gestion.PDFLayoutTextStripper.processPage(PDFLayoutTextStripper.java:80)
at org.apache.pdfbox.text.PDFTextStripper.processPages(PDFTextStripper.java:319)
at org.apache.pdfbox.text.PDFTextStripper.writeText(PDFTextStripper.java:266)
at org.apache.pdfbox.text.PDFTextStripper.getText(PDFTextStripper.java:227)
at com.sagedillepasa.gestion.test.doStrip(test.java:44)
at com.sagedillepasa.gestion.test.main(test.java:61)

Deploy artifact to Maven Central

Hi, could you please provide an artifact with this class on Maven Central, so others can easily use it without having to build it from source? Thanks!

How to support Chinese character from pdf to txt?

I have found some solutions by using OCI to identified Chinese character, how to integrate OCI with PDFLayoutTextStripper?
Any reply will be appreciated. Thx

missing last row of every page in PDF

hello , thankyou for making this open source

But i tried to apply the new class and the result is missing last row of every page in PDF

To print some character every time a table is encountered.

I am using your code to extract data of tables from my pdf but the spacing between the columns is not equal and the data is multiline, is it possible to print any symbol or character everytime it encounters the boundary of a cell or a rectangle. So that it will be easy to know that data belongs to which column or what

isSpaceCharacterAtIndex

return this.line.charAt(index) != SPACE_CHARACTER;

it can happen that index is larger as char array length.

try..catch or test against line.length

I lazy fixed with try... catch and result is ok for me. Better to find out why index is larger as line.length.

txt samples

It's really great you have a couple samples up, but do consider posting the pdf source and txt file result as samples.
Thanks!

Is other language like Chinese supported?

Is other language like Chinese supported? What should I do in order to use this feature? With just Apache PDF Box, I can extract text from the PDF documents.

Tail characters getting stripped off

I am working with a host of PDF reports and while I am able to maintain the layout using your class, sometimes the tail characters are getting stripped off, but the parent class i.e. PDFTextStripper works fine.

Does this have anything to do with this.setCurrentPageWidth(pageRectangle.getWidth()); ??

By the way great work with the class, made the process of extracting tables so easy.

Sample Code Doesn't Work

The Sample Code in the Readme file indicates that PDFParser takes in a RandomAccessFile and a string as a constructor.

There is no constructor present with this signature however.

lol my team is building a very complex one of these. you should check it out.

C#

Any chance it would be implemented for c# PdfBox? -https://www.codeproject.com/Articles/538617/Working-with-PDF-files-in-Csharp-using-PdfBox-and

multi-line cells

how well does this library handles wrapped text inside cells, are they read as a different row?

there's another project that looks to solve this issue: https://github.com/tabulapdf/tabula

not found: type PDFLayoutTextStripper

Zero height of TextPosition is not handled

Hi,

I faced case where PDFBox returns 0 as a height of TextPosition in the method "getNumberOfNewLinesFromPreviousTextPosition". Not sure why we have such strange height, but it leads to very large loop in the "createNewEmptyNewLines" method. If the height of TextPosition is 0 then we try to divide by 0 and numberOfLines becomes equal to Integer.MAX_VALUE.
I suggest to add a new condition (textPosition.getHeight() != 0) for this case to that line. Does it make sense? If so then I can a create pull request.

Regards,
Timur

jonathanlink / pdflayouttextstripper Goto Github PK

pdflayouttextstripper's Introduction

PDFLayoutTextStripper

Use cases

Data extraction from a table in a PDF file

How to install

Maven

Manual

How to use on Linux/Mac

How to use on Windows

Sample code

Contributors

pdflayouttextstripper's People

Contributors

Stargazers

Watchers

Forkers

pdflayouttextstripper's Issues

Recommend Projects

Recommend Topics

Recommend Org

Jobs