dpapathanasiou / pdfminer-layout-scanner Goto Github PK
View Code? Open in Web Editor NEWA more complete example of programming with PDFMiner, which continues where the default documentation stops
License: MIT License
A more complete example of programming with PDFMiner, which continues where the default documentation stops
License: MIT License
PDFMiner (http://www.unixuser.org/~euske/python/pdfminer/index.html) is a pdf parsing library written in Python by Yusuke Shinyama. In addition to the pdf2txt.py and dumppdf.py command line tools, there is a way of analyzing the content tree of each page programmatically. This is a more complete example of programming with PDFMiner, which continues where the default documentation (http://www.unixuser.org/~euske/python/pdfminer/programming.html#layout) stops. This code is still a work-in-progress, with room for improvement. Usage: import layout_scanner and call get_toc() for a list of the table of contents, and get_pages() for the full text. Here are some examples using the Python shell: >>> import layout_scanner >>> toc=layout_scanner.get_toc('/path/to/your/pdf-file.pdf') >>> len(toc) ... should return the number of elements in the pdf document's table of contents (or 0 if there is no TOC) >>> toc[0] ... a tuple containing the ordinal sequence and the title string, for example: (1, u'Introduction') >>> pages=layout_scanner.get_pages('/path/to/your/pdf-file.pdf') >>> len(pages) ... should return the number of pages in the pdf document >>> pages[0] ... a string of all the text on the first page Room for Improvement * Column Merging - while the fuzzy heuristic I described works well for the pdf files I've parsed so far, I can imagine more complex documents where it would break-down (perhaps this is where the analysis should be more sophisticated, and not ignore so many types of pdfminer.layout.LT* objects). * Image Extraction - I'd like to be able to be at least as good as pdftoimages, and save every file in ppm or pnm default format, but I'm not sure what I could be doing differently * Title and Heading Capitalization - this seems to be an issue with PDFMiner, since I get similar results in using the command line tools, but it is annoying to have to go back and fix all the mis-capitalizations manually, particularly for larger documents. * Title and Heading Fonts and Spacing - a related issue, though probably something in my own code, is that those same title and paragraph headings aren't distinguished from the rest of the text. In many cases, I have to go back and add vertical spacing and font attributes for those manually. * Page Number Removal - originally, I thought I could just use a regex for an all-numeric value on a single physical line, but each document does page numbering slightly differently, and it's very difficult to get rid of these without manually proofreading each page. * Footnotes - handling these where the note and the reference both appear on the same page is hard enough, but doing it when they span different (even consecutive) pages is worse.
The latest pdfminer API has removed PDFDocument.initialize() method. see http://euske.github.io/pdfminer/index.html#changes
If anyone meets error: PDFDocument has no attribute initialize(), can just comment line 32: #doc.initialize(pdf_pwd)
There's a typo in "parse_lt_objs" when the function calls itself on an "LTFigure" object--in that recursive call, the argument "lt_obj.objs" found in the original code at http://denis.papathanasiou.org/posts/2010.08.04.post.html has been replaced by "lt_obj", which starts an ugly endless loop.
Unfortunately, in the current version of PDFMiner, the "LTFigure" class has no ".objs" attribute. The proper fix is would seem be to to call the "analyze" method on the "LTFigure" object, which constructs and returns the (internal) "_.objs" attribute, but I'm not positive it's working correctly--my documents may not have "LTFigures" that actually contain child objects.
use pdf2txt.py samples/simple1.pdf
Hello,
I'm receiving "error saving image on page" for each image of a pdf file, but cannot see why.
See layout_scanner.py line 169
Hey!
I am getting images with bytes as hex == 789cec9d
do you know what kind of image that might be so that I can properly extract them.
thanks!
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.