GithubHelp home page GithubHelp logo

vi-dot / pdfalto Goto Github PK

View Code? Open in Web Editor NEW

This project forked from kermitt2/pdfalto

0.0 2.0 0.0 603 KB

PDF to XML ALTO files

License: GNU General Public License v2.0

CMake 0.12% Batchfile 0.15% C 83.49% Objective-C 0.79% C++ 15.45%

pdfalto's Introduction

pdfalto

pdfalto is a command line executable for parsing PDF files and producing structured XML representations of the PDF content in ALTO format.

pdfalto is a fork of pdf2xml, developed at XRCE, with modifications for robustness, addition of features and output enhanced format in ALTO (including in particular space information, useful for instance for further machine learning processing). It is based on the Xpdf library.

The latest (non-)stable version is 0.1.

Usage

General usage is as follow:

 pdfalto [options] <PDF-file> [<xml-file>]
  -f <int>               : first page to convert
  -l <int>               : last page to convert
  -verbose               : display pdf attributes
  -noText                : do not extract textual objects
  -noImage               : do not extract Images (Bitmap and Vectorial)
  -noImageInline         : do not include images inline in the stream
  -outline               : create an outline file xml (i.e. a table of content) as additional file
  -annotation            : create an annotations file xml as additional file
  -cutPages              : cut all pages in separately files
  -blocks                : add blocks informations whithin the structure
  -readingOrder          : blocks follow the reading order
  -fullFontName          : fonts names are not normalized
  -nsURI <string>        : add the specified namespace URI
  -opw <string>          : owner password (for encrypted files)
  -upw <string>          : user password (for encrypted files)
  -q                     : don't print any messages or errors
  -v                     : print version info
  -h                     : print usage information
  -help                  : print usage information
  --help                 : print usage information
  -?                     : print usage information
  --saveconf <string>    : save all command line parameters in the specified XML <file>

In addition to the ALTO file describing the PDF content, the following files are generated:

  • _annot.xml file containing a description of the annotations in the PDF (e.g. GOTO, external http links, ...) obtained with -annotation option

  • _outline.xml file containing a possible PDF-embedded table of content (aka outline) obtained with -outline option

  • .xml_data/ subdirectory containing the vectorial (.vec) and bitmap images (.png) embedded in the PDF, this is generated by default when the option -noImage is not present

In future release, the embedded PDF metadata will be also generated as separated file.

Build

Linux and MacOS

  • Install libxml2 (development headers). See http://xmlsoft.org/

  • Install libmotif-dev (development headers)

  • Xpdf 4.00 is shipped as git submodule, to download it:

git submodule update --init --recursive

  • Build pdfalto:

cd pdfalto

cmake .

make

The executable pdfalto is generated in the root directory. Additionally, this will create a static library for xpdf-4.00 at the following path xpdf-4.00/build/xpdf/lib/libxpdf.a and all the libraries and their respective subdirectory.

Windows

to be reviewed !

*NOTE: this version seems to have some problems with certain pdf, we recommend you to use the version built using cygwin (same process as Linux).

If you feel like discovering the issue, we would much appreciate it ;-)*

This guide compile pdf2xml using the native libraries of Windows:

drwxr-xr-x 1 lfoppiano 197121 0 lug 28 17:41 dirent/
drwxr-xr-x 1 lfoppiano 197121 0 ago  1 10:38 libiconv-1.9.1/
drwxr-xr-x 1 lfoppiano 197121 0 lug 30 20:02 libxml2-2.7.8.win32/
drwxr-xr-x 1 lfoppiano 197121 0 ago  1 10:44 pdf2xml/ (<- pdf2xml source)
drwxr-xr-x 1 lfoppiano 197121 0 lug 28 09:06 xpdf-3.04/
  • Build xpdf using the windows ms_make.bat.

  • create libxpdf.a in xpdf-XX/xpdf/ with

lib /out:libxpdf.lib *.obj

  • Compile the zlib and png libraries, under the /images subdirectory in pdf2xml source:

make.bat

Future work

  • generate metadata information in a separate XML file (as ALTO schema does not support that)

  • support unicode composition of characters

  • map special characters in secondary fonts to their expected unicode

  • propagate unsolved character unicode value (free unicode range for embedded fonts) as encoded special character in ALTO (so-called "placeholder" approach)

  • generalize reading order to all blocks (now it is limited to the blocks of the first page)

  • try OCR for unsolved character unicode value based on their associated glyph in embedded font

  • try OCR for unsolved character unicode value in context based on their occurences in the document

  • try to optimize speed and memory

Changes

  • use the latest version of xpdf, version 4.00.

  • add cmake

  • ALTO output is replacing custom Xerox XML format

  • encode URI (using xmlURIEscape from libxml2) for the @href attribute content to avoid blocking XML wellformedness issues. From our experiments, this problem happens in average for 2-3 scholar PDF out of one thousand.

  • output coordinates attributes for the BLOCK elements when the -block option is selected,

  • add a parameter -readingOrder which re-order the blocks following the reading order when the -block option is selected. By default in pdf2xml, the elements follow the PDF content stream (the so-called raw order). In pdf2txt from xpdf, several text flow orders are available including the raw order and the reading order. Note that, with this modification and this new option, only the blocks are re-ordered.

From our experiments, the raw order can diverge quite significantly from the order of elements according to the visual/reading layout in 2-4% of scholar PDF (e.g. title element is introduced at the end of the page element, while visually present at the top of the page), and minor changes can be present in up to 100% of PDF for some scientific publishers (e.g. headnote introduced at the end of the page content). This additional mode can be thus quite useful for information/structure extraction applications exploiting pdf2xml output.

  • use the latest version of xpdf, version 3.04.

Contributors

xpdf is developed by Glyph & Cog, LLC (1996-2017) and distributed under GPL2 or GPL3 license.

pdf2xml is orignally written by Hervé Déjean, Sophie Andrieu, Jean-Yves Vion-Dury and Emmanuel Giguet (XRCE) under GPL2 license.

pdf2xml has been modified and forked by Patrice Lopez ([email protected]) and Achraf Azhar ([email protected]).

The windows version has been built originally by @pboumenot and ported on windows 7 for 64 bit, then for windows (native and cygwin) by @lfoppiano and @flydutch.

License

As the original pdf2xml, pdfalto is distributed under GPL2 license.

pdfalto's People

Contributors

aazhar avatar kermitt2 avatar

Watchers

 avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.