GithubHelp home page GithubHelp logo

haighton / create_searchable_pdf Goto Github PK

View Code? Open in Web Editor NEW
4.0 1.0 0.0 4.52 MB

Create a searchable PDF with ALTO-XML and JP2 files.

HTML 9.37% CSS 51.15% XSLT 11.22% Python 25.55% Dockerfile 2.71%
alto-xml pdf searchable-pdf

create_searchable_pdf's Introduction

This script automates the process of creating a searchable PDF file that conforms to de BKT2/BKT3 specifications for digitized historical material of the National Library of the Netherlands | KB. This can be used when a delivered PDF for a digitized object is corrupt or missing.

The script is based around hocr-pdf from hocr-tools, which can create a searchable PDF from a directory of JPEG and hOCR files.

To be able to use hocr-pdf the ALTO-XML files have to be converted into hOCR files. This is done by an XSL transformation using alto2hocr.xsl in Saxon-HE 9.7.0.21J. The JP2 scans, which is the format the KB uses for digitized historical material, have to be converted to JPEG. It also needs a Metadatadump XML file which contains information about the object which have to be put in the Document Information of the final PDF.

create_searchable_pdf's People

Stargazers

 avatar  avatar  avatar  avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.