GithubHelp home page GithubHelp logo

pdf-annotation-exporter's Introduction

Overview

This is a PDF annotation export system that uses PDF.js, and chrome headless to extract annotations from PDF files.

The hard requirement is for high fidelity exports which (a) preserve all the formatting, (b) have 100% precision, (c) 100% recall, (d) include all annotation types, (e) include screenshots of the originals, and (f) extract original markup as HTML or SVG that can be used natively in other tools.

What works:

  • I can paste poc.js into a PDF.js loaded PDF and it's able to export the text and also the image of the highlighted text.

  • Exporting the entire page as a PNG

  • Finding the box coordinates around the highlight.

  • Extracting the text...

  • Use a standalone PDF.js that I call directly.

  • Input and output to files from the command line.

What remains:

  • package it in a docker container on Linux as the dependencies are kind of harsh

  • clean up the npm dependencies

  • fix some alignment issues on various highlights.

  • fix some last minute FIXMEs in the code.

  • generated screenshots of highlights don't look as crisp as they do on screen. Numerous people have complained that the resolution of these screenshots is fixed at 96 DPI but I need to see how this manifests itself as I don't understand how that's impacting the output

Text extracted

It's able to properly export the text:

  "linesOfText": [
    "This paper introduces a new family of leaderless Byzan-",
    "tine fault tolerance protocols, built on a metastable mech-",
    "anism.  These protocols provide a strong probabilistic",
    "safety guarantee in the presence of Byzantine adversaries,",
    "while their concurrent nature enables them to achieve"
  ],

Image

It's able to properly export an image too:

What's next

  • other types of annotations.

  • annotations with user entered text.

  • connect it up with chrome headless to run from the command line

  • use the PDF.js API ourselves. Don't rely on the demo PDF.js

  • compute the scale ourselves

  • take screenshots at 400% so the font resolution is high

What I need help on?

  • Is there a reliable way to extract the HTML with all the formatting for the PDF to be embedded elsewhere in an iframe? Some type of data URL of encoded HTML would be ideal.

Example:

Highlight of a normal paragraph

Highlight using mathematical notation

pdf-annotation-exporter's People

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.