GithubHelp home page GithubHelp logo

pdf-association / deriving-html-from-pdf Goto Github PK

View Code? Open in Web Editor NEW
14.0 16.0 1.0 5.92 MB

Repository for members of the Derving-HTML-from-PDF TWG under PDF Association to track work algorithm for deriving HTML from well tagged PDF 2.0.

Home Page: https://pdfa.org/community/deriving-html-from-pdf-twg/

License: Creative Commons Attribution 4.0 International

pdf algorithm html html-css-javascript reuse

deriving-html-from-pdf's People

Contributors

petervwyatt avatar romantoda avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Forkers

arunkumarpanda

deriving-html-from-pdf's Issues

Link vs Reference

Reference - used for inta document targets (defined via structure destination or without link annotation)
Link - link annotation pointing outside of the document

Definintion of Page

Goal is to preserve the information about pagination in derived html

  • Look into EPUB what they are coming up with
  • another suggestion is to include "fake" or
    with additional properties indicating "page break"

use of AF - suggestion in Derivation are against example in 2.0

Associated files are processed based on AFRelationship value.
Alternative - means that content of the AF serves as replacement. Substructure is ignored (not output)
Supplement - means that AF is output and then substructure is processed

ISO 32000-2 however is not aligned with above statement and suggest to use MathML associated file with Supplement
on the contrary PDF 2.0 AP Note 002 suggests to use Alternative for MathML equations

Lang in catalog

4.2.4 Body suggests to derive Lang from Catalog dictionary to

Suggestion: Duplicate the language info into lang attribute

security/encryption implications

Encrypted files may contain permissions that doesn't allow text extraction or printing. Derived HTML doesn't carry this information and HTML viewer can't decide if such function is allowed or not.

High level description of form derivation variations

At PDF week, I think Roman described three variations of derived HTML. If we were to give them names they could be: "HTML Equivalent", "Storage Format", "Logical Model". I'll try describing them here -- along with some commentary.

HTML Equivalent

Translate to the nearest HTML equivalent tags and attributes. Create <input>, <textarea>, <select> etc instead of custom components. Use HTML properties instead of data-* properties where possible.

For the most part, I think HTML-equivalent is difficult -- for reasons previously discussed. It is also the format that is the most effort for derivation to produce.

Storage Format

Field information would be described using data attributes mimicking the same properties as found in the source PDF. For example, field flags are sent as a single data-pdf-ff property

The advantage to this style is that derivation is very easy. The disadvantage is that we need to do the storage-to-logical-model processing in JavaScript.

Logical model

Translate to properties that are aligned with the logical model defined in Forms.Next.
The logical model objects/properties are very similar to the Acroforms Field Object and field properties.
Producing HTML that maps to the logical model means data-* attributes support for each of the roughly 50 field properties described in section 3.10 of the draft Forms.Next specification.

Some properties are dictionaries, and we have a choice to either send the entire dictionary as a JSON blob, or we can split the dictionary into individual properties.
e.g., the events dictionary could be one property: data-pdf-events or could be multiple properties: data-pdf-events-change, data-pdf-events-click etc. Applies also to the constraints, formats and custom dictionaries.

Of course, we could go the other direction and encode the entire set of logical model properties as a JSON blob under a single data-pdf-field-properties attribute.

The advantage of the logical model, is that it maps directly from what a host form processor has in memory and directly to what a responsive mode form processor will support. i.e., if field properties are encoded in any other manner, they have to be translated to the logical model by the responsive mode client.

Lbl structure elements in lists

According to section 4.3.7.6, a Layout:Placement attribute with value Block derives to a CSS property display:block.

However in numerous example documents there are Lbl structure elements in lists that have this attribute with the value Block. Despite the Lbl elements deriving to HTML element, the rule above turns them back into blocks again. I don't know if this is a problem with the authoring tool of the PDFs, but perhaps we should consider ignoring any Layout:Placement attributes for Lbl structure elements inside lists.

Actions

According to UA/2 (8.8): all destinations within the current document shall be structure destinations
Other actions associated with Link annotation should point outside of the document
Actions can also be associated with widget annotation through AA

TOC

UA/2 redefines the use of TOC
we need a more complex example of the way the TOC would be derived into as well as outside of the main html flow

FENote

FENote definition has changed in 32005. Note structure element shall not be used, FENote could now be inline as well as block (either

or

)
we need to clarify the use of FENote
as an example: For table footnote, I guess the … would be used

We derive Sub to , Span allows FENote as child, Sub doesn’t

Clarify Block and Inline usage of SE

structure elements that can serve as BLSE and ILSE may behave differently based on context. We are not clear about situations in which BLSE might be used as ILSE

Consider allowing stylesheets to be embedded with <link> as well as @import

Currently Associated Files that are text/css must be embedded in the final HTML using an @import statement.

Could we consider allowing these to be embedded with <link> as well? The end result is the same but <link> has two advantages:

  1. <link> can accept a title attribute which can be mapped to the description of the associated file, if specified. This could give stylesheets some context.
  2. <link> can take a charset attribute, which is deprecated, yes, but if the derivation code is saving the stylesheets to local storage and referencing them via a file:// URL, then it's the only way to set the encoding for a stylesheet that needs it - for example, if the original stylesheet has no BOM, no @charset and was served with Content-Type: text/css;charset=Shift-JIS. If this were the case, then if using @import the stylesheet would need to be re-encoded to UTF8 before embedding.

I realise these are pretty weak arguments, but I'm raising them because the last one is a bit of an obscure edge case and might not have been considered when @import was chosen and not <link>. Perhaps allowing both might allow more flexibility?

fixing/updating Forms

It is not clearly defined what information is used in which situation. We might have informations from

  • structure element and attributes
  • widget annotations
  • form field

Make clear what overrides what and if we can merge those data

Html attributes (lang, id etc..) can overwrite properties from SE

The 4.3.7.1. defines order of processing attributes based on owner. For example you have both list attribute and html (HTML owner) that map to the same css property. In that case the order guarantees that the property coming from the latter takes precedence. It's not clear what happens with structure element properties like lang, id etc.

H vs. Hn in the same document

Current standard doesn't allow mixing H and Hn, however it is expected that one document may contain two Sects one with H1,H2 structure and second Sect with just H. Usually such document is created by merging two tagged documents.

Ref entry

UA/2 is more prescriptive about the use of Ref entry

let's come up with suggestion how the existence of Ref entry could be expressed in html

Document & DocumentFragment

Document containing multiple Document = clarify what goes into head and what doesn’t and what is the interaction
DocumentFragment - more examples, allow more freedom

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.