pdf-association / deriving-html-from-pdf Goto Github PK

Repository for members of the Derving-HTML-from-PDF TWG under PDF Association to track work algorithm for deriving HTML from well tagged PDF 2.0.

Home Page: https://pdfa.org/community/deriving-html-from-pdf-twg/

License: Creative Commons Attribution 4.0 International

pdf algorithm html html-css-javascript reuse

deriving-html-from-pdf's People

Contributors

Stargazers

Watchers

Forkers

arunkumarpanda

deriving-html-from-pdf's Issues

Non interactive forms (PrintField attribute)

Form structure elements with Printfield attribute are not handled correctly.

Link vs Reference

Reference - used for inta document targets (defined via structure destination or without link annotation)
Link - link annotation pointing outside of the document

address actions based on Named destinations

Definintion of Page

Goal is to preserve the information about pagination in derived html

Look into EPUB what they are coming up with
another suggestion is to include "fake" or
with additional properties indicating "page break"

use of AF - suggestion in Derivation are against example in 2.0

Associated files are processed based on AFRelationship value.
Alternative - means that content of the AF serves as replacement. Substructure is ignored (not output)
Supplement - means that AF is output and then substructure is processed

ISO 32000-2 however is not aligned with above statement and suggest to use MathML associated file with Supplement
on the contrary PDF 2.0 AP Note 002 suggests to use Alternative for MathML equations

Studio

Lang in catalog

4.2.4 Body suggests to derive Lang from Catalog dictionary to

Suggestion: Duplicate the language info into lang attribute

security/encryption implications

Encrypted files may contain permissions that doesn't allow text extraction or printing. Derived HTML doesn't carry this information and HTML viewer can't decide if such function is allowed or not.

High level description of form derivation variations

At PDF week, I think Roman described three variations of derived HTML. If we were to give them names they could be: "HTML Equivalent", "Storage Format", "Logical Model". I'll try describing them here -- along with some commentary.

HTML Equivalent

Translate to the nearest HTML equivalent tags and attributes. Create <input>, <textarea>, <select> etc instead of custom components. Use HTML properties instead of data-* properties where possible.

For the most part, I think HTML-equivalent is difficult -- for reasons previously discussed. It is also the format that is the most effort for derivation to produce.

Storage Format

Field information would be described using data attributes mimicking the same properties as found in the source PDF. For example, field flags are sent as a single data-pdf-ff property

The advantage to this style is that derivation is very easy. The disadvantage is that we need to do the storage-to-logical-model processing in JavaScript.

Logical model

Translate to properties that are aligned with the logical model defined in Forms.Next.
The logical model objects/properties are very similar to the Acroforms Field Object and field properties.
Producing HTML that maps to the logical model means data-* attributes support for each of the roughly 50 field properties described in section 3.10 of the draft Forms.Next specification.

Some properties are dictionaries, and we have a choice to either send the entire dictionary as a JSON blob, or we can split the dictionary into individual properties.
e.g., the events dictionary could be one property: data-pdf-events or could be multiple properties: data-pdf-events-change, data-pdf-events-click etc. Applies also to the constraints, formats and custom dictionaries.

Of course, we could go the other direction and encode the entire set of logical model properties as a JSON blob under a single data-pdf-field-properties attribute.

The advantage of the logical model, is that it maps directly from what a host form processor has in memory and directly to what a responsive mode form processor will support. i.e., if field properties are encoded in any other manner, they have to be translated to the logical model by the responsive mode client.

Unify Caption derivation

Lbl structure elements in lists

According to section 4.3.7.6, a Layout:Placement attribute with value Block derives to a CSS property display:block.

However in numerous example documents there are Lbl structure elements in lists that have this attribute with the value Block. Despite the Lbl elements deriving to HTML element, the rule above turns them back into blocks again. I don't know if this is a problem with the authoring tool of the PDFs, but perhaps we should consider ignoring any Layout:Placement attributes for Lbl structure elements inside lists.

align use of namespaces with latest 2.0 proposals

we don’t have too many namespace samples, combination of standard SE and tags from namespaces, combination of attributes content

Actions

According to UA/2 (8.8): all destinations within the current document shall be structure destinations
Other actions associated with Link annotation should point outside of the document
Actions can also be associated with widget annotation through AA

Attributes

Discussion: https://docs.google.com/document/d/1l448dxkW35oaF9ctzMLWYjW-_8moPd76Mz5-cX_ylOc/edit#heading=h.fsev8oc0rux3

FENote

FENote definition has changed in 32005. Note structure element shall not be used, FENote could now be inline as well as block (either

)
we need to clarify the use of FENote
as an example: For table footnote, I guess the … would be used

We derive Sub to , Span allows FENote as child, Sub doesn’t

Clarify Block and Inline usage of SE

structure elements that can serve as BLSE and ILSE may behave differently based on context. We are not clear about situations in which BLSE might be used as ILSE

Consider allowing stylesheets to be embedded with <link> as well as @import

Currently Associated Files that are text/css must be embedded in the final HTML using an @import statement.

Could we consider allowing these to be embedded with <link> as well? The end result is the same but <link> has two advantages:

<link> can accept a title attribute which can be mapped to the description of the associated file, if specified. This could give stylesheets some context.
<link> can take a charset attribute, which is deprecated, yes, but if the derivation code is saving the stylesheets to local storage and referencing them via a file:// URL, then it's the only way to set the encoding for a stylesheet that needs it - for example, if the original stylesheet has no BOM, no @charset and was served with Content-Type: text/css;charset=Shift-JIS. If this were the case, then if using @import the stylesheet would need to be re-encoded to UTF8 before embedding.

I realise these are pretty weak arguments, but I'm raising them because the last one is a bit of an obscure edge case and might not have been considered when @import was chosen and not <link>. Perhaps allowing both might allow more flexibility?

fixing/updating Forms

It is not clearly defined what information is used in which situation. We might have informations from

structure element and attributes
widget annotations
form field

Make clear what overrides what and if we can merge those data

Html attributes (lang, id etc..) can overwrite properties from SE

The 4.3.7.1. defines order of processing attributes based on owner. For example you have both list attribute and html (HTML owner) that map to the same css property. In that case the order guarantees that the property coming from the latter takes precedence. It's not clear what happens with structure element properties like lang, id etc.