GithubHelp home page GithubHelp logo

pdf-association / deriving-html-from-pdf Goto Github PK

View Code? Open in Web Editor NEW
14.0 14.0 1.0 5.92 MB

Repository for members of the Derving-HTML-from-PDF TWG under PDF Association to track work algorithm for deriving HTML from well tagged PDF 2.0.

Home Page: https://pdfa.org/community/deriving-html-from-pdf-twg/

License: Creative Commons Attribution 4.0 International

algorithm html html-css-javascript pdf reuse

deriving-html-from-pdf's Issues

Mixing Layout attributes and embedded CSS stylesheets

The spec currently requires that PDF Layout attributes are converted to CSS style attributes - from 4.3.7.3

For each attribute derived to a CSS property, the processor shall create a CSS declaration
using the dictionary key as the property and the value of the key (converted into a string
using common methods) as the property value.
A style attribute for the HTML element shall be created and all CSS declarations in the
current PDF structure element shall be concatenated into a string, delimited by
semicolons as necessary, and the string shall be used as the value of the style attribute.

The problem here is that

  1. this mapping is lossy - there is not a direct correlation between Layout attributes and CSS, so some concepts cannot be accurately represented (e.g. the CSS property text-indent allows "hanging" indent, but the PDF Layout property TextIndent does not)
  2. setting a "style" attribute on the element is the highest priority way to style an element; it will override any styles read from an embedded stylesheet.

It's also going to be very difficult for any tool deriving HTML from PDF to know when this is going to happen, as it would involve evaluating all the styles to see which apply.

Adding a note like...

A processor may optionally choose to ignore structure attributes if another source for this information is available, such as attributes with a CSS attribute owner, or an embedded stylesheet.

...might give processors a bit more flexibility when it comes to this situation.

Edit: the note above doesn't really need the "attributes with a CSS owner" as that's already covered by the sequence of owners in 4.3.7.

HTML attribute owner really needs to be HTML/MathML/SVG attribute owner

The spec says that we process attributes from the following ordered list of owners:

  • List attribute owner
  • Table attribute owner
  • Layout attribute owner
  • PrintField attribute owner
  • HTML attribute owner
  • CSS attribute owner
  • ARIA attribute owner

But there's no MathML or SVG in there - both are first-class citizens in HTML(*), and in pdf-association/pdf-issues#286 it's been clarified that attributes on a MathML element should have a MathML attribute owner. I'm presuming the same conclusion would apply for SVG.

I think that "HTML attribute owner" probably needs to be "MathML attribute owner if the element is in the MathML namespace, SVG attribute owner if the element is in the SVG namespace, otherwise the HTML attribute owner"


(*) The gory details are: when reading or writing an HTML document as text/html, each element name is checked to see if it's on the list of defined SVG and MathML elements, and the namespace is set so, or defaults to HTML otherwise. So <body><math><p> in HTML is parsed as if it were this in XHTML

<body xmlns="http://www.w3.org/1999/xhtml">
<math xmlns="http://www.w3.org/1998/Math/MathML">
<p xmlns="http://www.w3.org/1999/xhtml">

Of course when reading or writing the same document as application/xhtml+xml, the namespaces are written as normal for XML.

Link vs Reference

Reference - used for inta document targets (defined via structure destination or without link annotation)
Link - link annotation pointing outside of the document

TOC

UA/2 redefines the use of TOC
we need a more complex example of the way the TOC would be derived into as well as outside of the main html flow

Actions

According to UA/2 (8.8): all destinations within the current document shall be structure destinations
Other actions associated with Link annotation should point outside of the document
Actions can also be associated with widget annotation through AA

Clarify exact behaviour of associated files, alternative and supplement

The text on associated files in the 1.1 spec has caused a bit of a debate here while implementing

  • If the value of the AFRelationship key in the associated file’s file specification dictionary is Alternative then the associated file serves as a replacement and all children of the structure element shall be ignored.

  • If the value of the AFRelationship key in the associated file’s file specification dictionary is Supplement then the associated file serves as a supplemental and after processing the associated file the processor shall continue with processing children of the structure element.

In both cases it's not clear if the "parent element" - the one that had the associated file attached to it - is replaced or not. This is mostly due to the paragraphs that follow this section implying that it is replaced.

It could definitely do with clarifying, and as we've been able to make arguments here for replacing or not-replacing, I have a suggestion:

  • If the value of AFRelationship is alternative, the entire element and its children are replaced with the contents of the file.
  • If the value of AFRelationship is supplement, the contents of the file is inserted as the first child of the parent, after which the children are processed as normal.

This way the author can choose: if they want to replace an element including the parent, use alternative. If they want to keep the parent, use supplement and don't put any children in the source document.

The spec also says:

  • Multiple associated files shall be processed in the order in which they are stored in the array of the AF key.

That makes sense for multiple supplements - they're all inserted, in order. But if we have multiple alternatives, do we pick the first one that is in a supported format? That's how multipart/alternative is defined in the MIME spec, so when I see the word "alternative" that's what I presume. Is this correct? What if we have a mix of supplements and alternatives? Do we ignore the supplements, do we ignore the alternatives, or some other option?

Finally, over the next few paragraphs the spec gives specific details for how to process certain types of embedded file. These are problematic as they contradict what's come previously - for example:

  • For HTML and MathML we have the text "If the associated file is an Embedded File then the contents of the associated file’s embedded file stream shall be added directly to the output HTML stream, taking the place of the structure element that would normally have been generated." But what if it's AFRelationship=supplement? Do we still "take the place" of the element?

I'd strongly suggest this paragraph is deleted, for both MathML and HTML. The process is already adequately described above, and what's here contradicts the processing steps for AFRelationship=Supplement.

I also don't see any value at all in the recommendation for SVG that it generates an <img> with the source referring to the SVG. Inline SVG in HTML is a much better option than using an <img> with a separate file or (worse) a data: URL. Delete this paragraph entirely and inline SVG is no different to HTML or MathML, as it should be. Or, alternatively, we generalise the previous paragraph describing the processing for images to all image/* types, which includes SVG - if SVG is to be embedded inline, we simply mark it as text/html


That's quite a long issue so I'll sum up. My suggestions are that we specify:

  1. AFRelationship=Alternative means the parent element is replaced with the file, while AFRelationship=Supplement means the file is inserted as the first child of the parent element.
  2. If there is at least one file with AFRelationship=Alternative in a supported format, any other files are ignored.
  3. The paragraphs describing special handling of various types in the last half of section 4.6.4 are revisited and considered for deletion, unless the behaviour for a particular type is markedly different from the general rules at the start of 4.6.4 (as it is for image types, for example)

Consider allowing stylesheets to be embedded with <link> as well as @import

Currently Associated Files that are text/css must be embedded in the final HTML using an @import statement.

Could we consider allowing these to be embedded with <link> as well? The end result is the same but <link> has two advantages:

  1. <link> can accept a title attribute which can be mapped to the description of the associated file, if specified. This could give stylesheets some context.
  2. <link> can take a charset attribute, which is deprecated, yes, but if the derivation code is saving the stylesheets to local storage and referencing them via a file:// URL, then it's the only way to set the encoding for a stylesheet that needs it - for example, if the original stylesheet has no BOM, no @charset and was served with Content-Type: text/css;charset=Shift-JIS. If this were the case, then if using @import the stylesheet would need to be re-encoded to UTF8 before embedding.

I realise these are pretty weak arguments, but I'm raising them because the last one is a bit of an obscure edge case and might not have been considered when @import was chosen and not <link>. Perhaps allowing both might allow more flexibility?

Ref entry

UA/2 is more prescriptive about the use of Ref entry

let's come up with suggestion how the existence of Ref entry could be expressed in html

H vs. Hn in the same document

Current standard doesn't allow mixing H and Hn, however it is expected that one document may contain two Sects one with H1,H2 structure and second Sect with just H. Usually such document is created by merging two tagged documents.

High level description of form derivation variations

At PDF week, I think Roman described three variations of derived HTML. If we were to give them names they could be: "HTML Equivalent", "Storage Format", "Logical Model". I'll try describing them here -- along with some commentary.

HTML Equivalent

Translate to the nearest HTML equivalent tags and attributes. Create <input>, <textarea>, <select> etc instead of custom components. Use HTML properties instead of data-* properties where possible.

For the most part, I think HTML-equivalent is difficult -- for reasons previously discussed. It is also the format that is the most effort for derivation to produce.

Storage Format

Field information would be described using data attributes mimicking the same properties as found in the source PDF. For example, field flags are sent as a single data-pdf-ff property

The advantage to this style is that derivation is very easy. The disadvantage is that we need to do the storage-to-logical-model processing in JavaScript.

Logical model

Translate to properties that are aligned with the logical model defined in Forms.Next.
The logical model objects/properties are very similar to the Acroforms Field Object and field properties.
Producing HTML that maps to the logical model means data-* attributes support for each of the roughly 50 field properties described in section 3.10 of the draft Forms.Next specification.

Some properties are dictionaries, and we have a choice to either send the entire dictionary as a JSON blob, or we can split the dictionary into individual properties.
e.g., the events dictionary could be one property: data-pdf-events or could be multiple properties: data-pdf-events-change, data-pdf-events-click etc. Applies also to the constraints, formats and custom dictionaries.

Of course, we could go the other direction and encode the entire set of logical model properties as a JSON blob under a single data-pdf-field-properties attribute.

The advantage of the logical model, is that it maps directly from what a host form processor has in memory and directly to what a responsive mode form processor will support. i.e., if field properties are encoded in any other manner, they have to be translated to the logical model by the responsive mode client.

Mixing namespaces

It's not clear how to deal with mixing namespaces (MathML and labels), PDF and HTML etc..

FENote

FENote definition has changed in 32005. Note structure element shall not be used, FENote could now be inline as well as block (either <div> or <p>)
we need to clarify the use of FENote
as an example: For table footnote, I guess the <TR> <TD colspan=??> <FENote> … </FENote> </TD> </TR> would be used

We derive Sub to <span>, Span allows FENote as child, Sub doesn’t

Lang in catalog

4.2.4 Body suggests to derive Lang from Catalog dictionary to

Suggestion: Duplicate the language info into lang attribute

better explain Code

consensus is to allow the inclusion of Sub into a Code. Sub will include code lines.
If this is addressed in 32005, we can add addition text

Definintion of Page

Goal is to preserve the information about pagination in derived html

  • Look into EPUB what they are coming up with
  • another suggestion is to include "fake" or
    with additional properties indicating "page break"

Lbl structure elements in lists

According to section 4.3.7.6, a Layout:Placement attribute with value Block derives to a CSS property display:block.

However in numerous example documents there are Lbl structure elements in lists that have this attribute with the value Block. Despite the Lbl elements deriving to HTML element, the rule above turns them back into blocks again. I don't know if this is a problem with the authoring tool of the PDFs, but perhaps we should consider ignoring any Layout:Placement attributes for Lbl structure elements inside lists.

Html attributes (lang, id etc..) can overwrite properties from SE

The 4.3.7.1. defines order of processing attributes based on owner. For example you have both list attribute and html (HTML owner) that map to the same css property. In that case the order guarantees that the property coming from the latter takes precedence. It's not clear what happens with structure element properties like lang, id etc.

use of AF - suggestion in Derivation are against example in 2.0

Associated files are processed based on AFRelationship value.
Alternative - means that content of the AF serves as replacement. Substructure is ignored (not output)
Supplement - means that AF is output and then substructure is processed

ISO 32000-2 however is not aligned with above statement and suggest to use MathML associated file with Supplement
on the contrary PDF 2.0 AP Note 002 suggests to use Alternative for MathML equations

Proposal: use cascade-layers

The Problem

Elements in the generated HTML can get their styles from a number of sources

  • "Layout" attributes from the ClassMap in the PDF, which are translated in the same way to a CSS class selector in the document stylesheet
  • "Layout" attributes on the StructElem in the PDF, eg Layout:Padding translates to a CSS padding property which is set directly on the element using style
  • "CSS" attributes, which, like Layout attributes above, can be specified directly on the StructElem or in a ClassMap
  • Stylesheets embedded in the PDF
  • An HTML "style" attribute set on a StructElem

Listed in increasing order of priority (although 3 and 4 could be swapped; it's not clear). So we have several problems:

  • the priority doesn't match CSS priorities. For example, attributes set by an embedded stylesheet (item 4) would be overridden by attributes set by a Layout attribute (item 1). I referred to that in #33.
  • ClassMap can be used to set non-CSS attributes as well - for example, it's perfectly valid to set the Table:ColSpan attribute using a PDF attribute class. This requires a processing step not in the spec.
  • If an element belongs to two classes and they both set different values for the same attribute, the order the classes are defined on the Element determines which wins. In CSS it's the order the classes are defined in the stylesheet, which is a problem if one element is a member of class1 and class2, and another element is a member of class2 and class1.

Proposed Solution: CSS Cascade Layers

Cascade Layers were added to CSS in 2022, and were part of the 2022 interoperability project so are supported in all browsers. They give finer control over priorities and were designed to solve the problem of mixing style rules from different sources. Google for details, spec is https://drafts.csswg.org/css-cascade-5/#layering. The shortest example I can give:

<style>
@layer b, a;                 /* Lowest priority first */
@layer a { 
    div { color: green }     /* So this rule wins... */
}
@layer b {
    #mydiv { color: red }    /* ... whereas with no layers, this rule would win */
}
</style>
<div id="mydiv">This is green</div>

I'm proposing we use cascade layers for deriving HTML, because it will:

  • fix the priority issues listed above
  • give information on the origin of each CSS property written to the HTML, which would be useful for authors
  • allow a very quick way for anyone editing the HTML to reprioritise a category of attribute - for example, prefer embedded stylesheets over CSS attributes, or the other way around.
  • in general, move away from the "style" attribute on the generated elements, which is a pain when it comes to editing the generated HTML.

Suggested process

The most basic would be something like this:

  • Don't generate CSS classes for attributes specified via the C key in the structure element dictionary. Instead, those attributes are merged with the attributes specified directly on the StructElem, as defined in the PDF spec, to generate a final list of attributes for each element. Then for each attribute, if it is:
    • A Layout attribute that creates a CSS property, eg <</O/Layout/TextAlign/Justify>>? Assign that CSS property the layer "pdf"
    • An attribute with a CSS owner, eg <</O/CSS-3.00/text-align(justify)>>? Assign that CSS property the layer "css"
    • Non-CSS attributes are processed as normal, i.e. set directly on the generated element if appropriate.
  • If any CSS properties were assigned to a layer in the previous step, generate a new, unique classname (eg "auto-1") and add it to the list of classes on the generated element, then generate a rule in the appropriate layer's stylesheet, eg .auto-1 { text-align:justify }

Finally, for any embedded stylesheets, make sure they're embedded as part of an "embedded" layer - easily done by wrapping @layer embedded { ... } around the entire stylesheet (except any initial @import, @namespace or @charset rules).

Here's a (contrived, extreme) example - say we have a PDF with an embedded stylesheet #mypara { color: green} that contains this StructElem:

1 0 obj
<< 
  /Type /StructElem
  /ID (mypara)
  /S /P
  /A [
    << /O /CSS-3.00 /color (blue) >>
    << /O /Layout /Color [1 0 0] >>
    <</O /HTML-5.00 /style (font-weight: bold)>>
  ]
>>
endobj

We would turn this into the following HTML snippet

<style>
 @layer pdf, css, embedded;  /* lowest priority first */
 @layer pdf {
   .auto-1 { color: #ff0000; }
 }
 @layer css {
   .auto-1 { color: blue; }
 }
</style>
<style>
 @layer embedded {
   #mypara { color: green }
 }
</style>
<p class="auto-1" style="font-weight: bold">green bold text</p>

The source of all the styles is clear; the order can be changed by adjusting the initial @layer, we don't have to rewrite the "style" attribute and we no longer have any concerns about priorities. That's the basic idea, but we can go further - multiple elements could share a class, each owner could get its own name to distinguish "HTML-4.01" from "HTML-5", etc etc. Disabling a layer completely is also fairly easy now all the styles from a single source are grouped together.

Only negative I can see is that current classes in the ClassMap are translated to HTML classes in the stylesheet, and in this proposal we would lose that. Although it feels like keeping these is a nice thing to do, I really want to stress that the inheritance model for PDF classes differs from CSS classes. Whether we use cascade layers or not, if we're going to rely on them to set styles then the differences really need to be properly considered.

Here are three attachments to show the difference in real files:

Document & DocumentFragment

Document containing multiple Document = clarify what goes into head and what doesn’t and what is the interaction
DocumentFragment - more examples, allow more freedom

Clarify Block and Inline usage of SE

structure elements that can serve as BLSE and ILSE may behave differently based on context. We are not clear about situations in which BLSE might be used as ILSE

Lbl within a LI requirements are too rigid

As specified in the Deriving-HTML-from-PDF spec 1.0, 4.3.5.3.1:

When we have a Lbl within an LI, we must set the enclosing <ul> or <ol> element to have list-style-type:none and then use the content of the Lbl to represent the list bullet. But usually doesn't give great results.

  • In many cases the semantic meaning of the list can be better represented using the lists natural style - for example, an unordered list where the Lbl only contains bullet characters should be a simple <ul>, and an ordered list with Lbls starting at 1 and incrementing should be a simple <ol>.
  • The indenting behaviour of list bullets is lost when removing them and replacing them with a <span> containing the bullet
  • For lists with custom Lbls that don't match a normal HTML list numbering scheme, or that are formatted slightly differently, it's possible to use the @counter-style rule to define a custom list type (see https://www.w3.org/TR/css-counter-styles-3/). It's fully supported in all browsers as of today.

I think it would be useful to allow for a bit of latitude in the spec here - allowing for lists to be represented either as described now, where the first <span> child of the list represents the label, or alternatively making use of HTML list numbering if it can accurately represent the labels in the list.

Figure with Link

The usual tagging for Image with link would be

<Figure>
    <Link>
        Content

in case of inline image, current text isn't clear about the link becoming a parent of them

fixing/updating Forms

It is not clearly defined what information is used in which situation. We might have informations from

  • structure element and attributes
  • widget annotations
  • form field

Make clear what overrides what and if we can merge those data

complex TOC and List

We address list within a list by requiring new <li> element. Though we don't specify properties (list-style-type:none) nor what to do with content

<TOC>
  <Caption>
  <TOC> 
    <Caption>
    <TOCI>
    <TOCI>

Or what if TOC contains "Title" ?

Associated Files would be better if they could be inserted anywhere in the document

An idea we had while implementing. The way associated files are used in the Derivation Spec is that if found on the StructureElements, their contents are copied to the <head> of the HTML document being generated. This is very useful for embedding styles - but it could be even more useful if we specified that the content of the file should be embedded inside the current element.

An example. Let's say I have a PDF generated from an SVG like this:

<svg xmlns="http://www.w3.org/2000/svg" width="600" height="180" viewBox="0 0 600 180">
  <defs>
    <linearGradient id="gradientDefault" gradientUnits="objectBoundingBox">
      <stop offset="0" stop-color="red">
      <stop offset="1" stop-color="blue">
    </linearGradient>
  </defs>
  <rect x="50" y="10" width="500" height="40" fill="url(#gradientDefault)">
</svg>

The PDF will contain a /Figure StructureElement representing the <svg>, containing rectangle with a linear gradient. But converting that back from PDF to SVG is going to be hard, because it involves reconstructing the SVG <linearGradient>.

If we could attach an associated file to the /Figure with /AFRelationship /Supplement containing the SVG linear gradient definition, then the derivation algorithm could specify that the content of the file is inserted at that point in the document - inside the <svg>. For even more flexibility, /AFRelationship /Alternative could be used to replace the tag and its content in the generated output with the content of the file.

This usage would also support the "MathML as attachment" idea - the MathML could be attached at the right point in the document to augment (or replace) the inline tags in the PDF.

Finally, if we specified that any attachments on the /StructTreeRoot continue to be added to the <head>, that continues to support the current usage of adding stylesheets etc. to the document head.

When to use label

Currently we map PDF <Lbl> to HTML <label> only when it's inside a <Form>.

Why don't we do this all the time? I realise that in HTML it is normally used for labelling form elements, but there's nothing that I can see that says it can't be used elsewhere: here's the definition, it says:

The label element represents a caption in a user interface. The caption can be associated with a specific form control, known as the label element's labeled control, either using the for attribute, or by putting the form control inside the label element itself.

So semantically, <label> is the perfect fit for <Lbl>, and it "can be" used for lablelling form elements - seems pretty flexible to me.

In terms of position in the HTML hierarchy, it can be used anywhere you might use a <span> and the only restriction on its descendents is that it can't contain another <label>, which is fine as <Lbl> can't contain an <Lbl> in ISO32005.

fix TextPosition attributes

The TextPosition attribute can change the html tag, but the behavior is unwanted in certain scenarios (on Link structure element)

security/encryption implications

Encrypted files may contain permissions that doesn't allow text extraction or printing. Derived HTML doesn't carry this information and HTML viewer can't decide if such function is allowed or not.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.