The Problem
Elements in the generated HTML can get their styles from a number of sources
- "Layout" attributes from the ClassMap in the PDF, which are translated in the same way to a CSS class selector in the document stylesheet
- "Layout" attributes on the StructElem in the PDF, eg
Layout:Padding
translates to a CSS padding
property which is set directly on the element using style
- "CSS" attributes, which, like Layout attributes above, can be specified directly on the StructElem or in a ClassMap
- Stylesheets embedded in the PDF
- An HTML "style" attribute set on a StructElem
Listed in increasing order of priority (although 3 and 4 could be swapped; it's not clear). So we have several problems:
- the priority doesn't match CSS priorities. For example, attributes set by an embedded stylesheet (item 4) would be overridden by attributes set by a Layout attribute (item 1). I referred to that in #33.
- ClassMap can be used to set non-CSS attributes as well - for example, it's perfectly valid to set the
Table:ColSpan
attribute using a PDF attribute class. This requires a processing step not in the spec.
- If an element belongs to two classes and they both set different values for the same attribute, the order the classes are defined on the Element determines which wins. In CSS it's the order the classes are defined in the stylesheet, which is a problem if one element is a member of class1 and class2, and another element is a member of class2 and class1.
Proposed Solution: CSS Cascade Layers
Cascade Layers were added to CSS in 2022, and were part of the 2022 interoperability project so are supported in all browsers. They give finer control over priorities and were designed to solve the problem of mixing style rules from different sources. Google for details, spec is https://drafts.csswg.org/css-cascade-5/#layering. The shortest example I can give:
<style>
@layer b, a; /* Lowest priority first */
@layer a {
div { color: green } /* So this rule wins... */
}
@layer b {
#mydiv { color: red } /* ... whereas with no layers, this rule would win */
}
</style>
<div id="mydiv">This is green</div>
I'm proposing we use cascade layers for deriving HTML, because it will:
- fix the priority issues listed above
- give information on the origin of each CSS property written to the HTML, which would be useful for authors
- allow a very quick way for anyone editing the HTML to reprioritise a category of attribute - for example, prefer embedded stylesheets over CSS attributes, or the other way around.
- in general, move away from the "style" attribute on the generated elements, which is a pain when it comes to editing the generated HTML.
Suggested process
The most basic would be something like this:
- Don't generate CSS classes for attributes specified via the C key in the structure element dictionary. Instead, those attributes are merged with the attributes specified directly on the StructElem, as defined in the PDF spec, to generate a final list of attributes for each element. Then for each attribute, if it is:
- A Layout attribute that creates a CSS property, eg
<</O/Layout/TextAlign/Justify>>
? Assign that CSS property the layer "pdf"
- An attribute with a CSS owner, eg
<</O/CSS-3.00/text-align(justify)>>
? Assign that CSS property the layer "css"
- Non-CSS attributes are processed as normal, i.e. set directly on the generated element if appropriate.
- If any CSS properties were assigned to a layer in the previous step, generate a new, unique classname (eg "auto-1") and add it to the list of classes on the generated element, then generate a rule in the appropriate layer's stylesheet, eg
.auto-1 { text-align:justify }
Finally, for any embedded stylesheets, make sure they're embedded as part of an "embedded" layer - easily done by wrapping @layer embedded { ... }
around the entire stylesheet (except any initial @import
, @namespace
or @charset
rules).
Here's a (contrived, extreme) example - say we have a PDF with an embedded stylesheet #mypara { color: green}
that contains this StructElem:
1 0 obj
<<
/Type /StructElem
/ID (mypara)
/S /P
/A [
<< /O /CSS-3.00 /color (blue) >>
<< /O /Layout /Color [1 0 0] >>
<</O /HTML-5.00 /style (font-weight: bold)>>
]
>>
endobj
We would turn this into the following HTML snippet
<style>
@layer pdf, css, embedded; /* lowest priority first */
@layer pdf {
.auto-1 { color: #ff0000; }
}
@layer css {
.auto-1 { color: blue; }
}
</style>
<style>
@layer embedded {
#mypara { color: green }
}
</style>
<p class="auto-1" style="font-weight: bold">green bold text</p>
The source of all the styles is clear; the order can be changed by adjusting the initial @layer, we don't have to rewrite the "style" attribute and we no longer have any concerns about priorities. That's the basic idea, but we can go further - multiple elements could share a class, each owner could get its own name to distinguish "HTML-4.01" from "HTML-5", etc etc. Disabling a layer completely is also fairly easy now all the styles from a single source are grouped together.
Only negative I can see is that current classes in the ClassMap are translated to HTML classes in the stylesheet, and in this proposal we would lose that. Although it feels like keeping these is a nice thing to do, I really want to stress that the inheritance model for PDF classes differs from CSS classes. Whether we use cascade layers or not, if we're going to rely on them to set styles then the differences really need to be properly considered.
Here are three attachments to show the difference in real files: