GithubHelp home page GithubHelp logo

Describing offsets about fathom HOT 12 CLOSED

yeslogic avatar yeslogic commented on April 28, 2024 1
Describing offsets

from fathom.

Comments (12)

mikeday avatar mikeday commented on April 28, 2024

Actually deref is a problem as it potentially allows a struct to dereference the offset to itself and access its own fields before they have been parsed, which in practice could lead to an infinite loop in the parser or would require runtime detection of same.

While some formats may contain self-referential structs, most will not, and it would be preferable to exclude them at compile time wherever possible.

If the deref function is unavailable then no pointers can be followed until parsing is complete, however this makes it impossible for the parser to check conditions that cross pointer boundaries.

Perhaps instead of being so free with pointer values it would be better to explicitly denote the relationships between different structs, to make it easier to graph the network and detect cycles. Then we could allow deref in expressions, but require that possibly cyclic references are explicitly annotated as such.

from fathom.

mikeday avatar mikeday commented on April 28, 2024

Consider a new "link" type which like label also matches the empty sequence but is parameterised by base address, relative offset, and type of value being linked to. For example:

struct A {
    start: label
    offset: u16
    refB: link(start, offset, B)
}

struct B { ... }

Now it is obvious that A has a link to B, and if B can also have a link to A (or if B contains A!) then it can be established that the possibility for cyclic references exists, in which case deref must be checked.

from fathom.

mikeday avatar mikeday commented on April 28, 2024

Some notable uses of offset values in OpenType:

The OpenType OffsetTable:

checksum: u32,
offset: u32,
length: u32,

Here the tag value gives the type we expect to find at the offset, however we would also like to take the length field into account but this occurs after the offset, suggesting that the creation of the link must take place after this field (unless expressions can make safe forward references).

A tricky aspect of this is that values in one table can depend on values in another, eg. the hmtx table depends on the number_of_h_metrics field from the hhea table and the num_glyphs field from the maxp table. This suggests that the tables have to be processed in dependency order, as the hmtx table will be encountered after hhea but before maxp if read in streaming order.

Many occurrences of arrays of offsets, each one pointing to a struct. Perhaps this could be handled by treating them as an array of structs, where each one contains the offset value and a phantom computed field constructing the link?

Many occurrences of "offset or 0", requiring conditional types.

The GSUB and GPOS tables have offsets to a script list, feature list, and lookup list. The script list has indices (not offsets) into the feature list and the feature list has indices (not offsets) into the lookup list. Bounds checking these indices requires forward references / reading in dependency order as the script list cannot be validated until the feature list has been processed and so on.

CFFIndex is a CFF thing not an OpenType thing, but it's fiddly: an array of count+1 offsets representing count objects where the size of each object is determined by subtracting the adjacent offsets.

from fathom.

markbrown avatar markbrown commented on April 28, 2024

The GSUB and GPOS tables have offsets to a script list, feature list, and lookup list. The script list has indices (not offsets) into the feature list and the feature list has indices (not offsets) into the lookup list. Bounds checking these indices requires forward references / reading in dependency order as the script list cannot be validated until the feature list has been processed and so on.

The link type should allow the user to read in an appropriate order, if such an order exists. That is, if I understand correctly, you can have the offsets in one order, then the links in some other order. The parser is driven by the occurrence of the links, not the offsets.

So for GSUB, you'd specify the offsets in the order that the spec says, then the links in the order that allows you to use the size of the lookup list in the type of the feature list, etc.

If there are cyclic dependencies, see #94 see below.

from fathom.

markbrown avatar markbrown commented on April 28, 2024

CFFIndex is a CFF thing not an OpenType thing, but it's fiddly: an array of count+1 offsets representing count objects where the size of each object is determined by subtracting the adjacent offsets.

Assuming use of the link type, this is an example where having the element type of the array depend on the index would be useful. You could have a count field, then an array of count+1 offsets, then an array with index i of count links, each with type link(start, offset[i], E(offset[i+1] - offset[i])). Here E is the element type, which is presumably some function of the size.

from fathom.

markbrown avatar markbrown commented on April 28, 2024

Say we want a structure that has offsets to two tables, where the type of each depends on the size of the other. Even though there is a cyclic dependency, we should be able to parse it by looking at the size fields first, then parsing the tables in either order. We can do this with link types by reusing the offset values:

struct H {
    start: label
    offsetA: u32;
    offsetB: u32;
    sizeA: link(start, offsetA, u32)
    sizeB: link(start, offsetB, u32)
    tableA: link(start, offsetA, Table(sizeB))
    tableB: link(start, offsetB, Table(sizeA))
}

// How to write a type function?
struct Table(othersize: int) {
    size: u32
    indices: (0..othersize-1)[size]
}

Although this requires a bit of effort from users to work out the dependencies themselves, it has the advantage of being easy to implement :)

from fathom.

markbrown avatar markbrown commented on April 28, 2024

Even without considering cycles, you can still have multiple reference to the same thing. You only want to parse it once; the second time you want to re-use the same host value you got previously. (Worst case: parsing more than once for recursive types can lead to exponential behaviour.)

It would seem the solution is to cache the parse results of links, for any type that we can't prove will not have multiple links. One way to detect cycles is to have the cache map links to option types: when you start to parse a new link you map it to None in the cache, and when you finish you map it to its final value. While parsing, if you hit a link that is cached as None, you know it must be one of your ancestors.

from fathom.

markbrown avatar markbrown commented on April 28, 2024

Some thoughts on different types of pointers.

Say the parser has the following additional state:

  • A map from input position and normalised type to host value or None. When we start parsing a type at a given position, we add a None entry for it.
    Link : position x type -> option(host value)
  • A map from input position and normalised type to a set of pointer locations (i.e., pointers to pointers). At the end of parsing, we fill in the pointers to host values in this map, for every position in the Link map.
    Fwd: position x type -> set of **host value
  • A set of input positions. It's an error if we ever see a link that refers to one of these positions.
    Reserved : set of position

We can think of a number of different types of link, each with its own operational semantics, which I'll give as pre- and post-conditions. These types could be supplied by the user as unchecked or checked assertions, or inferred as with the acyclic example above. There's two broad categories: passive links that do not cause parsing to occur, and active links that do. Not sure how many of these are useful in practice, but here goes:

  • passive forward link
    • pre: the position/type is not in the Link map
    • post: add location we should fill in to the Fwd map; return NULL
  • passive backward link
    • pre: Link(pos, typ) = Some v
    • post: return v
  • passive upward link
    • pre: Link(pos, typ) = None
    • post: add location to the Fwd map; return NULL
  • passive acyclic link
    • a passive backward or forward link, depending on which precondition succeeds
  • passive link
    • a passive backward, forward or upward link, depending on which precondition succeeds
  • active primary link
    • pre: the position/type is not in the Link map
    • post: Link(pos, typ) = v; return v (where v is the parsed value)
  • active unique link
    • pre: the position/type is not in either the Link or Fwd map
    • post: add position to Reserved; return parsed value
  • active shared link
    • an active primary link or a passive backward link, depending on which precondition succeeds
  • active link
    • an active primary link or a passive backward or upward link, depending on which precondition succeeds.

Assuming no errors were detected, we can safely dereference those types that always return a parsed value (passive backward, active primary, active unique, active shared), without needing a dynamic check.

The method described here avoids performing the same parsing job more than once, but it can still parse the same position more than once, if the types differ. It would be possible to enforce, perhaps only for some types, that the same position is never parsed as two different types. That might be a useful default.

from fathom.

mikeday avatar mikeday commented on April 28, 2024

So passive links are to things that will be parsed anyway because they are reachable in the "normal" stream, while active links require an explicit parse because they are otherwise unreachable?

from fathom.

markbrown avatar markbrown commented on April 28, 2024

Yes, or they will be parsed anyway because they are reachable via some active pointer.

from fathom.

brendanzab avatar brendanzab commented on April 28, 2024

So, I think we have done most of this at this stage - the thing we are missing though (the hard thing) is dereferencing offsets. This would allow us to have values that depend on offsets, but could lead to all sorts of weirdness. I'm thinking we should close this in favor of a targeted issue, with some of the key ideas from this issue summarised there.

from fathom.

brendanzab avatar brendanzab commented on April 28, 2024

Closed by #109 and #115

from fathom.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.