Comments (12)
Actually deref
is a problem as it potentially allows a struct to dereference the offset to itself and access its own fields before they have been parsed, which in practice could lead to an infinite loop in the parser or would require runtime detection of same.
While some formats may contain self-referential structs, most will not, and it would be preferable to exclude them at compile time wherever possible.
If the deref
function is unavailable then no pointers can be followed until parsing is complete, however this makes it impossible for the parser to check conditions that cross pointer boundaries.
Perhaps instead of being so free with pointer values it would be better to explicitly denote the relationships between different structs, to make it easier to graph the network and detect cycles. Then we could allow deref
in expressions, but require that possibly cyclic references are explicitly annotated as such.
from fathom.
Consider a new "link" type which like label also matches the empty sequence but is parameterised by base address, relative offset, and type of value being linked to. For example:
struct A {
start: label
offset: u16
refB: link(start, offset, B)
}
struct B { ... }
Now it is obvious that A has a link to B, and if B can also have a link to A (or if B contains A!) then it can be established that the possibility for cyclic references exists, in which case deref
must be checked.
from fathom.
Some notable uses of offset values in OpenType:
The OpenType OffsetTable:
checksum: u32,
offset: u32,
length: u32,
Here the tag value gives the type we expect to find at the offset, however we would also like to take the length field into account but this occurs after the offset, suggesting that the creation of the link must take place after this field (unless expressions can make safe forward references).
A tricky aspect of this is that values in one table can depend on values in another, eg. the hmtx
table depends on the number_of_h_metrics
field from the hhea
table and the num_glyphs
field from the maxp
table. This suggests that the tables have to be processed in dependency order, as the hmtx
table will be encountered after hhea
but before maxp
if read in streaming order.
Many occurrences of arrays of offsets, each one pointing to a struct. Perhaps this could be handled by treating them as an array of structs, where each one contains the offset value and a phantom computed field constructing the link?
Many occurrences of "offset or 0", requiring conditional types.
The GSUB and GPOS tables have offsets to a script list, feature list, and lookup list. The script list has indices (not offsets) into the feature list and the feature list has indices (not offsets) into the lookup list. Bounds checking these indices requires forward references / reading in dependency order as the script list cannot be validated until the feature list has been processed and so on.
CFFIndex is a CFF thing not an OpenType thing, but it's fiddly: an array of count+1
offsets representing count
objects where the size of each object is determined by subtracting the adjacent offsets.
from fathom.
The GSUB and GPOS tables have offsets to a script list, feature list, and lookup list. The script list has indices (not offsets) into the feature list and the feature list has indices (not offsets) into the lookup list. Bounds checking these indices requires forward references / reading in dependency order as the script list cannot be validated until the feature list has been processed and so on.
The link
type should allow the user to read in an appropriate order, if such an order exists. That is, if I understand correctly, you can have the offsets in one order, then the links in some other order. The parser is driven by the occurrence of the links, not the offsets.
So for GSUB, you'd specify the offsets in the order that the spec says, then the links in the order that allows you to use the size of the lookup list in the type of the feature list, etc.
If there are cyclic dependencies, see #94 see below.
from fathom.
CFFIndex is a CFF thing not an OpenType thing, but it's fiddly: an array of count+1 offsets representing count objects where the size of each object is determined by subtracting the adjacent offsets.
Assuming use of the link
type, this is an example where having the element type of the array depend on the index would be useful. You could have a count
field, then an array of count+1
offsets, then an array with index i
of count
links, each with type link(start, offset[i], E(offset[i+1] - offset[i]))
. Here E
is the element type, which is presumably some function of the size.
from fathom.
Say we want a structure that has offsets to two tables, where the type of each depends on the size of the other. Even though there is a cyclic dependency, we should be able to parse it by looking at the size fields first, then parsing the tables in either order. We can do this with link
types by reusing the offset values:
struct H {
start: label
offsetA: u32;
offsetB: u32;
sizeA: link(start, offsetA, u32)
sizeB: link(start, offsetB, u32)
tableA: link(start, offsetA, Table(sizeB))
tableB: link(start, offsetB, Table(sizeA))
}
// How to write a type function?
struct Table(othersize: int) {
size: u32
indices: (0..othersize-1)[size]
}
Although this requires a bit of effort from users to work out the dependencies themselves, it has the advantage of being easy to implement :)
from fathom.
Even without considering cycles, you can still have multiple reference to the same thing. You only want to parse it once; the second time you want to re-use the same host value you got previously. (Worst case: parsing more than once for recursive types can lead to exponential behaviour.)
It would seem the solution is to cache the parse results of links, for any type that we can't prove will not have multiple links. One way to detect cycles is to have the cache map links to option types: when you start to parse a new link you map it to None
in the cache, and when you finish you map it to its final value. While parsing, if you hit a link that is cached as None
, you know it must be one of your ancestors.
from fathom.
Some thoughts on different types of pointers.
Say the parser has the following additional state:
- A map from input position and normalised type to host value or
None
. When we start parsing a type at a given position, we add aNone
entry for it.
Link : position x type -> option(host value)
- A map from input position and normalised type to a set of pointer locations (i.e., pointers to pointers). At the end of parsing, we fill in the pointers to host values in this map, for every position in the
Link
map.
Fwd: position x type -> set of **host value
- A set of input positions. It's an error if we ever see a link that refers to one of these positions.
Reserved : set of position
We can think of a number of different types of link, each with its own operational semantics, which I'll give as pre- and post-conditions. These types could be supplied by the user as unchecked or checked assertions, or inferred as with the acyclic example above. There's two broad categories: passive links that do not cause parsing to occur, and active links that do. Not sure how many of these are useful in practice, but here goes:
- passive forward link
- pre: the position/type is not in the
Link
map - post: add location we should fill in to the
Fwd
map; return NULL
- pre: the position/type is not in the
- passive backward link
- pre:
Link(pos, typ) = Some v
- post: return v
- pre:
- passive upward link
- pre:
Link(pos, typ) = None
- post: add location to the
Fwd
map; return NULL
- pre:
- passive acyclic link
- a passive backward or forward link, depending on which precondition succeeds
- passive link
- a passive backward, forward or upward link, depending on which precondition succeeds
- active primary link
- pre: the position/type is not in the
Link
map - post:
Link(pos, typ) = v; return v
(wherev
is the parsed value)
- pre: the position/type is not in the
- active unique link
- pre: the position/type is not in either the
Link
orFwd
map - post: add position to
Reserved
; return parsed value
- pre: the position/type is not in either the
- active shared link
- an active primary link or a passive backward link, depending on which precondition succeeds
- active link
- an active primary link or a passive backward or upward link, depending on which precondition succeeds.
Assuming no errors were detected, we can safely dereference those types that always return a parsed value (passive backward, active primary, active unique, active shared), without needing a dynamic check.
The method described here avoids performing the same parsing job more than once, but it can still parse the same position more than once, if the types differ. It would be possible to enforce, perhaps only for some types, that the same position is never parsed as two different types. That might be a useful default.
from fathom.
So passive links are to things that will be parsed anyway because they are reachable in the "normal" stream, while active links require an explicit parse because they are otherwise unreachable?
from fathom.
Yes, or they will be parsed anyway because they are reachable via some active pointer.
from fathom.
So, I think we have done most of this at this stage - the thing we are missing though (the hard thing) is dereferencing offsets. This would allow us to have values that depend on offsets, but could lead to all sorts of weirdness. I'm thinking we should close this in favor of a targeted issue, with some of the key ideas from this issue summarised there.
from fathom.
from fathom.
Related Issues (20)
- Constrained representation types HOT 3
- Cover more unification codepaths in the testsuite
- Let formats HOT 2
- Sugar for guarded fields in record formats
- Challenges arising from the OpenType `glyf` table HOT 2
- Inconsistency between synthesised function literals and checked function literals HOT 1
- Sum types? HOT 4
- Semantic Interpretation Revisited
- Inconsistency between tuple types and record types
- Compile time benchmarks in CI? HOT 1
- Add documentation for implicit arguments HOT 1
- Lazy evaluation HOT 6
- OpenType data description
- Distillation crashes in some cases HOT 1
- Implementation annoyances HOT 1
- Multiple modules HOT 1
- Global string interner HOT 8
- Separate name resolution from elaboration HOT 5
- Question: Comparison with Kaitai? HOT 4
- Incorrect elaboration of record literals?
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from fathom.