`codec_generator` PROSE format

The following is a guide and specification for Prose, a custom output format. It is native to the codec_generator compiler project, meaning that it is not an established format with extant documentation. This document is intended to serve such a purpose.

Background

codec_generator is a project within the broad Tezos ecosystem, whose main product is a compiler framework written in OCaml. This compiler takes, as input, a set of data-encoding schema values, corresponding to the protocol-specific types defined throughout Octez. As output, it produces codecs.

A codec, in the general case, is a source-code module that defines a particular data-type and associated serialization/deserialization methods, intended to have a similar structure to the original type, and with the same encoding and decoding semantics and nuances as the original schema. Currently, support has been implemented for outputting Rust codecs, with support for TypeScript under active development as well.

A codec may also be a semi-formal description of the structural layout of a schema type, as is the case with the Prose output target.

Motivation

At the time that Prose was originally implemented, the compiler framework itself was hard-coded for producing only Rust codecs. In light of this, it was important to add a new output target, to guard against over-specialization in the design of the compiler as it evolved further. As the process of implementing support for a completely new programming language would have been prohibitive, it was sufficient to instead invent a simple-enough target format, to serve as a proof-of-concept for the extensibility of the compiler beyond just Rust.

Beyond this, there are two practical purposes that Prose serves in its own right:

A Prose codec can be used as a prototype for development of hand-written code in languages currently unsupported by codec_generator; in this way, it can partially obviate the need for consulting the original schema, which can be difficult to decipher, and relies on familiarity with both OCaml and the data-encoding library.

Secondly, for the purposes of internal development of the codec_generator pipeline, Prose codec output can be used to debug the compiler itself: if there is an input schema that triggers an exception, or produces unexpected output for a different target language, the generated Prose codec is instrumental to determine the exact combination of properties that caused the failure.

How to Use

For developers working outside of codec_generator, there are two primary ways in which Prose codecs can be useful.

The first is as a diagnostic aid, through the script whatchanged.sh, packaged with the codec_generator repository. This script can be used to discover the concrete changes visible in the binary encoding of a type, across different protocol versions. If a field is added to a record, a variant is added to an enumerated type or discriminated union, or otherwise, some aspect of the serialization format is substantively different between two protocol versions, this script can report both what schemata were changed, and what those changes were.

Beyond this, if a new type is introduced, either to the protocol itself or to the scope of a Tezos client library, a Prose codec for that new type can be used as a model to implement both the type itself, and its implicit encoding/decoding semantics.

As a caveat, it is worth mentioning that not all properties of an Octez schema are preserved by the codec_generator tool, and certain high-level mappings from an underlying schema type to its intended Octez type cannot be introspected or represented in any target language, Prose or otherwise. Therefore, while Prose codecs can be used to implement the fundamental structure of a type, the Octez definition should still be consulted to fill in any major gaps.

For a more specific guide as to how to write code to decode, encode, and process a schema type (based on the structure revealed by a Prose codec) please see the codec_generator Target Implementation Guide. This document is primarily focused on informing future development of target-language modules for codec_generator, but can also be used as a guide to hand-writing codec libraries.

Specification

Prose is a custom target language, and its corresponding output files are given the .prose or .prose~ extensions.

It expresses a compact model of the data-encoding schema language, operating on values already simplified by the codec_generator compiler.

(Note that this document is being written based on the Prose model for pre-0.7 versions of data-encoding, and may have to be updated once the Prose model is restabilized onto v0.7)

Though it is potentially subject to reorganization, the definition of the Prose generator and AST are unified into a single module, hosted on the project repo under the path src/virtual/prose.ml.

The underlying AST used to model Prose is the following:

{%gist archaephyrryx/7ed664ccfbc096a157e09baf7f004851 %}

These structural elements are a transient model, and are compiled into human-readable descriptions for each given component.

Unit

A Unit element corresponds to almost all schemata of type unit Encoding.t. Such an element can either be thought of as being 'omitted' from the binary serialization layer, or as being encoded as a zero-width binary value.

It can arise when the source schema includes one of the following data-encoding combinators at some layer (Cf. encoding.ml):

null
empty
unit
constant _str (_str: descriptive string presented only in the JSON layer, often for giving names to tagged variants in a discriminated union)

In Prose output, Unit is represented by the following string:

"zero-width value (null or unit)"

Bool

A Bool element corresponds to a bool Encoding.t, which represents a boolean value as a single byte, whose bits are either all set (0xff := true) or all unset (0x00 := false).

Though similar flagging can be found at the binary layer for Dft or Opt (tagged) elements, these are treated separately in the Prose model.

Almost every instance of Bool will correspond to the Encoding.bool combinator occurring in the source schema.

In Prose output, Bool is represented by the following string:

"boolean value"

Num

A Num _ element is an aggregate over all num_t cases defined in simplified.ml:

{%gist archaephyrryx/acb04c69c2312ea2706d2634f63d0f97 %}

These correspond to the respective combinators:

int8
uint8
uint16
int16
int31
int32
int64
ranged_int
float
ranged_float

Note that Uint30 appears as a first-class value in simplfied.ml, while in data-encoding it does not appear; this is due to the fact that, while 30-bit unsigned integer values are commonly used for enum tag-values, dynamic length prefixes, and other composite constructs, a raw integer value representing the same range as uint30 can only be modeled as a ranged_int from 0x0 to 0x3fffffff. However, such a RangedInt value is re-associated as Uint30 when it has those bounds, during the Simplification phase of the compiler run.

Integer-valued Num _ are rendered as:

"<N>-bit (un)?signed integer ()"

for the appropriate value of N and the correct indicator of signedness.

For specific (non-equivalent to a NativeInt) RangedInt values, an additional suffix of

" between <MIN> and <MAX>"

is appended.

For Double and RangedDouble, the following string is emitted:

"IEEE-754 double-precision float"

with RangedDouble adding the suffix

" between MIN and MAX"

as appropriate.

`Int31/Uint30` vs `Int32`

There is a technical distinction that should be raised on the subject of the division of 4-byte integer types into Int32, Int31, and Uint30.

On 32-bit OCaml platforms, in-memory values are represented as 32-bit machine words whose MSB is a one-bit flag indicating whether the remaining 31-bits are an unboxed value (e.g. Ocaml int, bool, unit, certain ADT discriminants, etc.) or the pointer to the address of a boxed value, which may take up more than one machine word to store (list, string, recursive types, records, tuples, etc.).

As a consequence, 32-bit OCaml platforms have only 31 bits to store an unboxed integer value, reserving the highest of the non-flag bits for sign and the remaining 30 bits for the value, using two's-complement representation as normal.

This means that in order to be compatible with OCaml builds on 32-bit systems, data-encoding treats int as a 31-bit signed value, hence Int31. Within this range, Uint30 is a virtual sub-type consisting of all non-negative Int31 values.

As an alternative, there is a standard library module in OCaml, Int32, that offers portable 32-bit integers on all platforms, either raw on 64-bit systems or boxed on 32-bit systems. This is the underlying type used for the int32 combinator. (Similarly, an Int64 module exists, corresponding to the int64 combinator). Though guaranteed-portable, the memory footprint of such values, and the worse performance of operations on such values compared to unboxed int, means that Int32 is typically used only in contexts where all the extra bit of precision cannot be sacrificed. Similarly, Int64 incurs a performance penalty on even 64-bit platforms compared to int, and is only used when representing values over the full 64-bit range (e.g. time-related values in Octez, as in lib_base/time.ml). Therefore, the majority of fixed-precision integer schemata use int as their underlying type, using Int32.t and Int64.t for the extra variants int32 and int64 only.

Due to the fact that there is no standard Uint32 module in use by data-encoding, all unsigned 4-byte integers can be assumed to be 30-bit maximal, meaning that any value 0x4000_0000 ($2^{30}$) or above is illegal for that component or sub-component.

In practice, due to the fact that most usages of Uint30-valued binary elements are for unsigned values strictly less than $2^{30}$ (e.g. binary length of strings, arrays; tags for enumerated types), this should not be a major issue, but to the extent that bounds-checking does not significantly reduce performance, it is advised to ensure that values greater than $2^{30} - 1$ are never serialized into a context that expects a Uint30 value. Similarly, for the case of Int31, values less than $-(2^{31})$ or greater than $2^{30} - 1$ should be filtered out.

As a final note, it is important to clarify that, while values outside of the range $[0, 2^{30} - 1]$ (for Uint30) or $[-(2^{30}), 2^{30} - 1]$ (for Int31) are invalid, the actual encoding of both types is equivalent to that of a full-range Int32 with the same numeric value. That is, negative Int31 values will still use the MSB as their sign-bit, rather than use the 32-bit OCaml in-memory representation. All this means is that Int31 and Uint30 should be encoded and decoded as if they were Int32 values, but that post-decode and pre-encode bounds-checking are necessary to filter out illegal values.

`N`/`Z`

An N or Z element represents an arbitrary-precision natural ($\mathbb{N}_0$) or integer ($\mathbb{Z}$). These are generally referred to as Zarith values, as the underlying type of the encoding is the Z.t type defined in the OCaml Zarith library.

N corresponds to the n combinator, and Z corresponds to the z combinator, in data-encoding.

In Prose output, N is emitted as

"arbitrary-precision natural (non-negative) integer"

and Z as

"arbitrary-precision integer"

Dynamic-Size Integer Encoding

A custom binary serialization/deserialization strategy is used for the data-encoding primitives N and Z.

As of data-encoding v0.7, this strategy may be used for bounded-precision integer values, via uint_like_n and int_like_z, and so it applies more broadly than just for Zarith-based arbitrary-precision integers.

Due to the nuance and complexity of the encoding strategy in question, it is sufficient to state here that the encoding is a self-terminating byte-sequence where the high bit for each byte is 1 whenever there is at least one more byte left in the value, and 0 when the byte in question is to be considered terminal. The remaining lower seven bits in each byte are to be interpreted in little-endian byte-order as the constituent bits of the binary representation of the value, with Z and uint_like_z using the second highest bit in the initial byte to signal the sign, as proper negation rather than two's-complement representation.

A detailed explanation of the Zarith encoding strategy can be found in the Target Implementation Guide (a more detailed explanation of several other schema/codec types can be found in the same document as well).

`String` and `Bytes`

The unified String _ element represents two intrinsically related, but technically distinct schema kinds in data-encoding:

String (Some n) is a transcription of Fixed.string n
String None is a transcription of Variable.string

Similarly, Bytes (Some n | None) models Fixed.bytes n and Variable.bytes

String _ is emitted with the base string-value:

"character string"

with Bytes _ emitting a similar:

"byte sequence"

with the conditional suffix (only for (String | Bytes) (Some n)) of

" (fixed length: n)"

In both cases, the length stored in the AST constructor, and emitted in the codec output, is the number of bytes (even for String), rather than the number of characters, due to the nature of string in OCaml as outlined below:

OCaml `string` type

The OCaml string type is far more permissive in its binary contents than string types in certain other languages. Because OCaml strings are immutable, fixed-length sequences of bytes, there is no ironclad guarantee as to whether the byte contents of a string will be valid UTF-8, or be valid as strings in other languages.

It is important, therefore, to understand that String and Bytes are virtually interchangeable, with the only real difference being that a value typed as string is thereby giving a hint that it contains text-oriented data, and might be useful to present as a sequence of characters rather than a raw binary blob.

In fact, since v0.6 of data-encoding, there is a representation-hint value passed to the newly defined string' and bytes' constructors, taking a value within the new type

type string_json_repr = Hex | Plain

(Cf. encoding.ml)

which indicates whether the JSON represntation of such a value should be a plaintext string encoding, or presented as a hexadecimal encoding of its constituent bytes. Accordingly, this version redefines string and bytes:

let string = string' Plain

let bytes = bytes' Hex

(This refactoring applies both to the top-level definitions, as well as for the Fixed and Variable module definitions of string and bytes, each with their own corresponding prime-variant).

`Dyn`

An element Dyn (lp, t) corresponds to the dynamic_size schema wrapper in data-encoding, with lp being of type

type lenpref = [`Uint30 | `Uint16 | `Uint8]

This value indicates the encoding used for a length prefix (measuring length-in-bytes, even for payload values with another natural concept of length)

:::warning In v0.7 of data-encoding, a new variant is added to this collection, namely `N.

This new variant specifies a dynamically sized length-prefix, using the encoding for the Zarith schema-type N; however, it is understood that such values ultimately model int and not Zarith.Z.t, and so they are still constrained to be expressible as uint30 values. The only difference, therefore, between the Uint30 and N length-prefix encodings is whether to optimize for smaller values, which can cut down the overhead of the byte prefix according to this table:

Value Range	N bytes	vs Uint8	vs Uint16	vs Uint30
$0$--$127$	1	+0%	-50%	-75%
$128$--$255$	2	+100%	+0%	-50%
$256$--$(2^{14} - 1)$	2	N/A	+0%	-50%
$2^{14}$--$(2^{16} - 1)$	3	N/A	+50%	-25%
$2^{16}$--$(2^{21} - 1)$	3	N/A	N/A	-25%
$2^{21}$--$(2^{28} - 1)$	4	N/A	N/A	+0%
$2^{28}$--$(2^{30} - 1)$	5	N/A	N/A	+25%

:::

`Seq`

An element Seq (t, limit) corresponds to a unified notion of sequence type, which models both list- and array-based schema components. The constructor itself is typed as Seq of t * limit, where t is a self-reference to the Prose AST type, and limit is a mirrored definition from Data_encoding:

type limit = Data_encoding__Encoding.limit =
  | No_limit
  | At_most of int
  | Exactly of int

The encoding format does not change based on what limit is enforced, but the operational semantics of decoding each kind of sequence is slightly different, and so each limit-variant will be explored.

`Seq (t, No_limit)`

An element Seq (t, No_limit) represents a sequence whose cardinality (number of elements) is fully variable.

The d

(Note that with the release of v0.7)

`Tuple`

An element Tuple ts represents an OCaml tuple whose in-order positional arguments are given by the list ts. Each element is, in turn, the translation of the corresponding positional argument in the source schema, into a Prose AST type.

For example,

Data_encoding.(tup3 bool Variable.string int64)

would be translated (approximately) to

Tuple [Bool; String None; Num (Int (NativeInt Int64))]

Tuples, as product-types, yield more complex output than simple elements.

As a rule, a tuple consisting of N positional argument will be emitted with a header of

"N-tuple : "

followed by a sequence of items, indented by one layer compared to the header, each of which will be:

"i: <output for element at positional index i>"

For example, the above 3-tuple would yield

3-tuple : 
    0: boolean value
    1: character string
    2: 64-bit signed integer

Record

An element Record fs is similar to an element Tuple _, but models a record type rather than a tuple type. As such, it stores a list of named fields, as (string * t).

Much like Tuple _, the emitted string consists of a header line:

"Record : "

followed by a sequence of items, indented one layer; each of these items follows the pattern

"`<field name>`: <output for corresponding field>"

For example, the following schema:

obj3 (req "first" string) (opt "middle" string) (req "last" string))

would be converted to Prose and emitted as

Record : 
    `first`: length-prefixed (prefix width: 4 bytes): character string
    `middle`: [tagged] nullable of: length-prefixed (prefix width: 4 bytes): character string
    `last`: length-prefixed (prefix width: 4 bytes): character string

    | Seq of t * limit
    | Pad of int * t
    | Opt of bool * t
    | Dft of t * string

    | Enum0 of enum_size * (string * int) list
    | EnumPlus of tag_size * (int * (string * t)) list
    | SubDef of string * t
    | ExtRef of string

archaephyrryx / prose Goto Github PK

prose's Introduction

codec_generator PROSE format

Background

Motivation

How to Use

Specification

Unit

Bool

Num

Int31/Uint30 vs Int32

N/Z

Dynamic-Size Integer Encoding

String and Bytes

OCaml string type

Dyn

Seq

Seq (t, No_limit)

Tuple