celestiaorg / celestia-specs Goto Github PK

View Code? Open in Web Editor NEW

51.0 10.0 15.0 1.96 MB

Celestia Specifications

License: Creative Commons Zero v1.0 Universal

JavaScript 100.00%

lazyledger consensus blockchain celestia

celestia-specs's Introduction

Celestia Specifications

Notice

THIS REPOSITORY IS NOT CURRENTLY ACTIVELY MAINTAINED, AND DOES NOT REFLECT THE CELESTIA PROTOCOL. THE REFERENCE IMPLEMENTATION OF THE CELESTIA PROTOCOL SHOULD BE CONSULTED AS THE COMPLETE SPECIFICATION.

https://github.com/celestiaorg/celestia-core: Tendermint Core full node
https://github.com/celestiaorg/celestia-node: Celestia-specific logic, attached to Celestia Core node
https://github.com/celestiaorg/celestia-app: Celestia state machine (staking and fee payments) logic

The following core ideas in this repository inform the implementation:

Building From Source

To build book:

mdbook build

To serve locally:

mdbook serve

Contributing

Markdown files must conform to GitHub Flavored Markdown. Markdown must be formatted with:

celestia-specs's People

Contributors

Stargazers

Watchers

Forkers

gitter-badger phaethonp musalbas tac0turtle wondertan typicalbuster nwhite1 renaynay nusret1996 rootulp evan-forbes bvrooman yazzyyaz biggmg

celestia-specs's Issues

Optimize hashing for SMT

Currently the SMT has compact proofs (i.e. size on average O(log non_empty_leaves)), but still requires O(log tree_height) hash operations. We'd like to reduce the number of hash operations to an average O(log non_empty_leaves) as well.

The two most obvious options:

Prefix a different bit for leaf node and internal nodes (e.g. 0 for leaf nodes and 1 for internal nodes). This gives second preimage resistance equal to the number of bits in the hash function (256).
Prefix leaves with their key. This gives second preimage resistance equal to the number of bits in the key (160 for addresses, in the state tree).

Either of these approaches should provide sufficient protection against second preimage attacks. The former is simpler and is more commonly used, the latter requires fewer operations, and should therefore be cheaper to compute and verify in general, and specifically in a smart contract.

Consider adding rewards to voting power

Currently, rewards are held as pending rewards, but aren't added as voting power. This necessitates validators unbond completely to fold their rewards into their voting power, which is poor UX. We should consider immediately doing that automatically, and investigate if this leads to any potential issues.

Fix empty leaf hash value

The empty leaf hash value is currently keccak256(), but should be changed to sha256() = 0xe3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855, in accordance with #103.

Define different node types

There are several different types of nodes supported by LazyLedger, each with different compute and networking requirements, along with different assumptions and security properties. These should be specified.

Investigate different fee mechanisms

This issue tracks alternatives for fee mechanisms, that were brought up in #79.

Question: Single state tree or several trees?

We previously tried to convince @adlerjohn to have only one tree for both account state and validator set(s).

That means a light client downloading the validator set of a particular height would also need to request an inclusion proof per validator in that set (inclusion in the state smt that is).
As we plan to use optimized compact smts, the size of an inclusion proof varies depending on the actual entries in that tree. Assuming, that we have loads of actually used accounts in the tree, we quickly hit 500kb and upto 1MB of inclusion proof data only (additionally to the valset data).

If we instead track the valset in a different tree for validators only, the light client only needs the root of that tree (in the header) and it's leaves (the vals in the valset) to verify the valset matches the one signed a header (or will sign a future header).

Tendermint basically already does this by including ValidatorsHash and NextValidatorsHash fields in the header: https://github.com/tendermint/tendermint/blob/18d333c3927f3dce0d8d5970ee40e5ffcc33448b/types/block.go#L351-L353
(note in tendermint, these result from txs included in the previous block due to deferred execution)

Different from tendermint, we can't use the exact kind of simple tree as we want to enable compact fraud proofs (e.g. a compact smt instead of the simple merkle tree currently used: https://github.com/tendermint/tendermint/blob/18d333c3927f3dce0d8d5970ee40e5ffcc33448b/types/validator_set.go#L345-L356) and the app needs to track the tree changes as far as I understand (intermediate state roots for fraud proofs).

In any case, we should re-discuss on the state tree or trees.

With a dedicated tree for the valset, the light client now can downloads the valset and recompute the tree without downloading an inclusion proof per validator.
(to verify the root matches the state root in the header it actually needs the accounts root too but that should be it).

Investigate square sizes that aren't a power of 2

#79 made the size of the available data square variable per block (with a fixed maximum size). It (and the rest of the pre-existing rationale) assumes that the current square size is a power of 2, however this is not strictly necessary.

We should investigate what complexities that are introduced by allowing arbitrary square sizes (upper bounded by the maximum), and if acceptable, propose an algorithm for determining the structure of commitments to available data that is required for transactions.

erasure coding: square width should vary per block instead of being fixed

Currently, the availableDataOriginalSquareSize does not depend on the amount of data in the block if I understand correctly:
https://github.com/lazyledger/lazyledger-specs/blob/master/specs/data_structures.md#consensus-parameters

What would be pros/cons to change this to depend on the amount of data in each block (and hence to vary per block)?

According to @adlerjohn:

But it could be made to vary per-block I guess. It would then need to be in the block header.

I think arranging the data into a square that depends on the total number of shares per block makes sense. Otherwise we might create large blocks for no reason.

High-level Architecture diagrams

We should add architecture diagrams that clarifies at least the following relations:

LazyLedger layer-1 (including block producers/validators)
application specific sidechain dealing with state (and maybe block producers of that chain)
end-user client submitting Tx

The first version of this can be a very simple draft. Point 3) could also be skipped in a first diagram if that simplifies things. Additionally, more detailed diagrams explaining all touching points and interactions between all three would be great.

Merkle tree default value deviates conceptually from RFC-6962

In https://github.com/lazyledger/lazyledger-specs/blob/master/specs/data_structures.md#binary-merkle-tree
we state:

Binary Merkle trees are constructed in the same fashion as described in Certificate Transparency (RFC-6962). Leaves are hashed once to get leaf node values and internal node values are the hash of the concatenation of their children (either leaf nodes or other internal nodes).

but also

The base case (an empty tree) is defined as zero:
node.v = 0x0000000000000000000000000000000000000000000000000000000000000000

This is confusing as we actually deviate from RFC-6962. There it says:

The hash of an empty list is the hash of an empty string:
MTH({}) = SHA-256().

https://tools.ietf.org/html/rfc6962#section-2.1

My understanding is that we only borrow how to split / try to balance the tree in case the leaf nums are odd. The base hasher and the base case are different.

Define commit validation

Define how a CommitSig is validated.

Decide on data model for validator set

There are two viable data models that we need to choose between for the validator set and fee-paying account balances: UTXOs and accounts. Defining a transaction format is dependent on this.

Considerations

Transactions are generally larger in the UTXO data model, and reasoning about a single account receiving funds is more complex. However, it allows for easier parallel validation and, most importantly, it more conducive for HTLCs (which can be used for cross-chain atomic swaps without execution---an indispensable feature).

Recommendation

Given that there will be no repeated interactions with contracts, the UTXO data model looks preferable.

Use 32-byte addresses

Use 32-byte instead of 20-byte addresses. Rationale:

256 bits instead of 160 bits of security. 80 bits of collision resistance may not be sufficient.
The benefit of using a smaller address space is a smaller state tree. But this is redundant given 1) state tree is keyed by the hash of the address and 2) compact SMT.
Only downside: slightly larger transactions, as recipient of a transfer is 12 bytes larger. This isn't much of an issue since transfers should be intentionally few.

Add rationale for over-the-network message commitments

#79 introduced available data square sizes that vary per block. It also surfaced that a worst-case logarithmic number of signatures can be attached to each transaction over the network, to account for different square sizes non-interactively. This should be documented sooner rather than later, as it's non-obvious by looking only at the data structures.

Use a Merkle accumulator for history

Instead of a linear previous-blockhash, a Merkle accumulator (such as a Merkle mountain range) should be used. This allows for proving inclusion of a block in the chain with a logarithmic number of hashes, which is especially in the context of cross-chain relays.

Block proposer

Honest block proposer link in block_proposer.md is local host. 😄

http://localhost/#honest-block-proposer

Specify cryptographic primitives

LazyLedger, as all blockchains uses numerous cryptographic primitives (most notably, hash functions and ecdsa or other cryptographic signature algorithms, e.g. BLS).
We need to choose among these and document the choices made in the spec (and ideally document why these decisions where made somewhere else).

Note: While these decisions are critical for a final spec, decisions about well established building blocks can happen at a later point in time in the specification process. We can and should start the implementation with the choices made in tendermint anyways to rapidly prototype.

Should we decrease NAMESPACE_ID_BYTES?

Currently NAMESPACE_ID_BYTES == 32 see:

Just spitballing here but I think 16, or even just 8 bytes should be enough.

ref: celestiaorg/nmt#10 (comment)

Nomenclature of "request" is confusing

┆Issue is synchronized with this Asana task by Unito

Investigate different hash functions

This issue is to track investigation into which hash function to use throughout the primary LazyLedger data structures (i.e. Merkle trees and simple hashes).

Two critical desiderata:

Standard: Using a nonstandard cryptographic primitive can lead to long-term widespread confusion and should be avoided at all costs.
Ethereum compatibility: Ethereum has native support for the following hash algorithms: keccak256, sha256, blake2b, ripemd160.

Note that performance is a secondary concern at best.

Of Ethereum's natively supported hash algorithms, only sha256 is standard (i.e. approved by a standards body). However, it is vulnerable to a length extension attack. The severity of this should be considered low both because it only applies to the use of this hash function in a MAC, and because the extensive use of sha256 in the cryptocurrency ecosystem serves as a "crowd immunity" layer.

SHA-3 would be a preferable alternative to avoid this issue altogether, but is not natively supported by Ethereum at this time.

Proofs shouldn't include the Merkle root or the value being proven, as this is part of the statement being proven, not the proof

A Merkle proof is a proof for the statement "a key x has value y for the tree with root z". It doesn't make sense for the statement being proven to be a part of the proof, because the prover could always generate a valid proof if they get to pick the statement. Furthermore, the verifier may already know what the statement being proven is, so including it as part of the proof may be redundant. At minimum, they should already know what the root of the tree is.

See: https://github.com/lazyledger/smt/pull/5/files#r460376202

Move block data diagram to data structures

Proposal to move the data structure diagram:

to the Data Structure section
(currently under https://github.com/LazyLedger/lazyledger-specs/blob/master/specs/architecture.md#system-architecture).

The suggested diagrams in #7 should go under Architecture instead.

Specifiy non-membership proofs for sparse Merkle tree

In the current compute-optimised SMT implementation, this simply requires an extra field in the proof for the data of the leaf found at the position of the key that non-membership is being proven for, if that leaf is not a placeholder. https://github.com/lazyledger/smt/blob/2d392fe132b6304df772fa06ab83b6d984277f06/proofs.go#L11

Add header field for number of shares used

#80 explored the performance of different square sizes, and determined that the number of shares for each row and column of the available data square should always be a power of 2. This could result in a lot of wasted computation and sampling if only slightly more than a quarter of the shares are used for real data (indeed, this is exactly what prompted #68).

To avoid waste, we can do two things:

Add the number of shares used for "real" data in the block header.
Since any shares beyond the number of real data shares must be tail padding, rows containing only tail padding should be omitted from the row NMT commitments, and don't have to be sampled. The exact number of rows to omit can be deterministically computed using the current available data square size and the number of real data shares.

Need Merkle tree diagrams

#47 cleaned up a number of issues around Merkle tree specs. Diagrams are still missing however.

Binary Merkle tree
Sparse Merkle tree
Sparse Merkle tree operations (add, remove, update)

See: https://github.com/google/trillian/tree/master/docs/merkletree/treetex

Get rid of compromises made to be closer to tendermint implementation (where it makes sense)

It often seems like a good trade-off to keep some details that originated from the way tendermint (and abci) is designed in the spec. We've now decided that we draw the line to were we want to avoid this at (almost) all costs were we have to make trade-offs that are consensus-critical:

when it comes to things like data structures etc [the spec] should be designed according to the Tendermint implementation
But we shouldn't have to make consensus-critical trade-offs due to the Tendermint implementation, like adding extra fraud proofs

Potentially incomplete list of fields that are artefacts from the way tendermint / abci is designed which we would not include in an ideal LL spec: ValidatorsHash, NextValidatorsHash.

While this could mean that the implementation could incrementally be made safe while we remove implementation specific details step by step, we have to make sure that the spec is sane and specifies a safe system.

ref: #75
ref: #65

Is removing validators and next validators' hashes absolutely necessary?

From a single abci app (in our case the state machine that tracks the minimal state necessary for PoS), it makes sense to remove these two hashes: https://github.com/tendermint/tendermint/blob/8571b2ee9a77ba62b24926324864677867e55124/types/block.go#L341-L342

They could be part of the appHash (stat root) anyways. My concern with this is that it does not make sense from the PoV of tendermint: they are essential information for tendermint consensus and these hashes are computed by tendermint (and not the app) which does not know or understand how the (abci) app deals with state. Also, the light client needs these two hashes (independent from how the app tacks state). This means we can't use most of the tendermint light client logic (which might be OK as the LL light client will be different enough anyways)

TODO: further discuss and clarify the implications of removing these two fields vs keeping them.

Add implicit and canonical data fields

E.g. transaction signatures should sign over the chain ID to prevent replay attacks and other exploits, but including the chain ID for each transaction is redundant and can instead be implicit. Same for votes.

Storage node & protocol

We need to define the public API and the protocol to interact with a storage node.

This also plays into data availability (which might be started being spec'ed out in #2) as consensus participants (validators) that do not want to be storage nodes themselves will need to query them for data availability.

Quoting musalbas here:

So there should be an option to run:
a) validator that is also a storage node
b) storage node only
~~c) validator that just uses data availability proofs~~(update: not planned anymore)
d) light clients that use data availability proofs

Initial evidence types

Currently the only evidence type is double-signing. There should be additional types for anything slashable, i.e. anything that would trigger a fraud proof.

Upcoming new Tendermint evidence types: https://github.com/tendermint/tendermint/blob/proto-breakage/proto/types/evidence.proto

Also see: https://medium.com/tendermint/different-types-of-evidence-in-tendermint-5de4440fdd54

Specify erasure code fraud proofs

Handle fees

Depends on #82.

Define main terminology

Define main terminology (naming for key concepts specific to LazyLedger but also some more general terms). E.g. clear terminology for LazyLedger layer1, sidechains, execution environments, Tx to layer1, sidechain Tx, tbc ...

┆Issue is synchronized with this Asana task by Unito

Add special state transition for end of block

After all transactions are processed, a special implicit state transition should be applied that e.g. properly changes the validator set based on the current state.

The state will also have to be changed to commit to some priority queue of queued validators, so that a state transition fraud proof can be created for this.

The core idea is to augment the queued validators with a single-linked-list priority queue of voting power. Every time the queue is updated, the links are updated and this metadata is added by the block proposer as a wrapped transaction, allowing fraud proofs without knowing the whole queue.

Merkle trees misc cleanup and fixes

https://github.com/lazyledger/lazyledger-specs/blob/master/specs/data_structures.md#sparse-merkle-tree

For leaf node of leaf message m with key k, its value v is:

Better to use a different word than message here (otherwise this might be conflated with the other use of the word messages)

Also, it would help getting a quick understanding, if there was 2-3 simple to follow examples how an smt and its will look like (including trivial edge-case like all-empty tree, only one entry, and one realistic example with a few entries).

Splitting into shares needs to know request length

Currently shares for transactions, evidence, and intermediate state roots are arranged by first serializing an array of e.g. transactions, then splitting them into 255-byte shares (with the 256th byte reserved to indicate the starting location of the first start of a transaction in the share.

The issue here is that simply serializing an array of transactions doesn't return the length of each serialized transaction, so we can't properly compute the reserved byte value.

Splitting into shares should instead use the following procedure:

Serialize each transaction individually (which tells us the number of bytes for each transaction)
Create the byte array serialize(length of serialize(tx1)) || serialize(tx1) || serialize(length of serialize(tx2)) || serialize(tx2) || ...
Split the above byte array into 255-byte shares, and use the lengths from step (1) to compute reserved byte values

Inter-message padding shares

Currently, inter-message padding shares are paid for with a special transaction, created by the block proposer. However, that may not be ideal. An alternative is instead including this information as part of the block data in a way that's not a transaction, and thus does not need to be paid for. Another alternative is setting inter-message padding shares to the previous message's namespace ID and adding some logic to prune padding shares from proofs, etc.

Related: #86.

Using Python syntax as pseudocode

I think Python syntax should be used a pseudocode, as it's almost identical to the current (fake) syntax, and since it's a real language it provides a point of reference.

┆Issue is synchronized with this Asana task by Unito

Make naming of data structures consistent

Some have spaces, while others don't.

Investigate canonical encoding scheme

Currently, a canonical variant of protobuf3 is used to define data structures and do serialization/deserialization. This is great for portability, isn't ideal for blockchain applications. For example, there are no fixed-size byte arrays (32-bytes hashes used all over the place in blockchains), or optional fields. Robust protobuf implementations for embedded devices (e.g. hardware wallets) or blockchain smart contracts (e.g. Solidity) are non-existent and would be prohibitive to develop.

The crux is that protobuf was designed for client-server communication, with different versions (i.e. both forwards and backwards compatibility are key features). This is unneeded for core blockchain data structures (e.g. blocks, votes, or transactions), but may be good for node-to-node communication (e.g. messages that wrap around block data, vote data, or transaction data).

We should investigate the feasibility of using a simpler serialization scheme for core data structures.

Desiderata:

fully deterministic (specifically, bijective)
binary, not text
native support for basic blockchain data types (esp. fixed-sized arrays)
typedefs / type aliases (i.e. zero-cost abstractions)
no requirements on backwards or forwards compatibility

Comparison of Difference Schemes

Protobuf

https://developers.google.com/protocol-buffers/docs/overview

https://github.com/lazyledger/protobuf3-solidity-lib

https://github.com/cosmos/cosmos-sdk/blob/master/docs/architecture/adr-027-deterministic-protobuf-serialization.md

cosmos/cosmos-sdk#7488

protocolbuffers/protobuf#3521

XDR

https://tools.ietf.org/html/rfc4506

https://developers.stellar.org/docs/glossary/xdr

Veriform

https://github.com/iqlusioninc/veriform

SimpleSerialize (SSZ)

https://github.com/ethereum/eth2.0-specs/blob/dev/ssz/simple-serialize.md

┆Issue is synchronized with this Asana task by Unito

SMT should hash leaf value

Exclusion proofs for SMTs are inclusion proofs of a different leaf (or an empty leaf). Including the entire leaf data is redundant in such a case. We can optimize this by hashing the leaf data, i.e v = h(k, h(leaf_data)).

Investigate signature aggregation

BLS can be used to aggregate signatures (i.e. represent an arbitrary number of unaggregated signatures as a fixed-length aggregated signature). This should allow for a drastic reduction in the size of Tendermint commit data.

┆Issue is synchronized with this Asana task by Unito

Provide rationale for message placement

Need to describe the rationale behind placing messages on new rows w.r.t fee payments.

Clarify naming and role of availableDataOriginalSquareSize

[availableDataOriginalSquareSize] should probably be renamed to availableDataOriginalSquareSizeMax, to indicate the largest blocksize accepted by the network, rather than the size of the current block (which should be placed elsewhere in the block header).

ref: #75
ref: #68

Padding shares should be deterministic and consensus-enforced

Currently, padding between messages can have any namespace ID between the namespace IDs of the surrounding messages:

The non-interactive default rules may introduce empty shares that do not belong to any message (in the example above, the top-right share is empty). These must be explicitly disclaimed by the block producer using special transactions.

However, these padding shares aren't included when distributing the entire block all at once, since blocks aren't distributed normally laid out as shares. To fix this, the namespace ID of padding shares must be both deterministic and enforced by consensus.

Spec more Merkle proofs format

#47 cleaned up a number of issues around Merkle tree specs. Additional proof formats still need to be defined

Merkle multiproofs
Exclusion proof for Sparse Merkle tree (rather, inclusion proof of empty leaf)
Range proof for NMT

Define main data structures

Block
Header
Merkle Tree
etc

Define scope and structure of a preliminary LazyLedger spec

Define scope of the spec via chapters/structure.

Can't prove messages aren't paid for

One issue with the current scheme of separating transactions and messages is that while we can compactly prove that the message a transaction pays for has been included, we can't compactly prove that a message wasn't paid for.

The most obvious solutions are:

Add some reserved bytes at the beginning of each message pointing to the transaction index that pays for them. This doesn't work without major changes, as it breaks the non-interactive default share layout.
Accumulate <key: message start, value: <transaction index, message end (or message length)>> pair in a Sparse Merkle Tree and commit to the root in the block header. This way we can prove exclusion directly.
Require that transactions that pay for a message are ordered by message start (and implicitly namespace ID). This way we can prove exclusion with two inclusion proofs.

celestiaorg / celestia-specs Goto Github PK

celestia-specs's Introduction

Celestia Specifications

Notice

Building From Source

Contributing

celestia-specs's People

Contributors

Stargazers

Watchers

Forkers

celestia-specs's Issues

Considerations

Recommendation

Comparison of Difference Schemes

Protobuf

XDR

Veriform

SimpleSerialize (SSZ)

Recommend Projects

Recommend Topics

Recommend Org

Jobs