GithubHelp home page GithubHelp logo

celestiaorg / celestia-specs Goto Github PK

View Code? Open in Web Editor NEW
51.0 10.0 15.0 1.96 MB

Celestia Specifications

License: Creative Commons Zero v1.0 Universal

JavaScript 100.00%
lazyledger consensus blockchain celestia

celestia-specs's Introduction

Celestia Specifications

Community

Notice

THIS REPOSITORY IS NOT CURRENTLY ACTIVELY MAINTAINED, AND DOES NOT REFLECT THE CELESTIA PROTOCOL. THE REFERENCE IMPLEMENTATION OF THE CELESTIA PROTOCOL SHOULD BE CONSULTED AS THE COMPLETE SPECIFICATION.

  1. https://github.com/celestiaorg/celestia-core: Tendermint Core full node
  2. https://github.com/celestiaorg/celestia-node: Celestia-specific logic, attached to Celestia Core node
  3. https://github.com/celestiaorg/celestia-app: Celestia state machine (staking and fee payments) logic

The following core ideas in this repository inform the implementation:

  1. Erasure coding
  2. Namespace IDs
  3. MsgPayForData transaction
  4. State transition

Building From Source

To build book:

mdbook build

To serve locally:

mdbook serve

Contributing

Markdown files must conform to GitHub Flavored Markdown. Markdown must be formatted with:

celestia-specs's People

Contributors

adlerjohn avatar bvrooman avatar liamsi avatar musalbas avatar nusret1996 avatar renaynay avatar rootulp avatar tac0turtle avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

celestia-specs's Issues

Optimize hashing for SMT

Currently the SMT has compact proofs (i.e. size on average O(log non_empty_leaves)), but still requires O(log tree_height) hash operations. We'd like to reduce the number of hash operations to an average O(log non_empty_leaves) as well.

The two most obvious options:

  1. Prefix a different bit for leaf node and internal nodes (e.g. 0 for leaf nodes and 1 for internal nodes). This gives second preimage resistance equal to the number of bits in the hash function (256).
  2. Prefix leaves with their key. This gives second preimage resistance equal to the number of bits in the key (160 for addresses, in the state tree).

Either of these approaches should provide sufficient protection against second preimage attacks. The former is simpler and is more commonly used, the latter requires fewer operations, and should therefore be cheaper to compute and verify in general, and specifically in a smart contract.

Consider adding rewards to voting power

Currently, rewards are held as pending rewards, but aren't added as voting power. This necessitates validators unbond completely to fold their rewards into their voting power, which is poor UX. We should consider immediately doing that automatically, and investigate if this leads to any potential issues.

Fix empty leaf hash value

The empty leaf hash value is currently keccak256(), but should be changed to sha256() = 0xe3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855, in accordance with #103.

Define different node types

There are several different types of nodes supported by LazyLedger, each with different compute and networking requirements, along with different assumptions and security properties. These should be specified.

Question: Single state tree or several trees?

We previously tried to convince @adlerjohn to have only one tree for both account state and validator set(s).

That means a light client downloading the validator set of a particular height would also need to request an inclusion proof per validator in that set (inclusion in the state smt that is).
As we plan to use optimized compact smts, the size of an inclusion proof varies depending on the actual entries in that tree. Assuming, that we have loads of actually used accounts in the tree, we quickly hit 500kb and upto 1MB of inclusion proof data only (additionally to the valset data).

If we instead track the valset in a different tree for validators only, the light client only needs the root of that tree (in the header) and it's leaves (the vals in the valset) to verify the valset matches the one signed a header (or will sign a future header).

Tendermint basically already does this by including ValidatorsHash and NextValidatorsHash fields in the header: https://github.com/tendermint/tendermint/blob/18d333c3927f3dce0d8d5970ee40e5ffcc33448b/types/block.go#L351-L353
(note in tendermint, these result from txs included in the previous block due to deferred execution)

Different from tendermint, we can't use the exact kind of simple tree as we want to enable compact fraud proofs (e.g. a compact smt instead of the simple merkle tree currently used: https://github.com/tendermint/tendermint/blob/18d333c3927f3dce0d8d5970ee40e5ffcc33448b/types/validator_set.go#L345-L356) and the app needs to track the tree changes as far as I understand (intermediate state roots for fraud proofs).

In any case, we should re-discuss on the state tree or trees.

With a dedicated tree for the valset, the light client now can downloads the valset and recompute the tree without downloading an inclusion proof per validator.
(to verify the root matches the state root in the header it actually needs the accounts root too but that should be it).

Investigate square sizes that aren't a power of 2

#79 made the size of the available data square variable per block (with a fixed maximum size). It (and the rest of the pre-existing rationale) assumes that the current square size is a power of 2, however this is not strictly necessary.

We should investigate what complexities that are introduced by allowing arbitrary square sizes (upper bounded by the maximum), and if acceptable, propose an algorithm for determining the structure of commitments to available data that is required for transactions.

erasure coding: square width should vary per block instead of being fixed

Currently, the availableDataOriginalSquareSize does not depend on the amount of data in the block if I understand correctly:
https://github.com/lazyledger/lazyledger-specs/blob/master/specs/data_structures.md#consensus-parameters

What would be pros/cons to change this to depend on the amount of data in each block (and hence to vary per block)?

According to @adlerjohn:

But it could be made to vary per-block I guess. It would then need to be in the block header.

I think arranging the data into a square that depends on the total number of shares per block makes sense. Otherwise we might create large blocks for no reason.

High-level Architecture diagrams

We should add architecture diagrams that clarifies at least the following relations:

  1. LazyLedger layer-1 (including block producers/validators)
  2. application specific sidechain dealing with state (and maybe block producers of that chain)
  3. end-user client submitting Tx

The first version of this can be a very simple draft. Point 3) could also be skipped in a first diagram if that simplifies things. Additionally, more detailed diagrams explaining all touching points and interactions between all three would be great.

Merkle tree default value deviates conceptually from RFC-6962

In https://github.com/lazyledger/lazyledger-specs/blob/master/specs/data_structures.md#binary-merkle-tree
we state:

Binary Merkle trees are constructed in the same fashion as described in Certificate Transparency (RFC-6962). Leaves are hashed once to get leaf node values and internal node values are the hash of the concatenation of their children (either leaf nodes or other internal nodes).

but also

The base case (an empty tree) is defined as zero:
node.v = 0x0000000000000000000000000000000000000000000000000000000000000000

This is confusing as we actually deviate from RFC-6962. There it says:

The hash of an empty list is the hash of an empty string:
MTH({}) = SHA-256().

https://tools.ietf.org/html/rfc6962#section-2.1

My understanding is that we only borrow how to split / try to balance the tree in case the leaf nums are odd. The base hasher and the base case are different.

Decide on data model for validator set

There are two viable data models that we need to choose between for the validator set and fee-paying account balances: UTXOs and accounts. Defining a transaction format is dependent on this.

Considerations

Transactions are generally larger in the UTXO data model, and reasoning about a single account receiving funds is more complex. However, it allows for easier parallel validation and, most importantly, it more conducive for HTLCs (which can be used for cross-chain atomic swaps without execution---an indispensable feature).

Recommendation

Given that there will be no repeated interactions with contracts, the UTXO data model looks preferable.

Use 32-byte addresses

Use 32-byte instead of 20-byte addresses. Rationale:

  • 256 bits instead of 160 bits of security. 80 bits of collision resistance may not be sufficient.
  • The benefit of using a smaller address space is a smaller state tree. But this is redundant given 1) state tree is keyed by the hash of the address and 2) compact SMT.
  • Only downside: slightly larger transactions, as recipient of a transfer is 12 bytes larger. This isn't much of an issue since transfers should be intentionally few.

Use a Merkle accumulator for history

Instead of a linear previous-blockhash, a Merkle accumulator (such as a Merkle mountain range) should be used. This allows for proving inclusion of a block in the chain with a logarithmic number of hashes, which is especially in the context of cross-chain relays.

Specify cryptographic primitives

LazyLedger, as all blockchains uses numerous cryptographic primitives (most notably, hash functions and ecdsa or other cryptographic signature algorithms, e.g. BLS).
We need to choose among these and document the choices made in the spec (and ideally document why these decisions where made somewhere else).

Note: While these decisions are critical for a final spec, decisions about well established building blocks can happen at a later point in time in the specification process. We can and should start the implementation with the choices made in tendermint anyways to rapidly prototype.

Investigate different hash functions

This issue is to track investigation into which hash function to use throughout the primary LazyLedger data structures (i.e. Merkle trees and simple hashes).

Two critical desiderata:

  1. Standard: Using a nonstandard cryptographic primitive can lead to long-term widespread confusion and should be avoided at all costs.
  2. Ethereum compatibility: Ethereum has native support for the following hash algorithms: keccak256, sha256, blake2b, ripemd160.

Note that performance is a secondary concern at best.

Of Ethereum's natively supported hash algorithms, only sha256 is standard (i.e. approved by a standards body). However, it is vulnerable to a length extension attack. The severity of this should be considered low both because it only applies to the use of this hash function in a MAC, and because the extensive use of sha256 in the cryptocurrency ecosystem serves as a "crowd immunity" layer.

SHA-3 would be a preferable alternative to avoid this issue altogether, but is not natively supported by Ethereum at this time.

Proofs shouldn't include the Merkle root or the value being proven, as this is part of the statement being proven, not the proof

A Merkle proof is a proof for the statement "a key x has value y for the tree with root z". It doesn't make sense for the statement being proven to be a part of the proof, because the prover could always generate a valid proof if they get to pick the statement. Furthermore, the verifier may already know what the statement being proven is, so including it as part of the proof may be redundant. At minimum, they should already know what the root of the tree is.

See: https://github.com/lazyledger/smt/pull/5/files#r460376202

Add header field for number of shares used

#80 explored the performance of different square sizes, and determined that the number of shares for each row and column of the available data square should always be a power of 2. This could result in a lot of wasted computation and sampling if only slightly more than a quarter of the shares are used for real data (indeed, this is exactly what prompted #68).

To avoid waste, we can do two things:

  1. Add the number of shares used for "real" data in the block header.
  2. Since any shares beyond the number of real data shares must be tail padding, rows containing only tail padding should be omitted from the row NMT commitments, and don't have to be sampled. The exact number of rows to omit can be deterministically computed using the current available data square size and the number of real data shares.

Get rid of compromises made to be closer to tendermint implementation (where it makes sense)

It often seems like a good trade-off to keep some details that originated from the way tendermint (and abci) is designed in the spec. We've now decided that we draw the line to were we want to avoid this at (almost) all costs were we have to make trade-offs that are consensus-critical:

when it comes to things like data structures etc [the spec] should be designed according to the Tendermint implementation
But we shouldn't have to make consensus-critical trade-offs due to the Tendermint implementation, like adding extra fraud proofs

Potentially incomplete list of fields that are artefacts from the way tendermint / abci is designed which we would not include in an ideal LL spec: ValidatorsHash, NextValidatorsHash.

While this could mean that the implementation could incrementally be made safe while we remove implementation specific details step by step, we have to make sure that the spec is sane and specifies a safe system.

ref: #75
ref: #65

Is removing validators and next validators' hashes absolutely necessary?

From a single abci app (in our case the state machine that tracks the minimal state necessary for PoS), it makes sense to remove these two hashes: https://github.com/tendermint/tendermint/blob/8571b2ee9a77ba62b24926324864677867e55124/types/block.go#L341-L342

They could be part of the appHash (stat root) anyways. My concern with this is that it does not make sense from the PoV of tendermint: they are essential information for tendermint consensus and these hashes are computed by tendermint (and not the app) which does not know or understand how the (abci) app deals with state. Also, the light client needs these two hashes (independent from how the app tacks state). This means we can't use most of the tendermint light client logic (which might be OK as the LL light client will be different enough anyways)

TODO: further discuss and clarify the implications of removing these two fields vs keeping them.

Add implicit and canonical data fields

E.g. transaction signatures should sign over the chain ID to prevent replay attacks and other exploits, but including the chain ID for each transaction is redundant and can instead be implicit. Same for votes.

Storage node & protocol

We need to define the public API and the protocol to interact with a storage node.

This also plays into data availability (which might be started being spec'ed out in #2) as consensus participants (validators) that do not want to be storage nodes themselves will need to query them for data availability.

Quoting musalbas here:

So there should be an option to run:
a) validator that is also a storage node
b) storage node only
c) validator that just uses data availability proofs(update: not planned anymore)
d) light clients that use data availability proofs

Define main terminology

Define main terminology (naming for key concepts specific to LazyLedger but also some more general terms). E.g. clear terminology for LazyLedger layer1, sidechains, execution environments, Tx to layer1, sidechain Tx, tbc ...

┆Issue is synchronized with this Asana task by Unito

Add special state transition for end of block

After all transactions are processed, a special implicit state transition should be applied that e.g. properly changes the validator set based on the current state.

The state will also have to be changed to commit to some priority queue of queued validators, so that a state transition fraud proof can be created for this.

The core idea is to augment the queued validators with a single-linked-list priority queue of voting power. Every time the queue is updated, the links are updated and this metadata is added by the block proposer as a wrapped transaction, allowing fraud proofs without knowing the whole queue.

Merkle trees misc cleanup and fixes

https://github.com/lazyledger/lazyledger-specs/blob/master/specs/data_structures.md#sparse-merkle-tree

For leaf node of leaf message m with key k, its value v is:

Better to use a different word than message here (otherwise this might be conflated with the other use of the word messages)

Also, it would help getting a quick understanding, if there was 2-3 simple to follow examples how an smt and its will look like (including trivial edge-case like all-empty tree, only one entry, and one realistic example with a few entries).

Splitting into shares needs to know request length

Currently shares for transactions, evidence, and intermediate state roots are arranged by first serializing an array of e.g. transactions, then splitting them into 255-byte shares (with the 256th byte reserved to indicate the starting location of the first start of a transaction in the share.

The issue here is that simply serializing an array of transactions doesn't return the length of each serialized transaction, so we can't properly compute the reserved byte value.

Splitting into shares should instead use the following procedure:

  1. Serialize each transaction individually (which tells us the number of bytes for each transaction)
  2. Create the byte array serialize(length of serialize(tx1)) || serialize(tx1) || serialize(length of serialize(tx2)) || serialize(tx2) || ...
  3. Split the above byte array into 255-byte shares, and use the lengths from step (1) to compute reserved byte values

Inter-message padding shares

Currently, inter-message padding shares are paid for with a special transaction, created by the block proposer. However, that may not be ideal. An alternative is instead including this information as part of the block data in a way that's not a transaction, and thus does not need to be paid for. Another alternative is setting inter-message padding shares to the previous message's namespace ID and adding some logic to prune padding shares from proofs, etc.

Related: #86.

Using Python syntax as pseudocode

I think Python syntax should be used a pseudocode, as it's almost identical to the current (fake) syntax, and since it's a real language it provides a point of reference.

┆Issue is synchronized with this Asana task by Unito

Investigate canonical encoding scheme

Currently, a canonical variant of protobuf3 is used to define data structures and do serialization/deserialization. This is great for portability, isn't ideal for blockchain applications. For example, there are no fixed-size byte arrays (32-bytes hashes used all over the place in blockchains), or optional fields. Robust protobuf implementations for embedded devices (e.g. hardware wallets) or blockchain smart contracts (e.g. Solidity) are non-existent and would be prohibitive to develop.

The crux is that protobuf was designed for client-server communication, with different versions (i.e. both forwards and backwards compatibility are key features). This is unneeded for core blockchain data structures (e.g. blocks, votes, or transactions), but may be good for node-to-node communication (e.g. messages that wrap around block data, vote data, or transaction data).

We should investigate the feasibility of using a simpler serialization scheme for core data structures.

Desiderata:

  • fully deterministic (specifically, bijective)
  • binary, not text
  • native support for basic blockchain data types (esp. fixed-sized arrays)
  • typedefs / type aliases (i.e. zero-cost abstractions)
  • no requirements on backwards or forwards compatibility

Comparison of Difference Schemes

Protobuf

https://developers.google.com/protocol-buffers/docs/overview

https://github.com/lazyledger/protobuf3-solidity-lib

https://github.com/cosmos/cosmos-sdk/blob/master/docs/architecture/adr-027-deterministic-protobuf-serialization.md

cosmos/cosmos-sdk#7488

protocolbuffers/protobuf#3521

XDR

https://tools.ietf.org/html/rfc4506

https://developers.stellar.org/docs/glossary/xdr

Veriform

https://github.com/iqlusioninc/veriform

SimpleSerialize (SSZ)

https://github.com/ethereum/eth2.0-specs/blob/dev/ssz/simple-serialize.md

┆Issue is synchronized with this Asana task by Unito

SMT should hash leaf value

Exclusion proofs for SMTs are inclusion proofs of a different leaf (or an empty leaf). Including the entire leaf data is redundant in such a case. We can optimize this by hashing the leaf data, i.e v = h(k, h(leaf_data)).

Investigate signature aggregation

BLS can be used to aggregate signatures (i.e. represent an arbitrary number of unaggregated signatures as a fixed-length aggregated signature). This should allow for a drastic reduction in the size of Tendermint commit data.

┆Issue is synchronized with this Asana task by Unito

Padding shares should be deterministic and consensus-enforced

Currently, padding between messages can have any namespace ID between the namespace IDs of the surrounding messages:

The non-interactive default rules may introduce empty shares that do not belong to any message (in the example above, the top-right share is empty). These must be explicitly disclaimed by the block producer using special transactions.

However, these padding shares aren't included when distributing the entire block all at once, since blocks aren't distributed normally laid out as shares. To fix this, the namespace ID of padding shares must be both deterministic and enforced by consensus.

Spec more Merkle proofs format

#47 cleaned up a number of issues around Merkle tree specs. Additional proof formats still need to be defined

  • Merkle multiproofs
  • Exclusion proof for Sparse Merkle tree (rather, inclusion proof of empty leaf)
  • Range proof for NMT

Can't prove messages aren't paid for

One issue with the current scheme of separating transactions and messages is that while we can compactly prove that the message a transaction pays for has been included, we can't compactly prove that a message wasn't paid for.

The most obvious solutions are:

  1. Add some reserved bytes at the beginning of each message pointing to the transaction index that pays for them. This doesn't work without major changes, as it breaks the non-interactive default share layout.
  2. Accumulate <key: message start, value: <transaction index, message end (or message length)>> pair in a Sparse Merkle Tree and commit to the root in the block header. This way we can prove exclusion directly.
  3. Require that transactions that pay for a message are ordered by message start (and implicitly namespace ID). This way we can prove exclusion with two inclusion proofs.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.