multiformats / multicodec Goto Github PK

View Code? Open in Web Editor NEW

335.0 49.0 201.0 299 KB

Compact self-describing codecs. Save space by using predefined multicodec tables.

License: MIT License

Python 100.00%

multicodec's Introduction

multicodec

Canonical table of of codecs used by various multiformats

Motivation
Description
Examples
Multicodec table
- Adding new multicodecs to the table
Implementations
Reserved Code Ranges
FAQ
Contribute
License

Motivation

Multicodec is an agreed-upon codec table. It is designed for use in binary representations, such as keys or identifiers (i.e CID).

Description

The code of a multicodec is usually encoded as unsigned varint as defined by multiformats/unsigned-varint. It is then used as a prefix to identify the data that follows.

Examples

Multicodec is used in various Multiformats. In Multihash it is used to identify the hashes, in the machine-readable Multiaddr to identify components such as IP addresses, domain names, identities, etc.

Multicodec table

Find the canonical table of multicodecs at table.csv. There's also a sortable viewer.

Status

Each multicodec is marked with a status:

draft - this codec has been reserved but may be reassigned if it doesn't gain wide adoption.
permanent - this codec has been widely adopted and may not reassigned.
deprecated - this codec has been deprecated.

NOTE: Just because a codec is marked draft, don't assume that it can be re-assigned. Check to see if it ever gained wide adoption and, if so, mark it as permanent.

Adding new multicodecs to the table

The process to add a new multicodec to the table is the following:

Fork this repo
Add your codecs to the table. Each newly proposed codec must have:
A unique codec.
A unique name.
A category.
A status of "draft".
Submit a Pull Request

This "first come, first assign" policy is a way to assign codes as they are most needed, without increasing the size of the table (and therefore the size of the multicodecs) too rapidly.

The first 127 bits are encoded as a single-byte varint, hence they are reserved for the most widely used multicodecs. So if you are adding your own codec to the table, you most likely would want to ask for a codec bigger than 0x80.

Codec names should be easily convertible to constants in common programming languages using basic transformation rules (e.g. upper-case, conversion of - to _, etc.). Therefore they should contain alphanumeric characters, with the first character being alphabetic. The primary delimiter for multi-part names should be -, with _ reserved for cases where a secondary delimiter is required. For example: bls12_381-g1-pub contains 3 parts: bls_381, g1 and pub, where bls_381 is "BLS 381" which is not commonly written as "BLS381" and therefore requires a secondary separator.

The validate.py script can be used to validate the table once it's edited.

Implementations

go
JavaScript
Python
- py-multicodec
- multicodec sub-module of Python module multiformats
Haskell
Elixir
Scala
Ruby
Java
- java-multicodec
- copper-multicodec
Kotlin
- multicodec part of Kotlin project multiformat
Add yours today!

Reserved Code Ranges

The following code ranges have special meaning and may only have meanings assigned to as specified in their description:

Private Use Area

Range: 0x300000 – 0x3FFFFF

Codes in this range are reserved for internal use by applications and will never be assigned any meaning as part of the Multicodec specification.

FAQ

Why varints?

So that we have no limitation on protocols.

What kind of varints?

An Most Significant Bit unsigned varint, as defined by the multiformats/unsigned-varint.

Don't we have to agree on a table of protocols?

Yes, but we already have to agree on what protocols themselves are, so this is not so hard. The table even leaves some room for custom protocol paths, or you can use your own tables. The standard table is only for common things.

Where did multibase go?

For a period of time, the multibase prefixes lived in this table. However, multibase prefixes are symbols that may map to multiple underlying byte representations (that may overlap with byte sequences used for other multicodecs). Including them in a table for binary/byte identifiers lead to more confusion than it solved.

You can still find the table in multibase.csv.

Can I use multicodec for my own purpose?

Sure, you can use multicodec whenever you have the need for self-identifiable data. Just prefix your own data with the corresponding varint encodec multicodec.

Contribute

Contributions welcome. Please check out the issues.

Check out our contributing document for more information on how we work, and about contributing in general. Please be aware that all interactions related to multiformats are subject to the IPFS Code of Conduct.

Small note: If editing the README, please conform to the standard-readme specification.

License

multicodec's People

Contributors

Stargazers

Watchers

Forkers

daviddias greglook wasserfuhr davidar pombredanne progval richardlitt kumavis tabrath safemarket mildred jeremybanks celeduc rasmuserik marten-seemann justindrake magik6k vyzo stebalien donaldtsang dhruvbaldawa hari-mohan-choudhary crackcomm mkg20001 richardschneider kinokabaret datalove-app chrisdostert ivan386 aboodman mikeal drew-512 bochaco samli88 rtradeltd fluency03 nocursor dalavancloud arachnid zolmeister keutmann acud mailchain kenadia mathiasrw nichtich mcdee jayd2446 sleeplessbyte plan-systems jasnell filips123 alexmat2on joepeak dmitrizagidulin tiancheng91 damons jacekdaa michel47 decanus asicminer2009 breezeemr mdemir-bdt darklinkxxxx kayger44 jorropo cnxtech retzger studyzy d25zozo 1e1f nvzqz sensaipicsou parkan awoie tau-coin aidanok oed zhophrenica88 tplooker blockchainsb mvandeberg fulldecent ralendor vscommunity1 charity-funds cryptonewsmedia clash-of-games lthibault or13 google-wallet-bot cyber-crop xiachongbuyubing ntninja pukkamustard b5 peterjan blacktemplar aludirk textileio

multicodec's Issues

comparison with cbor

Trying to understand the difference between this and cbor. Looks like both use some pre-defined values to represent their content. Wondering what is the difference, and if there are any cost/performance pros/cons.

Not to mention, this multicodec also supports cbor as one of the types in the tables. That sounds like a double / duplicate content description (first one telling the rest of the content is 'cbor', and then 'cbor' telling that the rest of the content is, say ip-address or url or binary etc.). Not sure why would one want to use such duplicated representation, unnecessarily increasing the content length.

Would cbor be not equal to something like multihash rather ? wondering which one is more preferred for future-proof and storing content into DB.

Multicodec trailing slash discrependency

go-multicodec uses /json as an example, spec says /json/.

@dignifiedquire could you take a look at what js-multicodec is using?

Add safe-encoding (better base32/base64/base85)

See https://github.com/kstenerud/safe-encoding for more info, and the standard is made to be more web-friendly than common versions of base32 (RFC 4648, Z-base, Crockford), base64 (RFC 4648, RFC 4648 §5 or URL, Uuencoding, BinHex) and base85 (Btoa, Z85, Adobe)

Request: Media Container Formats

Examples: mkv, mk4

Add IPLD CIDs as a multicodec type

Add IPLD CIDs as a multicodec type (ref https://github.com/ensdomains/multicodec/issues/1)

Need a code for `dag-json`

I'm working on dag-json https://github.com/mikeal/dag-json and I need a code creating new CIDs.

I was going to fork the repo and send a PR but I wasn't quite sure where it should go in the table :(

Java Multicodec Implementation

Hello!
I have started working on the Java Implementation of Multicodec: https://github.com/AliabbasMerchant/java-multicodec
Tagging @ianopolous

Request: perceptual and/or locality sensitive hash functions?

The udp code is wrong

According to the multiaddr repo, the code for udp is 0x011. However, in this repo, it's 0x0111. Unfortunately, sha1 is 0x011 here...

Also, some of these numbers appear to be varint encodes, others not?

Outdated README

The current README is heavily outdated. You can see that there's several left-overs from earlier versions of the multiformats stuff (e.g. references to multistreams). The current README defines multicodec as being a data format/encoding like the other multiformats are. It's specified as <codec-encoded-as-varint><data>.

I went over the other multiformats (multiaddr, multihash, multibase, multistream-select, cid). There the term "multicodec" is used (if used) as being am identifier encoded as varint, hence just as <codec-encoded-as-varint> without the <data> piece.

I can see two ways moving forward. We could expand the multicodec format and use it also in other specs, or we drop the multicodec format and use just the codec table.

Expanding the multicodec format

Currently the only case where we use <codec-encoded-as-varint><data> is within CIDs. To expand it to other specs it would need to change a bit to be useful. It could be <codec-encoded-as-varint><data-length-as-varint><data> with optionally just <codec-encoded-as-varint><data> if the codec specifies a fixed size (e.g. multiaddr when using ipv4).

That way multihash would just be a multicodec. Also the binary representation of multiaddr would just be a list of multicodecs.

Dropping multiformat as format

We could also just remove the <codec-encoded-as-varint><data> part and define multicodec as just being a table of codecs which is used by the various multiformats.

If that's the plan I'd suggest renaming it to "multiformat codecs" and moving the table into the multiformats repo. Then removing the multicodec repository as it would be meaningless.

/cc @mikeal

Add ed25519-sha2 signature code

Since it is possible to encode ed25519 (sha2?) public key already with 0xed, it would be great to encode ed25519-sha2 signature in the same way.

The following related issues are not actively updated, but the introduction of "multisig"/multisign may greatly help us in https://github.com/hyperledger/iroha to migrate from one cryptography algorithm to another without losing backward compatibility.
multiformats/multiformats#23
#125

Ping @yusefnapora

Reserving ranges

Add the ability to reserve ranges in the multicodec table, à la Protobuf.

sha1 x11 conflict

sha1 0x11
x11 0x1100

Request: Archive Formats

Examples: tar, zip

Request: Serialization Formats

Missing:

Create initial public release

Please create a proper "1.0" or "0.1" release for this product.

Also adopt a policy of committing to backwards compatible releases using SerVer policy.

As current, nobody should adopt multicodec in their applications because there is no guarantee that multicodec is a fixed target.

Unicode characters

It would be funny and useful to have unicode characters table mapping unicode 📠 to protocols.

I am not sure how would we decide on certain protocols etc. but I am honestly interested in hearing your feedback on that idea.

multicodec-packed

I just added a PR for multicodec-packed: #9 -- should this be a separate repo or this one? i'm inclined to keep it here (since it's so similar) until it demonstrates the need to move out. other thoughts?

define a codec table format

This is the first attempt creating a format for defining codec tables. Codec tables map single bytes to multiple bytes in this case ascii strings which represent a multicodec prefix.

Table definition

Field	Type	Description
count	`varuint32`	count of type entries to follow
entries	`type_entry*`	repeated type entries as described below

Entry definition

Field	Type	Description
count	`varuint32`	the length of the string of a multicodec prefix
prefix	`bytes`	the prefix string
packed encoding	`byte`	The packed encoding of prefix

This is effectively layer 0 compression. Which is a simple binary encoding of the prefix bytes and related data structures. The encoding is dense and trivial to interact with.

Add Swarm hashes as a multicodec type

Request to add Swarm hashes as a multicodec type (ref https://github.com/ensdomains/multicodec/issues/1#issuecomment-442640712)

Add BD_ADDR

Bluetooth devices have unique identifiers known as BD_ADDR, these should be added to multiformats and may be useful especially if libp2p adds a bluetooth specific transport. Another thing to consider is the addition of a Mac address identifier.

Complete the table with remaining multiformat types

It would be cool to bring the remaining tables (multihash, multiaddr, etc) to the main multicodec table (https://github.com/multiformats/multicodec#prefix-examples), so that we can avoid overlapping codes if we can.

Note: Having overlapping codes is not necessarily a bad thing if the overlap comes from the table extensions, but if we can avoid it, less hassle to handle.

Missing dnsaddr, dns4, dns6 codecs

The dnsaddr, dns4, and dns6 codecs as used here are missing from the csv table of codecs.

Missing codec: raw string

I think it would be awesome to use this table as a general standard for self describing formats (outside of ipld and ipfs I mean). However a lot of basic things are missing. One I would like to use is just utf8 string. Sometimes, you might just want to send an error message encoded as a string.

I don't know if it would make sense to add all the other string encodings? Personally I try to use utf8 for everything.

It's not clear to me either how nested things can be expressed. gzipped xxx? Maybe compression is something that should be set on the connection rather than the package.

add eth-account

we need eth-account (0x92) to correct travers account data

Table: Application specific ranges

We were going to keep some application specific ranges. I'm not finding them in the table. We should add them back.

Ideally we should have a function that keeps some numbers within every varint size. Meaning that:

a 1-byte varint should have a few value reserved for app-specific codes.
a 2-byte varint should have some too, many more than 1-byte.
etc.

Effectively: come up with a function appSpecific(code int) bool that:

assigns some values at every varint range
is simple to understand

For example, one such function could be:

func appSpecific(code int) bool {
    return code % 127 > (127 - 8)
    // keeps 8 codes every 128.
}

json missing from table.csv

Hi, I'm trying to figure out why some values in the table.csv have codes and others don't, and why it includes bson but not json when json is clearly implemented (at least as far as I can see in the go-multicodec repo). Can someone enlighten me? Thanks!

Add the UNIX protocol

Constant currently used in go-multiaddr is 0x0190.

maybe pull the changes into here?

hey @diasdavid maybe pull the multistream changes into here?

This was a WIP i wrote way before the conversation we had on the train from Slovenia to Berlin. So it's outdated. Then-- i was considering switching the newline in favor of a / so that it didnt break things like ndjson.... but that may be a fraught endeavor. in any case, the last char is changeable, as it's included in the varint. maybe we could leave it up to the users to decide? (not sure). the niceness of the \n for streams is that it plays well with telnet.

Request: jpg/png

I see them in the spec table, but they're blank. Can we get some prefixes for those?

haskell multicodec

Transfered ownership of the multicodec implementation to the organisation , didn't expect to get integrated automatically so tell me if i need to remove it and file a request before redoing the transfer (implementation is working and complete), also if you have the time and courage :D , please take a look at it and double check if the spec is correct (followed js-multicodec but go-multicodec seemed different so i was a bit confused ).

Possibly deduplicate raw/identity

We currently have 0x55 for DagRAW and 0x0 for the "identity" multihash. These are really the same thing and could use the same codec. Thoughts? This would, unfortunately, have to change slowly over a long time but it still annoys me that we have both.

What's the difference between multicodec and multistreams

From the multicodec README:

Multistreams are self-describing protocol/encoding streams.

From the multistream README:

Multicodecs are self-describing protocol/encoding streams.

Each project has a readme that leads by calling the other project a thing, and they call the projects the same thing.

Define multicodec's prefix

Suppose I come across a multicodec-encoded stream in the wild, and have no context for it. After some investigating I might conclude everything after the first few bytes is a standard format. But what are those first few bytes?

If there were a registered prefix for multicodec itself, then all stored/transmitted multicodec streams could start with the same recognizable "magic number". Once the stream is recognized as a multicodec stream, people/machines will know to look at the prefix in the next few bytes to determine the nature of the payload.

Clean up table

The table is messy. we should:

remove, or separate out, all entries that do not yet have a code assigned
keep it sorted by code

Request: encryption schemes

I see a need for standardized encryption encoded outputs, comparable to hashes.

e.g. <nonce/iv>

URDNA2015 Support

Has support for RDF Dataset Normalization and URDNA2015 (Universal RDF Dataset Normalization Algorithm) been discussed here?

Not sure quite how it would fit in with multicodec and multihash, but this hashing algorithm is a bit different from others in that first the data (possibly submitted in JSON-LD form) goes through a canonicalization algorithm and then gets hashed. The idea being that a single RDF dataset gets a consistent hash regardless of how it's serialized (i.e different field orderings and whitespace in JSON or a different format like XML will all produce the same hash for the same data).

add crypto key types

Now multikey is not implemented yet, I need to encode the pubkey of secp256k1. Can add the key type of secp256k1 here?

Add an IPNS code

IPFS addresses are good enough to point to static resources, but the corresponding data cannot be updated (the update changes the resource hash, and thus gives a new IPFS address by design.

This issue is solved by IPNS, which are like tags pointing to an IPFS file. IPNS can be updated, which make them a great anchor to a website (for example). Yet IPNS is not listed in the codecs.

Adding a code for ipns-ns (0xe5?) would allow new synergies for easily upgradeable decentralized UIs. The current alternative (frequently updating an ENS entry) is expensive.

Automate table sync between repos

There are csv tables spread over many repos. There should be one source of truth (multicodec/table.csv), and repos should pull it from there.

Text, raw varint, bytecode distinction

miscellaneous,,
raw, raw binary, 0x55

Ok. This valid raw varint code.

bases encodings,,
identity, raw binary, NUL

Empty string or 0x00 or what?

base1, unary, "1"

Only when i open raw table.csv i see " in there.

multiaddrs
...
udp 0x0111

This not valid raw varint code.

Can be same codes in different sections?

In multiformats/multihash#55 (comment) i use raw varint prefix

0x84... - Varint Prefix for Merkle Hash Tree Root

Is there conflict with sctp?

sctp, , 0x84

Mimetypes as codes

Naming every possible type is already done with mime types.
I think it would make sense to use them as codes, instead of defining your own.

"image/jpg" or "text/plain" make nice paths btw ;)

What are your thoughts on it, did I miss something?

Quic and CBOR have the same multicodec

Both use 0x51 (OT: this also also happens to be 'Q' the base58btc encoding of 0x12, the SHA2-256 multicodec and the prefix of all of our v0 CIDs).

Make CID versions multicodecs

This way, we can trivially disambiguate CIDs from, e.g., raw multihashes.

(To do this, we'd need to steal 0x1 from unary for CIDv1).

Sorting in "Sortable Table" is broken.

Table: https://ipfs.io/ipfs/QmXec1jjwzxWJoNbxQF5KffL8q6hFXm9QwUGaa3wKGk6dT/#title=Multicodecs&src=https://raw.githubusercontent.com/multiformats/multicodec/master/table.csv

Source: https://github.com/Stebalien/static-table

Add application specific/development codecs

For dev purposes, it is good to have a range of valid codecs that can be used for developing support for new protocols. I think we had something like that in the past but I can't find it in the table right now.

Too many Code duplicates

Such as:

raw 0x55 - base64urlpad 'U', which is 0x55
bencode 0x63 - base32pad 'c', which is 0x63
dbl-sha2-256 0x56 - base32hex-upper 'V', which is 0x56

And,

multihash 0x31 - base1 '1', which is 0x31
multicodec 0x30 - base2 '0', which is 0x30
dns6 0x37 - base8 '7', which is 0x37

Whats the data format (or is it compatible) for "bmt multihash 0xd6 Binary Merkle Tree Hash" to hash binary forest pairs that define a function that returns byte[] id of another object (id of an id generator then its id)?

How would I give the context in the id to say what its hashing?

Something like... QmmultihashId...sdfsdf:bmt:occamsfuncerV1:contextSpecificId4234234 but that seems too many names in an id and theres binary form.

It is only a forest of pairs and leaf is all 0s (undecided on how many 0s), of calls of a universal lambda function which would for example derive a preprocessing of small literal bitstrings to quote them else use a securehash and may have bit masks etc... Since I dont want to deal with the never ending complexity of updating software when new stuff is added, I just want to use multihash to (less efficiently) name every possible binary forest (that generates a function) and let the functions do the rest, further optimizing, etc.

multicodec-table module

It will make it easier to have the latest multicodec table in sync across modules, if the spec itself exports it as a module for implementations to consume. The modules for each repo can live inside this repo and get publish with table updates and or changes.

Create a CJS module once #16 is merged.

Add multicodec for libp2p key

We need a new multicodec to unlock PeerID representation in CIDv1 for subdomains.

Should I just PR a proposal with one? Which range should it be in?

Context

IPNS: should work with case-insensitive identifiers (Base32) ipfs/kubo#5287
@Stebalien described the need in comment:
- The current peer ID format is just base58btc(<sha2-256 multicodec><256><digest>)
  (just a multihash)
- The new format would be
  <base32 multibase> ++ base32(<cidv1><libp2p key multicodec><the multihash>)
That is, we'd use an entirely new multicodec.
Wider context:
- CID in subdomain: ipfs/in-web-browsers#89
- Migration to CIDv1 (default base32) (https://github.com/ipfs/ipfs/issues/337)