GithubHelp home page GithubHelp logo

multicodec's Introduction

multicodec

Canonical table of of codecs used by various multiformats

Table of Contents

Motivation

Multicodec is an agreed-upon codec table. It is designed for use in binary representations, such as keys or identifiers (i.e CID).

Description

The code of a multicodec is usually encoded as unsigned varint as defined by multiformats/unsigned-varint. It is then used as a prefix to identify the data that follows.

Examples

Multicodec is used in various Multiformats. In Multihash it is used to identify the hashes, in the machine-readable Multiaddr to identify components such as IP addresses, domain names, identities, etc.

Multicodec table

Find the canonical table of multicodecs at table.csv. There's also a sortable viewer.

Status

Each multicodec is marked with a status:

  • draft - this codec has been reserved but may be reassigned if it doesn't gain wide adoption.
  • permanent - this codec has been widely adopted and may not reassigned.
  • deprecated - this codec has been deprecated.

NOTE: Just because a codec is marked draft, don't assume that it can be re-assigned. Check to see if it ever gained wide adoption and, if so, mark it as permanent.

Adding new multicodecs to the table

The process to add a new multicodec to the table is the following:

  1. Fork this repo
  2. Add your codecs to the table. Each newly proposed codec must have:
  3. A unique codec.
  4. A unique name.
  5. A category.
  6. A status of "draft".
  7. Submit a Pull Request

This "first come, first assign" policy is a way to assign codes as they are most needed, without increasing the size of the table (and therefore the size of the multicodecs) too rapidly.

The first 127 bits are encoded as a single-byte varint, hence they are reserved for the most widely used multicodecs. So if you are adding your own codec to the table, you most likely would want to ask for a codec bigger than 0x80.

Codec names should be easily convertible to constants in common programming languages using basic transformation rules (e.g. upper-case, conversion of - to _, etc.). Therefore they should contain alphanumeric characters, with the first character being alphabetic. The primary delimiter for multi-part names should be -, with _ reserved for cases where a secondary delimiter is required. For example: bls12_381-g1-pub contains 3 parts: bls_381, g1 and pub, where bls_381 is "BLS 381" which is not commonly written as "BLS381" and therefore requires a secondary separator.

The validate.py script can be used to validate the table once it's edited.

Implementations

Reserved Code Ranges

The following code ranges have special meaning and may only have meanings assigned to as specified in their description:

Private Use Area

Range: 0x300000 – 0x3FFFFF

Codes in this range are reserved for internal use by applications and will never be assigned any meaning as part of the Multicodec specification.

FAQ

Why varints?

So that we have no limitation on protocols.

What kind of varints?

An Most Significant Bit unsigned varint, as defined by the multiformats/unsigned-varint.

Don't we have to agree on a table of protocols?

Yes, but we already have to agree on what protocols themselves are, so this is not so hard. The table even leaves some room for custom protocol paths, or you can use your own tables. The standard table is only for common things.

Where did multibase go?

For a period of time, the multibase prefixes lived in this table. However, multibase prefixes are symbols that may map to multiple underlying byte representations (that may overlap with byte sequences used for other multicodecs). Including them in a table for binary/byte identifiers lead to more confusion than it solved.

You can still find the table in multibase.csv.

Can I use multicodec for my own purpose?

Sure, you can use multicodec whenever you have the need for self-identifiable data. Just prefix your own data with the corresponding varint encodec multicodec.

Contribute

Contributions welcome. Please check out the issues.

Check out our contributing document for more information on how we work, and about contributing in general. Please be aware that all interactions related to multiformats are subject to the IPFS Code of Conduct.

Small note: If editing the README, please conform to the standard-readme specification.

License

This repository is only for documents. All of these are licensed under the CC-BY-SA 3.0 license © 2016 Protocol Labs Inc. Any code is under a MIT © 2016 Protocol Labs Inc.

multicodec's People

Contributors

acud avatar arachnid avatar daviddias avatar dmitrizagidulin avatar expede avatar filips123 avatar gozala avatar hsanjuan avatar i-norden avatar jbenet avatar kenadia avatar kubuxu avatar kumavis avatar lidel avatar marcopolo avatar marten-seemann avatar melekes avatar molekilla avatar oed avatar pjkundert avatar richardlitt avatar rvagg avatar samli88 avatar stebalien avatar tplooker avatar vmx avatar vyzo avatar warpfork avatar whyrusleeping avatar willscott avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

multicodec's Issues

comparison with cbor

Trying to understand the difference between this and cbor. Looks like both use some pre-defined values to represent their content. Wondering what is the difference, and if there are any cost/performance pros/cons.

Not to mention, this multicodec also supports cbor as one of the types in the tables. That sounds like a double / duplicate content description (first one telling the rest of the content is 'cbor', and then 'cbor' telling that the rest of the content is, say ip-address or url or binary etc.). Not sure why would one want to use such duplicated representation, unnecessarily increasing the content length.

Would cbor be not equal to something like multihash rather ? wondering which one is more preferred for future-proof and storing content into DB.

The udp code is wrong

According to the multiaddr repo, the code for udp is 0x011. However, in this repo, it's 0x0111. Unfortunately, sha1 is 0x011 here...

Also, some of these numbers appear to be varint encodes, others not?

Outdated README

The current README is heavily outdated. You can see that there's several left-overs from earlier versions of the multiformats stuff (e.g. references to multistreams). The current README defines multicodec as being a data format/encoding like the other multiformats are. It's specified as <codec-encoded-as-varint><data>.

I went over the other multiformats (multiaddr, multihash, multibase, multistream-select, cid). There the term "multicodec" is used (if used) as being am identifier encoded as varint, hence just as <codec-encoded-as-varint> without the <data> piece.

I can see two ways moving forward. We could expand the multicodec format and use it also in other specs, or we drop the multicodec format and use just the codec table.

Expanding the multicodec format

Currently the only case where we use <codec-encoded-as-varint><data> is within CIDs. To expand it to other specs it would need to change a bit to be useful. It could be <codec-encoded-as-varint><data-length-as-varint><data> with optionally just <codec-encoded-as-varint><data> if the codec specifies a fixed size (e.g. multiaddr when using ipv4).

That way multihash would just be a multicodec. Also the binary representation of multiaddr would just be a list of multicodecs.

Dropping multiformat as format

We could also just remove the <codec-encoded-as-varint><data> part and define multicodec as just being a table of codecs which is used by the various multiformats.

If that's the plan I'd suggest renaming it to "multiformat codecs" and moving the table into the multiformats repo. Then removing the multicodec repository as it would be meaningless.

/cc @mikeal

Create initial public release

Please create a proper "1.0" or "0.1" release for this product.

Also adopt a policy of committing to backwards compatible releases using SerVer policy.

As current, nobody should adopt multicodec in their applications because there is no guarantee that multicodec is a fixed target.

Unicode characters

It would be funny and useful to have unicode characters table mapping unicode 📠 to protocols.

I am not sure how would we decide on certain protocols etc. but I am honestly interested in hearing your feedback on that idea.

multicodec-packed

I just added a PR for multicodec-packed: #9 -- should this be a separate repo or this one? i'm inclined to keep it here (since it's so similar) until it demonstrates the need to move out. other thoughts?

define a codec table format

This is the first attempt creating a format for defining codec tables. Codec tables map single bytes to multiple bytes in this case ascii strings which represent a multicodec prefix.

Table definition

Field Type Description
count varuint32 count of type entries to follow
entries type_entry* repeated type entries as described below

Entry definition

Field Type Description
count varuint32 the length of the string of a multicodec prefix
prefix bytes the prefix string
packed encoding byte The packed encoding of prefix

This is effectively layer 0 compression. Which is a simple binary encoding of the prefix bytes and related data structures. The encoding is dense and trivial to interact with.

Add BD_ADDR

Bluetooth devices have unique identifiers known as BD_ADDR, these should be added to multiformats and may be useful especially if libp2p adds a bluetooth specific transport. Another thing to consider is the addition of a Mac address identifier.

Missing codec: raw string

I think it would be awesome to use this table as a general standard for self describing formats (outside of ipld and ipfs I mean). However a lot of basic things are missing. One I would like to use is just utf8 string. Sometimes, you might just want to send an error message encoded as a string.

I don't know if it would make sense to add all the other string encodings? Personally I try to use utf8 for everything.

It's not clear to me either how nested things can be expressed. gzipped xxx? Maybe compression is something that should be set on the connection rather than the package.

add eth-account

we need eth-account (0x92) to correct travers account data

Table: Application specific ranges

We were going to keep some application specific ranges. I'm not finding them in the table. We should add them back.

Ideally we should have a function that keeps some numbers within every varint size. Meaning that:

  • a 1-byte varint should have a few value reserved for app-specific codes.
  • a 2-byte varint should have some too, many more than 1-byte.
  • etc.

Effectively: come up with a function appSpecific(code int) bool that:

  • assigns some values at every varint range
  • is simple to understand

For example, one such function could be:

func appSpecific(code int) bool {
    return code % 127 > (127 - 8)
    // keeps 8 codes every 128.
}

json missing from table.csv

Hi, I'm trying to figure out why some values in the table.csv have codes and others don't, and why it includes bson but not json when json is clearly implemented (at least as far as I can see in the go-multicodec repo). Can someone enlighten me? Thanks!

maybe pull the changes into here?

hey @diasdavid maybe pull the multistream changes into here?

This was a WIP i wrote way before the conversation we had on the train from Slovenia to Berlin. So it's outdated. Then-- i was considering switching the newline in favor of a / so that it didnt break things like ndjson.... but that may be a fraught endeavor. in any case, the last char is changeable, as it's included in the varint. maybe we could leave it up to the users to decide? (not sure). the niceness of the \n for streams is that it plays well with telnet.

Request: jpg/png

I see them in the spec table, but they're blank. Can we get some prefixes for those?

haskell multicodec

Transfered ownership of the multicodec implementation to the organisation , didn't expect to get integrated automatically so tell me if i need to remove it and file a request before redoing the transfer (implementation is working and complete), also if you have the time and courage :D , please take a look at it and double check if the spec is correct (followed js-multicodec but go-multicodec seemed different so i was a bit confused ).

Possibly deduplicate raw/identity

We currently have 0x55 for DagRAW and 0x0 for the "identity" multihash. These are really the same thing and could use the same codec. Thoughts? This would, unfortunately, have to change slowly over a long time but it still annoys me that we have both.

What's the difference between multicodec and multistreams

From the multicodec README:

Multistreams are self-describing protocol/encoding streams.

From the multistream README:

Multicodecs are self-describing protocol/encoding streams.

Each project has a readme that leads by calling the other project a thing, and they call the projects the same thing.

Define multicodec's prefix

Suppose I come across a multicodec-encoded stream in the wild, and have no context for it. After some investigating I might conclude everything after the first few bytes is a standard format. But what are those first few bytes?

If there were a registered prefix for multicodec itself, then all stored/transmitted multicodec streams could start with the same recognizable "magic number". Once the stream is recognized as a multicodec stream, people/machines will know to look at the prefix in the next few bytes to determine the nature of the payload.

Clean up table

The table is messy. we should:

  • remove, or separate out, all entries that do not yet have a code assigned
  • keep it sorted by code

URDNA2015 Support

Has support for RDF Dataset Normalization and URDNA2015 (Universal RDF Dataset Normalization Algorithm) been discussed here?

Not sure quite how it would fit in with multicodec and multihash, but this hashing algorithm is a bit different from others in that first the data (possibly submitted in JSON-LD form) goes through a canonicalization algorithm and then gets hashed. The idea being that a single RDF dataset gets a consistent hash regardless of how it's serialized (i.e different field orderings and whitespace in JSON or a different format like XML will all produce the same hash for the same data).

add crypto key types

Now multikey is not implemented yet, I need to encode the pubkey of secp256k1. Can add the key type of secp256k1 here?

Add an IPNS code

IPFS addresses are good enough to point to static resources, but the corresponding data cannot be updated (the update changes the resource hash, and thus gives a new IPFS address by design.

This issue is solved by IPNS, which are like tags pointing to an IPFS file. IPNS can be updated, which make them a great anchor to a website (for example). Yet IPNS is not listed in the codecs.

Adding a code for ipns-ns (0xe5?) would allow new synergies for easily upgradeable decentralized UIs. The current alternative (frequently updating an ENS entry) is expensive.

Automate table sync between repos

There are csv tables spread over many repos. There should be one source of truth (multicodec/table.csv), and repos should pull it from there.

Text, raw varint, bytecode distinction

miscellaneous,,
raw, raw binary, 0x55

Ok. This valid raw varint code.

bases encodings,,
identity, raw binary, NUL

Empty string or 0x00 or what?

base1, unary, "1"

Only when i open raw table.csv i see " in there.

multiaddrs
...
udp 0x0111

This not valid raw varint code.

Can be same codes in different sections?

In multiformats/multihash#55 (comment) i use raw varint prefix

0x84... - Varint Prefix for Merkle Hash Tree Root

Is there conflict with sctp?

sctp, , 0x84

Mimetypes as codes

Naming every possible type is already done with mime types.
I think it would make sense to use them as codes, instead of defining your own.

"image/jpg" or "text/plain" make nice paths btw ;)

What are your thoughts on it, did I miss something?

Make CID versions multicodecs

This way, we can trivially disambiguate CIDs from, e.g., raw multihashes.

(To do this, we'd need to steal 0x1 from unary for CIDv1).

Add application specific/development codecs

For dev purposes, it is good to have a range of valid codecs that can be used for developing support for new protocols. I think we had something like that in the past but I can't find it in the table right now.

Too many Code duplicates

Such as:

raw 0x55 - base64urlpad 'U', which is 0x55
bencode 0x63 - base32pad 'c', which is 0x63
dbl-sha2-256 0x56 - base32hex-upper 'V', which is 0x56

And,

multihash 0x31 - base1 '1', which is 0x31
multicodec 0x30 - base2 '0', which is 0x30
dns6 0x37 - base8 '7', which is 0x37

Whats the data format (or is it compatible) for "bmt multihash 0xd6 Binary Merkle Tree Hash" to hash binary forest pairs that define a function that returns byte[] id of another object (id of an id generator then its id)?

How would I give the context in the id to say what its hashing?

Something like... QmmultihashId...sdfsdf:bmt:occamsfuncerV1:contextSpecificId4234234 but that seems too many names in an id and theres binary form.

It is only a forest of pairs and leaf is all 0s (undecided on how many 0s), of calls of a universal lambda function which would for example derive a preprocessing of small literal bitstrings to quote them else use a securehash and may have bit masks etc... Since I dont want to deal with the never ending complexity of updating software when new stuff is added, I just want to use multihash to (less efficiently) name every possible binary forest (that generates a function) and let the functions do the rest, further optimizing, etc.

multicodec-table module

It will make it easier to have the latest multicodec table in sync across modules, if the spec itself exports it as a module for implementations to consume. The modules for each repo can live inside this repo and get publish with table updates and or changes.

  • Create a CJS module once #16 is merged.

Add multicodec for libp2p key

We need a new multicodec to unlock PeerID representation in CIDv1 for subdomains.

Should I just PR a proposal with one? Which range should it be in?

Context

  • IPNS: should work with case-insensitive identifiers (Base32) ipfs/kubo#5287
    @Stebalien described the need in comment:
    • The current peer ID format is just base58btc(<sha2-256 multicodec><256><digest>)
      (just a multihash)

    • The new format would be
      <base32 multibase> ++ base32(<cidv1><libp2p key multicodec><the multihash>)

    That is, we'd use an entirely new multicodec.

  • Wider context:

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.