vshymanskyy / muon Goto Github PK

View Code? Open in Web Editor NEW

242.0 16.0 5.0 1.94 MB

µON - a compact and simple binary object notation

License: Apache License 2.0

Python 100.00%

binary-format cbor json msgpack

muon's Introduction

µON [muon]

A compact and simple object notation. µ[mju:] stands for "micro".

File extension:     .mu
MIME type:          application/muon
Endianness:         little-endian
Signature/Magic:    optional, 8F B5 30 31 ["�µ01"] @ 0x0

Muon has some interesting properties (see presentation and docs):

Every null-terminated UTF8 string is at the same time a valid muon object
Gaps in the UTF8 encoding space (code units) are used to encode things like [ ] { } etc.
Muon is self-describing and schemaless, just like JSON (unlike Protobuf and FlatBuffers)
Compact (~10..50% smaller than JSON). On par or outperforms CBOR, MsgPack, UBJSON
Unlimited size of objects and values
Data is ready to be used in-place without pre-processing
Supports raw binary data (values and TypedArrays)
Can optionally contain count of elements (and size in bytes) of all structures for efficient document processing, similar to BSON

Future goals:

Strict specification (little or no room for implementation-specific behavior / vendor-specific extensions)

Try it yourself

python3 muon_py/json2mu.py ./data/AirlineDelays.min.json ./AirlineDelays.mu
python3 muon_py/mu2json.py ./AirlineDelays.mu > ./AirlineDelays.json

Run benchmarks:

python3 muon_py/extra/json-analyze.py ./data/*.json ./data/small/*.json
python3 muon_py/mu-benchmark.py ./data/*.json ./data/small/*.json

Muon structure

See more in documentation.

Disclaimer: the notation is still Work In Progress. If you have any ideas or comments, please feel free to post them here.

muon's People

Contributors

Stargazers

Watchers

Forkers

ivankravets icodein altoplano ebell495 mayhemheroes

muon's Issues

Are we using signed or unsigned LEB128?

If I read the Python code correctly, the LEB128 lengths in strings and typed arrays are unsigned LEB128, and for the 0xBB typed values we use signed LEB128, but it isn't specified anywhere else. Maybe that should be clarified in the images and presentation.

Standard text representation

Just for debug purposes, sometimes it is handy just print it to console. So it is useful to have some cheap (on the fly, without full decoding to separate object; maybe in order to run on microcontroller for example) toString method.

Json is an option, but not sure whether it can be done cheap (by memory).
But as far as I understand, not every muon can be represented as json.

Explanation of what tags actually are

The encoding diagram has sections for both attr and tags but doesn't explain what they are or how they are encoded. Am I missing something really obvious?

I don't understand what kv is supposed to be, is it the same as the dict kv? What values are valid for tag? and val? several tag encodings are mentioned in the table of values, but appear to only be allowable outside of the main object?

It's not particularly clear that the encoding diagram is recursive, but does make sense once you get to the choices for lists and dicts.

Ideas and comments

This project is on an early stage of development, and generally should be treated as RFC (Request for comments). If you have any ideas/comments, please feel free to post them here.

Could we chain muon on-the-wire data?

Could we plainly chain (concatenate as cat does) muon on-the-wire data and feed them into muon reader/parser?

(in the past I had fun thinking about somewhat similar idea but with not so much convincing result 😉)

Originally posted by @dumblob in #5 (comment)

Deterministic encoding

Thanks for this work. You mention a compact form in the presentation, it would be very good to define that further as a minimal deterministic encoding.

https://datatracker.ietf.org/doc/html/rfc8949#section-4.2 may be interesting if you haven't come across it.

Are 8C tagged strings lenght encoded?

Given #5 it is possible for strings to contain \u0000 values. With that in mind should I assume all 8C tags length-encoded instead of zero-terminated? Would it make sense to have a separate tag for adding zero-terminated strings and for length-encoded strings to the LRU?

Use `size` tag for fixed-length strings

... instead of the dedicated marker.
Fixed-size string will remain non-0-terminated.

Consider using `prefix code` and `zigzag` encoding for variable-length integers

Handling of duplicate keys in dicts

I have several questions relating to dicts:

What is the expected handling for dicts where the same key exists multiple times?
Should later values replace earlier ones? Is it implementation defined?
Is there any expectation of what sort of data structure should represent a dict internally - such as an ordered key-value map, or a hash map?
Does/should/can the order of keys matter in an encoding?

Request for clarification of how the 8C tag interacts with lists

Say I have a nested object that for some reason contains the strings "parallelepiped" and "therizinosaurus" many times, and I know this. Which of the two following ways of encoding an 8C tagged list is correct?

8C 90 <utf8 "parallelepiped"> 00 <utf8 "therizinosaurus"> 00 91

90 8C  <utf8 "parallelepiped"> 00 8C <utf8 "therizinosaurus"> 00 91

~~(I'm pretending #12 doesn't exist and that 8C-tagged strings are zero-terminated, will update example once that issue is resolved)~~

My guess is that they are both correct, but the former only adds the strings to the LRU without encoding a list into the object, while the latter encodes a list of strings that also get added to the LRU. Whatever the case, it might be good to be explicit about it in the docs somewhere.

Allow int's to be dictionary keys

Sometimes, key of the dictionary can be an integer (like id or timestamp). And it could be usefull not to stringify them.

As far as I understood, it is enough to allow Ax, Bx for the keys to implement this ... does not require any major specification change.

Resynchronization

muon is awesome, but the one thing it is lacking versus line-by-line json is ability to seek around randomly in the stream.

the jsonl separator \n is hard to beat, but support for raw data makes that impossible.

I have found the AVRO strategy of picking a random 16-byte delimiter is nice.

I think some byte (maybe 0xf0) should indicate a delimiter follows. the first time will define the delimiter (which is chosen randomly); subsequent times should just be validated.

Should fixed-length string (type `0x82`) be `0`-terminated?

Add support for `unums/posits`, a new floating point format

Posit is a number format that is similar to IEEE 754 format (floating point numbers).
The Posit Standard (2022) has been ratified by the Posit Working Group: https://posithub.org

The idea is to add a new posit tag that can apply to u8, u16 and u32 typed values and arrays, which are then treated as the corresponding Posits.
Also, this could be a good opportunity to add support for arbitrary-precision floats, but is there a rationale?

Are `Attributes` useful enough?

... or should we remove them for the sake of simplification?

alternative: move Attrubutes to Muon document level, so it can only appear in the beginning of the file or stream.

PSON and IOTMP libraries

Hello @vshymanskyy

Thanks for developing the muon and TinyGSM libraries

I'm a Macker and I use Thinger.io in my projects.
They developed PSON (A Serialization Format for IoT Sensor Networks) and the IOTMP protocol.

I hope the information about PSON and IOTMP can help in muon development.

Thank you for your work

PSON Article: https://www.mdpi.com/1424-8220/21/13/4559/htm
PSON Github: https://github.com/thinger-io/Protoson

IOTMP Article: https://www.mdpi.com/1424-8220/19/5/1044/htm
IOTMP Github: https://github.com/thinger-io/IOTMP

Integration: PSON - IOTMP - ARDUINO and TINYGSM with Thinger.io
Arduino-Library: https://github.com/thinger-io/Arduino-Library

Question regarding numbers being passed between JS and Python versions and what it means for their types

Apologies for the wall of text, it's because I really like the idea of Muon :)

TL;DR: key questions are bolded in the text below

So bit of context: I'm trying to write a JavaScript implementation, since the format is so elegantly simple that I feel I can achieve a basic version of it. My first goal is to have "feature parity" with how JavaScript handles JSON. That is: being able to roundtrip any object that you could also send through JSON.stringify and get back from JSON.parse, and ignoring the things that it can't. After that I'll worry about the extra things Muon supports.

Having said that, the fact that Muon has more ways to encode numbers was too tempting to not play around with for size savings. The way I have handled numbers so far is assuming that everything is a double unless explicitly made a BigInt (so basically how JavaScript handles numbers), and reserving i64, u64 and LEB128 for those BigInt values. This lets me use all the other number types the AX and BX rows to always pick the minimum number of bytes necessary to encode numbers, e.g.

[8, 16, 1/16, 0.1] => 90 A8 B4 10 B9 00 00 80 3D BA 9A 99 99 99 99 99 B9 3F 91
                       |  |  |     |              |                          |
                       |  |  |     |              |                          List end
                       |  |  |     |              f64 approximation of 0.1
                       |  |  |     f32 encoding of 1/16
                       |  |  u8 encoding of 16
                       |  direct encoding of 8
                       List start

This is not a problem when just roundtripping JS-to-JS, since all values just gets promoted back to doubles in the end.

But now imagine we're sending data between Python and JS code through Muon. Python uses variable sized numbers under the hood, right? One could say all integers are "BigInt" and all floats are doubles (I think), unless one is working with NumPy. The example Python encoder from the slides either directly encodes 0-9, or uses LEB128 for all other integers.

Imagine we have a list of integers between 0 and ~~1000~~ [some value bigger than Number.MAX_SAFE_INTEGER] in Python that we encode this way, then decode in my JS implementation. We would end up with an array of mixed doubles and BigInt values. So one number type gets converted to two different ones.

One way to handle this would be to say that a JS implementation of Muon has to convert LEB128 back to doubles if it safely fits in a double, but that also potentially leads to issues: say that I start in JavaScript with a list that contains BigInts, some of which could also be safely converted to doubles. First we serialize this list. Let's assume this will use LEB128 encoding because of the BigInt type, like I have so far. Now we deserialize this list in JS. Because of the rule we just established some of the BigInt values will turn into doubles - we change the number types again!

So we basically have two needs that are a little bit at odds:

serialization/deserialization within the same language should not result in type changes
serialization in one language and deserialization in another should result in predictable number types

I think the best summary of this question is: how should Muon handle the different ways languages handle number types when transmitting data between these languages?

For now, for my own implementation I will prioritize 1 over 2 (because it's a toy implementation and I'm not planning to interact with Python in my own use-cases).

PS: I'm sure this question has come up with other encodings that have support for more than just doubles, so maybe it's worth looking up what the arguments + conclusions were in those situations?

Serialization/deserialization

Not sure whether this feature is a goal, but something like protobuf can provide is useful.

At the end, you build and parse messages from/to some memory objects/structs. And in would be useful to have an ability not to write serialization/deserialization code by hand, but generate it.

With possible backward compatible changes in schema in mind, generator like protobuf looks promising: you write schema, generator builds your classes/structs, for example, for microcontoller, backend and mobile app consistantly.

On the other hand, something like json serialization usually done is also a way to go.

Should LRU cache apply to arbitrary objects, not only strings?

Handling of strings with nul bytes in them

When I try and round-trip the following json, it doesn't work properly.
["stuff", "things", "zero\u0000things", "multi\u0000\u0000zero\u0000things"]