GithubHelp home page GithubHelp logo

muon's Introduction

µON [muon]

A compact and simple object notation. µ[mju:] stands for "micro".

Muon on Twitter

File extension:     .mu
MIME type:          application/muon
Endianness:         little-endian
Signature/Magic:    optional, 8F B5 30 31 ["�µ01"] @ 0x0

Muon has some interesting properties (see presentation and docs):

  • Every null-terminated UTF8 string is at the same time a valid muon object
  • Gaps in the UTF8 encoding space (code units) are used to encode things like [ ] { } etc.
  • Muon is self-describing and schemaless, just like JSON (unlike Protobuf and FlatBuffers)
  • Compact (~10..50% smaller than JSON). On par or outperforms CBOR, MsgPack, UBJSON
  • Unlimited size of objects and values
  • Data is ready to be used in-place without pre-processing
  • Supports raw binary data (values and TypedArrays)
  • Can optionally contain count of elements (and size in bytes) of all structures for efficient document processing, similar to BSON

Future goals:

  • Strict specification (little or no room for implementation-specific behavior / vendor-specific extensions)

Try it yourself

python3 muon_py/json2mu.py ./data/AirlineDelays.min.json ./AirlineDelays.mu
python3 muon_py/mu2json.py ./AirlineDelays.mu > ./AirlineDelays.json

Run benchmarks:

python3 muon_py/extra/json-analyze.py ./data/*.json ./data/small/*.json
python3 muon_py/mu-benchmark.py ./data/*.json ./data/small/*.json

Muon structure

Muon diagram

See more in documentation.

Disclaimer: the notation is still Work In Progress. If you have any ideas or comments, please feel free to post them here.


Stand With Ukraine

muon's People

Contributors

vshymanskyy avatar waterfountain1996 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

muon's Issues

Are we using signed or unsigned LEB128?

If I read the Python code correctly, the LEB128 lengths in strings and typed arrays are unsigned LEB128, and for the 0xBB typed values we use signed LEB128, but it isn't specified anywhere else. Maybe that should be clarified in the images and presentation.

Standard text representation

Just for debug purposes, sometimes it is handy just print it to console. So it is useful to have some cheap (on the fly, without full decoding to separate object; maybe in order to run on microcontroller for example) toString method.

Json is an option, but not sure whether it can be done cheap (by memory).
But as far as I understand, not every muon can be represented as json.

Explanation of what tags actually are

The encoding diagram has sections for both attr and tags but doesn't explain what they are or how they are encoded. Am I missing something really obvious?

I don't understand what kv is supposed to be, is it the same as the dict kv? What values are valid for tag? and val? several tag encodings are mentioned in the table of values, but appear to only be allowable outside of the main object?

It's not particularly clear that the encoding diagram is recursive, but does make sense once you get to the choices for lists and dicts.

Ideas and comments

This project is on an early stage of development, and generally should be treated as RFC (Request for comments). If you have any ideas/comments, please feel free to post them here.

Are 8C tagged strings lenght encoded?

Given #5 it is possible for strings to contain \u0000 values. With that in mind should I assume all 8C tags length-encoded instead of zero-terminated? Would it make sense to have a separate tag for adding zero-terminated strings and for length-encoded strings to the LRU?

Handling of duplicate keys in dicts

I have several questions relating to dicts:

  • What is the expected handling for dicts where the same key exists multiple times?
  • Should later values replace earlier ones? Is it implementation defined?
  • Is there any expectation of what sort of data structure should represent a dict internally - such as an ordered key-value map, or a hash map?
  • Does/should/can the order of keys matter in an encoding?

Request for clarification of how the 8C tag interacts with lists

Say I have a nested object that for some reason contains the strings "parallelepiped" and "therizinosaurus" many times, and I know this. Which of the two following ways of encoding an 8C tagged list is correct?

8C 90 <utf8 "parallelepiped"> 00 <utf8 "therizinosaurus"> 00 91

90 8C  <utf8 "parallelepiped"> 00 8C <utf8 "therizinosaurus"> 00 91

(I'm pretending #12 doesn't exist and that 8C-tagged strings are zero-terminated, will update example once that issue is resolved)

My guess is that they are both correct, but the former only adds the strings to the LRU without encoding a list into the object, while the latter encodes a list of strings that also get added to the LRU. Whatever the case, it might be good to be explicit about it in the docs somewhere.

Allow int's to be dictionary keys

Sometimes, key of the dictionary can be an integer (like id or timestamp). And it could be usefull not to stringify them.

As far as I understood, it is enough to allow Ax, Bx for the keys to implement this ... does not require any major specification change.

Resynchronization

muon is awesome, but the one thing it is lacking versus line-by-line json is ability to seek around randomly in the stream.

the jsonl separator \n is hard to beat, but support for raw data makes that impossible.

I have found the AVRO strategy of picking a random 16-byte delimiter is nice.

I think some byte (maybe 0xf0) should indicate a delimiter follows. the first time will define the delimiter (which is chosen randomly); subsequent times should just be validated.

Add support for `unums/posits`, a new floating point format

Posit is a number format that is similar to IEEE 754 format (floating point numbers).
The Posit Standard (2022) has been ratified by the Posit Working Group: https://posithub.org

The idea is to add a new posit tag that can apply to u8, u16 and u32 typed values and arrays, which are then treated as the corresponding Posits.
Also, this could be a good opportunity to add support for arbitrary-precision floats, but is there a rationale?

Are `Attributes` useful enough?

... or should we remove them for the sake of simplification?

alternative: move Attrubutes to Muon document level, so it can only appear in the beginning of the file or stream.

PSON and IOTMP libraries

Hello @vshymanskyy

Thanks for developing the muon and TinyGSM libraries

I'm a Macker and I use Thinger.io in my projects.
They developed PSON (A Serialization Format for IoT Sensor Networks) and the IOTMP protocol.

I hope the information about PSON and IOTMP can help in muon development.

Thank you for your work

PSON Article: https://www.mdpi.com/1424-8220/21/13/4559/htm
PSON Github: https://github.com/thinger-io/Protoson

IOTMP Article: https://www.mdpi.com/1424-8220/19/5/1044/htm
IOTMP Github: https://github.com/thinger-io/IOTMP

Integration: PSON - IOTMP - ARDUINO and TINYGSM with Thinger.io
Arduino-Library: https://github.com/thinger-io/Arduino-Library

Question regarding numbers being passed between JS and Python versions and what it means for their types

Apologies for the wall of text, it's because I really like the idea of Muon :)

TL;DR: key questions are bolded in the text below

So bit of context: I'm trying to write a JavaScript implementation, since the format is so elegantly simple that I feel I can achieve a basic version of it. My first goal is to have "feature parity" with how JavaScript handles JSON. That is: being able to roundtrip any object that you could also send through JSON.stringify and get back from JSON.parse, and ignoring the things that it can't. After that I'll worry about the extra things Muon supports.

Having said that, the fact that Muon has more ways to encode numbers was too tempting to not play around with for size savings. The way I have handled numbers so far is assuming that everything is a double unless explicitly made a BigInt (so basically how JavaScript handles numbers), and reserving i64, u64 and LEB128 for those BigInt values. This lets me use all the other number types the AX and BX rows to always pick the minimum number of bytes necessary to encode numbers, e.g.

[8, 16, 1/16, 0.1] => 90 A8 B4 10 B9 00 00 80 3D BA 9A 99 99 99 99 99 B9 3F 91
                       |  |  |     |              |                          |
                       |  |  |     |              |                          List end
                       |  |  |     |              f64 approximation of 0.1
                       |  |  |     f32 encoding of 1/16
                       |  |  u8 encoding of 16
                       |  direct encoding of 8
                       List start

This is not a problem when just roundtripping JS-to-JS, since all values just gets promoted back to doubles in the end.

But now imagine we're sending data between Python and JS code through Muon. Python uses variable sized numbers under the hood, right? One could say all integers are "BigInt" and all floats are doubles (I think), unless one is working with NumPy. The example Python encoder from the slides either directly encodes 0-9, or uses LEB128 for all other integers.

Imagine we have a list of integers between 0 and 1000 [some value bigger than Number.MAX_SAFE_INTEGER] in Python that we encode this way, then decode in my JS implementation. We would end up with an array of mixed doubles and BigInt values. So one number type gets converted to two different ones.

One way to handle this would be to say that a JS implementation of Muon has to convert LEB128 back to doubles if it safely fits in a double, but that also potentially leads to issues: say that I start in JavaScript with a list that contains BigInts, some of which could also be safely converted to doubles. First we serialize this list. Let's assume this will use LEB128 encoding because of the BigInt type, like I have so far. Now we deserialize this list in JS. Because of the rule we just established some of the BigInt values will turn into doubles - we change the number types again!

So we basically have two needs that are a little bit at odds:

  1. serialization/deserialization within the same language should not result in type changes
  2. serialization in one language and deserialization in another should result in predictable number types

I think the best summary of this question is: how should Muon handle the different ways languages handle number types when transmitting data between these languages?

For now, for my own implementation I will prioritize 1 over 2 (because it's a toy implementation and I'm not planning to interact with Python in my own use-cases).

PS: I'm sure this question has come up with other encodings that have support for more than just doubles, so maybe it's worth looking up what the arguments + conclusions were in those situations?

Serialization/deserialization

Not sure whether this feature is a goal, but something like protobuf can provide is useful.

At the end, you build and parse messages from/to some memory objects/structs. And in would be useful to have an ability not to write serialization/deserialization code by hand, but generate it.

With possible backward compatible changes in schema in mind, generator like protobuf looks promising: you write schema, generator builds your classes/structs, for example, for microcontoller, backend and mobile app consistantly.

On the other hand, something like json serialization usually done is also a way to go.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.