greglook / clj-cbor Goto Github PK

View Code? Open in Web Editor NEW

69.0 7.0 7.0 482 KB

Native Clojure CBOR codec implementation.

License: The Unlicense

Clojure 100.00%

cbor codec binary-format

clj-cbor's People

Contributors

Stargazers

Watchers

Forkers

aengelberg aria42 axel-angel clojure-land kristjansson seabass-labrax galdolber

clj-cbor's Issues

Buffering behavior in decode is non-obvious

A question came up recently about using cbor/decode to lazily parse an input stream, where data from the underlying stream was getting lost.

This is due to the fact that internally, decode will automatically coerce its argument into a java.io.DataInputStream by first calling clojure.java.io/input-stream on it. The implementation of input-stream always wraps a java.io.BufferedInputStream around the result, which in turn means that the first read on a bare InputStream sucks more data than expected into the internal buffer, which is then discarded!

The immediate solution is to make sure you're always passing a DataInputStream to the decode methods, but a better approach would be to not try coercing things that are already InputStreams and to clearly document this behavior on the functions.

Fix asymmetry in encode/decode API

Rather than having encode write a single item and decode read a lazy sequence, the semantics should be changed to include encode-all and decode-all methods that operate on sequences. This implies a breaking change to decode such that it always returns a single item.

Decoder unpredictably returns non-Long types when decoding integers

user=> (into {} (map (juxt identity (comp type first cbor/decode cbor/encode))) [0 1 23 24 255 256 65535 65536 1000000 (inc' Long/MAX_VALUE)])
{0 #class java.lang.Long,
 1 #class java.lang.Long,
 23 #class java.lang.Long,
 24 #class java.lang.Integer,
 255 #class java.lang.Integer,
 256 #class java.lang.Integer,
 65535 #class java.lang.Integer,
 65536 #class java.lang.Long,
 1000000 #class java.lang.Long,
 9223372036854775808N #class clojure.lang.BigInt}

Similarly, if a tagged CBOR type wraps an integer, its read handler may receive either an Integer, Long, or BigInt.

Test out jump-table decoder approach

Given that many values are entirely determined by the first byte of each data item, the codec could take a more direct approach to decoding values by using a jump table instead of the normal branching logic. This would have 256 entries, which return hard-coded values where possible and dispatches to the more nuanced decoding functions elsewhere. Using a case would give constant-time decoding for all single-byte values.

See RFC 7049 Appendix B

Support streaming writes

CBOR supports a notion of streamed data, where instead of being prefixed, a collection is marked as being a stream. The reader is expected to read elements off the stream until they see the special break value. The codec currently supports reading streamed data, but has no way to write it.

The most obvious place to support streaming is when we encounter lazy sequences or other values which are Iterable and don't match any other handlers. Lazy sequences are currently fully realized to count them before writing any data, which may be undesirable. This could be controlled by a new codec option.

Streaming map support is a little trickier, and it's not clear that it's actually valuable to support yet. This could be accomplished with a wrapper type though, which held an iterable value and indicated that it should expect to read key/value pairs and write out a streaming CBOR map.

Support inheritance-based fallback for write-handlers

There are a few real-world use-cases that have cropped up where the majority of a type's API uses an abstract parent class or interface, with multiple underlying concrete implementations. One example is Joda's org.joda.time.DateTimeZone and the attendant CachedDateTimeZone, FixedDateTimeZone, and DateTimeZoneBuilder$PrecalculatedZone (yuck). The user wants to be able to treat all of these types uniformly, so having to declare every possible subclass makes type extension awkward here.

This can be solved by attempting inheritance-based resolution, similar to puget.dispatch/inheritance-lookup.

Strict mode

The CBORCodec should support a strict mode per the RFC, which rejects inputs it doesn't know how to parse. This would impact simple and tagged values which don't have an explicit mapping or read handler.

Performance testing

Generate or otherwise encode some big, hairy data structures and establish performance benchmarks for writing them. Compare this lib to a few other formats/libraries to determine relative performance.

Please have a look at the revised edition of the CBOR standard

This is not an issue on your software, but:
I would like to point out that the upcoming editorial revision of RFC
7049, informally called “rfc7049bis”, is in working group last call
right now — if you can spare some time, please have a look at

https://cbor-wg.github.io/CBORbis/draft-ietf-cbor-7049bis.html

and submit any issues (including potential for further improvement of
the text) you might find, at

https://github.com/cbor-wg/CBORbis

Thank you.

Carsten Bormann

clj-cbor emits warnings in Clojure 1.9

WARNING: bytes? already refers to: #'clojure.core/bytes? in namespace: clj-cbor.data.core, being replaced by: #'clj-cbor.data.core/bytes?
WARNING: boolean? already refers to: #'clojure.core/boolean? in namespace: clj-cbor.codec, being replaced by: #'clj-cbor.codec/boolean?

Canonical mode

The CBORCodec type should support canonical mode per the RFC. This would ensure that map keys and set entries are written to the output in sorted order, ensuring that data structures with equal content are always serialized to the same bytes. This is a useful property in many situations, including caching and content-addressable storage.

It should be noted that the specification calls for sorting the values by serialized bytes, which requires serializing all of the entries into memory before writing. This could be expensive and exhaust memory, so canonical mode must not be the default. In Puget, a compromise was to only sort the output if the data structure was below a configurable size, but that doesn't work in CBOR because a set with only two or three huge entries could cause issues just as much as a set with thousands of small entries.

Use registered 'set' tag from IANA

The library currently uses tag code 13 to represent sets of unique elements, but this tag is not registered anywhere. Ideally, it should be added to the IANA registry in the two-byte tag range. This will unfortunately change the canonical serialization of any data using sets, but we can maintain backwards compatibility for some number of future library versions by reading in tag 13 objects as sets if no handler is registered.

greglook / clj-cbor Goto Github PK

clj-cbor's People

Contributors

Stargazers

Watchers

Forkers

clj-cbor's Issues

Recommend Projects

Recommend Topics

Recommend Org

Jobs