greglook / clj-cbor Goto Github PK
View Code? Open in Web Editor NEWNative Clojure CBOR codec implementation.
License: The Unlicense
Native Clojure CBOR codec implementation.
License: The Unlicense
A question came up recently about using cbor/decode
to lazily parse an input stream, where data from the underlying stream was getting lost.
This is due to the fact that internally, decode
will automatically coerce its argument into a java.io.DataInputStream
by first calling clojure.java.io/input-stream
on it. The implementation of input-stream
always wraps a java.io.BufferedInputStream
around the result, which in turn means that the first read on a bare InputStream
sucks more data than expected into the internal buffer, which is then discarded!
The immediate solution is to make sure you're always passing a DataInputStream
to the decode
methods, but a better approach would be to not try coercing things that are already InputStream
s and to clearly document this behavior on the functions.
Rather than having encode
write a single item and decode
read a lazy sequence, the semantics should be changed to include encode-all
and decode-all
methods that operate on sequences. This implies a breaking change to decode
such that it always returns a single item.
user=> (into {} (map (juxt identity (comp type first cbor/decode cbor/encode))) [0 1 23 24 255 256 65535 65536 1000000 (inc' Long/MAX_VALUE)])
{0 #class java.lang.Long,
1 #class java.lang.Long,
23 #class java.lang.Long,
24 #class java.lang.Integer,
255 #class java.lang.Integer,
256 #class java.lang.Integer,
65535 #class java.lang.Integer,
65536 #class java.lang.Long,
1000000 #class java.lang.Long,
9223372036854775808N #class clojure.lang.BigInt}
Similarly, if a tagged CBOR type wraps an integer, its read handler may receive either an Integer, Long, or BigInt.
Given that many values are entirely determined by the first byte of each data item, the codec could take a more direct approach to decoding values by using a jump table instead of the normal branching logic. This would have 256 entries, which return hard-coded values where possible and dispatches to the more nuanced decoding functions elsewhere. Using a case
would give constant-time decoding for all single-byte values.
CBOR supports a notion of streamed data, where instead of being prefixed, a collection is marked as being a stream. The reader is expected to read elements off the stream until they see the special break value. The codec currently supports reading streamed data, but has no way to write it.
The most obvious place to support streaming is when we encounter lazy sequences or other values which are Iterable
and don't match any other handlers. Lazy sequences are currently fully realized to count them before writing any data, which may be undesirable. This could be controlled by a new codec option.
Streaming map support is a little trickier, and it's not clear that it's actually valuable to support yet. This could be accomplished with a wrapper type though, which held an iterable value and indicated that it should expect to read key/value pairs and write out a streaming CBOR map.
There are a few real-world use-cases that have cropped up where the majority of a type's API uses an abstract parent class or interface, with multiple underlying concrete implementations. One example is Joda's org.joda.time.DateTimeZone
and the attendant CachedDateTimeZone
, FixedDateTimeZone
, and DateTimeZoneBuilder$PrecalculatedZone
(yuck). The user wants to be able to treat all of these types uniformly, so having to declare every possible subclass makes type extension awkward here.
This can be solved by attempting inheritance-based resolution, similar to puget.dispatch/inheritance-lookup
.
The CBORCodec
should support a strict mode per the RFC, which rejects inputs it doesn't know how to parse. This would impact simple and tagged values which don't have an explicit mapping or read handler.
Generate or otherwise encode some big, hairy data structures and establish performance benchmarks for writing them. Compare this lib to a few other formats/libraries to determine relative performance.
This is not an issue on your software, but:
I would like to point out that the upcoming editorial revision of RFC
7049, informally called “rfc7049bis”, is in working group last call
right now — if you can spare some time, please have a look at
https://cbor-wg.github.io/CBORbis/draft-ietf-cbor-7049bis.html
and submit any issues (including potential for further improvement of
the text) you might find, at
https://github.com/cbor-wg/CBORbis
Thank you.
Carsten Bormann
WARNING: bytes? already refers to: #'clojure.core/bytes? in namespace: clj-cbor.data.core, being replaced by: #'clj-cbor.data.core/bytes?
WARNING: boolean? already refers to: #'clojure.core/boolean? in namespace: clj-cbor.codec, being replaced by: #'clj-cbor.codec/boolean?
The CBORCodec
type should support canonical mode per the RFC. This would ensure that map keys and set entries are written to the output in sorted order, ensuring that data structures with equal content are always serialized to the same bytes. This is a useful property in many situations, including caching and content-addressable storage.
It should be noted that the specification calls for sorting the values by serialized bytes, which requires serializing all of the entries into memory before writing. This could be expensive and exhaust memory, so canonical mode must not be the default. In Puget, a compromise was to only sort the output if the data structure was below a configurable size, but that doesn't work in CBOR because a set with only two or three huge entries could cause issues just as much as a set with thousands of small entries.
The library currently uses tag code 13 to represent sets of unique elements, but this tag is not registered anywhere. Ideally, it should be added to the IANA registry in the two-byte tag range. This will unfortunately change the canonical serialization of any data using sets, but we can maintain backwards compatibility for some number of future library versions by reading in tag 13 objects as sets if no handler is registered.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.