Comments (16)
What if this is stored as an app config - deserializing a config that may be a string and boolean may be ambiguous itself (depending on the format of course)
I don't find this argument convincing - if we cared about this we'd only ever pass in strings as config variables to any function. A limited type system in some legacy config format shouldn't limit the options we can use in python.
I think splitting the sorting capability into separate options is excessive
I also would really prefer to keep this configured via one keyword argument. Beyond limiting the number of options a user needs to understand, there's no real use case for the case of sort_dataclass_like=True, sort_collections=False
. The use cases are:
- No sorting: efficiency, the default
- Sort unordered collections only: determinism with a mild performance cost
- Sort unordered collections & object fields human readability or a canonical form that other tools can also produce
A few additional kwarg ideas:
# def encode(..., order: Literal[None, "deterministic", "sorted"] = None):
# could also call the kwarg `ordering`
encode(obj, order=None) # could also call this "unordered", but I prefer None
encode(obj, order="deterministic")
encode(obj, order="sorted")
# def encode(..., sort_level: Literal[0, 1, 2] = 0):
# treats sorting as a knob to turn up. I don't really like this one.
encode(obj, sort_level=0)
encode(obj, sort_level=1)
encode(obj, sort_level=2)
Right now I'm leaning towards the former. I agree it reads better than the sort
kwarg I originally proposed above, and more explicitly describes the use cases for each mode. Also, by not explicitly describing what's being sorted in each case it leaves it open to add additional normalization steps as needed for different protocols to achieve a deterministic output.
from msgspec.
# def encode(..., order: Literal[None, "deterministic", "sorted"] = None):
# could also call the kwarg `ordering`
encode(obj, order=None) # could also call this "unordered", but I prefer None
encode(obj, order="deterministic")
encode(obj, order="sorted")
I like this one
from msgspec.
Sure, but boolean values are valid in literal types as well. This would be typed something like:
def encode(..., sort: Literal[True, False, "canonical"]=False): ...
I'm not attached to using True
/False
if there are some short string literals that make more sense. The implementation is essentially done, just need to do the hardest part, naming things :).
from msgspec.
No sort_keys
option currently exists. Can you say more about your use case for one?
from msgspec.
I assume it's to make the output more human-readable.
My view is that msgspec
output is designed to be machine-readable and in my own use-cases I use rich.print_json
to pretty-print JSON (msgspec
output), and optionally sort-keys, for human consumption e.g. output to console.
from msgspec.
I can provide a use-case: we do a lot of snapshot testing, comparing json serialized AST tree’s. Since we frequently change the tree structure and often have to manually review diffs, we sort the keys before dumping the output. It is pretty easy to sort the keys after the fact, but obviously slower. Although, to be fair, not a big deal since it is only important in testing.
from msgspec.
My purpose for using key sorting, and why I asked, was to make a smooth transition in my work project from the standard Flask json serializer (based on the built-in json library) to msgspec. The Flask developers have made the default key sorting - https://github.com/pallets/flask/blob/main/src/flask/json/provider.py#L149
and I would like all our API users and testers not to notice the transition
but I don't find this feature mandatory. It would be great only as compatibility with json/ujson/orjson that have key sorting
from msgspec.
I have exactly the same case as @evgenii-moriakhin, I'm trying to integrate msgspec into Flask applications.
It is also useful for caching, enforcing the key orders can allow to have a quicker cache hit and improve the hit/miss ratio
from msgspec.
Update - I have a working implementation of this, only need to figure out a nice spelling.
In summary - we expose 3 levels of sorting:
- No Sorting. The default. Structured objects (structs/dataclasses/...) encode in the order of their defined fields, unordered collections (dicts, sets, ...) encode in the order of their current in-memory representation.
- Sort unordered collections alone. This mode ensures reproducible results across invocations by ordering items in semantically unordered collections (dicts and sets). This compromises a bit on speed, but ensures consistent output. This can be used for caching or hashing the output purposes since it's consistent, but not all keys are ordered for human-readable purposes.
- Fully sorted. Besides sorting dicts and sets, we also sort the field orders of all structured objects. This compromises more on speed, but ensures the output JSON has fields sorted alphabetically for all JSON object types.
The question is - what to name this kwarg and options?
For kwarg name, I don't want to use sort_keys
since we're sorting more than keys, but do want "sort" to appear in the name. Right now I'm waffling between sort
and sort_mode
.
For values, I think I want False
and True
to correspond to 1 & 2 respectively above. I'm not sure what to call 3. "all"
? "full"
? "canonical"
?
The reason I want True
to not map to 3 (sort everything) is that I suspect most users will just pass in True
, and for most use cases a consistent (but not necessarily fully sorted) output will suffice. Trying to make the most performant sorting option also the one people use. For prior art, this also matches what go's existing json
library does.
# Example of potential kwarg naming and semantics
msgspec.json.encode(obj) # defaults to no sorting
msgspec.json.encode(obj, sort=False) # no sorting
msgspec.json.encode(obj, sort=True) # only sorts dicts and sets, Structs remain in field order
msgspec.json.encode(obj, sort="canonical") # sorts sets, dicts, and any object keys
Thoughts?
from msgspec.
I don't like the ability to pass both a boolean type and a string; from an API friendliness standpoint, it seems quite ambiguous.
At the same time, what alternatives can be suggested? The only things that come to mind are Enums or option flags like in https://github.com/ijl/orjson#option.
Neither of these two approaches is KISS-like, but they are more explicit (and more meaningful naming can be chosen), which, in my opinion, corresponds to the Zen of Python.
It is also worth considering whether there are plans for other serialization options in msgspec in the future? (Again, see example https://github.com/ijl/orjson#option).
In this case, the flags option seems the most advantageous, as it allows for expressing explicitness, meaningful naming, and the ability to extend serialization options in the future. This configuration approach with bitwise flags is also used in other popular libraries used in Python (not just orjson) - for example, libvirt.
from msgspec.
I’d vote strongly in favor of an enum. Afaik this is much more widespread [in Python ecosystem] and explicit than binary flags like orjson uses. Flags & bits are more domain of lower level language like C. Unless you plan to support dict/set and structured objects sorting independently. Then flag it is.
P.S. curious why sorting structured types is a bigger performance impact than dicts/sets. Thought that for structured types it is a “compile” (encoder build) time impact, while dicts/sets it is an actual item sort.
from msgspec.
I personally don't like using bitwise flags as they don't feel pythonic, and am against the usage of enums (or enums alone) to avoid the need for additional imports just to set a config. There's lots of precedence in python for usage of a fixed set of literals to configure an option (hence the existence of typing.Literal
), and we're already using that in msgspec in several places. See the uuid_format
here for example.
P.S. curious why sorting structured types is a bigger performance impact than dicts/sets. Thought that for structured types it is a “compile” (encoder build) time impact, while dicts/sets it is an actual item sort.
The implementation I have currently doesn't do the sorting at compile-time because I didn't want to bother with it yet and assumed any code paths requiring sorting would be less performance oriented than the default no-sorting path. We could optimize this path further with structured types (easiest for msgspec.Struct
types, a bit of a refactor for attrs/dataclasses needed before handling it there). That said, my assumption here is that structured types are best represented in the fixed order of their fields, and sorting is really only needed for unordered collections like dict
/set
. Making the most intuitive config option only affect these types felt right to me. It also matches what golang does (structs are encoded in field order, maps are sorted before encoding).
from msgspec.
There's lots of precedence in python for usage of a fixed set of literals to configure an option (hence the existence of typing.Literal), and we're already using that in msgspec in several places. See the uuid_format here for example.
However, you suggested using boolean types. Are there any reasons to use True/False for values instead of string literals? It could be useful if the argument was called sort_keys, but the name is being changed anyway.
from msgspec.
The implementation I have currently doesn't do the sorting at compile-time because I didn't want to bother with it yet and assumed any code paths requiring sorting would be less performance oriented than the default no-sorting path.
Got it, thanks for clarification. As I’ve posted, in my use case it is not performance critical indeed.
Rgd enums vs. literals. If you want zero imports, I would then vote for two independent kwargs, e.g. sort_collections
and sort_dataclass_like
(or maybe even sort_structs
).
Not sure if a union of boolean & string is pythonic. Since you are not evolving an existing API, but creating a new one from scratch, there is nothing that forces you to mix types like this. What if this is stored as an app config - deserializing a config that may be a string and boolean may be ambiguous itself (depending on the format of course)
from msgspec.
Would having two arguments work?
def encode(..., sort: bool, sort_options: Literal["collections", "all"]='collections'): ...
I'm not sure about the naming of the sort_options
arguments but canonical
doesn't seem quite right either (though maybe there are no good options for a descriptive name for unordered collections and struct fields!)
Edit: Just saw the above post 🤦
I guess it's slightly different in that there is only one sort
argument and the second argument controls how the sorting is performed.
If you wanted control to sort struct fields independently of collections the sort_options
argument could be easily extended to Literal["collections", "fields", "all"]
from msgspec.
I think splitting the sorting capability into separate options is excessive, and overall I like the approach with string literals without boolean values
sort: Literal["no", "exclude_structs", "canonical"]
(as an example, the correct naming is what we are discussing right now).
it solves the problems of explicit and meaningful naming
from msgspec.
Related Issues (20)
- Support types.MappingProxyType HOT 3
- Add either `init_omit_defaults` or `omit_none` HOT 5
- Consider making `DecodeError` and `ValidationError` inherit from `ValueError` HOT 1
- Docs page on testing
- json schema generation - differences between pydantic and msgspec HOT 3
- Allow conversion to collection from generator HOT 2
- Porting guide for users coming from `orjson`
- Converting dicts into list with key-reuse HOT 3
- Collecting multiple validation/constraint errors at once HOT 1
- Allow `omit_defaults` to exclude fields when encoded value is `{}` (empty dict)
- Duplicate key detection
- Allow unknown tags, defaulting to tagged base
- Implementing optional bytes type for json. HOT 1
- Update annotation parsing to work with PEP 649 in Python 3.13
- `omit_defaults` does not omit tuples and frozensets HOT 2
- Field Alias Overrides in Subclasses Not Reflected in __struct_encode_fields__
- Convert builtin types to numpy HOT 1
- Is it possible to have the decoding of union of all subclasses of a struct
- Subclasses of frozen Structs causing mypy error: `Cannot inherit non-frozen dataclass from a frozen one` HOT 2
- Datetime without timezone are decoded as str with msgspec.msgpack. in 0.18.6 HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from msgspec.