GithubHelp home page GithubHelp logo

Comments (16)

jcrist avatar jcrist commented on May 28, 2024 5

What if this is stored as an app config - deserializing a config that may be a string and boolean may be ambiguous itself (depending on the format of course)

I don't find this argument convincing - if we cared about this we'd only ever pass in strings as config variables to any function. A limited type system in some legacy config format shouldn't limit the options we can use in python.

I think splitting the sorting capability into separate options is excessive

I also would really prefer to keep this configured via one keyword argument. Beyond limiting the number of options a user needs to understand, there's no real use case for the case of sort_dataclass_like=True, sort_collections=False. The use cases are:

  • No sorting: efficiency, the default
  • Sort unordered collections only: determinism with a mild performance cost
  • Sort unordered collections & object fields human readability or a canonical form that other tools can also produce

A few additional kwarg ideas:

# def encode(..., order: Literal[None, "deterministic", "sorted"] = None):
# could also call the kwarg `ordering`
encode(obj, order=None)   # could also call this "unordered", but I prefer None
encode(obj, order="deterministic")
encode(obj, order="sorted")

# def encode(..., sort_level: Literal[0, 1, 2] = 0):
# treats sorting as a knob to turn up. I don't really like this one.
encode(obj, sort_level=0)
encode(obj, sort_level=1)
encode(obj, sort_level=2)

Right now I'm leaning towards the former. I agree it reads better than the sort kwarg I originally proposed above, and more explicitly describes the use cases for each mode. Also, by not explicitly describing what's being sorted in each case it leaves it open to add additional normalization steps as needed for different protocols to achieve a deterministic output.

from msgspec.

mishamsk avatar mishamsk commented on May 28, 2024 3
# def encode(..., order: Literal[None, "deterministic", "sorted"] = None):
# could also call the kwarg `ordering`
encode(obj, order=None)   # could also call this "unordered", but I prefer None
encode(obj, order="deterministic")
encode(obj, order="sorted")

I like this one

from msgspec.

jcrist avatar jcrist commented on May 28, 2024 1

Sure, but boolean values are valid in literal types as well. This would be typed something like:

def encode(..., sort: Literal[True, False, "canonical"]=False): ...

I'm not attached to using True/False if there are some short string literals that make more sense. The implementation is essentially done, just need to do the hardest part, naming things :).

from msgspec.

jcrist avatar jcrist commented on May 28, 2024

No sort_keys option currently exists. Can you say more about your use case for one?

from msgspec.

dhirschfeld avatar dhirschfeld commented on May 28, 2024

I assume it's to make the output more human-readable.

My view is that msgspec output is designed to be machine-readable and in my own use-cases I use rich.print_json to pretty-print JSON (msgspec output), and optionally sort-keys, for human consumption e.g. output to console.

from msgspec.

mishamsk avatar mishamsk commented on May 28, 2024

I can provide a use-case: we do a lot of snapshot testing, comparing json serialized AST tree’s. Since we frequently change the tree structure and often have to manually review diffs, we sort the keys before dumping the output. It is pretty easy to sort the keys after the fact, but obviously slower. Although, to be fair, not a big deal since it is only important in testing.

from msgspec.

evgenii-moriakhin avatar evgenii-moriakhin commented on May 28, 2024

My purpose for using key sorting, and why I asked, was to make a smooth transition in my work project from the standard Flask json serializer (based on the built-in json library) to msgspec. The Flask developers have made the default key sorting - https://github.com/pallets/flask/blob/main/src/flask/json/provider.py#L149

and I would like all our API users and testers not to notice the transition

but I don't find this feature mandatory. It would be great only as compatibility with json/ujson/orjson that have key sorting

from msgspec.

noudin-ledger avatar noudin-ledger commented on May 28, 2024

I have exactly the same case as @evgenii-moriakhin, I'm trying to integrate msgspec into Flask applications.
It is also useful for caching, enforcing the key orders can allow to have a quicker cache hit and improve the hit/miss ratio

from msgspec.

jcrist avatar jcrist commented on May 28, 2024

Update - I have a working implementation of this, only need to figure out a nice spelling.

In summary - we expose 3 levels of sorting:

  1. No Sorting. The default. Structured objects (structs/dataclasses/...) encode in the order of their defined fields, unordered collections (dicts, sets, ...) encode in the order of their current in-memory representation.
  2. Sort unordered collections alone. This mode ensures reproducible results across invocations by ordering items in semantically unordered collections (dicts and sets). This compromises a bit on speed, but ensures consistent output. This can be used for caching or hashing the output purposes since it's consistent, but not all keys are ordered for human-readable purposes.
  3. Fully sorted. Besides sorting dicts and sets, we also sort the field orders of all structured objects. This compromises more on speed, but ensures the output JSON has fields sorted alphabetically for all JSON object types.

The question is - what to name this kwarg and options?

For kwarg name, I don't want to use sort_keys since we're sorting more than keys, but do want "sort" to appear in the name. Right now I'm waffling between sort and sort_mode.

For values, I think I want False and True to correspond to 1 & 2 respectively above. I'm not sure what to call 3. "all"? "full"? "canonical"?

The reason I want True to not map to 3 (sort everything) is that I suspect most users will just pass in True, and for most use cases a consistent (but not necessarily fully sorted) output will suffice. Trying to make the most performant sorting option also the one people use. For prior art, this also matches what go's existing json library does.

# Example of potential kwarg naming and semantics
msgspec.json.encode(obj)  # defaults to no sorting
msgspec.json.encode(obj, sort=False)  # no sorting
msgspec.json.encode(obj, sort=True)  # only sorts dicts and sets, Structs remain in field order
msgspec.json.encode(obj, sort="canonical")  # sorts sets, dicts, and any object keys

Thoughts?

from msgspec.

evgenii-moriakhin avatar evgenii-moriakhin commented on May 28, 2024

I don't like the ability to pass both a boolean type and a string; from an API friendliness standpoint, it seems quite ambiguous.

At the same time, what alternatives can be suggested? The only things that come to mind are Enums or option flags like in https://github.com/ijl/orjson#option.

Neither of these two approaches is KISS-like, but they are more explicit (and more meaningful naming can be chosen), which, in my opinion, corresponds to the Zen of Python.

It is also worth considering whether there are plans for other serialization options in msgspec in the future? (Again, see example https://github.com/ijl/orjson#option).

In this case, the flags option seems the most advantageous, as it allows for expressing explicitness, meaningful naming, and the ability to extend serialization options in the future. This configuration approach with bitwise flags is also used in other popular libraries used in Python (not just orjson) - for example, libvirt.

from msgspec.

mishamsk avatar mishamsk commented on May 28, 2024

I’d vote strongly in favor of an enum. Afaik this is much more widespread [in Python ecosystem] and explicit than binary flags like orjson uses. Flags & bits are more domain of lower level language like C. Unless you plan to support dict/set and structured objects sorting independently. Then flag it is.

P.S. curious why sorting structured types is a bigger performance impact than dicts/sets. Thought that for structured types it is a “compile” (encoder build) time impact, while dicts/sets it is an actual item sort.

from msgspec.

jcrist avatar jcrist commented on May 28, 2024

I personally don't like using bitwise flags as they don't feel pythonic, and am against the usage of enums (or enums alone) to avoid the need for additional imports just to set a config. There's lots of precedence in python for usage of a fixed set of literals to configure an option (hence the existence of typing.Literal), and we're already using that in msgspec in several places. See the uuid_format here for example.

P.S. curious why sorting structured types is a bigger performance impact than dicts/sets. Thought that for structured types it is a “compile” (encoder build) time impact, while dicts/sets it is an actual item sort.

The implementation I have currently doesn't do the sorting at compile-time because I didn't want to bother with it yet and assumed any code paths requiring sorting would be less performance oriented than the default no-sorting path. We could optimize this path further with structured types (easiest for msgspec.Struct types, a bit of a refactor for attrs/dataclasses needed before handling it there). That said, my assumption here is that structured types are best represented in the fixed order of their fields, and sorting is really only needed for unordered collections like dict/set. Making the most intuitive config option only affect these types felt right to me. It also matches what golang does (structs are encoded in field order, maps are sorted before encoding).

from msgspec.

evgenii-moriakhin avatar evgenii-moriakhin commented on May 28, 2024

There's lots of precedence in python for usage of a fixed set of literals to configure an option (hence the existence of typing.Literal), and we're already using that in msgspec in several places. See the uuid_format here for example.

However, you suggested using boolean types. Are there any reasons to use True/False for values instead of string literals? It could be useful if the argument was called sort_keys, but the name is being changed anyway.

from msgspec.

mishamsk avatar mishamsk commented on May 28, 2024

@jcrist

The implementation I have currently doesn't do the sorting at compile-time because I didn't want to bother with it yet and assumed any code paths requiring sorting would be less performance oriented than the default no-sorting path.

Got it, thanks for clarification. As I’ve posted, in my use case it is not performance critical indeed.

Rgd enums vs. literals. If you want zero imports, I would then vote for two independent kwargs, e.g. sort_collections and sort_dataclass_like (or maybe even sort_structs).

Not sure if a union of boolean & string is pythonic. Since you are not evolving an existing API, but creating a new one from scratch, there is nothing that forces you to mix types like this. What if this is stored as an app config - deserializing a config that may be a string and boolean may be ambiguous itself (depending on the format of course)

from msgspec.

dhirschfeld avatar dhirschfeld commented on May 28, 2024

Would having two arguments work?

def encode(..., sort: bool, sort_options: Literal["collections", "all"]='collections'): ...

I'm not sure about the naming of the sort_options arguments but canonical doesn't seem quite right either (though maybe there are no good options for a descriptive name for unordered collections and struct fields!)


Edit: Just saw the above post 🤦
I guess it's slightly different in that there is only one sort argument and the second argument controls how the sorting is performed.

If you wanted control to sort struct fields independently of collections the sort_options argument could be easily extended to Literal["collections", "fields", "all"]

from msgspec.

evgenii-moriakhin avatar evgenii-moriakhin commented on May 28, 2024

I think splitting the sorting capability into separate options is excessive, and overall I like the approach with string literals without boolean values

sort: Literal["no", "exclude_structs", "canonical"]
(as an example, the correct naming is what we are discussing right now).

it solves the problems of explicit and meaningful naming

from msgspec.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.