GithubHelp home page GithubHelp logo

Validation on serialization about msgspec HOT 5 OPEN

fungs avatar fungs commented on May 28, 2024 1
Validation on serialization

from msgspec.

Comments (5)

FHU-yezi avatar FHU-yezi commented on May 28, 2024

Maybe related to #513?

We can validate the data when we create the struct object.

from msgspec.

fungs avatar fungs commented on May 28, 2024

Thanks @FHU-yezi for the linked issue. I've read through it, and it is definitely related.

Let me try to explain a little further for everyone to understand this request.

IMO there are architectural and practical differences depending on who does the validation when. The goal should be to guarantee that a data structure was validated and not modified before serialization.

Strategy 1: In-type validation (variant validate-frozen-on-construction)

I totally like this concept because it merges the concepts of type and constraints. The distinction of both concepts is, in my eyes, is just an artifact of how computer systems commonly define and handle data types, mostly related to hardware architecture. However, to guarantee that the data is valid all the way until serialization, we must either write-protect it effectively (aka frozen objects), or we must revalidate after each possible modification. The former is difficult in Python due to its dynamic nature. The latter requires you to rewrite or wrap a type with all its write-enabled methods, even its accessible members.

An example of this approach is Pydantics NonNegativeInt type. If the type invariance says "I cannot be invalid", all is fine. I'd go for this approach in appropriate programming languages, not in Python. It would be really hard for anyone to write custom types.

Strategy 2: Lazy validation (serialization)

If we cannot guarantee a validated state or safeguard the type object from modification during processing, the logical option is to defer the validation to the time of serialization, thus circumventing the problem. To me, this also makes sense because usually the serialization routine needs to touch and re-encode every single item in the data structure, which would guarantee that we spend linear time on validation. It's important not note, that the validation needs to be type-informed, just like the serialization: both require deep knowledge about the semantics and structure of the type being processed.

In msgspec, validation is only applied for the back-transform. In this case, it doesn't really matter how it is done, because the full pipeline is implemented in msgspec itself. I assume, that for efficiency reasons, msgspec does validation on deserialization in C code, once the final data type objects are constructed in the chain.

Architecture

So why don't we just validate on instantiation and protect the data by code ownership until serialization?

The answer is software architecture. The data types in these kinds of frameworks (see attrs, pydantic, dataclass etc.) serve two different purposes: defining data models and interfaces and creating and working with objects easily and efficiently. So when building a standalone serialization layer for specific data, with a matching interface, the objects are constructed outside, maybe in a mutable version, maybe much earlier in the data processing pipeline, in custom code or in a different Python package, but relying on the very same interface definition. Thus, we cannot assume that all passed objects comply with the definition expected by the receiver.

That being said, if the struct constructor mentioned in #513 accepts an object of the same type with zero copy and can validate all the members, this would be equivalent to a simple validate(data) call to be run right before serialization (although probably less efficient than validation and serialization in the same procedure).

from msgspec.

fungs avatar fungs commented on May 28, 2024

#614 is inspired by the same architectural considerations.

from msgspec.

FHU-yezi avatar FHU-yezi commented on May 28, 2024

@fungs said something really meaningful.

For strategy 2 he mentioned, we also have another use case: What if this struct will never be serialized?

In my case, the struct object is directly used by user's code, and it is only for auto complete and type checking, user will never serialize it, unless they want to store it in another place.

In that case, if we doesn't support validate on init, the struct defination may be different from the real data, which will lead to misunderstanding.

from msgspec.

fungs avatar fungs commented on May 28, 2024

This seems to be a well structured approach to strategy 1: https://smarie.github.io/python-vtypes/

It might be compatible with msgspec, I need to test.

from msgspec.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.