pydantic / pydantic-core Goto Github PK

View Code? Open in Web Editor NEW

1.3K 27.0 184.0 6.6 MB

Core validation logic for pydantic written in rust

License: MIT License

Makefile 0.16% Python 58.44% Rust 41.03% JavaScript 0.27% HTML 0.10%

json-schema parsing pydantic rust schema validation

pydantic-core's Introduction

pydantic-core

This package provides the core functionality for pydantic validation and serialization.

Pydantic-core is currently around 17x faster than pydantic V1. See tests/benchmarks/ for details.

Example of direct usage

NOTE: You should not need to use pydantic-core directly; instead, use pydantic, which in turn uses pydantic-core.

from pydantic_core import SchemaValidator, ValidationError


v = SchemaValidator(
    {
        'type': 'typed-dict',
        'fields': {
            'name': {
                'type': 'typed-dict-field',
                'schema': {
                    'type': 'str',
                },
            },
            'age': {
                'type': 'typed-dict-field',
                'schema': {
                    'type': 'int',
                    'ge': 18,
                },
            },
            'is_developer': {
                'type': 'typed-dict-field',
                'schema': {
                    'type': 'default',
                    'schema': {'type': 'bool'},
                    'default': True,
                },
            },
        },
    }
)

r1 = v.validate_python({'name': 'Samuel', 'age': 35})
assert r1 == {'name': 'Samuel', 'age': 35, 'is_developer': True}

# pydantic-core can also validate JSON directly
r2 = v.validate_json('{"name": "Samuel", "age": 35}')
assert r1 == r2

try:
    v.validate_python({'name': 'Samuel', 'age': 11})
except ValidationError as e:
    print(e)
    """
    1 validation error for model
    age
      Input should be greater than or equal to 18
      [type=greater_than_equal, context={ge: 18}, input_value=11, input_type=int]
    """

Getting Started

You'll need rust stable installed, or rust nightly if you want to generate accurate coverage.

With rust and python 3.8+ installed, compiling pydantic-core should be possible with roughly the following:

# clone this repo or your fork
git clone [email protected]:pydantic/pydantic-core.git
cd pydantic-core
# create a new virtual env
python3 -m venv env
source env/bin/activate
# install dependencies and install pydantic-core
make install

That should be it, the example shown above should now run.

You might find it useful to look at python/pydantic_core/_pydantic_core.pyi and python/pydantic_core/core_schema.py for more information on the python API, beyond that, tests/ provide a large number of examples of usage.

If you want to contribute to pydantic-core, you'll want to use some other make commands:

make build-dev to build the package during development
make build-prod to perform an optimised build for benchmarking
make test to run the tests
make testcov to run the tests and generate a coverage report
make lint to run the linter
make format to format python and rust code
make to run format build-dev lint test

Profiling

It's possible to profile the code using the flamegraph utility from flamegraph-rs. (Tested on Linux.) You can install this with cargo install flamegraph.

Run make build-profiling to install a release build with debugging symbols included (needed for profiling).

Once that is built, you can profile pytest benchmarks with (e.g.):

flamegraph -- pytest tests/benchmarks/test_micro_benchmarks.py -k test_list_of_ints_core_py --benchmark-enable

The flamegraph command will produce an interactive SVG at flamegraph.svg.

Releasing

Bump package version locally. Do not just edit Cargo.toml on Github, you need both Cargo.toml and Cargo.lock to be updated.
Make a PR for the version bump and merge it.
Go to https://github.com/pydantic/pydantic-core/releases and click "Draft a new release"
In the "Choose a tag" dropdown enter the new tag v<the.new.version> and select "Create new tag on publish" when the option appears.
Enter the release title in the form "v<the.new.version> "
Click Generate release notes button
Click Publish release
Go to https://github.com/pydantic/pydantic-core/actions and ensure that all build for release are done successfully.
Go to https://pypi.org/project/pydantic-core/ and ensure that the latest release is published.
Done 🎉

pydantic-core's People

Contributors

Stargazers

Watchers

Forkers

nazrulworld rahulgupta92 sthagen jeremyfee danielsanchezq didimelli adriangb tranzystorekk davidhewitt coocos griels indietyp czotomo uriyyo dswij prettywood messense hoodmane fcfangcc humrochagf harsh1995 sergeytsaplin maestro-1 art049 dan-fritchman jcsho aminalaee ytmimi dr-emann zys864 matthijskok johnchildren jairhenrique abbassalloum pmbaumgartner sanders41 mauroalejandrojm gabbyrichards dwreeves carlos-rian odarbelaeze mattwang44 tushushu bastien-bo arsenron ngnpope musebc cav71 antonagestam smhavens voldemat github-yxb philhchen stranger6667 pgiraud ddneilson harry-lees alonme jeanarhancet dmontagu trk54ylmz hbcbh1999 dnuzz ghandic yohanvalencia xhochy hramezani yuvalbrt ponytailer mgroenbroek andrey-berenda kludex ytoku marksmayo oneslash hunhoon21 jonasks marian-vignau martinabeleda costrouc soof-golan bunjdo fatelei techthiyanes r3dm1ke qarmin likecodingloveproblems fynnbe owrior alvistack mgorny 0xbe7a sirivong alexwaygood ollz272 shirannx abdur-rahmaanj thanhchinhbk gamaleonardo gavinhofer

pydantic-core's Issues

strict JSON types

I seem to have had a mental lapse and forgotten about strict working properly on JSON types.

JSON Input should match python e.g.:

only allow int for int input
only allow float for float inputs

Python Exception for custom kind and message

We need a way for errors raise in python to properly populate kind, message, context, maybe even loc.

Solutions:

the simplest solutions is to just look for some attributes on all ValueErrors
we could also create and export a custom Exception, then check for that - tthereby avoiding unnecessary getattr
variation on 1) would be to look for on attribute which must be a dict, then get items from that

I guess we should do some profile to see which is fastest.

`on_error` validator and field property

Validator which returns a default value when validation fails
setting on TypeDict Field which allows the value to be omitted or the default to be used if an error occurs

Move validator name generation to build

and make it work properly with recursive references, ref #140

Field info for function validators

Dict including:

Name
Type
Position - index for tuple, key, value
Maybe more details in future

Passed as kwarg to function

recursive dictionaries passed into SchemaValidator result in a panic

For obvious reasons passing in a recursive dict causes a segfault. It's certainly user error, but it might be nice to return a Python RecursionError instead of a segfault

from pydantic_core import SchemaValidator
schema = {"type": "union", "choices": []}
schema['choices'].append(schema)
SchemaValidator(schema)

Include internal details on some errors

We could add another property to line errors with internal details on what went wrong.

E.g. info that we wouldn't want to show to end users but which might help a developer debugging the errors.

Example would be DateTimeObjectInvalid, see #77.

`Sequence` Validator

Like the other sequence like validators, but keep the type.

What do we do about str? I guess we have to allow it, but there are regularly scenarios where you want "any sequence other than a string".

prevent `int`, `float` and `Decimal` from being converted to strings

This is a hangover from pydantic v1, should be removed.

rename `model.rs` to `typeddict.rs`

and associated validator, schema etc.

@PrettyWood what do you think? The current name is not good, do you have better idea?

@adriangb you also mentioned this is confusing, idea?

Immutability of fields after validation in constructor

Description

Great project!

I'm opening this one as a TODO for the development of pydantic-core since pydantic/pydantic#4273 was closed but @samuelcolvin mentioned it could be handled in v2, not in v1.

Thanks!

Validating existing models

As per #21 we need to make sure values are copied.

As @tiangolo points out at in pydantic/pydantic#4218 (comment) we need to be able to validate a model without relying on isinstance.

I think we should therefore change how existing models are validated to effectively revalidate model.__dict__. That should solving copying (of models at least) and avoid subclasses being validated as parent classes.

This might have some performance impact, but it'll be much smaller than a hack in python to work around it.

Consistent wording in Error messages

I think we should use "Input", not "Value", also go through the test.

Partial typeddicts

pydantic/pydantic#3179

Should be fairly easy to modify models (to be renamed) to support partial as well as default values.

decide about `try_instance`

lax_dict takes a try_instance argument and can build a dict from a python object.

https://github.com/samuelcolvin/pydantic-core/blob/9330a17ea51e566539ba761709f8f5b9aec7553d/src/input/input_python.rs#L140

We need to decide when this should be used and when not, this also needs to be reflected in LookupKey.

Presumably this should be a config setting? But if it's just a config setting, we can't easily have a from_orm method (where this would apply).

Perhaps we're happy to have it set via config, then if from_orm = True in config, Model.parse_obj(my_object) would just work.

We should also use a consistent name, e.g. either from_orm or try_instance or something better, everywhere.

I would like to avoid runtime configuration options if possible.

@PrettyWood what do you think?

Also relates to #108.

make sure all values are copied

I guess in python versions of to_py we could somehow deep copy everything to match current pydantic behaviour.

Won't help performance but I think it's correct.

support positional arguments

We need a way to support positional arguments, this will be helpful for:

functions
named tuples
dataclasses - it should simplify our logic for dataclasses by allowing us to do all the validation before creating the dataclass

I'd like to reuse as much of the logic from TypedDictValidator as possible, in some regards here we can learn from the logic of validate_arguments in pydantic.

My proposal would be this:

a new validator which wraps a TypedDictValidator
converts positional arguments to named arguments
we pass the resulting dict to the function
we say we can't support position-only arguments for now - if we really need to support position only argument we could either have another validator or some custom logic, it'll be a bit slow, but that's fine

switch to maturin

If this can be done easily, it should all much faster cross platform builds.

separation of validation and python interface

Hi!

I have been a long time user of pydantic (great library btw) and have been following the development of the rust library.

I was wondering if it would make sense/would be possible to separate the python related code (use of Py* etc.) and rust implementation.

That way it'd be possible to use this great library not only in python code, but also port it to e.g. js or use for validation of rust code.

I'd be happy to take a stab at it and create a PR - if wanted.

More types

Performance questions?

I'm keen to "run onto the spike" and find any big potential performance improvements in pydantic-core while the API can be changed easily.

I'd therefore love anyone with experience of rust and/or pyo3 to have a look through the code and see if I'm doing anything dumb.

Particular concerns:

The "cast_as vs. extract" issues described in PyO3/pyo3#2278 was a bit scary as I only found the solution by chance, are there any other similar issues with pyo3?
generally copying and cloning values - am I copying stuff where I don't have to? In particular, is input or parts of input (in the case of a dict/list/tuple/set etc.) copied when it doesn't need to be?
Similarly, could we use a PyObject instead of PyAny or visa-versa and improve performance?
here and in a number of other implementations of ListInput and DictInput we do a totally unnecessary map, is this avoidable? Is this having a performance impact? Is there another way to give a general interface to the underlying datatypes that's more performance
The code for generating models here seems to be pretty slow compared to other validators, can anything be done?
Recursive models are slowing than I had expected, I thought it might be the use of RwLock that was causing the performance problems, but I managed to remove that (albeit in a slightly unsafe way) in #32 and it didn't make a difference. Is something else the problem? Could we remove Arc completely?
lifetimes get pretty complicated, I haven't even checked if get a memory leak from running repeat validation, should we/can we change any lifetimes?

I'll add to this list if anything else comes to me.

More generally I wonder if there are performance improvements that I'm not even aware of? "What you don't know, you can't optimise"

@pauleveritt @robcxyz

Changes to list, tuple, set, frozenset coersion

Stop coercing `set` / `frozenset` to `list` / `tuple`?

Although this is not "loosing information", the result is not deterministic/repeatable.

E.g. if you have the Field Tuple[PositiveInt, NegativeInt, str] then the input set {1, -1, 'a'} will work sometimes, and fail sometimes - this is pretty confusing.

I think we should change this.

Stop coercing `list` / `tuple` to `set` / `frozenset`?

Should we allow coercing a list to a set? In this case we are "loosing information" (e.g. order), however creating a set from a list is often desired - e.g. when parsing a format (yaml, toml etc.) that only has a list type.

I think we should not change this

Add coercing `dict_key` to `set` / `frozenset`?

Not that common, but we have it now and I think it kind of makes sense since dict_key "feels like" (sorry to be fluffy) a set.

I guess since dict_key are ordered, it should be fine to coerce them to list and tuple too.

I guess as with currently, we should allow dict_values to all these types too?

I think we should change this.

Generators?

In pydantic V1 we allow converting a generator to any of these types.

I think we should allow converting a generator to a list or tuple, but not set or frozenset.

@PrettyWood @tiangolo thoughts?

First class field validator

Documenting in person discussion.

It might make sense to have a "type": "field" schema or something along this lines to collect options that apply to the field and not the type of the field, like a "not required" optional that would leave the field unpopulated if it is not included.

implement `timedelta`

Somehow I forgot timedelta the work in speedate is done, just needs the type implementing.

stop bytes being converted to `int` and `float`

stop using maybe_as_string.

support NamedTuple / other initialization methods

From in person discussion.

Initializing a NamedTuple fails because it's immutable at a low level (object.__setattr__ and such tricks won't work).
Is this something we want to support?
Should we have a field option to specify how the field should be set (__new__ or setattr)? May be related to #59

wasm support

I really want pydantic-core to support wasm. This is mostly so that the examples in pydantic's docs can be edited and run in the browser, but also for wider use of pydantic.

As per PyO3/pyo3#2412 (comment), it looks like it should be possible.

But I'm not sure how to integrate that with maturin github actions. @messense any pointers? Or would you be willing to submit a PR?

Also, as well as getting wheels to build, what more do we need to do to get pydantic-core working with pyiodide?

Support self referencing models

E.g. like

from typing import Optional

from pydantic import BaseModel
from devtools import debug


class Branch(BaseModel):
    name: str
    sub_branch: Optional['Branch'] = None


b = Branch(name='main', sub_branch=Branch(name='sub'))
debug(b)

copy default and implement `default_factory`

on model / typed_dict.

Feature Request: 3rd party non-JSON serialization/deserialization

Hi, author of pydantic-yaml here. I have no idea about anything Rust-related, unfortunately, but hopefully this feature request will make sense in Python land.

I'm going off this slide in this presentation by @samuelcolvin, specifically:

We could add support for other formats (e.g. yaml, toml) the only side affect would be bigger binaries.

Here's a relevant discussion about "3rd party" deserialization from v1: pydantic/pydantic#3025

It would be great if pydantic-core were built in a way where non-JSON formats could be added "on top" rather than necessarily being built into the core. I understand performance is a big question in this rewrite, so ideally these would be high-level interfaces that can be hacked in Python (or implemented in Rust/etc. for better performance).

From the examples available already, it's possible that such a feature could be quite simple on the pydantic-core side - the 3rd party would create their own function a-la validate_json, possibly just calling validate_python. However, care would be needed on how format-specific details are sent between pydantic and the implementation. In V1 this is done with the Config class and special json_encoder/decoder attributes, which have been a pain to re-implement for YAML properly (without way too much hackery).

Ideally for V2, this would be something more easily addable and configurable. The alternative would be to just implement TOML, YAML etc. directly in the binary (and I wouldn't have to keep supporting my project, ha!)

Thanks again for Pydantic!

add tagged union

As you may already know I love unions: smart, strict and tagged ones 😄
I would like to work on a PR to add this.

The proposed syntax would be

'type': 'model',
'fields': {
    'pet': {
        'schema': {
            'type': 'union',
            'tag': 'species',
            'choices': [
                {
                    'type': 'model',
                    'fields': {
                        'species': {'schema': {'type': 'literal', 'expected': ['cat']}},
                        'lives': {'schema': {'type': 'int'}, 'default': 9},
                    },
                },
                {
                    'type': 'model',
                    'fields': {
                        'species': {'schema': {'type': 'literal', 'expected': ['dog']}},
                        'barks': {'schema': {'type': 'bool'}},
                    },
                },
            ],
        }
    },
},

I guess the UnionValidator would become something like

pub struct UnionValidator {
    choices: Vec<CombinedValidator>,
    strict: bool,
    tag: Option<Tag>
}

pub struct Tag {
    name: String,
    field_validator_mapping: HashMap<*const str, *const CombinedValidator>
}

What do you think @samuelcolvin?

reuse error kinds where appropriate

E.g. we shouldn't have FloatGreaterThan and IntGreaterThan, we should just have one GreaterThan type.

Parsing JSON directly

Would be amazing if we could parse and validate JSON directly, without creating python objects, then validating them.

The basic idea would be to create traits to achieve all the conversions used here, then implement those traits for both serde types, and pyo3 types.

Then use those types instead of pyo3 types throughout validators.

If we did this, it also opens the door to using pydantic-core without python 👀 - e.g. in an entirely theoretical "Tydantic" typescript package.

Question: Relevance of certain crates to this project

Hello,

Seeing that this is being built in Rust, do you reckon crates such as

@sharksforarms's deku
@jam1garner's binrw

.. can serve as a back-end, doing the declarative parsing work?

Recursion with function before

The recursion guard is unable to catch the following which results in a seg-fault

@pytest.mark.skip(reason='This case causes a seg-fault since the recursion checker cannot detect the cycle')
def test_function_change_id():
    def f(input_value, **kwargs):
        return input_value + ' Changed'

    v = SchemaValidator(
        {
            'choices': [
                {
                    'type': 'function',
                    'mode': 'before',
                    'function': f,
                    'schema': {'schema_ref': 'root-schema', 'type': 'recursive-ref'},
                },
                'int',
            ],
            'ref': 'root-schema',
            'type': 'union',
        }
    )

    with pytest.raises(ValidationError) as exc_info:
        assert v.validate_python('input value') == 'input value Changed'

    print(str(exc_info.value))

This is because the input is changing on each step, so the id isn't found in the recursion guard lookup set.

I don't see how we can detect this without introducing a mini-stack, which would really harm performance.

I think we just put a note in the docs. Saying "if you do really dumb stuff, you can get the validator to recursive infinitely".

add `loc` and `file_position` to `PydanticValueError`

More changes after #185.

Also, pydantic/pydantic#4254.

We should add loc to PydanticValueError which gets appended to the error loc

Also add file_position (tuple[int, int] of (line, col)), one day when we have a custom JSON parser we can populate this in pydantic-core, until then we just add it via PydanticValueError.

file_position will require some pretty output in error messages.

make sure we've implemented `validate_strict` everywhere

Might also be worth an experiment to if making strict a runtime switch negatively effects performance.

prevent accidental schema keys

macro to call at the end of build to check no other keys have been used.

Pypy support

@tiangolo asked about pypy support. @messense do you have input?

Looks like pyo3 does support pypy, see here and PyO3/rust-numpy#219.

But I know pydantic-core uses some non abi3 methods, maybe we use stuff that would cause problems with pypy, we should probably try it sooner rather than later.

build sdist on CI and include `self_schema.py`

as per #173 (comment)

We need to build sdist on CI and include self_schema.py.

simplify recursive refs

I'm not sure if this would be possible (I'm guessing it's not), but it would be nice to be able to recursively refer to a parent validator without having to know a-priori if it will be recursive or not.

Currently, if you are parsing something like:

class Outer:
    inner: List[Outer]

You would have to know that Outer is recursive before you parse its fields.

Would it be possible to have an optional "id" property on every validator that acts as the reference for recursive schemas instead of a special "recursive-container" validator? So:

from typing import List

from pydantic_core import SchemaValidator


class Outer:
    inner: "List[Outer]"


v = SchemaValidator(
    {
        "type": "model-class",
        # id is optional, required for this to be usable as a recursive ref
        # the value is arbitrary, the id of the type seems like a safe choice
        "id": str(id(Outer)),
        "schema": {
            "type": "model",
            "fields": {
                "inner": {
                    "type": "list",
                    "items": {
                        "type": "recursive-ref",
                        "id": str(id(Outer)),
                    }
                },
            },
        },
    }
)

Cleanup and test `Config`

Currently I'm not clear where config is used, and what attributes are respected.

E.g. there are some properties that are used in string.rs that are not in the python types.

It also has minimal tests, we need to test it properly - perhaps a separate test file to be clean.

What's going on here?

I realise I invited collaborators to from pydantic to this repo without an explanation of what's going on.

The plan is obviously to make this repo public, but I want to get the basic design solid before flipping the switch (maybe that's unnecessary, I don't know).

I'd love feedback on this idea in general, and specifics if possible.

I'm not "announcing" this yet, but feel free to discuss it with others if that helps.

The idea is not particularly secret after this but i'm hoping to build some suspense/fomo before going public.

`TypeDict` definitions for schema definitions

In _pydantic_core.pyi.

TODO before v0.1 release

This is not going to be the release which pydantic V2 is released with, but we should get a proper release out to give a target for other work.

What needs fixing before we do that?

cache `PyString` for short strings

As per PyO3/pyo3#2463, we could cache the PyString value for short (length <63 say) strings to achieve a significant time saving.

Although it would save time on the specific case of building dicts with repeated inputs, I wonder how much time it would really save in the real world?

Definitely this is an optimisation that should be looked at after pydantic v2 is released.

If we do do it, I guess it should be configurable.

cache.rs in orjson might be a useful starting point for this.

Panic on Ctrl+C

Just got this error when hitting ctrl+c while running SchemaValidator - e.g. creating a validator.

^Cthread '<unnamed>' panicked at 'a Display implementation returned an error unexpectedly: Error', /rustc/90ca44752a79dd414d9a0ccf7a74533a99080988/library/alloc/src/string.rs:2478:14
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
Traceback (most recent call last):
  File "/Users/samuel/code/pydantic-core/create_many.py", line 8, in <module>
    v = SchemaValidator(
pyo3_runtime.PanicException: a Display implementation returned an error unexpectedly: Error

Validation Context

See pydantic/pydantic@c8ba8f1 for an explanation of usage.

I Implemented some of the plumbing for this early on, then forgot the porcelain. Needs completing and testing.

I guess we should also make sure the other kwargs to validation functions make sense at the same time.

Questions regarding performance

In your presentation you talk about achieving a 12x performance improvement for validating a list of dicts with length of 100 elements, is this test consistently accomplishing the same numbers for bigger lists?

The other question that I have is regarding the decision to choosing Rust as the language for the core of pedantic, which was the criteria to choose Rust over other languages like C or C++ or even dotnet?

Yaml native support?

I read here that Pydantic v2 will have native json support....

however a lot of devs are using .yaml for configs (it is much readable for humans).

i susspect i will be able to load it as python object but Strict Mode can't be used and data goes from native code (probably c) to python and back to native code in rust. That is at least kind of stupid

I also susspect other people may want to add validation to other serialization formats (like bson,protobuff or what ever they need). Some of those people would like to do it in rust (idealy (runtime) plugable (not everyone will want everything) but i am not shure if it is easyly acheavable, but it is possible).

that could make pydantic-core serialization agnostic and still have same performance and validation would not just care how data got to it (and i undestand it has to have at least somewhat resembeling json structure and types)

Aplicability would be huge. Because a good developer should always write some validation and lot of time you are doing it whole by hand in some validate method writing checking ranges of numers making regex matchers to strings. or converting int/str to date object conversions... you get the idea and with pydantic it would be just much easier to do it exhaustively.

Those are just ideas and i wanted to share them publicly. Maybe a effort to split out serialization from validation would be to huge.

but that is to @samuelcolvin to decide

main thing is to not rush this... if this will make it in any time ( i am not saying it must be 2.0.0) i would be happy