GithubHelp home page GithubHelp logo

jpmorganchase / py-avro-schema Goto Github PK

View Code? Open in Web Editor NEW
35.0 3.0 6.0 227 KB

Generate Apache Avro schemas for Python types including standard library data-classes and Pydantic data models.

Home Page: https://py-avro-schema.readthedocs.io/

License: Apache License 2.0

Python 100.00%
avro dataclasses python schema data types deserialization jpmorganchase kafka messaging

py-avro-schema's Introduction

py-avro-schema

Generate Apache Avro schemas for Python types including standard library data-classes and Pydantic data models.

๐Ÿ“˜ Documentation: https://py-avro-schema.readthedocs.io

Installing

python -m pip install py-avro-schema

Developing

To setup a scratch/development virtual environment (under .venv/), first install Tox. Then run:

tox -e dev

The py-avro-schema package is installed in editable mode inside the .venv/ environment.

Run tests by simply calling tox.

Install code quality Git hooks using pre-commit install --install-hooks.

Terms & Conditions

Copyright 2022 J.P. Morgan Chase & Co.

Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.

Contributing

See CONTRIBUTING.md

py-avro-schema's People

Contributors

cgtobi avatar chouvic avatar dada-engineer avatar faph avatar jcameron73 avatar msinto93 avatar t3rrym avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar

py-avro-schema's Issues

Should we support un-annotated decimal types?

Currently, we require decimal.Decimal types to be annotated with a py_avro_schema.DecimalMeta object defining "precision" and optionally "scale".

Should we support plain decimal.Decimal types and default the precision parameter to be something sensible?

If so, is there any precedent for a default precision value?

Since we are generating Avro bytes schemas for decimals and not fixed schema, does it actually matter whether we default precision to something huge? The size of the actual serialized number would simply depend on the actual digits being used, not the schema's maximum precision.

The reason for the above question would be to align it with how we treat say dates and times. Here we default to the maximum precision Avro supports: nanoseconds.

Date is converted to INT and not STRING

Can I configure somehow the handling of datetime.date? Here is my use case:

I have a field in my pydantic model called start_date:

import datetime
from aifora.da.schema.customer.models._base import BaseModel
class MyModel(BaseModel):
   start_date: datetime.date 

When I let pydantic generate a JSON-Schema (MyModel.schema_json), this field is represented as string:

    "start_date": {
      "title": "Start Date",
      "type": "string",
      "format": "date"
    },

However, when I apply py-avro-schema to this field is represented as INT:

    {
      "name": "start_date",
      "type": {
        "type": "int",
        "logicalType": "date"
      },
    },

Is there any setting that dates in py-avro-schema are converted to strings?

Regression in 3.0.0 for Pydantic models defined using inheritance

We've encountered what seems like a regression in the behavior of the latest version of py-avro-schema when Pydantic models are defined via inheritance (e.g. to DRY up definitions where models have common fields). A minimal test case which passes in 2.2.1 but fails in 3.0.0:

def test_model_inheritance():
    class PyTypeCustomBase(pydantic.BaseModel):
        field_a: str

    class PyType(PyTypeCustomBase):
        field_b: str

    expected = {
        "type": "record",
        "name": "PyType",
        "fields": [
            {
                "name": "field_a",
                "type": "string"
            },
            {
                "name": "field_b",
                "type": "string"
            },
        ],
    }
    assert_schema(PyType, expected)

A potential fix may be as simple as the following ๐Ÿคทโ€โ™‚๏ธ :

--- a/src/py_avro_schema/_schemas.py
+++ b/src/py_avro_schema/_schemas.py
@@ -835,7 +835,7 @@ class PydanticSchema(RecordSchema):
         # Pydantic 2 resolves forward references for us. To avoid infinite recursion, we check if the unresolved raw
         # annoation is a forward reference. If so, we use that instead of Pydantic's resolved type hint. There might be
         # a better way to un-resolve the forward reference...
-        if isinstance(self.raw_annotations[name], (str, ForwardRef)):
+        if isinstance(self.raw_annotations.get(name), (str, ForwardRef)):

list class not allowed in pydantic nested schema default

I have a class of type pydantic.BaseModel with a chiled pydantic Basemodel with a list attribute.

This is not serializable right now

import pydantic


class Bar(pydantic.BaseModel):
    baz: list[str] = pydantic.Field(default_factory=list)


class Foo(pydantic.BaseModel):
    bar: Bar = pydantic.Field(default_factory=Bar)


print(pas.generate(Foo))

Traceback:

import py_avro_schema as pas

import pydantic


class Bar(pydantic.BaseModel):
    baz: list[str] = pydantic.Field(default_factory=list)


class Foo(pydantic.BaseModel):
    bar: Bar = pydantic.Field(default_factory=Bar)


print(pas.generate(Foo))


Traceback (most recent call last):
  File "/Users/dada_engineer/workspace/private/py-avro-schema/example.py", line 14, in <module>
    print(pas.generate(Foo))
  File "/Users/dada_engineer/workspace/private/py-avro-schema/.venv/lib/python3.9/site-packages/memoization/caching/plain_cache.py", line 42, in wrapper
    result = user_function(*args, **kwargs)
  File "/Users/dada_engineer/workspace/private/py-avro-schema/src/py_avro_schema/__init__.py", line 64, in generate
    schema_dict = schema(py_type, namespace=namespace, options=options)
  File "/Users/dada_engineer/workspace/private/py-avro-schema/src/py_avro_schema/_schemas.py", line 139, in schema
    schema_data = schema_obj.data(names=names)
  File "/Users/dada_engineer/workspace/private/py-avro-schema/src/py_avro_schema/_schemas.py", line 695, in data
    return self.data_before_deduplication(names=names)
  File "/Users/dada_engineer/workspace/private/py-avro-schema/src/py_avro_schema/_schemas.py", line 766, in data_before_deduplication
    "fields": [field.data(names=names) for field in self.record_fields],
  File "/Users/dada_engineer/workspace/private/py-avro-schema/src/py_avro_schema/_schemas.py", line 766, in <listcomp>
    "fields": [field.data(names=names) for field in self.record_fields],
  File "/Users/dada_engineer/workspace/private/py-avro-schema/src/py_avro_schema/_schemas.py", line 826, in data
    field_data["default"] = self.schema.make_default(self.default)
  File "/Users/dada_engineer/workspace/private/py-avro-schema/src/py_avro_schema/_schemas.py", line 908, in make_default
    return {key: _schema_obj(value.__class__).make_default(value) for key, value in py_default}
  File "/Users/dada_engineer/workspace/private/py-avro-schema/src/py_avro_schema/_schemas.py", line 908, in <dictcomp>
    return {key: _schema_obj(value.__class__).make_default(value) for key, value in py_default}
  File "/Users/dada_engineer/workspace/private/py-avro-schema/src/py_avro_schema/_schemas.py", line 162, in _schema_obj
    raise TypeNotSupportedError(f"Cannot generate Avro schema for Python type {py_type}")
py_avro_schema._schemas.TypeNotSupportedError: Cannot generate Avro schema for Python type <class 'list'>

relates to handling of pydantic defaults in #64

PositiveFloat is not supported

I use the PositiveFloat Fields in my pydantic model

from pydantic import PositiveFloat
class ModelExample(BaseModel):
    price: PositiveFloat

However, this seems not yet supported by py-avro-schema:

TypeError: Cannot generate Avro schema for Python type <class 'pydantic.types.PositiveFloat'>

Avro Logical Type "DATE" is not compatible with Pydantic Date definition

I am currently facing the issue that data compliant with the AVRO Schema cannot be handled with Pydantic. E.g. I have a data model

import datetime
from aifora.da.schema.customer.models._base import BaseModel
class MyModel(BaseModel):
   start_date: datetime.date 

Which returns in the Avro Schema (using py-avro-schema):

    {
      "name": "start_date",
      "type": {
        "type": "int",
        "logicalType": "date"
      },
    },

According to the AVRO Specification date is the number of days after 01/01/1970. However, Pydantic is not accepting this (always leads to datetime.date(1970, 1, 1)). Rather seconds or milliseconds (logicalType: Time ) are expected. Pydanitc Datetypes

Request: use fastavro

Is there a way to support fastavro? Can avro be replaced by this or do both libs need to be supported? Why is avro used not fastavro?

Use Pydantic's `model_config` to allow record name override

I'm migrating over to using Pydantic for defining and this library to generate Avro schema. It's awesome, but one challenge I'm running into is being able to override record names to maintain backwards compatibility.

For example, if I have one data asset partitioned across multiple Avro files in my data lake, BigQuery cannot successfully piece them together unless each file has consistent record names.

The workaround I've found is to subclass my Pydantic model with the legacy naming convention, but I feel like it would be much better if this library considered Pydantic's model_config. That way, I could keep everything in one, nicely named class, like so:

class StudentModel(BaseModel):
    model_config = ConfigDict(title="student_record")

    StudentID: str | None = None
    StudentSchoolID: str | None = None
    SecondaryStudentID: str | None = None
    StudentFirstName: str | None = None
    StudentMiddleName: str | None = None
    StudentLastName: str | None = None

Pydantic's BaseModel.model_json_schema() method behaves this way, but the way pas.generate() is set up, it just uses the literal class name.

union schema does not recursively create defaults

When I have a union in a model with a default value this value might not be serialisable, e.g. when using a pydantic model as default

example.py

import py_avro_schema as pas
from pydantic import BaseModel, Field

from typing import Union, List
from uuid import UUID


class X(BaseModel):
    ids: List[int] = Field(default_factory=list)


class Y(BaseModel):
    ids: List[float] = Field(default_factory=list)


class Bar(BaseModel):
    baz: Union[int, List[int]] = Field(default_factory=list)
    baz2: List[Union[str, UUID]] = Field(default_factory=list)
    baz3: Union[X, Y] = Field(default_factory=X)


class Foo(BaseModel):
    bar: Bar = Field(default_factory=Bar)


print(pas.generate(Foo))

Error:

Traceback (most recent call last):
  File "/Users/user/workspace/private/py-avro-schema/example.py", line 26, in <module>
    print(pas.generate(Foo))
  File "/Users/user/workspace/private/py-avro-schema/.venv/lib/python3.9/site-packages/memoization/caching/plain_cache.py", line 42, in wrapper
    result = user_function(*args, **kwargs)
  File "/Users/user/workspace/private/py-avro-schema/src/py_avro_schema/__init__.py", line 69, in generate
    schema_json = orjson.dumps(schema_dict, option=json_options)
TypeError: Type is not JSON serializable: X

This is fixable by defining a make_default and calling the make_default of self.schema_items[0], because this is sorted if there is a default and the default schema is inserted into the first list position.

pydantic classes as default values not serializable

Hi everyone,

I have a pydantic model that has an attribute of another pydantic model type that should be generated by default. This fails as the schema dict produced by py_avro_schema sets as default the pydantic class object, which is per default not json serializable via orjson.

example.py

import py_avro_schema as pas
from pydantic import BaseModel, Field


class Bar(BaseModel):
    baz: int = 0


class Foo(BaseModel):
    bar: Bar = Field(default_factory=Bar)


pas.generate(Foo)

This raises the following error:

Traceback (most recent call last):
  File "/Users/gellertd/workspace/procureai/foundation/constellation/example.py", line 13, in <module>
    pas.generate(Foo)
  File "/Users/gellertd/.pyenv/versions/constellation/lib/python3.11/site-packages/memoization/caching/plain_cache.py", line 42, in wrapper
    result = user_function(*args, **kwargs)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/gellertd/.pyenv/versions/constellation/lib/python3.11/site-packages/py_avro_schema/__init__.py", line 69, in generate
    schema_json = orjson.dumps(schema_dict, option=json_options)
                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
TypeError: Type is not JSON serializable: Bar

Suggestion

On a PydanticSchema for each recordfield check if the py_type is a pydantic basemodel subclass and if so call
default.model_dump(mode="json") on it

This would then produce this schema:

{
  "type": "record",
  "name": "Foo",
  "fields": [
    {
      "name": "bar",
      "type": {
        "type": "record",
        "name": "Bar",
        "fields": [{ "name": "baz", "type": "long", "default": 0 }],
        "namespace": "__main__",
        "doc": "Usage docs: https://docs.pydantic.dev/2.6/concepts/models/"
      },
      "default": { "baz": 0 }
    }
  ],
  "namespace": "__main__",
  "doc": "Usage docs: https://docs.pydantic.dev/2.6/concepts/models/"
}

which looks alright.

The downside is that pydantic would be needed at runtime so it must be either catched as an error or it would be imported on class / method level.

Let me know what you think about this please.
Thanks a lot.

[QUESTION]: DecimalType assignment type linter error

Hi there,

in vs code the following decimal type assignment (according to docs) creates a typing issue. Is this something that can be resolved in this project? ๐Ÿค”

from decimal import Decimal
import py_avro_schema as pas

foo: pas.DecimalType[512, 255] = Decimal(10)

Pylance error:

Expression of type "Decimal" cannot be assigned to declared type "DecimalType"
  "Decimal" is incompatible with "DecimalType"

Python Version: 3.11.4
py-avor-schema Version: 3.2.0

how to serialize timedelta fields

the schema for timedelta is "fixed" with len 12.
what is the actual serialization?

is it a string containing the number of seconds or milliseconds or microseconds?
what is the format of that string?
and why use a string (inefficient) instead of int/long ?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.