jpmorganchase / py-avro-schema Goto Github PK

View Code? Open in Web Editor NEW

35.0 3.0 6.0 227 KB

Generate Apache Avro schemas for Python types including standard library data-classes and Pydantic data models.

Home Page: https://py-avro-schema.readthedocs.io/

License: Apache License 2.0

Python 100.00%

avro dataclasses python schema data types deserialization jpmorganchase kafka messaging

py-avro-schema's Introduction

py-avro-schema

Generate Apache Avro schemas for Python types including standard library data-classes and Pydantic data models.

📘 Documentation: https://py-avro-schema.readthedocs.io

Installing

python -m pip install py-avro-schema

Developing

To setup a scratch/development virtual environment (under .venv/), first install Tox. Then run:

tox -e dev

The py-avro-schema package is installed in editable mode inside the .venv/ environment.

Run tests by simply calling tox.

Install code quality Git hooks using pre-commit install --install-hooks.

Terms & Conditions

Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.

Contributing

See CONTRIBUTING.md

py-avro-schema's People

Contributors

Stargazers

Watchers

Forkers

stjordanis modelyst pinkdiamond1 dada-engineer cgtobi faph

py-avro-schema's Issues

Should we support un-annotated decimal types?

Currently, we require decimal.Decimal types to be annotated with a py_avro_schema.DecimalMeta object defining "precision" and optionally "scale".

Should we support plain decimal.Decimal types and default the precision parameter to be something sensible?

If so, is there any precedent for a default precision value?

Since we are generating Avro bytes schemas for decimals and not fixed schema, does it actually matter whether we default precision to something huge? The size of the actual serialized number would simply depend on the actual digits being used, not the schema's maximum precision.

The reason for the above question would be to align it with how we treat say dates and times. Here we default to the maximum precision Avro supports: nanoseconds.

Test with Python 3.12 alpha/beta releases

Should be non-critical errors while 3.12 is under development

Wire up PyPI package uploads

Date is converted to INT and not STRING

Can I configure somehow the handling of datetime.date? Here is my use case:

I have a field in my pydantic model called start_date:

import datetime
from aifora.da.schema.customer.models._base import BaseModel
class MyModel(BaseModel):
   start_date: datetime.date

When I let pydantic generate a JSON-Schema (MyModel.schema_json), this field is represented as string:

    "start_date": {
      "title": "Start Date",
      "type": "string",
      "format": "date"
    },

However, when I apply py-avro-schema to this field is represented as INT:

    {
      "name": "start_date",
      "type": {
        "type": "int",
        "logicalType": "date"
      },
    },

Is there any setting that dates in py-avro-schema are converted to strings?

Regression in 3.0.0 for Pydantic models defined using inheritance

We've encountered what seems like a regression in the behavior of the latest version of py-avro-schema when Pydantic models are defined via inheritance (e.g. to DRY up definitions where models have common fields). A minimal test case which passes in 2.2.1 but fails in 3.0.0:

def test_model_inheritance():
    class PyTypeCustomBase(pydantic.BaseModel):
        field_a: str

    class PyType(PyTypeCustomBase):
        field_b: str

    expected = {
        "type": "record",
        "name": "PyType",
        "fields": [
            {
                "name": "field_a",
                "type": "string"
            },
            {
                "name": "field_b",
                "type": "string"
            },
        ],
    }
    assert_schema(PyType, expected)

A potential fix may be as simple as the following 🤷‍♂️ :

--- a/src/py_avro_schema/_schemas.py
+++ b/src/py_avro_schema/_schemas.py
@@ -835,7 +835,7 @@ class PydanticSchema(RecordSchema):
         # Pydantic 2 resolves forward references for us. To avoid infinite recursion, we check if the unresolved raw
         # annoation is a forward reference. If so, we use that instead of Pydantic's resolved type hint. There might be
         # a better way to un-resolve the forward reference...
-        if isinstance(self.raw_annotations[name], (str, ForwardRef)):
+        if isinstance(self.raw_annotations.get(name), (str, ForwardRef)):

Setup Sphinx documentation generation

Decide whether to publish to GH-pages or RTD.

Possibly using https://github.com/marketplace/actions/sphinx-to-github-pages

list class not allowed in pydantic nested schema default

I have a class of type pydantic.BaseModel with a chiled pydantic Basemodel with a list attribute.

This is not serializable right now

import pydantic


class Bar(pydantic.BaseModel):
    baz: list[str] = pydantic.Field(default_factory=list)


class Foo(pydantic.BaseModel):
    bar: Bar = pydantic.Field(default_factory=Bar)


print(pas.generate(Foo))

Traceback:

import py_avro_schema as pas

import pydantic


class Bar(pydantic.BaseModel):
    baz: list[str] = pydantic.Field(default_factory=list)


class Foo(pydantic.BaseModel):
    bar: Bar = pydantic.Field(default_factory=Bar)


print(pas.generate(Foo))


Traceback (most recent call last):
  File "/Users/dada_engineer/workspace/private/py-avro-schema/example.py", line 14, in <module>
    print(pas.generate(Foo))
  File "/Users/dada_engineer/workspace/private/py-avro-schema/.venv/lib/python3.9/site-packages/memoization/caching/plain_cache.py", line 42, in wrapper
    result = user_function(*args, **kwargs)
  File "/Users/dada_engineer/workspace/private/py-avro-schema/src/py_avro_schema/__init__.py", line 64, in generate
    schema_dict = schema(py_type, namespace=namespace, options=options)
  File "/Users/dada_engineer/workspace/private/py-avro-schema/src/py_avro_schema/_schemas.py", line 139, in schema
    schema_data = schema_obj.data(names=names)
  File "/Users/dada_engineer/workspace/private/py-avro-schema/src/py_avro_schema/_schemas.py", line 695, in data
    return self.data_before_deduplication(names=names)
  File "/Users/dada_engineer/workspace/private/py-avro-schema/src/py_avro_schema/_schemas.py", line 766, in data_before_deduplication
    "fields": [field.data(names=names) for field in self.record_fields],
  File "/Users/dada_engineer/workspace/private/py-avro-schema/src/py_avro_schema/_schemas.py", line 766, in <listcomp>
    "fields": [field.data(names=names) for field in self.record_fields],
  File "/Users/dada_engineer/workspace/private/py-avro-schema/src/py_avro_schema/_schemas.py", line 826, in data
    field_data["default"] = self.schema.make_default(self.default)
  File "/Users/dada_engineer/workspace/private/py-avro-schema/src/py_avro_schema/_schemas.py", line 908, in make_default
    return {key: _schema_obj(value.__class__).make_default(value) for key, value in py_default}
  File "/Users/dada_engineer/workspace/private/py-avro-schema/src/py_avro_schema/_schemas.py", line 908, in <dictcomp>
    return {key: _schema_obj(value.__class__).make_default(value) for key, value in py_default}
  File "/Users/dada_engineer/workspace/private/py-avro-schema/src/py_avro_schema/_schemas.py", line 162, in _schema_obj
    raise TypeNotSupportedError(f"Cannot generate Avro schema for Python type {py_type}")
py_avro_schema._schemas.TypeNotSupportedError: Cannot generate Avro schema for Python type <class 'list'>

relates to handling of pydantic defaults in #64

PositiveFloat is not supported

I use the PositiveFloat Fields in my pydantic model

from pydantic import PositiveFloat
class ModelExample(BaseModel):
    price: PositiveFloat

However, this seems not yet supported by py-avro-schema:

TypeError: Cannot generate Avro schema for Python type <class 'pydantic.types.PositiveFloat'>

Test and build for Python 3.11.0 (final)

Currently, tests are configured to fail (with warnings) under Python 3.11. Once Python 3.11.0 has been released this need to change.

Avro Logical Type "DATE" is not compatible with Pydantic Date definition

I am currently facing the issue that data compliant with the AVRO Schema cannot be handled with Pydantic. E.g. I have a data model

import datetime
from aifora.da.schema.customer.models._base import BaseModel
class MyModel(BaseModel):
   start_date: datetime.date

Which returns in the Avro Schema (using py-avro-schema):

    {
      "name": "start_date",
      "type": {
        "type": "int",
        "logicalType": "date"
      },
    },

According to the AVRO Specification date is the number of days after 01/01/1970. However, Pydantic is not accepting this (always leads to datetime.date(1970, 1, 1)). Rather seconds or milliseconds (logicalType: Time ) are expected. Pydanitc Datetypes

Request: use fastavro

Is there a way to support fastavro? Can avro be replaced by this or do both libs need to be supported? Why is avro used not fastavro?

Use Pydantic's `model_config` to allow record name override

I'm migrating over to using Pydantic for defining and this library to generate Avro schema. It's awesome, but one challenge I'm running into is being able to override record names to maintain backwards compatibility.

For example, if I have one data asset partitioned across multiple Avro files in my data lake, BigQuery cannot successfully piece them together unless each file has consistent record names.

The workaround I've found is to subclass my Pydantic model with the legacy naming convention, but I feel like it would be much better if this library considered Pydantic's model_config. That way, I could keep everything in one, nicely named class, like so:

class StudentModel(BaseModel):
    model_config = ConfigDict(title="student_record")

    StudentID: str | None = None
    StudentSchoolID: str | None = None
    SecondaryStudentID: str | None = None
    StudentFirstName: str | None = None
    StudentMiddleName: str | None = None
    StudentLastName: str | None = None

Pydantic's BaseModel.model_json_schema() method behaves this way, but the way pas.generate() is set up, it just uses the literal class name.

union schema does not recursively create defaults

When I have a union in a model with a default value this value might not be serialisable, e.g. when using a pydantic model as default

example.py

import py_avro_schema as pas
from pydantic import BaseModel, Field

from typing import Union, List
from uuid import UUID


class X(BaseModel):
    ids: List[int] = Field(default_factory=list)


class Y(BaseModel):
    ids: List[float] = Field(default_factory=list)


class Bar(BaseModel):
    baz: Union[int, List[int]] = Field(default_factory=list)
    baz2: List[Union[str, UUID]] = Field(default_factory=list)
    baz3: Union[X, Y] = Field(default_factory=X)


class Foo(BaseModel):
    bar: Bar = Field(default_factory=Bar)


print(pas.generate(Foo))

Error:

Traceback (most recent call last):
  File "/Users/user/workspace/private/py-avro-schema/example.py", line 26, in <module>
    print(pas.generate(Foo))
  File "/Users/user/workspace/private/py-avro-schema/.venv/lib/python3.9/site-packages/memoization/caching/plain_cache.py", line 42, in wrapper
    result = user_function(*args, **kwargs)
  File "/Users/user/workspace/private/py-avro-schema/src/py_avro_schema/__init__.py", line 69, in generate
    schema_json = orjson.dumps(schema_dict, option=json_options)
TypeError: Type is not JSON serializable: X

This is fixable by defining a make_default and calling the make_default of self.schema_items[0], because this is sorted if there is a default and the default schema is inserted into the first list position.

time-millis logical time should use an int schema instead of long

As per specification: https://avro.apache.org/docs/1.11.1/specification/#time-millisecond-precision

(time-micros logical type is a long schema)

pydantic classes as default values not serializable

Hi everyone,

I have a pydantic model that has an attribute of another pydantic model type that should be generated by default. This fails as the schema dict produced by py_avro_schema sets as default the pydantic class object, which is per default not json serializable via orjson.

example.py

import py_avro_schema as pas
from pydantic import BaseModel, Field


class Bar(BaseModel):
    baz: int = 0


class Foo(BaseModel):
    bar: Bar = Field(default_factory=Bar)


pas.generate(Foo)

This raises the following error:

Traceback (most recent call last):
  File "/Users/gellertd/workspace/procureai/foundation/constellation/example.py", line 13, in <module>
    pas.generate(Foo)
  File "/Users/gellertd/.pyenv/versions/constellation/lib/python3.11/site-packages/memoization/caching/plain_cache.py", line 42, in wrapper
    result = user_function(*args, **kwargs)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/gellertd/.pyenv/versions/constellation/lib/python3.11/site-packages/py_avro_schema/__init__.py", line 69, in generate
    schema_json = orjson.dumps(schema_dict, option=json_options)
                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
TypeError: Type is not JSON serializable: Bar

Suggestion

On a PydanticSchema for each recordfield check if the py_type is a pydantic basemodel subclass and if so call
default.model_dump(mode="json") on it

This would then produce this schema:

{
  "type": "record",
  "name": "Foo",
  "fields": [
    {
      "name": "bar",
      "type": {
        "type": "record",
        "name": "Bar",
        "fields": [{ "name": "baz", "type": "long", "default": 0 }],
        "namespace": "__main__",
        "doc": "Usage docs: https://docs.pydantic.dev/2.6/concepts/models/"
      },
      "default": { "baz": 0 }
    }
  ],
  "namespace": "__main__",
  "doc": "Usage docs: https://docs.pydantic.dev/2.6/concepts/models/"
}

which looks alright.

The downside is that pydantic would be needed at runtime so it must be either catched as an error or it would be imported on class / method level.

Let me know what you think about this please.
Thanks a lot.

[QUESTION]: DecimalType assignment type linter error

Hi there,

in vs code the following decimal type assignment (according to docs) creates a typing issue. Is this something that can be resolved in this project? 🤔

from decimal import Decimal
import py_avro_schema as pas

foo: pas.DecimalType[512, 255] = Decimal(10)

Pylance error:

Expression of type "Decimal" cannot be assigned to declared type "DecimalType"
  "Decimal" is incompatible with "DecimalType"

Python Version: 3.11.4
py-avor-schema Version: 3.2.0

black
flake8
isort
interrogate

Use RTD theme for docs again

readthedocs/readthedocs.org#10692

Seems RTD has changed their mind. I do not like Alabaster.