Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pydantic classes as default values not serializable #64

Closed
dada-engineer opened this issue Mar 8, 2024 · 18 comments · Fixed by #65
Closed

pydantic classes as default values not serializable #64

dada-engineer opened this issue Mar 8, 2024 · 18 comments · Fixed by #65

Comments

@dada-engineer
Copy link
Contributor

Hi everyone,

I have a pydantic model that has an attribute of another pydantic model type that should be generated by default. This fails as the schema dict produced by py_avro_schema sets as default the pydantic class object, which is per default not json serializable via orjson.

example.py

import py_avro_schema as pas
from pydantic import BaseModel, Field


class Bar(BaseModel):
    baz: int = 0


class Foo(BaseModel):
    bar: Bar = Field(default_factory=Bar)


pas.generate(Foo)

This raises the following error:

Traceback (most recent call last):
  File "/Users/gellertd/workspace/procureai/foundation/constellation/example.py", line 13, in <module>
    pas.generate(Foo)
  File "/Users/gellertd/.pyenv/versions/constellation/lib/python3.11/site-packages/memoization/caching/plain_cache.py", line 42, in wrapper
    result = user_function(*args, **kwargs)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/gellertd/.pyenv/versions/constellation/lib/python3.11/site-packages/py_avro_schema/__init__.py", line 69, in generate
    schema_json = orjson.dumps(schema_dict, option=json_options)
                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
TypeError: Type is not JSON serializable: Bar

Suggestion

On a PydanticSchema for each recordfield check if the py_type is a pydantic basemodel subclass and if so call
default.model_dump(mode="json") on it

This would then produce this schema:

{
  "type": "record",
  "name": "Foo",
  "fields": [
    {
      "name": "bar",
      "type": {
        "type": "record",
        "name": "Bar",
        "fields": [{ "name": "baz", "type": "long", "default": 0 }],
        "namespace": "__main__",
        "doc": "Usage docs: https://docs.pydantic.dev/2.6/concepts/models/"
      },
      "default": { "baz": 0 }
    }
  ],
  "namespace": "__main__",
  "doc": "Usage docs: https://docs.pydantic.dev/2.6/concepts/models/"
}

which looks alright.

The downside is that pydantic would be needed at runtime so it must be either catched as an error or it would be imported on class / method level.

Let me know what you think about this please.
Thanks a lot.

@dada-engineer
Copy link
Contributor Author

Found a solution without importing pydantic at runtime, as long as you are fine with # type: ignore. Which I figured is okay as there have already been some.

@dada-engineer
Copy link
Contributor Author

An alternative would be to allow setting orjson.dumpss default. This will make the json serialisation less performant though, according to orjson's docs (No problem for our usecase but might be for others)

dada-engineer added a commit to dada-engineer/py-avro-schema that referenced this issue Mar 8, 2024
@dada-engineer
Copy link
Contributor Author

dada-engineer commented Mar 8, 2024

Personally I'd prefer to handle this within PydanticSchema class but I'd like to have your opinion on this.

Downside: This can not handle a dataclass that has an attribute that is a pydantic class with a default. Not sure how common this usecase is or I should add handling this to DataclassSchema as well. 😅

dada-engineer added a commit to dada-engineer/py-avro-schema that referenced this issue Mar 8, 2024
@faph
Copy link
Contributor

faph commented Mar 8, 2024

ack

@dada-engineer
Copy link
Contributor Author

@faph any news on this?

@faph
Copy link
Contributor

faph commented Mar 18, 2024

Hi @dada-engineer

I am just trying to get my head around this and what the expected behavior should be. Reading the spec here: https://avro.apache.org/docs/1.11.1/specification/#schema-record makes me think that we should be following the Avro spec exactly on how field defaults should be serialized. There is only 1 correct serialization for a given type, so that would suggest we should NOT make the serialization configurable.

Instead, we might need to consider using the make_default methods consistently and recursively when generating defaults that a record type?

I have to say, I have not thought about using an entire Pydantic object as a default. I guess the Avro spec supports it, so py-avro-schema should support it too.

@faph
Copy link
Contributor

faph commented Mar 18, 2024

So the problem here is that PydanticSchema itself does not implement make_default. It probably should and should do this recursively by use each field's schema's make_default.

@dada-engineer
Copy link
Contributor Author

Thanks a lot for your feedback. This makes sense yes.

I am not sure if I understand how you want this to be implemented though.

@faph
Copy link
Contributor

faph commented Mar 18, 2024

Let me push a concept PR that would demonstrate this.

@dada-engineer
Copy link
Contributor Author

I think I have some idea. i'll close the merge request regarding orjson default, and update the other one. Maybe you can tell me if there are other more optimal fields to use 👍🏻

@faph
Copy link
Contributor

faph commented Mar 18, 2024

It's not unusual for complex record fields like in your case to be "nullable" instead of defaulting to a complex/record object. It's up to you whether such a default should be present in the Avro schema itself.

For the Python class, it's not unusual for the class/function signature to default to None, but then in the body of __init__() such a field to be set with a given object. Certainly, for plain Python classes that do not support mutable defaults, you would use this approach.

@dada-engineer
Copy link
Contributor Author

You want to tell me we should rethink our defaults? 🤔

@faph
Copy link
Contributor

faph commented Mar 18, 2024

No, no, if that's the right design for you that's fine. Avro and Pydantic support it. Just saying that plain Python classes don't.

@faph
Copy link
Contributor

faph commented Mar 18, 2024

So, this PR: #67 actually passes your test. But it would not be the correct solution.

Although arguably better than the current implementation. So we can think about whether we should merge this in the meantime like this...

@dada-engineer
Copy link
Contributor Author

Updated the merge request. Feel free to have a look.

@faph
Copy link
Contributor

faph commented Mar 18, 2024

That looks good, seems simpler than I expected, which is good!

dada-engineer added a commit to dada-engineer/py-avro-schema that referenced this issue Mar 18, 2024
dada-engineer added a commit to dada-engineer/py-avro-schema that referenced this issue Mar 18, 2024
@faph faph closed this as completed in #65 Mar 18, 2024
@faph
Copy link
Contributor

faph commented Mar 18, 2024

Released as 3.5.0.

Many thanks!

@koenlek
Copy link

koenlek commented Oct 2, 2024

For anyone interested, I managed to backport this to v2 (which I'm stuck on) with this diff on src/py_avro_schema/_schemas.py:

@@ -522,6 +522,12 @@
             "items": self.items_schema.data(names=names),
         }
 
+    def make_default(self, py_default: collections.abc.Sequence) -> JSONArray:
+        """Return an Avro schema compliant default value for a given Python Sequence
+        :param py_default: The Python sequence to generate a default value for.
+        """
+        return [self.items_schema.make_default(item) for item in py_default]
+
 
 class DictSchema(Schema):
     """An Avro map schema for a given Python mapping"""
@@ -841,6 +847,22 @@
         )
         return field_obj
 
+    def make_default(self, py_default: pydantic.BaseModel) -> JSONObj:
+        """Return an Avro schema compliant default value for a given Python value"""
+        return {key: _schema_obj(self._annotation(key)).make_default(value) for key, value in py_default}
+
+    def _annotation(self, field_name: str) -> Type:
+        """
+        Fetch the raw annotation for a given field name
+
+        Pydantic "unpacks" annotated and forward ref types in their FieldInfo API. We need to access to full, raw
+        annotated type hints instead.
+        """
+        for class_ in self.py_type.mro():
+            if class_.__annotations__.get(field_name):
+                return class_.__annotations__[field_name]
+        raise ValueError(f"{field_name} is not a field of {self.py_type}")  # Should never happen
+
     @staticmethod
     def _py_type(py_field: pydantic.fields.ModelField) -> Any:
         """Return a Python type annotation for a given Pydantic field"""

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants