Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Set a UUID when building a Schema object. #32399

Open
wants to merge 3 commits into
base: master
Choose a base branch
from

Conversation

scwhittle
Copy link
Contributor

Schemas are immutable so this meets the guarantee that a consistent UUID is used for matching schemas. Cleanup some cases setting a random UUID after creating Schema, fix case where same UUID was assigned to different Schema after sorting, and use Immutable data structures to enforce immutability.


Thank you for your contribution! Follow this checklist to help us incorporate your contribution quickly and easily:

  • Mention the appropriate issue in your description (for example: addresses #123), if applicable. This will automatically add a link to the pull request in the issue. If you would like the issue to automatically close on merging the pull request, comment fixes #<ISSUE NUMBER> instead.
  • Update CHANGES.md with noteworthy changes.
  • If this contribution is large, please file an Apache Individual Contributor License Agreement.

See the Contributor Guide for more tips on how to make review process smoother.

To check the build health, please visit https://github.com/apache/beam/blob/master/.test-infra/BUILD_STATUS.md

GitHub Actions Tests Status (on master branch)

Build python source distribution and wheels
Python tests
Java tests
Go tests

See CI.md for more information about GitHub Actions CI or the workflows README to see a list of phrases to trigger workflows.

Copy link
Contributor

github-actions bot commented Sep 5, 2024

Checks are failing. Will not request review until checks are succeeding. If you'd like to override that behavior, comment assign set of reviewers

Copy link
Contributor

@Abacn Abacn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, this PR contains several indepenent fixes, I'm good with most of them, but not sure about the oneOfType one, suggested to involve other reviewer who might now

}

@Override
public byte[] getArgument() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this looks like a breaking change. However I am not sure the implication, good to involve others reviewer, possibly @robertwb ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The previous behavior seems incorrect as proto serialization is not necessarily deterministic and the serialized proto equality doesn't match schema equality logic.

But I think that you are correct that this would break schema update compatability because LogicalType base class appears to be serializable.

@robertwb Any ideas on how to best proceed? We could

  1. keep OneOf as-is and just document that it's equality isn't guaranteed to be correct due to proto serialization and clear the uuid before serializing the proto.
  2. add a new type with correct semantics and deprecate this one
  3. try to integrate w/ update compatibility version but that seems difficult given schemas are created all over. We coudl have some static method we call that modifies static state instead of plumbing the option everywhere. getArgument base is to return an Object so we can have single class support both types based upon the configuration.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good question.

First, let's separate this change out into its own PR, so we can get the rest (which looks good) in.

I'm worried about what existing pipelines might break if we change this. Does a (null?) Row eventually get serialized with its schema as proto bytes anyway? We could look into explicitly using deterministic proto serialization here as an improvement at least.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some change of OneOf is needed because it's schema now has a uuid set and thus doesn't compare equal.

However I changed to keep using serialized proto but to recursively remove the UUID from the schema proto before serializing.

PTAL

Schemas are immutable so this meets the guarantee that a consistent UUID is used for matching schemas.

Cleanup some cases setting a random UUID after creating Schema.
Fix case where same UUID was assigned to different Schema after sorting.
Use Immutable data structures to enforce immutability.
Update OneOfType which using serialized proto equality which was incorrect if there was uuid, or encoding positions. Instead we can use a null Row using the generated oneof schema.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants