Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Kernel] Extended schema JSON serde to support collations #3628

Open
wants to merge 36 commits into
base: master
Choose a base branch
from

Conversation

ilicmarkodb
Copy link
Contributor

Which Delta project/connector is this regarding?

  • Spark
  • Standalone
  • Flink
  • Kernel
  • Other (fill in here)

Description

Extended serialization and deserialization to support collations in metadata.

How was this patch tested?

Tests added to DataTypeJsonSerDe.java and StructTypeSuite.scala.

Does this PR introduce any user-facing changes?

No.

@ilicmarkodb
Copy link
Contributor Author

@vkorukanti can you please review?

@vkorukanti vkorukanti changed the title Extended serialization and deserialization to support collations in metadata. [Kernel] Extended schema JSON serde to support collations Sep 6, 2024
Copy link
Collaborator

@vkorukanti vkorukanti left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great!

Add bit more comments on how the collation property is stored. You can look at the method docs in ColumnMapping (where similar nested level field ids are stored in metadata) for an example docs.


import java.util.Optional;

public class CollationIdentifier {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Javadoc, @since version and @evolving tag

import java.util.Optional;

public class CollationIdentifier {
public static final String PROVIDER_SPARK = "SPARK";
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do these constant mean? add some comment?

Comment on lines 45 to 47
if (parts.length == 1) {
throw new IllegalArgumentException(
String.format("Invalid collation identifier: %s", identifier));
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

checkArgument(parts.length != 1, String.format("Invalid collation identifier: %s", identifier));

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

or switch(parts.length) {
case 2:
case 3:
default: throw error
}

return String.format("%s.%s", provider, name);
}

public static CollationIdentifier fromString(String identifier) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All public APIs are going to show up in API docs. Please add proper javadoc.

Comment on lines 34 to 36
this.provider = provider;
this.name = name;
this.version = version;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

null checks. Objects.requireNonNull

@@ -16,6 +16,7 @@
package io.delta.kernel.types;

import io.delta.kernel.annotation.Evolving;
import io.delta.kernel.expressions.CollationIdentifier;

/**
* The data type representing {@code string} type values.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

update the doc to include collation info.

nestedCollatedFields.addAll(
getNestedCollatedFields(((ArrayType) parent).getElementType(), path + ".element"));
}
return nestedCollatedFields;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add comment, why we don't care about the DataType being StructType (basically we store the collation to the nearest structfield).

@@ -130,6 +132,14 @@ class DataTypeJsonSerDeSuite extends AnyFunSuite {
}
}

test("parseDataType: types with collated strings") {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add some negative tests which cause the parser to fail.

case(json, structType) =>
assert(parse(json) == structType)
}
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what about serialize tests?


import org.scalatest.funsuite.AnyFunSuite

class StructTypeSuite extends AnyFunSuite {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh I see. Why not just add the tests in DataTypeJsonSerDeSuite itself?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Basically, you can roundtrip and test both serialize and deserialize works.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you're right. thanks

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants