Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Relax dict_id equality in field merging #6356

Open
brancz opened this issue Sep 4, 2024 · 4 comments
Open

Relax dict_id equality in field merging #6356

brancz opened this issue Sep 4, 2024 · 4 comments
Labels
enhancement Any new improvement worthy of a entry in the changelog

Comments

@brancz
Copy link
Contributor

brancz commented Sep 4, 2024

Is your feature request related to a problem or challenge? Please describe what you are trying to do.

Trying to merge two schemas that are compatible, other than having fields whose dict_ids are different.

Describe the solution you'd like

Relax dict_id handling during merging of fields, such that if they are different the dict_id is unset, and only if they are equal are they kept.

Describe alternatives you've considered

Before opening this, I tried implementing my own merge function, but would have to change it quite significantly compared to the existing code, since I can't access internal fields. I could still make this work, but wanted to first check whether the project would be open to relaxing this, since this feels more aligned with how merging works otherwise (eg. making a field nullable).

Additional context

All of this is under the assumption that I understand dict_id correctly, which is that it's primarily used for IPC's ability to duplicate dicts.

@brancz brancz added the enhancement Any new improvement worthy of a entry in the changelog label Sep 4, 2024
@tustvold
Copy link
Contributor

tustvold commented Sep 5, 2024

#5981 is possibly related.

As it stands I am not sure we can ignore the dict ID when merging, as unlike with nulls, two arrays with different dictionary IDs can't be "merged", as there is no covering relationship between two different dict IDs.

@brancz
Copy link
Contributor Author

brancz commented Sep 5, 2024

Isn't that just an optimization though? I understand that if two dict IDs are identical one can just append the index arrays to each other safely, and if not then the dicts needs to be unified and indexes added transposed (using the terminology that the Go arrow package uses not sure it's universal it's just the library I have the most experience with).

@tustvold
Copy link
Contributor

tustvold commented Sep 5, 2024

Unfortunately that isn't how the IPC implementation treats them, in particular IPC files use dictionary IDs as unique identifiers to immutable dictionaries. It is all a bit confused, and it is peculiar that they're part of the schema definition at all, but that is what #5981 concerns.

@brancz
Copy link
Contributor Author

brancz commented Sep 6, 2024

If I understand this right then, it should be possible to remove dict_id from Field if we solve IPC dictionary ID handling some other way. And consequently dict_id would no longer be part of equality of Field. Do I understand correctly?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement Any new improvement worthy of a entry in the changelog
Projects
None yet
Development

No branches or pull requests

2 participants