Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feature request: have a TracingOption to represent enums as flattened structs instead of unions #221

Open
raj-nimble opened this issue Aug 13, 2024 · 2 comments

Comments

@raj-nimble
Copy link

raj-nimble commented Aug 13, 2024

Hi Chris,

I wanted to propose/discuss a new feature, where Rust enums are flattened into a struct (or map?) with a wide schema, where the non-selected variants would have None/Null fields and then deserialized with some intelligent logic checking the fields for each variant.

This would be a workaround/crutch for the fact that Unions are not supported in parquet (and support doesn't appear to be coming any time soon), but ideally generic enough to be useful for anyone if they wish, and it seems like the feature would do well to live in this crate.

For the interface, I would hope it would be a simple option set in TracingOptions, e.g. TracingOptions::new().flatten_enums_into_structs(true).

I would like to try to help implement this if you agree with the feature but don't think you'll have time to work on it yourself, although obviously you would implement this much faster than I would. I notice in your branch activity you appear to be actively working on version 0.12 of the crate, where maybe you are thinking about this already? Or possibly want to delay a feature like this until the next version? Either way, would love to discuss possibilities. Please let me know your thoughts.

Thanks,
Raj

@raj-nimble
Copy link
Author

raj-nimble commented Aug 14, 2024

As an example, imagine we had the following rust enum

#[derive(Serialize, Deserialize)]
enum RecordEnum {
    Inside { room: String },
    Outside { street: String, zipcode: u16 },
}

I think given the option, we could map that to an equivalent flattened struct like the following in terms of the Arrow Field:

#[derive(Serialize, Deserialize)]
struct RecordEnumStruct {
    inside_room: Option<String>,
    outside_street: Option<String>,
    outside_zipcode: Option<u16>,
}

The field names are to prevent field name collisions.
I have an outer record like this, holding both types:

#[derive(Serialize, Deserialize)]
struct Record {
    a: RecordEnum,
    b: RecordEnumStruct,
}

Represented as arrow Fields, instead of this:

    Field {
        name: "a",
        data_type: Union(
            [
                (
                    0,
                    Field {
                        name: "Inside",
                        data_type: Struct(
                            [
                                Field {
                                    name: "room",
                                    data_type: LargeUtf8,
                                    nullable: false,
                                    dict_id: 0,
                                    dict_is_ordered: false,
                                    metadata: {},
                                },
                            ],
                        ),
                        nullable: false,
                        dict_id: 0,
                        dict_is_ordered: false,
                        metadata: {},
                    },
                ),
                (
                    1,
                    Field {
                        name: "Outside",
                        data_type: Struct(
                            [
                                Field {
                                    name: "street",
                                    data_type: LargeUtf8,
                                    nullable: false,
                                    dict_id: 0,
                                    dict_is_ordered: false,
                                    metadata: {},
                                },
                                Field {
                                    name: "zipcode",
                                    data_type: UInt16,
                                    nullable: false,
                                    dict_id: 0,
                                    dict_is_ordered: false,
                                    metadata: {},
                                },
                            ],
                        ),
                        nullable: false,
                        dict_id: 0,
                        dict_is_ordered: false,
                        metadata: {},
                    },
                ),
            ],
            Dense,
        ),
        nullable: false,
        dict_id: 0,
        dict_is_ordered: false,
        metadata: {},
    }

We would auto-convert to this:

 Field {
        name: "b",
        data_type: Struct(
            [
                Field {
                    name: "inside_room",
                    data_type: LargeUtf8,
                    nullable: true,
                    dict_id: 0,
                    dict_is_ordered: false,
                    metadata: {},
                },
                Field {
                    name: "outside_street",
                    data_type: LargeUtf8,
                    nullable: true,
                    dict_id: 0,
                    dict_is_ordered: false,
                    metadata: {},
                },
                Field {
                    name: "outside_zipcode",
                    data_type: UInt16,
                    nullable: true,
                    dict_id: 0,
                    dict_is_ordered: false,
                    metadata: {},
                },
            ],
        ),
        nullable: false,
        dict_id: 0,
        dict_is_ordered: false,
        metadata: {},
    }

When disabling the first type, I can now write parquet files just fine. If we can do this for the user automatically I think it could be quite useful.

@raj-nimble raj-nimble changed the title feature request: serialize enums as flattened structs feature request: have a TracingOption to represent enums as flattened structs instead of unions Aug 14, 2024
@raj-nimble
Copy link
Author

Initial draft MR #222

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant