Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

A dataset with extension types cannot be queried by DuckDB #2884

Open
westonpace opened this issue Sep 13, 2024 · 4 comments
Open

A dataset with extension types cannot be queried by DuckDB #2884

westonpace opened this issue Sep 13, 2024 · 4 comments

Comments

@westonpace
Copy link
Contributor

Here's a simple reproducer.

import duckdb
import pyarrow as pa
import pyarrow.dataset
import uuid
import lance

class UuidType(pa.ExtensionType):

    def __init__(self):
        super().__init__(pa.binary(16), "my_package.uuid")

    def __arrow_ext_serialize__(self):
        # Since we don't have a parameterized type, we don't need extra                                                                                                                                            
        # metadata to be deserialized                                                                                                                                                                              
        return b''

    @classmethod
    def __arrow_ext_deserialize__(cls, storage_type, serialized):
        # Sanity checks, not required but illustrate the method signature.                                                                                                                                         
        assert storage_type == pa.binary(16)
        assert serialized == b''
        # Return an instance of this subclass given the serialized                                                                                                                                                 
        # metadata.                                                                                                                                                                                                
        return UuidType()

storage_array = pa.array([uuid.uuid4().bytes for _ in range(4)], pa.binary(16))
arr = pa.ExtensionArray.from_storage(UuidType(), storage_array)
tab = pa.table({"uuids": arr, "normal_type": [1, 2, 3, 4]})

ds = lance.write_dataset(tab, "/tmp/foo.lance", mode="overwrite", data_storage_version="2.0")

print(duckdb.sql("SELECT normal_type from ds").fetchall())

I get the error: duckdb.duckdb.NotImplementedException: Not implemented Error: Arrow Type with extension name: my_package.uuid and format: w:16, is not currently supported in DuckDB

@westonpace
Copy link
Contributor Author

I've raised this with duckdb here: duckdb/duckdb#13931

I can workaround this using a scanner:

scanner = ds.scanner(columns=["normal_type"])
print(duckdb.sql("SELECT normal_type from scanner").fetchall())

However, that has the downside of preventing filter pushdown.

I can also work around this in a rather hacky way:

class WrappedDataset(pyarrow.dataset.Dataset):

    def __init__(self, ds, schema):
        self._ds = ds
        self.pruned_schema = schema

    @property
    def schema(self):
        return self._schema

    def __getattribute__(self, attr):
        if attr == "schema":
            return object.__getattribute__(self, "pruned_schema")
        else:
            ds = super(WrappedDataset, self).__getattribute__("_ds")
            return object.__getattribute__(ds, attr)

pruned_schema = ds.schema
pruned_schema = pruned_schema.remove(0)
wrapped = WrappedDataset(ds, pruned_schema)

print(duckdb.sql("SELECT normal_type from wrapped").fetchall())

@westonpace
Copy link
Contributor Author

In the meantime I suppose we could implement some kind of "dataset view" which automatically applies a projection and has a revised schema.

@wjones127
Copy link
Contributor

Perhaps if we get NotImplementedException we could do a fallback projection that strips extension metadata, and just deals with storage types.

I think that's sort of what we might want for vectors. If they don't support it, drop the FixedShapeTensorArray metadata and just provide them the FixedSizeList.

@westonpace
Copy link
Contributor Author

westonpace commented Sep 13, 2024

Unfortunately we do not get the exception. DuckDb grabs the schema from us and then determines it cannot handle it and raises it to the user.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants