Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[QUESTION] Could integration with Ibis be supported? #88

Open
galenseilis opened this issue Sep 19, 2024 · 2 comments
Open

[QUESTION] Could integration with Ibis be supported? #88

galenseilis opened this issue Sep 19, 2024 · 2 comments

Comments

@galenseilis
Copy link
Contributor

galenseilis commented Sep 19, 2024

Description

I am exploring using a combo of Ibis, Kedro, and Pandera if that's possible.

Context

With Ibis I could be writing more consistent dataframe code regardless of the backend (e.g. Polars, MSSQL, or PostgreSQL) while having faster performance than Pandas, and also solving the parametrization problem that comes with integrating Python and SQL. With Kedro I get consistent data science project structures. With Pandera I get dataframe data validation. Everyone that cares about those things will similarly benefit from Kedro-Pandera integration with Ibis.

I would like something highly similar to what I see in the Kedro-Pandera plugin's documentation, except to also support Ibis datasets.

Possible Implementation

I'm not currently familiar with the internals of the Kedro-Pandera, so my suggestion will be somewhat limited to that lack of understanding.

Because Kedro-Pandera is responsible for an integration of Kedro and Pandera, the implementation should depend on current behaviour Kedro, Pandera, and Ibis rather than modifying their behaviour.

I've noted that Pandera supports Polars in addition to Pandas, however Ibis has its own classes that I do not expect Pandera to have support for. Rather, the implementation could take advantage of the fact that the Ibis dataframe objects will have either of to_pandas or to_polars.

Here is a summary of the logic I have in mind:

  • If a dataset is annotated to be one of the already-supported datasets, proceed as usual.
  • If a dataset is a kedro_datasets.ibis.TableDataset then load that dataset, convert it to polars/pandas, then run the Pandera validator on it.

Possible Alternatives

Another option is for me to have a Kedro pipeline for this type of validation instead. This would involve casting the Ibis table dataset to a polars dataframe myself, and loading the schema itself as a yaml Kedro dataset, and running the Pandera validator against the Polars dataset.

@galenseilis galenseilis changed the title [QUESTION] Is integration with Ibis supported? [QUESTION] Could integration with Ibis be supported? Sep 19, 2024
@Galileo-Galilei
Copy link
Owner

This is definitely valuable and should be added to the roadmap.

TBH I have hard times recently to maintain the plugins, and kedro-pandera is quite inactive. I plan to resume working on it one day, but I can't provide a time when I will resume development of kedro-pandera.

I definitely will accept and release PR though.

@noklam
Copy link
Collaborator

noklam commented Sep 19, 2024

Similar situation, I cannot take on any active development work but I can spare some time on PR review if someone is willing to spend time on this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants