Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Investigate using Ibis for the common interface library to any DF backend #89

Open
rht opened this issue Aug 29, 2024 · 9 comments
Open

Comments

@rht
Copy link
Contributor

rht commented Aug 29, 2024

https://ibis-project.org/ has various backends: Dask, pandas, Polars, and many more. But apparently, the pandas backend is so problematic to maintain that they have decided that they will remove them.

@adamamer20
Copy link
Collaborator

This seems amazing! Single API with a polars-like feel, 20+ backends and great performance, it should be a priority. The API seems very expressive and it also supports geospatial operations. It seems to me that it can also support the chosen backend API, although it's not very straightforward (https://ibis-project.org/reference/scalar-udfs#ibis.expr.operations.udf.scalar.builtin). No mention of GPU backend (maybe we can open an issue to ask them about it?) but they might use RAPIDS-polars when it comes out.

Regarding pandas performance, I believe it might always be a bottleneck. We might consider ditching it completely in the future.

@adamamer20
Copy link
Collaborator

adamamer20 commented Aug 31, 2024

I read a bit about ibis and the greatest point it has is the use of duckdb as backend (which might be even more performant than lazy polars).
However I also read online that some find ibis API a bit counterintuitive.
Also, if duckdb is objectively faster than any other backend, why should you switch to any other?
The greatest selling point is that, by decoupling the API from the backend, you can quickly switch the backend in the future if anything faster comes around, without changing the code you have written.
We need to acknowledge that a lot of people are used to pandas API currently. Maybe we could have pandas (maybe switching to Dask or Modin), polars and ibis as the three interfaces.

EDIT: Another pro of Ibis is the possibility of using a distributed engine like PySpark.

@rht
Copy link
Contributor Author

rht commented Aug 31, 2024

The problem is that we have an abstract DF interface, which is yet-another-Ibis. Ibis has long tried to support pandas themselves because it is crucial to their userbase growth. Do you resonate with Ibis' gripes while writing the concrete pandas backend for mesa-frames?

We need to acknowledge that a lot of people are used to pandas API currently. Maybe we could have pandas (maybe switching to Dask or Modin), polars and ibis as the three interfaces.

I'd say the main reason for using mesa-frames is performance, and people would be willing to use a faster option than pandas as long as the syntax is not as tedious as CUDA/FLAME 2. What is the fastest you can get while still on pandas syntax (i.e. Dask/Modin)? And whether it is faster than lazy Polars. We need to know this before proceeding.

@adamamer20
Copy link
Collaborator

I'd say the main reason for using mesa-frames is performance, and people would be willing to use a faster option than pandas as long as the syntax is not as tedious as CUDA/FLAME 2. What is the fastest you can get while still on pandas syntax (i.e. Dask/Modin)? And whether it is faster than lazy Polars. We need to know this before proceeding.

I think both Dask and Modin are definitely slower than polars / duckdb (on a single node). I think performance is key. However I fear modelers might not use mesa-frames because it's yet another API to learn. But having a single DF backend would simplify development greatly. We could also have a single ibis backend and return values in a pandas DataFrame for convenience. pandas now should have support for the pyarrow backend, so it shouldn't be too slow. I have to do some experiments to compare performance between different approaches. But this would be the easiest.

@adamamer20
Copy link
Collaborator

Opened an issue on GPU Acceleration

ibis-project/ibis#9986

@adamamer20
Copy link
Collaborator

adamamer20 commented Sep 8, 2024

@rht checkout the answer to the issue.

The potential with Ibis is quite promising. If Theseus gets released in the future, we could run Ibis models on multi-node GPUs, which would be a game-changer for companies using ABMs in production – it could enable much larger-scale simulations. You can transition smoothly from prototyping on a single CPU to running on multi-node GPUs without code changes. Interestingly, I read that GPUs provide the most significant speed-up for join operations, which is our main bottleneck. Adopting Ibis also seems like a good way to future-proof the codebase and reduce the complexity.
We can implement an option to return DFs in various formats anyway (ibis, pandas, lazy-polars, polars, etc.). This would allow users to work with methods they're familiar with for writing the actual method implementation while we benefit from Ibis under the hood.

@rht
Copy link
Contributor Author

rht commented Sep 8, 2024

Polars is adding GPU support Ibis gets "for free", covering single-node GPU query engine

Which means, the current mesa-frames will soon support single-node GPU, via Polars. What Ibis brings to the table is localized to multi-node GPU support.

yep, Theseus is not public (and probably won't be anytime soon) -- the general thinking is these modern single-node OLAP engines like DuckDB, DataFusion, and eventually Polars are sufficient for 90-99% of data use cases, as they allow you to scale up to ~10TB size queries

Unless there is another multi-node GPU library that Ibis already supports, I think adding an Ibis implementation now won't provide any immediate benefit to the user.

Also, to ease maintenance cost, supporting Ibis means we have to drop support for Polars, since Ibis already wraps Polars, and maintaining 3 backends is tedious.

@adamamer20
Copy link
Collaborator

Unless there is another multi-node GPU library that Ibis already supports, I think adding an Ibis implementation now won't provide any immediate benefit to the user.

There is still the benefit of distributed CPU with the Spark backend. But if Theseus gets released in the future (and I don't see why it shouldn't), we would get multi-node GPU for free.

Also, to ease maintenance cost, supporting Ibis means we have to drop support for Polars, since Ibis already wraps Polars, and maintaining 3 backends is tedious.

Yes, completely agree on this one.

@adamamer20
Copy link
Collaborator

I'm creating an ibis branch, and I'm going to start making PRs there.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants