Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add arbitrary matching function for joins #3445

Open
Gregstrq opened this issue Jul 6, 2024 · 3 comments
Open

Add arbitrary matching function for joins #3445

Gregstrq opened this issue Jul 6, 2024 · 3 comments
Labels
Milestone

Comments

@Gregstrq
Copy link

Gregstrq commented Jul 6, 2024

I have seen the discussion in #3363, and it was mentioned that adding arbitrary matching functions is part of 1.7 milestone. I would like to ask, what is the status of this feature request, are there any plans to implement it soon?

In principle, I could try to cook up a PR on my own, but I am not that familiar with DataFrames internals. Also the details of the syntax should be discussed in advance.

So, what is the principial difficulty with adding this feature? The way I see it, we just need to overload the equality comparison, which is used to relate the rows from the two DataFrames. Everything else, including the validation can stay without changes.

Now, there is a separate matter of the transformation of the columns, that was requested in #3363. I think, it makes sense to distinguish transformation and matching.
We should really only match the columns of the same type between two dataframes, because we also need to match different rows inside the same dataframe to do the validation.
And transformation should only provide the means to bring the columns to the same type: we first transform columns in one or both dataframes to the same type, and then run matching on the results of the transforms.

Transformaton can be already done by the users themselves simply by adding extra column. On the other hand, it is the arbitrary matching that requires tinkering with internals, but treating it separately from transformation simplifies the realization.

What do you think about it?

@bkamins bkamins added this to the 1.7 milestone Jul 7, 2024
@bkamins
Copy link
Member

bkamins commented Jul 7, 2024

See also #2738 for a related discussion.

Two key things are:

  1. syntax
  2. in general this is O(n^2) operation, but we could improve the performance in many cases; however, to allow for this detection we need syntax (see point 1. above) that would in the future allow to distinguish slow from fast operations.

@Gregstrq
Copy link
Author

Gregstrq commented Jul 7, 2024

in general this is O(n^2) operation, but we could improve the performance in many cases; however, to allow for this detection we need syntax (see point 1. above) that would in the future allow to distinguish slow from fast operations.

I don't understand this part. You need to do equality check to match the rows in two dataframes or to do validation. Equality check is a binary operation. What difference would it make to swap one binary operation with another one?

Could you may be provide the examples of slow and fast joins in the existing implementation?

  1. syntax

Why not simply pass :column_left=>binary_function=>:column_right to the on keyword?

@bkamins
Copy link
Member

bkamins commented Aug 3, 2024

Could you may be provide the examples of slow and fast joins in the existing implementation?

Existing implementation is always fast because it uses equality that is hashable (in short, as we do not always actually hash things).
If you can hash you can use dictionary for mapping.

However, for a general equality operator it does not have to be hashable, as e.g. it might not be transitive. In that case you need O(n^2) comparisons. An example of such equality operator is isapprox (a quite common case) or approximate distance matching of strings.

Why not simply pass :column_left=>binary_function=>:column_right to the on keyword?

Because, in general, it does not allow us to check if binary_function is a proper equality (i.e. if it transitive, reflexive and symmetric). We would need some syntax that would signal this and allow to e.g. hash :column_left to e.g. a dict and then do lookup o :column_right in this dict (this is just one example strategy that could be used, other are sorting, but then things would need to be sortable).

@bkamins bkamins modified the milestones: 1.7, 1.x Sep 14, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants