Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Change join_where to not allow ambiguous naming but do allow interchangeable order #18634

Closed
thomasfrederikhoeck opened this issue Sep 9, 2024 · 3 comments
Labels
enhancement New feature or an improvement of an existing feature

Comments

@thomasfrederikhoeck
Copy link
Contributor

thomasfrederikhoeck commented Sep 9, 2024

Description

The new functionallity in #18365 is really a cool addition - nice work!

When looking at the docstring I noticed that the order of the columns in the *predicates matter. I think this is a potential for errors when coming from either SQL or DuckDB. I would suggest the following instead:

  1. Raise if both DataFrames have a column named the same which is used in the join predicate. The following should raise a "Ambiguous Column" error:
df1 = pl.DataFrame({"id": [1, 2, 3]})
df2 = pl.DataFrame({"id": [1, 2, 3]})

df.join_where(df, pl.col("id") >= pl.col("id"))

where as the following would work:

df1 = pl.DataFrame({"id": [1, 2, 3]})
df2 = pl.DataFrame({"id": [1, 2, 3]})
df1.join_where(df2.rename({"id":"id_2"}), pl.col("id") >= pl.col("id_2"))

The following would also be allowed since the non-unique name is not in the join-predicates:

df1 = pl.DataFrame({"id": [1, 2, 3], "id_2": [2, 3, 4]})
df2 = pl.DataFrame({"id": [1, 2, 3]})

df.join_where(df, pl.col("id") >= pl.col("id_2"))
  1. Disregard if column is left- or right-handside:

Since the columns are now guaranteed to be unique you make the order not matter such that the following test should be succesfull:

df1 = pl.DataFrame({"id": [1, 2, 3]})
df2 = pl.DataFrame({"id_2": [1, 2, 3]})


polars.testing.assert_frame_equal(
    df1.join_where(df2, pl.col("id") >= pl.col("id_2")),
    df1.join_where(df2, pl.col("id_2") <= pl.col("id")),
    check_row_order = False
)
@thomasfrederikhoeck thomasfrederikhoeck added the enhancement New feature or an improvement of an existing feature label Sep 9, 2024
@cmdlineluser
Copy link
Contributor

There is some work going on here:

@ritchie46
Copy link
Member

I think this can be closed now?

@thomasfrederikhoeck
Copy link
Contributor Author

@ritchie46 Yes, I will close! The semantics you landed on seem very sound!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or an improvement of an existing feature
Projects
None yet
Development

No branches or pull requests

3 participants