Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PDS Q19 failing since Polars 1.7.0 #18710

Closed
2 tasks done
MarcoGorelli opened this issue Sep 12, 2024 · 9 comments · Fixed by #18714
Closed
2 tasks done

PDS Q19 failing since Polars 1.7.0 #18710

MarcoGorelli opened this issue Sep 12, 2024 · 9 comments · Fixed by #18714
Assignees
Labels
accepted Ready for implementation bug Something isn't working needs triage Awaiting prioritization by a maintainer python Related to Python Polars

Comments

@MarcoGorelli
Copy link
Collaborator

MarcoGorelli commented Sep 12, 2024

Checks

  • I have checked that this issue has not already been reported.
  • I have confirmed this bug exists on the latest version of Polars.

Reproducible example

import polars as pl

Q_NUM = 19


def q() -> pl.DataFrame:
    lineitem = pl.scan_parquet('../tpch-data/s1/lineitem.parquet')
    part = pl.scan_parquet('../tpch-data/s1/part.parquet')

    q_final = (
        part.join(lineitem, left_on="p_partkey", right_on="l_partkey")
        .filter(pl.col("l_shipmode").is_in(["AIR", "AIR REG"]))
        .filter(pl.col("l_shipinstruct") == "DELIVER IN PERSON")
        .filter(
            (
                (pl.col("p_brand") == "Brand#12")
                & pl.col("p_container").is_in(
                    ["SM CASE", "SM BOX", "SM PACK", "SM PKG"]
                )
                & (pl.col("l_quantity").is_between(1, 11))
                & (pl.col("p_size").is_between(1, 5))
            )
            | (
                (pl.col("p_brand") == "Brand#23")
                & pl.col("p_container").is_in(
                    ["MED BAG", "MED BOX", "MED PKG", "MED PACK"]
                )
                & (pl.col("l_quantity").is_between(10, 20))
                & (pl.col("p_size").is_between(1, 10))
            )
            | (
                (pl.col("p_brand") == "Brand#34")
                & pl.col("p_container").is_in(
                    ["LG CASE", "LG BOX", "LG PACK", "LG PKG"]
                )
                & (pl.col("l_quantity").is_between(20, 30))
                & (pl.col("p_size").is_between(1, 15))
            )
        )
        .select(
            (pl.col("l_extendedprice") * (1 - pl.col("l_discount")))
            .sum()
            .round(2)
            .alias("revenue")
        )
    )

    return q_final.collect()


if __name__ == "__main__":
    print(q())

Log output

join parallel: true
parquet file must be read, statistics not sufficient for predicate.
parquet file must be read, statistics not sufficient for predicate.
parquet file must be read, statistics not sufficient for predicate.
parquet file must be read, statistics not sufficient for predicate.
parquet file must be read, statistics not sufficient for predicate.
parquet file must be read, statistics not sufficient for predicate.
parquet file must be read, statistics not sufficient for predicate.
parquet file must be read, statistics not sufficient for predicate.
parquet file must be read, statistics not sufficient for predicate.
parquet file must be read, statistics not sufficient for predicate.
parquet file must be read, statistics not sufficient for predicate.
parquet file must be read, statistics not sufficient for predicate.
parquet file must be read, statistics not sufficient for predicate.
parquet file must be read, statistics not sufficient for predicate.
parquet file must be read, statistics not sufficient for predicate.
parquet file must be read, statistics not sufficient for predicate.
parquet file must be read, statistics not sufficient for predicate.
parquet file must be read, statistics not sufficient for predicate.
parquet file must be read, statistics not sufficient for predicate.
parquet file must be read, statistics not sufficient for predicate.
parquet file must be read, statistics not sufficient for predicate.
parquet file must be read, statistics not sufficient for predicate.
parquet file must be read, statistics not sufficient for predicate.
parquet file must be read, statistics not sufficient for predicate.
Traceback (most recent call last):
  File "/home/marcogorelli/scratch/q19.py", line 52, in <module>
    print(q())
          ^^^
  File "/home/marcogorelli/scratch/q19.py", line 48, in q
    return q_final.collect()
           ^^^^^^^^^^^^^^^^^
  File "/home/marcogorelli/scratch/.venv/lib/python3.12/site-packages/polars/lazyframe/frame.py", li
ne 2034, in collect
    return wrap_df(ldf.collect(callback))
                   ^^^^^^^^^^^^^^^^^^^^^
polars.exceptions.ColumnNotFoundError: "l_quantity" not found

Issue description

Since 1.7.0, q19 has started failing

spotted in the Narwhals CI https://github.com/narwhals-dev/narwhals/actions/runs/10823260522/job/30028496951?pr=951

Expected behavior

shape: (1, 1)
┌───────────┐
│ revenue   │
│ ---       │
│ f64       │
╞═══════════╡
│ 5696577.8 │
└───────────┘

Installed versions

--------Version info---------
Polars:              1.6.0
Index type:          UInt32
Platform:            Linux-5.15.153.1-microsoft-standard-WSL2-x86_64-with-glibc2.35
Python:              3.12.5 (main, Aug 14 2024, 05:08:31) [Clang 18.1.8 ]

----Optional dependencies----
adbc_driver_manager  <not installed>
altair               5.4.1
cloudpickle          3.0.0
connectorx           <not installed>
deltalake            <not installed>
fastexcel            <not installed>
fsspec               2024.9.0
gevent               <not installed>
great_tables         <not installed>
matplotlib           3.9.2
nest_asyncio         1.6.0
numpy                2.1.1
openpyxl             <not installed>
pandas               2.2.2
pyarrow              17.0.0
pydantic             <not installed>
pyiceberg            <not installed>
sqlalchemy           <not installed>
torch                <not installed>
xlsx2csv             <not installed>
xlsxwriter           <not installed>

@MarcoGorelli MarcoGorelli added bug Something isn't working python Related to Python Polars needs triage Awaiting prioritization by a maintainer labels Sep 12, 2024
@ritchie46
Copy link
Member

We're on it.

@ritchie46
Copy link
Member

The rest runs ok?

The issue is our new parquet prefiltering strategy. Choosing a different strategy for parallel works.

@MarcoGorelli
Copy link
Collaborator Author

MarcoGorelli commented Sep 12, 2024

q6 and q12 also fail with similar errors, the rest is fine

@ritchie46
Copy link
Member

Does this fixes them? #18714

@MarcoGorelli
Copy link
Collaborator Author

now it's just hanging indefinitely for me

@MarcoGorelli
Copy link
Collaborator Author

can confirm this still happens

the code i'm running is just https://github.com/pola-rs/polars-benchmark/blob/main/queries/polars/q19.py

@MarcoGorelli MarcoGorelli reopened this Sep 12, 2024
@MarcoGorelli MarcoGorelli changed the title TPC-H Q19 failing since Polars 1.7.0 PDS Q19 failing since Polars 1.7.0 Sep 12, 2024
@ritchie46
Copy link
Member

Culrprit reverted. I am going to add benchmark runs to CI :')

@supermarin
Copy link

+1, observing this with a very simple scan & collect. Here's some more debugging info:
Initially observed in this snippet:

sep = pl.scan_parquet(parquet.path("SEP")).filter(pl.col("date") < day).sort("date")
dates = sep.select("date").unique().tail(200)
prices = sep.select("date", "ticker", "closeadj", "open", "low", "high", "close")
last = dates.join(prices, on="date").collect()
# *** polars.exceptions.ColumnNotFoundError: "closeadj" not found

Then I went to reduce it down for this issue, and observed something else: it's complaining about columns that are in the table on the disk, but weren't selected:

sep = pl.scan_parquet(parquet.path("SEP")).filter(pl.col("date") < day).sort("date")
dates = sep.select("date").unique().tail(200)
prices = sep.select("date", "closeadj", "open")
last = dates.join(prices, on="date").collect()
# *** polars.exceptions.ColumnNotFoundError: "high" not found

Or

sep = pl.scan_parquet(parquet.path("SEP")).filter(pl.col("date") < day).sort("date")
dates = sep.select("date").unique().tail(200)
prices = sep.select("date", "closeadj")
last = dates.join(prices, on="date").collect()
# *** polars.exceptions.ColumnNotFoundError: "open" not found

Will test & report back here when a fix is released. 1.6.0 is OK

@supermarin
Copy link

Confirming fix in 1.7.1

@c-peters c-peters added the accepted Ready for implementation label Sep 16, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
accepted Ready for implementation bug Something isn't working needs triage Awaiting prioritization by a maintainer python Related to Python Polars
Projects
Status: Done
Development

Successfully merging a pull request may close this issue.

5 participants