PDS Q19 failing since Polars 1.7.0 #18710

MarcoGorelli · 2024-09-12T09:05:57Z

Checks

I have checked that this issue has not already been reported.
I have confirmed this bug exists on the latest version of Polars.

Reproducible example

import polars as pl

Q_NUM = 19


def q() -> pl.DataFrame:
    lineitem = pl.scan_parquet('../tpch-data/s1/lineitem.parquet')
    part = pl.scan_parquet('../tpch-data/s1/part.parquet')

    q_final = (
        part.join(lineitem, left_on="p_partkey", right_on="l_partkey")
        .filter(pl.col("l_shipmode").is_in(["AIR", "AIR REG"]))
        .filter(pl.col("l_shipinstruct") == "DELIVER IN PERSON")
        .filter(
            (
                (pl.col("p_brand") == "Brand#12")
                & pl.col("p_container").is_in(
                    ["SM CASE", "SM BOX", "SM PACK", "SM PKG"]
                )
                & (pl.col("l_quantity").is_between(1, 11))
                & (pl.col("p_size").is_between(1, 5))
            )
            | (
                (pl.col("p_brand") == "Brand#23")
                & pl.col("p_container").is_in(
                    ["MED BAG", "MED BOX", "MED PKG", "MED PACK"]
                )
                & (pl.col("l_quantity").is_between(10, 20))
                & (pl.col("p_size").is_between(1, 10))
            )
            | (
                (pl.col("p_brand") == "Brand#34")
                & pl.col("p_container").is_in(
                    ["LG CASE", "LG BOX", "LG PACK", "LG PKG"]
                )
                & (pl.col("l_quantity").is_between(20, 30))
                & (pl.col("p_size").is_between(1, 15))
            )
        )
        .select(
            (pl.col("l_extendedprice") * (1 - pl.col("l_discount")))
            .sum()
            .round(2)
            .alias("revenue")
        )
    )

    return q_final.collect()


if __name__ == "__main__":
    print(q())

Log output

join parallel: true
parquet file must be read, statistics not sufficient for predicate.
parquet file must be read, statistics not sufficient for predicate.
parquet file must be read, statistics not sufficient for predicate.
parquet file must be read, statistics not sufficient for predicate.
parquet file must be read, statistics not sufficient for predicate.
parquet file must be read, statistics not sufficient for predicate.
parquet file must be read, statistics not sufficient for predicate.
parquet file must be read, statistics not sufficient for predicate.
parquet file must be read, statistics not sufficient for predicate.
parquet file must be read, statistics not sufficient for predicate.
parquet file must be read, statistics not sufficient for predicate.
parquet file must be read, statistics not sufficient for predicate.
parquet file must be read, statistics not sufficient for predicate.
parquet file must be read, statistics not sufficient for predicate.
parquet file must be read, statistics not sufficient for predicate.
parquet file must be read, statistics not sufficient for predicate.
parquet file must be read, statistics not sufficient for predicate.
parquet file must be read, statistics not sufficient for predicate.
parquet file must be read, statistics not sufficient for predicate.
parquet file must be read, statistics not sufficient for predicate.
parquet file must be read, statistics not sufficient for predicate.
parquet file must be read, statistics not sufficient for predicate.
parquet file must be read, statistics not sufficient for predicate.
parquet file must be read, statistics not sufficient for predicate.
Traceback (most recent call last):
  File "/home/marcogorelli/scratch/q19.py", line 52, in <module>
    print(q())
          ^^^
  File "/home/marcogorelli/scratch/q19.py", line 48, in q
    return q_final.collect()
           ^^^^^^^^^^^^^^^^^
  File "/home/marcogorelli/scratch/.venv/lib/python3.12/site-packages/polars/lazyframe/frame.py", li
ne 2034, in collect
    return wrap_df(ldf.collect(callback))
                   ^^^^^^^^^^^^^^^^^^^^^
polars.exceptions.ColumnNotFoundError: "l_quantity" not found

Issue description

Since 1.7.0, q19 has started failing

spotted in the Narwhals CI https://github.com/narwhals-dev/narwhals/actions/runs/10823260522/job/30028496951?pr=951

Expected behavior

shape: (1, 1)
┌───────────┐
│ revenue   │
│ ---       │
│ f64       │
╞═══════════╡
│ 5696577.8 │
└───────────┘

Installed versions

--------Version info---------
Polars:              1.6.0
Index type:          UInt32
Platform:            Linux-5.15.153.1-microsoft-standard-WSL2-x86_64-with-glibc2.35
Python:              3.12.5 (main, Aug 14 2024, 05:08:31) [Clang 18.1.8 ]

----Optional dependencies----
adbc_driver_manager  <not installed>
altair               5.4.1
cloudpickle          3.0.0
connectorx           <not installed>
deltalake            <not installed>
fastexcel            <not installed>
fsspec               2024.9.0
gevent               <not installed>
great_tables         <not installed>
matplotlib           3.9.2
nest_asyncio         1.6.0
numpy                2.1.1
openpyxl             <not installed>
pandas               2.2.2
pyarrow              17.0.0
pydantic             <not installed>
pyiceberg            <not installed>
sqlalchemy           <not installed>
torch                <not installed>
xlsx2csv             <not installed>
xlsxwriter           <not installed>

The text was updated successfully, but these errors were encountered:

ritchie46 · 2024-09-12T09:41:16Z

We're on it.

ritchie46 · 2024-09-12T09:45:50Z

The rest runs ok?

The issue is our new parquet prefiltering strategy. Choosing a different strategy for parallel works.

MarcoGorelli · 2024-09-12T10:04:42Z

q6 and q12 also fail with similar errors, ~~the rest is fine~~

ritchie46 · 2024-09-12T11:11:29Z

Does this fixes them? #18714

MarcoGorelli · 2024-09-12T12:35:25Z

now it's just hanging indefinitely for me

MarcoGorelli · 2024-09-12T13:02:38Z

can confirm this still happens

the code i'm running is just https://github.com/pola-rs/polars-benchmark/blob/main/queries/polars/q19.py

ritchie46 · 2024-09-12T13:32:27Z

Culrprit reverted. I am going to add benchmark runs to CI :')

supermarin · 2024-09-12T15:09:58Z

+1, observing this with a very simple scan & collect. Here's some more debugging info:
Initially observed in this snippet:

sep = pl.scan_parquet(parquet.path("SEP")).filter(pl.col("date") < day).sort("date")
dates = sep.select("date").unique().tail(200)
prices = sep.select("date", "ticker", "closeadj", "open", "low", "high", "close")
last = dates.join(prices, on="date").collect()
# *** polars.exceptions.ColumnNotFoundError: "closeadj" not found

Then I went to reduce it down for this issue, and observed something else: it's complaining about columns that are in the table on the disk, but weren't selected:

sep = pl.scan_parquet(parquet.path("SEP")).filter(pl.col("date") < day).sort("date")
dates = sep.select("date").unique().tail(200)
prices = sep.select("date", "closeadj", "open")
last = dates.join(prices, on="date").collect()
# *** polars.exceptions.ColumnNotFoundError: "high" not found

Or

sep = pl.scan_parquet(parquet.path("SEP")).filter(pl.col("date") < day).sort("date")
dates = sep.select("date").unique().tail(200)
prices = sep.select("date", "closeadj")
last = dates.join(prices, on="date").collect()
# *** polars.exceptions.ColumnNotFoundError: "open" not found

Will test & report back here when a fix is released. 1.6.0 is OK

supermarin · 2024-09-12T15:54:35Z

Confirming fix in 1.7.1

MarcoGorelli added bug Something isn't working python Related to Python Polars needs triage Awaiting prioritization by a maintainer labels Sep 12, 2024

coastalwhite mentioned this issue Sep 12, 2024

fix: Parquet prefiltered with projection pushdown #18714

Merged

ritchie46 closed this as completed in #18714 Sep 12, 2024

MarcoGorelli reopened this Sep 12, 2024

MarcoGorelli changed the title ~~TPC-H Q19 failing since Polars 1.7.0~~ PDS Q19 failing since Polars 1.7.0 Sep 12, 2024

ritchie46 closed this as completed Sep 13, 2024

c-peters added the accepted Ready for implementation label Sep 16, 2024

c-peters assigned coastalwhite Sep 16, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PDS Q19 failing since Polars 1.7.0 #18710

PDS Q19 failing since Polars 1.7.0 #18710

MarcoGorelli commented Sep 12, 2024 •

edited

Loading

ritchie46 commented Sep 12, 2024

ritchie46 commented Sep 12, 2024

MarcoGorelli commented Sep 12, 2024 •

edited

Loading

ritchie46 commented Sep 12, 2024

MarcoGorelli commented Sep 12, 2024

MarcoGorelli commented Sep 12, 2024

ritchie46 commented Sep 12, 2024

supermarin commented Sep 12, 2024

supermarin commented Sep 12, 2024

PDS Q19 failing since Polars 1.7.0 #18710

PDS Q19 failing since Polars 1.7.0 #18710

Comments

MarcoGorelli commented Sep 12, 2024 • edited Loading

Checks

Reproducible example

Log output

Issue description

Expected behavior

Installed versions

ritchie46 commented Sep 12, 2024

ritchie46 commented Sep 12, 2024

MarcoGorelli commented Sep 12, 2024 • edited Loading

ritchie46 commented Sep 12, 2024

MarcoGorelli commented Sep 12, 2024

MarcoGorelli commented Sep 12, 2024

ritchie46 commented Sep 12, 2024

supermarin commented Sep 12, 2024

supermarin commented Sep 12, 2024

MarcoGorelli commented Sep 12, 2024 •

edited

Loading

MarcoGorelli commented Sep 12, 2024 •

edited

Loading