[Feature] - Support column level statistics #514

nicor88 · 2023-11-18T00:59:27Z

https://aws.amazon.com/about-aws/whats-new/2023/11/aws-glue-data-catalog-generating-column-level-statistics/

Add additional configurations that allow the user to add column level statistics to the table.
Minimal config to make it work:

collect_statistics: boolean (can be configured on the project level)
glue role that collects statistics
columns to get statistics (if empty or None we must collect statistics for the all columns) - optional if statistics are enabled

Open questions

for incremental loads do we drop all statistics and recreate new ones? Or we just run a new start_column_statistics_task_run
~~are all table types supported?~~ seems only supported by hive tables, not iceberg.

Notes

Currently not available in all regions

roslovets · 2023-11-22T00:15:56Z

Did anyone find a use case for the column statistics feature?

I tried to apply it to the unique Id field of a big table and after several minutes it computed a totally wrong number of unique values. Also it did not speed up simple sql queries at all.

I agree that it looks appealing to automate these statistics with dbt. But would it be useful in real life? Given that it can slow down project building significantly.

jessedobbelaere · 2023-11-22T00:21:26Z

@roslovets I believe the main reason is a potential performance gain indeed, according to this new Cost-Based Optimizer for Athena. I haven't seen hands-on test results yet though.

nicor88 · 2023-11-22T21:10:09Z

Here another article where AWS shows improved performances
https://aws.amazon.com/blogs/big-data/enhance-query-performance-using-aws-glue-data-catalog-column-level-statistics/

roslovets · 2023-11-24T10:35:22Z

Thank you for the links folks. According to their fancy examples we should be able to really save time on downstream models and tests even if it takes up to several minutes to compute statistics for one table.

But I still cannot get why select count distinct differs from the value I see in computed statistics. It ruins the whole idea and potentially makes a query planning and data processing inadequate.

Maybe you could do tests on your big tables as well?

iconara · 2024-07-09T15:18:52Z

But I still cannot get why select count distinct differs from the value I see in computed statistics. It ruins the whole idea and potentially makes a query planning and data processing inadequate.

The reason is that neither Athena, Glue, or any other engine like Spark manages the data. Data warehouses like Redshift, Snowflake, and BigQuery manage data and can keep stats they know are reliable and use these to optimize things like COUNT(DISTINCT x). Data lake query engines are only query engines, they don't manage storage or catalogs, and therefore they can't rely on the catalog stats for optimizations like that. You could add and remove data from a table at any point in time by adding or removing an object on S3, so the stats cannot be reliable.

The stats are still useful. Knowing the magnitude of tables involved in a join can make an enormous difference in performance. If you put a ten billion row table on the right side of a join without table stats there's a good chance the query will run out of memory. With stats, it will run just fine. It's not about the exact numbers, but about how tables compare to one another.

I don't know why the distinct count in your table stats differ from the actual numbers, but I assume it's because the collector uses approximations. Because of the above, the exact count is not the goal, and it would be wasteful computing an exact count. The focus is on the magnitude, and not on the exact count.

Data lake query engines can leverage other types of statistics to optimize some queries. COUNT(*) on a table backed by Parquet or ORC data will read only the file footers and no row data, because the file footer contains the row count. Iceberg tables also store statistics in the metadata, which some engines can use both for query planning and for computing some results.

roslovets · 2024-07-27T21:53:51Z

I don't know why the distinct count in your table stats differ from the actual numbers, but I assume it's because the collector uses approximations. Because of the above, the exact count is not the goal, and it would be wasteful computing an exact count. The focus is on the magnitude, and not on the exact count.

Glue collects approximate count of distinct values, you are right. Now it's clearly declared in Glue UI.

AWS shows up to 90% performance, so careful statistics collection for some models potentially can optimize run time of the whole pipeline.

I'm going to do some tests with large and incremental tables to confirm this.

nicor88 · 2024-08-12T11:04:03Z

@roslovets could you share your results here? I would like to understand if it's worth to do some work about statistic collection.

jbwtan · 2024-08-13T20:42:03Z

@nicor88 it's difficult for me to share exact results but I've observed a CTAS that selects from multiple tables without column statistics take 8mins and with column statistics on all dependent tables take 30 seconds.

Starting task runs on one T byte sized table and 150columns takes about 45mins on 25DPUs, 100% sample size so it could be beneficial to provide capabilities for sync/async

HTH

nicor88 added the feature New feature or request label Nov 18, 2023

antonysouthworth-halter mentioned this issue Apr 18, 2024

[Feature] Support configurable management of Table Optimisers for Iceberg tables #627

Open

1 task

nicor88 changed the title ~~Support column level statistics~~ [Feature] - Support column level statistics Aug 12, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature] - Support column level statistics #514

[Feature] - Support column level statistics #514

nicor88 commented Nov 18, 2023 •

edited

Loading

roslovets commented Nov 22, 2023

jessedobbelaere commented Nov 22, 2023

nicor88 commented Nov 22, 2023

roslovets commented Nov 24, 2023

iconara commented Jul 9, 2024

roslovets commented Jul 27, 2024

nicor88 commented Aug 12, 2024

jbwtan commented Aug 13, 2024 •

edited

Loading

[Feature] - Support column level statistics #514

[Feature] - Support column level statistics #514

Comments

nicor88 commented Nov 18, 2023 • edited Loading

Open questions

Notes

roslovets commented Nov 22, 2023

jessedobbelaere commented Nov 22, 2023

nicor88 commented Nov 22, 2023

roslovets commented Nov 24, 2023

iconara commented Jul 9, 2024

roslovets commented Jul 27, 2024

nicor88 commented Aug 12, 2024

jbwtan commented Aug 13, 2024 • edited Loading

nicor88 commented Nov 18, 2023 •

edited

Loading

jbwtan commented Aug 13, 2024 •

edited

Loading