Make use of clustering #6

bernard-wagner · 2022-02-27T14:07:38Z

Was playing around with the BigQuery public-datasets populated using ETL. Was wondering if it has been investigated whether it will be possible to add clustering in addition to partitioning. Would really reduce the processing costs if queries are structured properly.

https://cloud.google.com/bigquery/docs/clustered-tables

Although it possible to add clustering to an existing table, only new data will be clustered and have reduced processing costs:

https://cloud.google.com/bigquery/docs/creating-clustered-tables#modifying-cluster-spec

Instead, the data can be copied to a new table that has clustering enabled.

Example

As an example, the blocks table can get migrated using the following SQL script:

CREATE TABLE
   `bigquery-public-data.crypto_ethereum.blocks-cluster`
 PARTITION BY
   TIMESTAMP_TRUNC(`timestamp`, DAY)
 CLUSTER BY
   `number`, `hash` AS
 SELECT
   *
 FROM
   `bigquery-public-data.crypto_ethereum.blocks`

Thereafter the tables can be renamed to swap out the tables and delete the originals.

ALTER TABLE `bigquery-public-data:crypto_ethereum.blocks` RENAME TO `blocks-legacy`

ALTER TABLE `bigquery-public-data:crypto_ethereum.blocks-cluster` RENAME TO `blocks`

Benchmark

As a comparison, I created a copy of the blocks table and queried by number on the original table and the copy with clustering:

Original

Clustering

Comparison

Looking at the results, the overall amount of data processed was reduced from 14.7 GB to 6.9 MB.

Structure

The major design decision that will need to be made is which columns to cluster on and in what order. It is recommended to keep the order of clustered columns the same as the schema as it simplifies writing queries. Clustering only works if filters (WHERE) are applied in order of clustering.

Clustering has a minor storage overhead, but the benefits generally exceed the disadvantages and as many columns as possible should make use of clustering.

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Make use of clustering #6

Make use of clustering #6

bernard-wagner commented Feb 27, 2022 •

edited

Loading

Make use of clustering #6

Make use of clustering #6

Comments

bernard-wagner commented Feb 27, 2022 • edited Loading

Example

Benchmark

Original

Clustering

Comparison

Structure

bernard-wagner commented Feb 27, 2022 •

edited

Loading