Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make use of clustering #6

Open
bernard-wagner opened this issue Feb 27, 2022 · 0 comments
Open

Make use of clustering #6

bernard-wagner opened this issue Feb 27, 2022 · 0 comments

Comments

@bernard-wagner
Copy link

bernard-wagner commented Feb 27, 2022

Was playing around with the BigQuery public-datasets populated using ETL. Was wondering if it has been investigated whether it will be possible to add clustering in addition to partitioning. Would really reduce the processing costs if queries are structured properly.

https://cloud.google.com/bigquery/docs/clustered-tables

Although it possible to add clustering to an existing table, only new data will be clustered and have reduced processing costs:

https://cloud.google.com/bigquery/docs/creating-clustered-tables#modifying-cluster-spec

Instead, the data can be copied to a new table that has clustering enabled.

Example

As an example, the blocks table can get migrated using the following SQL script:

CREATE TABLE
   `bigquery-public-data.crypto_ethereum.blocks-cluster`
 PARTITION BY
   TIMESTAMP_TRUNC(`timestamp`, DAY)
 CLUSTER BY
   `number`, `hash` AS
 SELECT
   *
 FROM
   `bigquery-public-data.crypto_ethereum.blocks`

Thereafter the tables can be renamed to swap out the tables and delete the originals.

ALTER TABLE `bigquery-public-data:crypto_ethereum.blocks` RENAME TO `blocks-legacy`

ALTER TABLE `bigquery-public-data:crypto_ethereum.blocks-cluster` RENAME TO `blocks`

Benchmark

As a comparison, I created a copy of the blocks table and queried by number on the original table and the copy with clustering:

Original

Screenshot 2022-02-27 at 15 58 49

Clustering

Screenshot 2022-02-27 at 15 58 25

Comparison

Looking at the results, the overall amount of data processed was reduced from 14.7 GB to 6.9 MB.

Structure

The major design decision that will need to be made is which columns to cluster on and in what order. It is recommended to keep the order of clustered columns the same as the schema as it simplifies writing queries. Clustering only works if filters (WHERE) are applied in order of clustering.

Clustering has a minor storage overhead, but the benefits generally exceed the disadvantages and as many columns as possible should make use of clustering.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant