Skip to content

Latest commit

 

History

History
67 lines (46 loc) · 3.26 KB

benchmarking.md

File metadata and controls

67 lines (46 loc) · 3.26 KB

Benchmarking Materialize

This document details how to run benchmarks against Materialize.

chbench (CH-benCHmark)

We use the mixed workload CH-benCHmark benchmark to generate an interesting dataset for Materialize. This dataset is interesting for a couple of reasons:

  • This workload tests our Debezium / Kafka / Avro support because the load generator is not writing directly Materialize but instead writing to MySQL or Postgres. As such, it tests our ability to ingest data in a scenario that many customers use.
  • The analytics queries are non-trivial and exercise many features of Materialize. They also allow us to showcase how fast our queries are for analytics use cases.

Snapshot Benchmarks

Simply running chbench is not a good exercise of Materialize performance, as the benchmark is primarily bottlenecked by the upstream database. As such, there are chbench snapshots, currently generated by hand, that can be downloaded to test "replay performance".

Downloading chbench Snapshots

Note: This step requires access to an S3 bucket in Materialize's account. Currently, only Materialize engineers have access to this bucket. Permissions are not required to generate or use snapshots, only to upload / download snapshots. This bucket is currently accessible using the dev profile defined in our infra repository.

To download a snapshot, use the demo/chbench/bin/download_snapshot helper script. You will need to wrap the call to this script with aws-vault exec dev:

aws-vault exec dev -- ./demo/chbench/bin/download_snapshot -d /tmp/snapshots <SNAPSHOT_ID>

The last argument is the unique identifier for the snapshot. Because snapshots can be quite large, they are currently hand-managed and are simply assigned a unique number starting from 1.

Contents of a Snapshot

A snapshot contains 4 types of files:

- `*.sql`: These are memoized values for named queries. If Materialize successfully ingests
  all messages in the topics from a snapshot, then the results from `SELECT * FROM <view>`
  should match `<view>.sql`.
- `*.arrow`: These are Apache Arrow encoded files that contain all of the messages required to
  reconstruct a topic. Original message timestamps are preserved.
- `offets.json`: A JSON file that maps topic name to number of messages in the topic. This
  information can be computed from the arrow files but it's much faster to read from this
  file.
- `schemas.json`: A JSON file that contains the raw values of the JSON schemas so that we can
  rebuild schema registry with the exact same schemas as when the snapshot was created.

Running an Ingest Benchmark

Once you have downloaded a snapshot, you can run the following command to measure Materialize ingest performance:

./demo/chbench/bin/snapshot_bench -s -d /tmp/snaphots/<SNAPSHOT_ID>

This script takes quite some time to run, as it's quite slow at populating Kafka topics. Once it completes, it prints multiple timings that represent "how long Materialize took to read a topic and arrive at the final results for a view."

More documentation on this script is coming -- in the meantime, you can run for help text:

./demo/chbench/bin/snapshot_bench -h