This document details how to run benchmarks against Materialize.
We use the mixed workload CH-benCHmark benchmark to generate an interesting dataset for Materialize. This dataset is interesting for a couple of reasons:
- This workload tests our Debezium / Kafka / Avro support because the load generator is not writing directly Materialize but instead writing to MySQL or Postgres. As such, it tests our ability to ingest data in a scenario that many customers use.
- The analytics queries are non-trivial and exercise many features of Materialize. They also allow us to showcase how fast our queries are for analytics use cases.
Simply running chbench is not a good exercise of Materialize performance, as the benchmark is
primarily bottlenecked by the upstream database. As such, there are chbench snapshots
, currently
generated by hand, that can be downloaded to test "replay performance".
Note: This step requires access to an S3 bucket in Materialize's account. Currently, only
Materialize engineers have access to this bucket. Permissions are not required to generate or use
snapshots, only to upload / download snapshots. This bucket is currently accessible using the
dev
profile defined in our infra
repository.
To download a snapshot, use the demo/chbench/bin/download_snapshot
helper script. You will need
to wrap the call to this script with aws-vault exec dev
:
aws-vault exec dev -- ./demo/chbench/bin/download_snapshot -d /tmp/snapshots <SNAPSHOT_ID>
The last argument is the unique identifier for the snapshot. Because snapshots can be quite large,
they are currently hand-managed and are simply assigned a unique number starting from 1
.
A snapshot contains 4 types of files:
- `*.sql`: These are memoized values for named queries. If Materialize successfully ingests
all messages in the topics from a snapshot, then the results from `SELECT * FROM <view>`
should match `<view>.sql`.
- `*.arrow`: These are Apache Arrow encoded files that contain all of the messages required to
reconstruct a topic. Original message timestamps are preserved.
- `offets.json`: A JSON file that maps topic name to number of messages in the topic. This
information can be computed from the arrow files but it's much faster to read from this
file.
- `schemas.json`: A JSON file that contains the raw values of the JSON schemas so that we can
rebuild schema registry with the exact same schemas as when the snapshot was created.
Once you have downloaded a snapshot, you can run the following command to measure Materialize ingest performance:
./demo/chbench/bin/snapshot_bench -s -d /tmp/snaphots/<SNAPSHOT_ID>
This script takes quite some time to run, as it's quite slow at populating Kafka topics. Once it completes, it prints multiple timings that represent "how long Materialize took to read a topic and arrive at the final results for a view."
More documentation on this script is coming -- in the meantime, you can run for help text:
./demo/chbench/bin/snapshot_bench -h