diff --git a/landing-page/content/common/index-and-statistics-format.md b/landing-page/content/common/index-and-statistics-format.md new file mode 100644 index 000000000..486bc34b5 --- /dev/null +++ b/landing-page/content/common/index-and-statistics-format.md @@ -0,0 +1,144 @@ +--- +url: puffin +toc: false +--- + + +# Puffin file format + +This is a specification for the Puffin, a file format designed to store +information such as indexes and statistics about data managed in an +Iceberg table that cannot be stored directly within the Iceberg manifest. A +Puffin file contains arbitrary pieces of information (here called "blobs"), +along with metadata necessary to interpret them. The blobs supported by Iceberg +are documented at [Blob types](#blob-types). + +## Format specification + +A file conforming to the Puffin file format specification should have the structure +as described below. + +### Versions + +Currently, there is a single version of the Puffin file format, described below. + +### File structure + +The Puffin file has the following structure + +``` +Magic Blob₁ Blob₂ ... Blobₙ Footer +``` + +where + +- `Magic` is four bytes 0x50, 0x46, 0x41, 0x31 (short for: Puffin _Fratercula + arctica_, version 1), +- `Blobᵢ` is i-th blob contained in the file, to be interpreted by application + according to the footer, +- `Footer` is defined below. + +### Footer structure + +Footer has the following structure + +``` +Magic FooterPayload FooterPayloadSize Flags Magic +``` + +where + +- `Magic`: four bytes, same as at the beginning of the file. +- `FooterPayload`: optionally compressed, UTF-8 encoded JSON payload describing the + blobs in the file, with the structure described below, +- `FooterPayloadSize`: a length in bytes of the `FooterPayload` (compressed), + stored as 4 byte integer, +- `Flags`: 4 bytes for boolean flags + - byte 0 (first) + - bit 0 (lowest bit): whether `FooterPayload` is compressed + - all other bits are reserved for future use and should be set to 0 on write + - all other bytes are reserved for future use and should be set to 0 on write + +A 4 byte integer is always signed, in a two's complement representation, stored +little-endian. + +### Footer Payload + +Footer payload bytes is either uncompressed or LZ4-compressed (as a single +[LZ4 compression frame](https://github.com/lz4/lz4/blob/77d1b93f72628af7bbde0243b4bba9205c3138d9/doc/lz4_Frame_format.md) +with content size present), UTF-8 encoded JSON payload representing a single +`FileMetadata` object. + +#### FileMetadata + +`FileMetadata` has the following fields + + +| Field Name | Field Type | Required | Description | +| ---------- | --------------------------------------- | -------- | ----------- | +| blobs | list of BlobMetadata objects | yes | +| properties | JSON object with string property values | no | storage for arbitrary meta-information, like writer identification/version. See [Common properties](#common-properties) for properties that are recommended to be set by a writer. + +#### BlobMetadata + +`BlobMetadata` has the following fields + +| Field Name | Field Type | Required | Description | +|-------------------|-------------------| -------- | ----------- | +| type | JSON string | yes | See [Blob types](#blob-types) +| fields | list of JSON long | yes | List of field IDs the blob was computed for; the order of items is used to compute sketches stored in the blob. +| offset | JSON long | yes | The offset in the file where the blob contents start +| length | JSON long | yes | The length of the blob stored in the file +| compression-codec | JSON string | no | See [Compression codecs](#compression-codecs). If omitted, the data is assumed to be uncompressed. + +### Blob types + +The blobs can be of a type listed below + +#### `ndv-long-little-endian` blob type + +8-bytes unsigned integer stored little-endian and representing number of distinct values +of a single field. + +#### `apache-datasketches-theta-v1` blob type + +A serialized form of a "compact" Theta sketch produced by the [Apache +DataSketches](https://datasketches.apache.org/) library. The sketch is obtained by +constructing Alpha family sketch with default seed, and feeding it with individual +distinct values converted to bytes using Iceberg's single-value serialization. + +### Compression codecs + +The data can also be uncompressed. If it is compressed the code should be one of +codecs listed below. For maximal interoperability, other codecs are not supported. + +| Codec name | Description | +|------------|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------| +| lz4 | Single [LZ4 compression frame](https://github.com/lz4/lz4/blob/77d1b93f72628af7bbde0243b4bba9205c3138d9/doc/lz4_Frame_format.md), with content size present | +| zstd | Single [Zstandard compression frame](https://github.com/facebook/zstd/blob/8af64f41161f6c2e0ba842006fe238c664a6a437/doc/zstd_compression_format.md#zstandard-frames), with content size present | +__ + +### Common properties + +When writing a Puffin file it is recommended to set the following fields in the +[FileMetadata](#filemetadata)'s `properties` field. + +- `created-by` - human-readable identification of the application writing the file, + along with its version. Example "Trino version 381". +- `source-snapshot-id` - a table snapshot which was used to calculate blob contents +- `source-sequence-number` - sequence number of the table snapshot used to calculate blob contents