Add indices and stats file format specification

Add a specification for a container file format to store indices and stats for Iceberg tables.
apache · Apr 6, 2022 · b8268f1 · b8268f1
1 parent 4f8dd64
commit b8268f1
Showing 1 changed file with 119 additions and 0 deletions.
diff --git a/landing-page/content/common/index-and-statistics-format.md b/landing-page/content/common/index-and-statistics-format.md
@@ -0,0 +1,119 @@
+---
+url: index-and-statistics-format
+toc: false
+---
+<!--
+ - Licensed to the Apache Software Foundation (ASF) under one or more
+ - contributor license agreements.  See the NOTICE file distributed with
+ - this work for additional information regarding copyright ownership.
+ - The ASF licenses this file to You under the Apache License, Version 2.0
+ - (the "License"); you may not use this file except in compliance with
+ - the License.  You may obtain a copy of the License at
+ -
+ -   http://www.apache.org/licenses/LICENSE-2.0
+ -
+ - Unless required by applicable law or agreed to in writing, software
+ - distributed under the License is distributed on an "AS IS" BASIS,
+ - WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ - See the License for the specific language governing permissions and
+ - limitations under the License.
+ -->
+
+# Index and statistics file format
+
+This is a specification for the Plain Format for Iceberg Statistics, a file
+format designed to store information such as statistics about data managed in an
+Iceberg table that cannot be stored directly within the Iceberg manifest. A
+statistics file contains arbitrary pieces of information (here called "blobs"),
+along with metadata necessary to interpret them. The blobs supported by Iceberg
+are documented at [Blob types](#blob-types).
+
+## Format specification
+
+A file conforming to the format specification should have the structure as
+described below.
+
+### File structure
+
+The file has the following structure
+
+```
+Magic Blob₁ Blob₂ ... Blobₙ Footer
+```
+
+where
+
+- `Magic` is four bytes 0x50, 0x46, 0x49, 0x53 (short for: Plain Format for
+  Indices and Statistics),
+- `Blobᵢ` is i-th blob contained in the file, to be interpreted by application
+  according to the footer,
+- `Footer` is defined below.
+
+### Footer structure
+
+Footer has the following structure
+
+```
+Magic FooterPayload FooterPayloadSize Reserved Flags FileFormatVersion Magic
+```
+
+where
+
+- `Magic` four bytes, same as at the beginning of the file.
+- `FooterPayload` optionally LZ4-compressed, UTF-8 encoded JSON payload describing the
+  blobs in the file, with the structure described below,
+- `FooterPayloadSize` is a length in bytes of the `FooterPayload` (compressed),
+  stored as 4 byte integer, little-endian,
+- `Reserved` is 4 bytes reserved for future use, currently should be written as
+  0x00, 0x00, 0x00, 0x00,
+- `Flags` a 4 byte integer, stored little-endian, for boolean flags
+  - 0 (lowest bit): whether `FooterPayload` is compressed
+  - all other bits are reserved for future use and should be set to 0 on write
+- `FileFormatVersion` is a number, stored as 4 byte integer, little-endian,
+
+### Footer Payload
+
+Footer payload bytes is uncompressed or LZ4-compressed, UTF-8 encoded JSON payload representing
+a single `FileMetadata` object.
+
+#### FileMetadata
+
+`FileMetadata` has the following fields
+
+
+| Field Name | Field Type                              | Required | Description |
+| ---------- | --------------------------------------- | -------- | ----------- |
+| blobs      | list of BlobMetadata objects            | yes      |
+| properties | JSON object with string property values | no       | storage for arbitrary meta-information, like writer identification/version
+
+#### BlobMetadata
+
+`BlobMetadata` has the following fields
+
+| Field Name | Field Type             | Required | Description |
+| ---------- | ---------------------- | -------- | ----------- |
+| type       | JSON string            | yes      | See [Blob types](#blob-types)
+| columns    | list of JSON long      | yes      | list of column IDs the blob was computed for
+| offset     | JSON long              | yes      | The offset in the file where the blob contents start. Reader should assume the value can be more than 2^32.
+| length     | JSON long              | yes      | The length of the blob stored in the file
+| compression_codec | JSON string     | no       | See [Compression codecs](#compression-codecs). If omitted, the data is assumed to be uncompressed.
+
+### Blob types
+
+The blobs can be of a type listed below
+
+| Blob type                      | Description |
+| ------------------------------ | ----------- |
+| ndv-long-little-endian         | 8-bytes integer stored little-endian and representing number of distinct values
+| apache-datasketches-theta-v1   | A serialized form of a "compact" Theta sketch produced by the [Apache DataSketches](https://datasketches.apache.org/) library.
+
+### Compression codecs
+
+The data can also be uncompressed. If it is compressed the code should be one of
+codecs listed below. For maximal interoperability, other codecs are not supported.
+
+| Codec name | Description                        |
+|------------|------------------------------------|
+| lz4        | Single LZ4 compression frame       |
+| zstd       | Single Zstandard compression frame |
+__