Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: teach FSSTArray to compress the offsets of its codes #928

Draft
wants to merge 3 commits into
base: develop
Choose a base branch
from

Commits on Sep 25, 2024

  1. feat: teach DeltaArray slice and scalar_at

    More subtle than I expected.
    
    DeltaArray is a sequence of chunks. All chunks except the last must be "full", i.e. containing 1,024
    values. The last chunk may contain as few as one value and is encoded differently from the rest.
    
    In this PR, I introduced an "offset" and a "limit". Together they enable logical/lazy slicing while
    preserving full chunks for later decompression. The offset is a value, less than 1024, which offsets
    into the first chunk. The limit is either `None` or less than 1024. `None` represents no limit which
    allows callers to avoid computing the length of the last chunk [1]. Internally, the limit is
    converted to a "right-offset": `trailing_garbage` which is sliced away when decompression happens.
    
    [1] Which is, a bit annoyingly, this:
    
    ```
    match deltas.len() % 1024 {
        0 => 1024,
        n => n
    }
    ```
    danking committed Sep 25, 2024
    Configuration menu
    Copy the full SHA
    60d0441 View commit details
    Browse the repository at this point in the history
  2. fix docs

    danking committed Sep 25, 2024
    Configuration menu
    Copy the full SHA
    2453b4e View commit details
    Browse the repository at this point in the history
  3. feat: teach FSSTArray to compress the offsets of its codes

    The codes of an FSSTArray are a vector of binary-strings of one byte codes or an escape code
    followed by a data. The offsets, unexpectedly, grow quite large, increasing the file size (for
    example, the TPC-H l_comment column with this PR is 78% the byte size of itself on `develop`). Delta
    encoding notably decreases the size but also inflates the compression time, seemingly proportional
    to the space savings (TPC-H l_comment compresses in 111% of the time on `develop`).
    danking committed Sep 25, 2024
    Configuration menu
    Copy the full SHA
    1e83368 View commit details
    Browse the repository at this point in the history