Skip to content

The Many Types of Vector Bundles

Paul Rogers edited this page Apr 7, 2017 · 3 revisions

Drill is a columnar query engine. The column, in the form of a value vector, is the unit of storage. Value vectors are generated from templates.

While value vectors are the core, queries work with rows: SQL is a query language for relations, defined as tables of rows. To represent rows, Drill must "bundle" vectors together. This requires a bit of explanation. In a row-based system, one can easily point out a single row:

...|field n][field 1|field 2|field 3|...|field n][field 1|...

The middle bit is a whole row (AKA record), surrounded on either side by the previous and next rows. In a columnar system, it is impossible to point to one thing and say, "AHA! There is a row!". Instead, rows are an emergent concept.

Let's start with a single vector. To keep things simple, let's take a required integer vector, something that is represented with a single buffer:

|  10 |
|  20 | <-- Second row
|  30 |
...
| 100 |

If our data set consisted of a single column, then the above represents a set of 10 rows. The arrow above points to the second row. We can also see that this is the second position in the array of integers that make up the vector.

Now, here is where things get interesting. Once we have two or more columns, we have a collection of vectors with no intrinsic relationship. One might be a vector of integers, another a vector of strings. In Drill, these are just vectors. We need a way to signal that the vectors are related. We do this by combining vectors into a "bundle" (my term). Vectors in a bundle observe a convention: that the data in each column position corresponds to the same row. So, if our rows are:

[ 10 | Fred     ]
[ 20 | Barney   ]
[ 30 | Wilma    ]

We end up with two columns, in a "bundle" with an implied correspondence between column positions:

|  10 |      | Fred     |
|  20 |      | Barney   |
|  30 |      | Wilma    |

It seems that vectors started as the primary concept in Drill, and "bundles" were added in ad-hoc manner as developers found it convenient to combine vectors. As a result, Drill has multiple bundle implementations.

Single-Column Bundles

Drill columns can be represented as a single value vector, or as a stack of value vectors (the so-called "hyper vectors"). This is a "vertical" bundle of vectors with one stacked on top of another:

|  10 |  \
|  20 |   > First vector
|  30 |  /
- - - -
|  40 |  \
|  50 |   > Second vector
|  60 |  /

A hyper vector occurs in operators such as the sort which accumulate a collection of record batches, then performs a single operation (sorting in this case) on all of them.

When working with single columns, the record count (really just value count) is held on the vector itself.

VectorWrapper

The VectorWrapper is the abstraction that handles one or more vertically-stacked vectors.

  • SimpleVectorWrapper<T extends ValueVector>: Holds a single value vector.
  • HyperVectorWrapper<T extends ValueVector>: Holds two or more vertically-stacked vectors.

Multiple-Column Bundles

The other form of vector bundle are "horizontal" collections. In this case, the vectors are heterogenous and each represents a separate column in the query. There is an assumed correspondence between vector indices as explained above.

Unlike, say, JDBC, columns in Drill are assumed to have no ordering. Each bundle is, essentially, an unordered collection of columns. Some bundles may provide an ordering, others do not. When no ordering exists, columns are still accessible by name, using a linear search across the unordered columns.

Bundled vectors continue to maintain their own value counts. But, in order for the horizontal bundle to represent a group of rows, it is necessary for all vectors to agree on the same count. Some bundles provide this unified count as a "record count".

Iterable<VectorWrapper<?>

The most basic representation of a vector bundle is as an iterator over the bundle. Or, specifically, an iterator over wrappers around vectors, since the columns themselves may be a hyper-vector (vertical stack of vectors.)

VectorAccessible

At the next level of abstraction is the vector accessible: a bundle of vectors that allows iterated access, provides a schema, a record count and optionally provides a selection vector.

Several dozen implementations of VectorAccessible exist, each with its own way to store a vector bundle.

VectorContainer

The VectorContainer class is the most common implementation of VectorAccessible. It provides an ordered collection of vectors, defined by the order that vectors are created in the container.

The container also provide a schema, which is an ordered collection of MaterializedField.

In general, the application first adds vectors, then builds the schema, to ensure that the fields listed in the schema match the order of the vectors.

RecordBatch

The term "record batch" in Drill is very confusing. Based on normal English, one would assume that a "record batch" is a collection of records. Given that records are shredded into vectors, a record batch would seem to be the same as what we are here calling a vector bundle. ("Vector bundle" takes a vertical view, "record batch" would take a horizontal view.)

However, the RecordBatch class is not just a collection of records, it is also an operator on those records. Thus, concrete classes such as the ProjectRecordBatch not only hold the batch, it does the projection operator on that batch.

As a result, there is no concept of a batch of rows in Drill separate from operations on that batch.

VectorAccessibleSerializable

A specialized form of the vector accessible used to serialize vectors to and from streams.

WritableBatch

The WritableBatch class is a remapping of a vector bundle into a form that can be serialized.

MapVector

The bundles described above all hold entire rows. Drill supports maps. For various reasons maps in Drill are actually more like nested rows. (Or, more accurately, both rows and maps are tuples: collections of columns with a shared schema.)

The MapVector class holds the set of vectors that make up a map. Unlike a VectorContainer vectors within a map are unordered: they reside in a name map rather than an ordered list. As a result, columns in a row can be accessed by position; columns in a map must be accessed by name. While access by name is reasonable for a map; Drill maps are not true maps; they are tuples, and so access by position should be possible.

Other Classes

A wide variety of other vector bundles exist. Each has its own purpose and semantics. Hopefully studying the above will provide insight into these other collections.

Clone this wiki locally