Skip to content

Revisit Vector Accessors

Paul Rogers edited this page May 3, 2017 · 3 revisions

Drill has a complex stack of code used to write to, and read from, value vectors:

  • Application code, which keeps track of the current row position, the number of rows (when writing), and does type-specific access using the Accessor or Mutator class for each vector {-|Nullable|Repeated}{type}Vector. Or, uses an accessor to do writing (Parquet, ScanBatch.)
  • Accessors (generated in the vector.complex package) that write to, or read from, vectors, also in a type-specific way: {-|Nullable|Repeated}{type}Writer and {-|Nullable|Repeated}{type}Reader.
  • Accessor and Mutator classes on each value vector that provide type-specific access to data.
  • Various flavors of get/set methods on DrillBuf, often for primitive or byte[] arguments.
  • Repeated versions of (mostly) the same methods on the UnsignedDirectLittleEndian (UDLE) class.
  • Repeated versions of (mostly) the same methods on Netty's pooled or unpooled byte buffers.
  • Netty's PlatformIndependent methods that repeat the primitive get/set methods to work with memory addresses.
  • Java Unsafe class that implements the above methods.

Each layer has a set (not always the same) of get/set methods.

Issues

The basics of performance engineering is to optimize inner loops. In Drill, the inner loop often includes reading from and/or writing a specific column value. Each access must descend though the six or seven layers identified above.

Further, each layer has roughly the same (but slightly different) versions of the same get/set methods. The result is a very large footprint of code to be kept consistent. For each vector, for each primitive data type, multiple copies of get/set methods are needed.

Goals

The goal of this exercise is to determine if we can optimize the accessor stack to improve performance and reduce code complexity. Indeed, other pages on this site identified that we can double performance by eliminating most of the access layers.

Concepts

The above stack is redesigned to separate concerns:

  • Netty ByteBuf (Drill DrillBuf) hold data.
  • A set of encoders map primitive types to/from Drill's little endian (LE) storage format, given only an address, offset and data value.
  • Accessors (readers, writers) provide a type-independent API to read and write vectors. Generated implementations map the generic API to specific vector types.
  • Application code works with the readers and writers.

The call stack to access a value now is:

  • Application
  • Accessor
  • Encoder
  • PlatformIndependent

In practice, even the encoder can probably be skipped for most simple types; perhaps it is needed only to encode decimal and period types.

The (column) accessor is defined as an interface which is all the application needs. See the recently-added ColumnReader and ColumnWriter classes for a prototype. Since the interface itself is vector-type neutral, the hierarchy can be used to represent implementations rather than data types. for example:

  • ColumnWriter
    • VectorColumnWriter -- "classic" writers
      • IntVectorColumnWriter
      • NullableIntVectorColumnWriter
    • DirectColumnWriter -- "streamlined" writers for direct memory
      • DirectIntColumnWriter
      • DirectNullableIntColumnWriter
    • HeapColumnWriter -- hypothetical implementation using heap buffers
      • HeapIntColunWriter
      • HeapNullableIntColumnWriter

Here, we have three different storage formats (classic vectors, direct memory and buffer memory), each with implementations for each vector type. As with the original ByteBuf(fer) concepts, a single interface works for all implementations. (Methods that don't apply to a given vector type just throw exceptions.)

Clone this wiki locally