Drill Maps

Drill veterans may be familiar with how maps work in Drill, but us novices can be surprised, so let’s explore how Drill maps actually work. This is an important prelude to a question I’ll ask in a follow-up e-mail.

According the Apache Drill videos, Drill was designed to follow the JSON standard. As we will learn below, that goal is, shall we say, more aspirational than operational: Drill does not actually support the full JSON data model.

JSON documents are of two forms, the (single) object:

{ name: “fred” }

Or the array:

[ { name: “fred”}, { name: “wilma” } ]

Shown here as an array of objects. Can also be an array of scalars.

Already here Drill does not follow the standard. Drill follows a harmless extension (that is nonetheless hotly debated on Slashdot that defines a Drill JSON document as a non-standard sequence of JSON maps:

{ name: “fred” }
{ name: “wilma” }

Notice the lack of commas between objects. (Also notice a harmless simplification in my examples, no quotes around field names.)

Drill is an SQL system and SQL is designed for relational (tabular) data. Drill makes assumptions about the JSON data. Each object in the input file maps to a row; object members map to columns:

{ id: 100, name: “fred” }
{ id: 200, name: “wilma” }

Becomes

id	name
100	fred
200	wilma

JSON objects are maps, each is independent of another; they are just collections of name/value pairs. Here again, Drill differs from JSON; Drill assumes all objects are of the same row (record) type, and so the idea of multiple document types per JSON file does not fit well with Drill.

Since all objects are of the same type, Drill assumes each member could occur in every row so:

{ id: 100, name: “fred”, status: “past-due" }
{ id: 200, name: “wilma”, children: 2 }

Which gives this schema;

id	name	status	children
100	fred	past-due	null
200	wilma	null	2

That is, though in JSON every object is independent of every other object, Drill treats objects as representations of a common row structure, assumes all columns exist for all rows, and fill in null values when members are missing. Kind of makes sense in the above: different customers may have different attributes, but all attributes could apply to any customer.

This makes much less sense if the JSON file happened to be a collection of different document types:

{ docType: “customer”, name: “fred” }
{ docType: “order”, product: “stones” }

docType	name	product
customer	fred	null
order	null	stones

Here, an order does not have a name and customer does not have a product. Yet, Drill will create a combined row schema; distorting the original meaning of the JSON data. The above usage gives rise to lots of bugs. The “name” column for an order is not simply null, it is, actually, not present. When doing a CTAS to JSON, we should not write:

{ docType: “customer”, name: “fred”, product: null }
{ docType: “order”, name: null, product: “stones” }

So, Drill is not fully JSON compliant; rather Drill uses JSON to represent tabular data.

Let’s move onto nested objects (what Drill calls “maps”):

{ orderId: 100, customer: { name: “fred”, city: “bedrock” },
  item: {quantity: 100, product: “stones” } }

Here we have an order with two nested objects: one for customer, another for the (single) ordered item (arrays will be a separate topic; that’s where we’re headed.)

Again, in general, objects are independent collections of name/value pairs, but Drill works with rows. This is true for nested objects (maps) as well.

Let’s introduce a new term: “tuple”. This is a relational-theory term that is a fancy term for what we usually call a row (or record) a collection of items each of which is an ordered list of values, along with a schema for those values. In JSON, we have to repeat the name for each member; in a tuple-based system, we specify the schema once, then just list the values, as we showed in the table examples.

In Drill, each row is a tuple. But, in Drill, each so-called “map” is also a tuple. This means that, in Drill, maps are just nested tuples. If column X is a map, then all rows must have the same set of map members. That is, column X is a nested tuple.

We can describe the nested members using the usual dotted-name notation: “customer.name”, “item.quantity”. So, for the example above, we really have two nested tuples. We can show the resulting table as:

orderId	customer.name	customer.city	item.quantity	item.stones
100	fred	bedrock	100	stones

This is the advantage of Drill’s idea of a map: it is a nested tuple and can be “flattened” into a simple row structure. (Note, however, that Drill code is conflicted on this point. The code implements maps as described above, but pretends it implements real maps, confusing all who wander into the code…)

Drill does not support using maps (nested tuples) in filters, in sort keys, and so on. Why not? Because nested tuples are not really objects, they are just a convenient way of grouping other columns. Drill should allow direct references to the scalar values within a nested tuple. There is nothing wrong with “item.quantity > 200”, say. But, we keep confusing ourselves about the difference between a “map” (nested tuple) and a scalar member of that map. The scalar member is no different from a top-level column, except that it has a dotted name.

And, of course, we can nest to any level:

{ orderId: 100, cust: { name: { first: “fred”, last: “flintstone” }, addr: { city: “bedrock” } },
  item: {quan: 100, prod: { name: “stones”, size: “medium" } } }

To give:

orderId	cust.name.first	cust.name.last	addr.city	item.quan	item.prod.name	item.prod.size
100	fred	flintsone	bedrock	100	stones	medium

All of this makes perfect sense. But, it can be a surprise to someone that sees the term “map” because a map is normally a collection of name/value pairs and each map instance is unrelated to others.

In fact, other products use different names to be clearer. Impala and Hive provide both a Map and a Struct type. A Hive Struct is similar to a Drill Map: both are tuples. Drill has no equivalent of the Hive Map type (that is, Drill does not support true maps, despite the name of our Map data type.)

In short:

Drill	Impala/Hive	JSON/Javascript	Java
Map	Struct	N/A	Class
N/A	Map	Object	Map

Note that a Drill Map has no corresponding type in JSON; and the JSON object type has no equivalent in Drill. Instead, we insist that JSON objects are defined by the file creator so that they can be treated as if they were structures. Because the JSON objects are structures, they can be flattened into a plain old relational tuple (albeit with dotted column names.) Again, this leads to all manner of bugs when people expect that Drill maps are “real” maps.

Now, it is not a bad thing to use JSON to represent structures. In fact, there is a standard for doing so and specifying (in JSON) the schema of that structure. This approach started with the Space Telescope group no less.

As noted earlier, even though Drill maps ended up being implemented as nested tuples (Hive structs), Drill code does not take advantage of this fact. In Drill, “orderID” and “item.quan” are far different constructs, but need not be. Drill cannot simply reference “cust.name.first”, it must go through costly projections that involve copying data.

A good debate for another time: given that a Drill map is a tuple, should we just go ahead and take advantage of that fact to improve performance and simplify code? And, given that Drill does not have a true map type, should we add that in some future release? (Yes, we will have fun with type naming since Map = Struct, ? = Map ?)

Take away: the Drill videos lead one to believe that Drill does, in fact, handle JSON objects as maps. But, actual implementation does just what the video claims we don’t do: it flattens JSON object into (nested) tuples. This forces mapping conventions on the JSON data we read: JSON objects must correspond to tuples.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Drill Maps

Clone this wiki locally