Skip to content
Paul Rogers edited this page Mar 27, 2017 · 2 revisions

Drill veterans may be familiar with how maps work in Drill, but us novices can be surprised, so let’s explore how Drill maps actually work. This is an important prelude to a question I’ll ask in a follow-up e-mail.

According the Apache Drill videos, Drill was designed to follow the JSON standard. As we will learn below, that goal is, shall we say, more aspirational than operational: Drill does not actually support the full JSON data model.

JSON documents are of two forms, the (single) object:

{ name: “fred” }

Or the array:

[ { name: “fred”}, { name: “wilma” } ]

Shown here as an array of objects. Can also be an array of scalars.

Already here Drill does not follow the standard. Drill follows a harmless extension (that is nonetheless hotly debated on Slashdot that defines a Drill JSON document as a non-standard sequence of JSON maps:

{ name: “fred” }
{ name: “wilma” }

Notice the lack of commas between objects. (Also notice a harmless simplification in my examples, no quotes around field names.)

Drill is an SQL system and SQL is designed for relational (tabular) data. Drill makes assumptions about the JSON data. Each object in the input file maps to a row; object members map to columns:

{ id: 100, name: “fred” }
{ id: 200, name: “wilma” }

Becomes

id name
100 fred
200 wilma

JSON objects are maps, each is independent of another; they are just collections of name/value pairs. Here again, Drill differs from JSON; Drill assumes all objects are of the same row (record) type, and so the idea of multiple document types per JSON file does not fit well with Drill.

Since all objects are of the same type, Drill assumes each member could occur in every row so:

{ id: 100, name: “fred”, status: “past-due" }
{ id: 200, name: “wilma”, children: 2 }

Which gives this schema;

id name status children
100 fred past-due null
200 wilma null 2

That is, though in JSON every object is independent of every other object, Drill treats objects as representations of a common row structure, assumes all columns exist for all rows, and fill in null values when members are missing. Kind of makes sense in the above: different customers may have different attributes, but all attributes could apply to any customer.

This makes much less sense if the JSON file happened to be a collection of different document types:

{ docType: “customer”, name: “fred” }
{ docType: “order”, product: “stones” }
docType name product
customer fred null
order null stones

Here, an order does not have a name and customer does not have a product. Yet, Drill will create a combined row schema; distorting the original meaning of the JSON data. The above usage gives rise to lots of bugs. The “name” column for an order is not simply null, it is, actually, not present. When doing a CTAS to JSON, we should not write:

{ docType: “customer”, name: “fred”, product: null }
{ docType: “order”, name: null, product: “stones” }

So, Drill is not fully JSON compliant; rather Drill uses JSON to represent tabular data.

Let’s move onto nested objects (what Drill calls “maps”):

{ orderId: 100, customer: { name: “fred”, city: “bedrock” },
  item: {quantity: 100, product: “stones” } }

Here we have an order with two nested objects: one for customer, another for the (single) ordered item (arrays will be a separate topic; that’s where we’re headed.)

Again, in general, objects are independent collections of name/value pairs, but Drill works with rows. This is true for nested objects (maps) as well.

Let’s introduce a new term: “tuple”. This is a relational-theory term that is a fancy term for what we usually call a row (or record) a collection of items each of which is an ordered list of values, along with a schema for those values. In JSON, we have to repeat the name for each member; in a tuple-based system, we specify the schema once, then just list the values, as we showed in the table examples.

In Drill, each row is a tuple. But, in Drill, each so-called “map” is also a tuple. This means that, in Drill, maps are just nested tuples. If column X is a map, then all rows must have the same set of map members. That is, column X is a nested tuple.

We can describe the nested members using the usual dotted-name notation: “customer.name”, “item.quantity”. So, for the example above, we really have two nested tuples. We can show the resulting table as:

orderId customer.name customer.city item.quantity item.stones
100 fred bedrock 100 stones

This is the advantage of Drill’s idea of a map: it is a nested tuple and can be “flattened” into a simple row structure. (Note, however, that Drill code is conflicted on this point. The code implements maps as described above, but pretends it implements real maps, confusing all who wander into the code…)

Drill does not support using maps (nested tuples) in filters, in sort keys, and so on. Why not? Because nested tuples are not really objects, they are just a convenient way of grouping other columns. Drill should allow direct references to the scalar values within a nested tuple. There is nothing wrong with “item.quantity > 200”, say. But, we keep confusing ourselves about the difference between a “map” (nested tuple) and a scalar member of that map. The scalar member is no different from a top-level column, except that it has a dotted name.

And, of course, we can nest to any level:

{ orderId: 100, cust: { name: { first: “fred”, last: “flintstone” }, addr: { city: “bedrock” } },
  item: {quan: 100, prod: { name: “stones”, size: “medium" } } }

To give:

orderId cust.name.first cust.name.last addr.city item.quan item.prod.name item.prod.size
100 fred flintsone bedrock 100 stones medium

All of this makes perfect sense. But, it can be a surprise to someone that sees the term “map” because a map is normally a collection of name/value pairs and each map instance is unrelated to others.

In fact, other products use different names to be clearer. Impala and Hive provide both a Map and a Struct type. A Hive Struct is similar to a Drill Map: both are tuples. Drill has no equivalent of the Hive Map type (that is, Drill does not support true maps, despite the name of our Map data type.)

In short:

Drill Impala/Hive JSON/Javascript Java
Map Struct N/A Class
N/A Map Object Map

Note that a Drill Map has no corresponding type in JSON; and the JSON object type has no equivalent in Drill. Instead, we insist that JSON objects are defined by the file creator so that they can be treated as if they were structures. Because the JSON objects are structures, they can be flattened into a plain old relational tuple (albeit with dotted column names.) Again, this leads to all manner of bugs when people expect that Drill maps are “real” maps.

Now, it is not a bad thing to use JSON to represent structures. In fact, there is a standard for doing so and specifying (in JSON) the schema of that structure. This approach started with the Space Telescope group no less.

As noted earlier, even though Drill maps ended up being implemented as nested tuples (Hive structs), Drill code does not take advantage of this fact. In Drill, “orderID” and “item.quan” are far different constructs, but need not be. Drill cannot simply reference “cust.name.first”, it must go through costly projections that involve copying data.

A good debate for another time: given that a Drill map is a tuple, should we just go ahead and take advantage of that fact to improve performance and simplify code? And, given that Drill does not have a true map type, should we add that in some future release? (Yes, we will have fun with type naming since Map = Struct, ? = Map ?)

Take away: the Drill videos lead one to believe that Drill does, in fact, handle JSON objects as maps. But, actual implementation does just what the video claims we don’t do: it flattens JSON object into (nested) tuples. This forces mapping conventions on the JSON data we read: JSON objects must correspond to tuples.

Clone this wiki locally