Drill Within the Big Data Ecosystem

Having worked as a developer in the big data space for a while now, answers are emerging to the question: "how does Drill fit into the Big Data ecosystem? When would someone use Drill vs. the many other excellent tools available." This note summarizes that (evolving) understanding.

Context

While there are many definitions of Big Data, a useful one here is simply the idea that an application has so much data that it goes beyond the capabilities of "traditional" technologies such as a relational database or data warehouse. Why this definition? An application developer just needs to get the job done. If Oracle (or Teradata or SQL-Server, etc.) handled the task, just use that and get the app built.

The challenge is when the volume of data is simply too large for existing tools to handle (or handle in a way that fits the project budget). Most traditional tools were "scale-up": meaning the application needed ever larger machines, or expensive clusters of specialized hardware. Enter the idea of spreading the data across a large number of commodity machines, then writing simple tasks to process the data step-by-step.

Without a vendor to provide a proprietary storage format, data storage defaulted to simple, generic formats: CSV, JSON, and later others such as ORC and Parquet.

At first, Google and other hand-crafted the processing. Quickly Google realized that that most of the framework is generic, created Map-Reduce, which led to the open-source Hadoop. In short order, many DAG-based projects came about: Microsoft Dryad, Hadoop, Spark, Storm and many more.

The distributed DAG model has data transforms as nodes and (typically network) exchanges as edges. Writing a data transform is a specialized skill. Soon Hive came along to automate the process: submit HiveQL in one end and get a DAG out the other.

Old-time Hadoop used disk files as exchanges which was convenient, but slow. Dryad used network exchanges to keep data in memory, as did Storm and others.

The purpose here is not to provide a complete history of Big Data, but rather to provide the context for Drill. Apache Drill is a DAG-based, distributed execution engine with in-memory processing and network exchanges between nodes. Drill's defining feature is that Drill uses SQL to define the DAG: Drill is responsible for planning a DAG that efficiently implements a set of data transforms expressed as SQL.

Understanding Drill within the Big Data Ecosystem

This definition of Drill helps us understand several ways that Drill differs from, or is like, the other members of the Big Data menagerie.

SQL-based Transforms

Drill transforms SQL into a DAG of data transforms. This means that Drill, not the user, defines the set of transform tasks. Unlike Spark, Storm, Apex and others; the user has no ability to code up an operator. Instead, the set of operators available are those needed to execute a SQL statement.

As it turns out, SQL has a long history and firm theoretical foundation; SQL can express a very large percentage of the operations that people perform on tabular (like) data. Still, there are operations where SQL is a poor fit. SQL is not ideal for machine-learning tasks (it has no way to express iteration, to build learning models, etc.) SQL is also not intended for stream processing (though it is great for working on such data once "parked" in storage.)

This fact is the first way to differentiate Drill. When a data transform task can be represented as SQL, Drill automates all the work required to execute the operation: just submit the query and get the results -- a huge savings relative to any tool that requires hand-created transforms or DAGs.

In-Memory and Network Exchanges

Unlike old-time Map-Reduce (or Hive), Drill does not use disk files to transfer data between exchanges. Instead, Drill buffers data in memory (in so-called "record batches") and transfers batches between nodes (actually, between clusters of nodes called "fragments") using the network.

The result is far faster processing than Map-Reduce or Hive can obtain. (Though, Hive LLAP is beginning to also offer network exchanges.) However, this speed-up comes at an important cost: Drill needs plenty of memory to hold in-flight batches. There is no free lunch: Drill allows an application to gain faster processing at the expense of more memory (relative to a tool which uses disk-based exchanges.)

Late-Schema, Many File Formats

As noted above, Big Data is a loose collection of tools. There is no single engine (such as Oracle, DB2, SQL-Server, etc.) to define the on-disk format (typically highly optimized for that one engine.) Instead, Big Data offers the "Data Lake" a distributed storage system such as HDFS into which one can toss files. Without a reason to standard on one format or another, Data Lakes often take the approach of "let 100 formats bloom": CSV, JSON, Parquet, ORC, Sequence File, you name it. And, this does not include the very large number of specialized formats (Pcap, Web logs, binary formats and zillions of others.)

The lack of a standard, optimized format presents challenges to any Big Data tool. The most obvious is the need for parsers to transform some peculiar format into the record (or "row" or "tuple") format needed by a particular tool.

For example, Map-Reduce allows developers to right record readers that parse an arbitrary file into a standardized Hadoop key/value format. Drill supports "storage plugins" and "format plugins" to solve the same problem.

Traditional databases spent significant time optimizing their on-disk formats to maximize throughput: structuring data into pages that directly map to disk blocks, adding indexes to increase lookup-speed, arranging data sequentially on disk to maximize read performance by minimizing disk seeks, and so on.

It should not come as a complete surprise that modern formats (JSON, Parquet), while good for many tasks, are not optimized for either disk or CPU performance. (Parquet, for example, compresses data which improves disk performance, but at the cost of greater CPU load to decompress the data.) Hence neither Drill, nor any other Big Data tool, can hope to achieve the same disk read performance that a dedicated, highly-tuned disk format can provide. (Perhaps a new project will come along to address this challenge...)

File Scans

Traditional databases provide indexes to speed access. Back in the day, when disks (and CPUs) where very slow, and result sets where small, a huge increase in performance was possible with the classic B-tree index. A single index format could handle key-based lookups along with range-based lookups, and would order the data as a bonus. Indexes are not free: they require that specialized disk format discussed above, and incur extra storage and update overhead.

In the Big Data world, with generic file formats, indexes have (thus far) been impractical. Yes, there are search engines (such as Solr or ElasticSearch), but these are not quite the same. TSDB provides the equivalent of an index over time series data while building on the key/value Hbase store. Druid provided cube-based storage with index-like structures (and aggregation.) And so on.

Still, when using a generic Data Lake, there is no index mechanism available. Map-Reduce and others simply assumed that they would scan all files: searching the entire haystack looking for the needle. Drill inherits this same model: given a set of files, Drill must scan the entire set to find the data of interest. (This is called a "full table scan" in DB parlance.)

The full table scan nature of Drill is a very important consideration when using Drill. Yes, Drill provides SQL, just like a relational database. Relational databases provide fast (index-based) response. But, Drill has no indexes and thus uses table scans. Many people looking at Drill assume that somehow using SQL magically provides a speedup over table scans. It does not. Instead, a more apples-to-apples comparison would require that a user delete their Oracle (or MySQL or Postgres) indexes, forcing the engine to do a full table scan. Drill provides query performance on that order.

Drill does exploit a very common trick when working with file data: "partition" files according to some key, often date, resulting in a directory tree of the form logs/2017/07/01/22/server-log.csv. Drill will "prune" the set of candidate files if the query specifies a subset of directories. For the above example, we could restrict our query to only the logs for one specific day, say 2017-06-25.

While partitioning is very useful (and should be used whenever possible), it "indexes" only one dimension: time in this case. Suppose we wanted to find all accesses to our servers from some particular IP address. We'd need a full scan of all log files: perhaps 10s of TB of them.

Of course, such a query is trivial with Drill, and this points out the proper way to think about Drill: it greatly reduces the human effort needed to query data, not necessarily the execution cost (a full table scan is needed whether we use Drill, Spark, Hive or any other tool.)

Aggregation

We humans have created Big Data, but we can't absorb it. None of us can make sense of billions of individual records about anything. We need summaries, trends, statistics -- preferably in the form of visualizations such as charts. In short, for human consumption, we often must aggregate the detailed data stored in our Data Lake.

An entire industry exists around data aggregation. Some tools have data formats optimized for aggregating data on the fly. Others "pre-aggregate" the data in the form of OLAP cubes. With pre-aggregation: the sums are done ahead of time; allowing the application to quickly read the sums themselves rather than the underlying detail data.

Aggregation in the Big Data world is, at present, the Wild West. The brute-force approach is to aggregate data as needed for each task: by running a Spark, Hive or Drill query. But, this approach incurs the table-scan costs mentioned above.

A number of projects have attempted to provide aggregation engines: Kylin, Druid, TSDB (for time series) and a number of commercial products (such as AtScale.)

One can even use a roll-your own approach: use Drill (or Hive or Impala) to execute CREATE TABLE AS queries that summarize a chunk of data and write it to a new file. Later queries can explicitly target the summary files, providing much faster results than a query against the original detail data. If the summary files are compact enough, the results can even achieve query performance comparable to a purpose-built solution.

Summary

The discussion above shows that Drill shines for Big Data tasks that map to SQL queries. Drill may not reduce execution cost relative to other tools (since that cost is driven, primarily, but the need to read and parse a very large amount of data in generic file formats), but it does vastly reduce the human time: from hours (using Map-Reduce) to seconds (using SQL.)

Drill is not well suited when an application needs quick response over detailed data across a variety of dimensions (due to the physics of the underlying storage, as discussed above.) Drill is very well suited to data partitioned along one dimension, with queries that use that same dimension.

Drill is ideal for on-the-fly aggregation, but at the cost of full table scans. Since Drill is a query engine (not a storage engine), Drill can also perform queries over aggregated data created using some other tool. In this way, Drill and an aggregation tool can provide quick response for dashboards and the like.

As with any Big Data tool, it falls upon the application developer to understand the capabilities of cluster hardware and Big Data tools. Choose the best tool (or combination of tools) needed for the job. Your Big Data application (as your home improvement project) will suffer with the wrong tool choice. Drill is not magic. However, it is a very handy tool to keep in your Big Data toolbox.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly