Zone maps, or “queries go brrr”

Waiting sucks. Let's use zone maps to move fast and read less.

Zone maps, or “queries go brrr”
Ye olde zone map. Photograph by Tulane Public Relations. Retrieved from Wikipedia.

Waiting sucks, whether you're a trader eager for so-called “real-time” insights or a patient awaiting test results. In either case, a data system might be the bottleneck, waiting for records to stream in from storage. Zone maps reduce that delay by skipping irrelevant records. Each “zone” stores summary statistics (such as a minimum and maximum), so that the data system can skip reading records that fail to meet the query criteria. In this post, we’ll look at why zone maps matter and how they’re implemented in Vortex and Parquet.

A Simple Example of Min/Max Zone Maps

Zone maps record summary statistics about a specific region of a dataset. Typically, a “zone” is a contiguous set of records in a dataset. The simplest and most common kind of statistics used in a zone map are the minimum and maximum value of a particular column in that sequence of rows.

Consider, for example, a few rows from the familiar Iris dataset:

RowSepal lengthSepal widthPetal lengthPetal widthSpecies
15.13.51.40.2I. setosa
24.93.01.40.2I. setosa
34.73.21.30.2I. setosa
44.63.11.50.2I. setosa
55.03.61.40.3I. setosa
67.03.24.71.4I. versicolor
76.43.24.51.5I. versicolor
86.93.14.91.5I. versicolor
95.52.34.01.3I. versicolor
106.52.84.61.5I. versicolor
115.72.84.11.3I. versicolor
126.33.36.02.5I. virginica
135.82.75.11.9I. virginica
147.13.05.92.1I. virginica
156.32.95.61.8I. virginica
166.53.05.82.2I. virginica

The following is one possible zone map with five zones. Each zone stores a minima and maxima for three of the five columns.

ZoneRowsSepal lengthPetal lengthPetal width
11-3[4.9, 5.1][1.3, 1.4][0.2, 0.2]
24-6[4.6, 7.0][1.4, 4.7][0.2, 1.4]
37-10[5.5, 6.9][4.0, 4.9][1.3, 1.5]
411-13[5.7, 6.3][4.1, 6.0][1.3, 2.5]
514-16[6.3, 7.1][5.6, 5.9][1.8, 2.2]

Suppose we must evaluate this SQL query:

SELECT * from tbl WHERE "Petal width" < 1.0

Using only the zone map, we may conclude that every row from seven onward fails the WHERE clause because their petal widths are at least 1.3. We need not even load those last three zones from storage! We also know that every row in the first zone passes the WHERE clause and we know at least one row in the second zone passes.

This zone map allowed us to skip reading three out of five zones. Those zones comprise 5/8ths of the data. If bandwidth was our only bottleneck, our query might run over twice as fast!

However, not every query can leverage a zone map, for example, this query must scan our entire dataset:

SELECT * FROM tbl WHERE "Petal width" < 1.9

Moreover, not every column is suited to a zone map! Consider the following query which searches for samples with a small but reasonable range of Sepal length.

SELECT * FROM tbl where "Sepal length" >= 6.0 AND "Sepal length" <= 6.2 

Although no samples match this WHERE clause, our metadata table only allows us to eliminate zone 1 and zone 5 whose sepal length values are respectively fully below 6.0 or fully above 6.2.

The issue with the sepal length column is that it does not cluster with the ordering of the rows. In the context of zone maps, a numeric column is said to cluster when values from nearby rows are nearby on the number line. The petal width column demonstrates clustering. In particular, the rows of the Iris data happen to be ordered by their species which are, famously, fairly well distinguished by petal width and not so well distinguished by sepal length.

A demonstration of two columns: one which clusters with row index and one which does not.

In the top sub-figure, notice that points with similar row indices, tend to have similar y-coordinates. In contrast, in the bottom sub-figure, the range of y-coordinates is much larger, particularly after row five.

Timestamps are perhaps the most common example of a clustered column because logs are inserted roughly in timestamp-order. Order numbers can be a clustered column. In genetics & genomics, the gene and the genetic locus are often clustered columns.

Beyond Min/Max: Bloom Filters

While zone maps of minima and maxima are common, a zone map may store any kind of metadata that helps a query system to avoid reading data. For example, the count of null rows can accelerate IS (NOT) NULL queries. Some formats, such as Apache Parquet and ORC, store bloom filters, a kind of approximate membership query (AMQ) data structure. There are many good descriptions of Bloom filters, such as this one, so we won’t cover them again here; however, an AMQ data structure, such as a Bloom filter, can help where a minimum and maximum is too coarse.

Consider for example, a column of UUIDs. By design, a UUID column will not cluster because it is meant to be uniformly distributed. Instead, the minima and maxima will grow further and further apart as the zone grows in size. Such a zone map is of limited utility because very few values are not contained within that range.

Bloom filters are a solution to this problem. As long as a single zone has relatively few distinct values, the Bloom filter will compactly—albeit imprecisely—represent those distinct values. Equality and IN queries can instead query the Bloom filter and avoid reading any zone which the Bloom filter knows does not contain matching records.

Geospatial Metadata

We currently lack the expertise to describe it well, but there are even more exotic metadata in use in production. For example, authors from Google have described the use of geospatial metadata to accelerate queries about distances on the surface of Earth (see Section 5.3.10 of Edara & Pasumansky “Big Metadata” and Google’s S2 Geometry library). In general, any small summary that helps skip data can be included in a zone map.

Concrete Implementations

As far as this author can tell, the earliest discussion of zone maps was Moerkotte’s 1998 paper “Small Materialized Aggregates”. Since then, many systems have implemented them under a variety of names: ClickHouse “Data Skipping Indices”, Monet DB “correlated orderings”, IBM Netezza & Oracle “zone maps”, Postgres “Block Range Indices”, Parquet “page indices”, ORC “indexes”, Snowflake “micro-partitions”, Vertica “ROS containers”, and Google Capacitor’s “range constraints”, to name just eleven of the countless implementations.

Implementations also vary widely in how they size a zone: Oracle storage indices use 1 MB zones, Postgres blocks and Parquet pages are usually 8 KB, ORC stripes are 64 MiB and their row groups are 10,000 rows.

The Design and Implementation of Zone Maps

We recently implemented zone maps for Vortex, our new columnar file format. In preparation for doing so, we reviewed the extensive literature around zone maps and assessed how modern databases and hardware interact with the conclusions of that body of literature. Two major design goals crystalized for us:

  1. Reading a zone map from storage must not require any copies or deserializations.
  2. Zone maps need not be tied to physical blocks, thanks to slice-able lightweight compression.

The first goal is a bit old hat at this point. MessagePack, Cap’n Proto, Arrow, and FlatBuffers all emerged in the early twenty-teens, championing minimal-overhead data access: no decompression, no pointer swizzling. Modern storage systems are relatively high-bandwidth, so we can trade a big storage footprint for less CPU overhead. This is even more true for zone maps which lie on the critical path towards retrieving the data.

As for the second goal, zone maps of the last two decades were often used with block compression pages. The best compression ratio is achieved when the compressor has access to the most data, ideally the whole column. However, compressing the whole column into one page means reading any one row requires decompressing every row. Most formats expose a page size parameter and ask the user to choose a size that balances compression ratio against random access overhead.

Lightweight compression schemes like FSST, FastLanes, and ALP break this tension by delivering compression ratios competitive with block compressors (e.g. Zstd, Snappy) without compromising constant-time access to any one row. This allows us to tune the I/O block size to the physical characteristics of the underlying storage system while separately tuning the zone size to the logical characteristics of the data. We could choose the same zone size for every column to allow pruning one column by a predicate evaluated on the zone map of a different column. We could also choose variable and data-dependent zone sizes that minimize zone map size while maximizing the effectiveness of our metadata.

Zone maps in Vortex

Zone maps are stored in the footer of a Vortex file. In the best case, the initial read, which retrieves the file layout description, also includes the zone maps. Once read, no deserialization is necessary because Vortex pervasively uses zero-copy representations of both metadata and data.

The zones of a Vortex file with ~260,000 rows and two columns: one column of 32-bit integers and one of 50-character strings. The strings, being much larger, are split across many 1 MiB chunks whereas all 260,000 integers fit in one chunk.

By default, zones are blocks of 8,192 rows. As we mentioned above, zone blocks need not be aligned to compression block size which are instead tuned to the page size of the underlying storage system. A consequence of this decoupling is that zones and compression blocks arbitrarily overlap with one another. In columns with large values, such as column of images, a single zone might contain several compression blocks. A single block might even contain rows from two different zones. The situation is reversed for a 32-bit integer column where each block may contain tens of zones. In either case, our lightweight cascading compression allows efficient slicing, enabling us to read a row range without paying the cost of full decompression.

Zone Maps in Parquet

Until 2017, Parquet’s zone maps were scattered around the file, requiring multiple reads to load them and then Thrift deserializations to access them. The Parquet community knew this was a problem and addressed this issue by adding page indices to the spec. For each row group and column, all the page minima and maxima are stored together in the footer. This substantially improved upon the scattered situation, allowing the zone map for a column chunk to be read and deserialized as one object. Over time, the major Parquet implementations added support:

Six years later, the Parquet ecosystem has functional zone maps, but they remain entangled with physical structures in Parquet. Both pages and the row groups that contain them are chosen to meet byte-size targets. This is reasonable given that Parquet relies on block compression to achieve small file sizes.

Future Work

Dear reader, with what kinds of data do you work? Are there domain-relevant statistics that may accelerate queries against your data? Maybe a clever transformation like Google’s S2 cells or an approximate quantiles sketch? Share your insight with us in an issue and maybe contribute directly to the project!

Follow along with us as we develop a well-specified and backwards compatible file format and build out an extensive library for working with compressed arrays in-memory, on-disk, and over-the-wire. Star the Vortex GitHub repository and peruse the documentation, where you will find descriptions of the high-level concepts, the file format, and our integrations with the Python data science ecosystem.

Finally, subscribe to an email newsletter of this blog for a steady stream of both engineering practicalities as well as research digests on the design and implementation of column-oriented storage.


Annotated Bibliography

In this paper, Oracle describes an algorithm for selecting a set of columns on which to create zone maps in order to reduce the number of zones accessed by a set of queries.
Section 3.2 discusses ClickHouse’s zone maps, which are called “skipping indices”. They support three kinds of metadata: min-max, finite set, and bloom filters.
This is a fantastic paper. Section 4 describes how to work with zone maps weighing tens of terabytes. Section 5 describes, in depth, how to transform SQL predicates into predicates on zone maps that enable pruning of zones without matching rows. Sub-sub-section 5.3.10 describes the use of geospatial indices to prune queries which filter on intersecting geographies.
Section 6.1 describes Google’s Capacitor library which collects various smart data access techniques into a common library used by many of Google’s data systems including F1, MapReduce, and BigQuery. Capacitor provides, in the data access layer, zone map based pruning.