What if we just didn’t decompress it?

How push-down compute helps Vortex run faster and decompress less

What if we just didn’t decompress it?
Photo by Jean Bach / Unsplash

All columnar file formats support the idea of projection push-down. That is, only reading columns that are requested by the reader or are required to evaluate any filter predicates.

Most columnar file formats support the idea of predicate push-down. That is, using Zone Maps to prune large chunks of rows whose statistics provably fail the predicate.

Vortex is unique in the way it evaluates filter and projection expressions by supporting full compute push-down, in many cases avoiding decompression entirely.

Scalar Compute

Most compute functions in Vortex are “scalar”, meaning they operate element-wise, and are very often invoked with a literal value on the right-hand side. For example, a < 10 is much more common than a < b where both sides are arrays.

🤓
Examples of non-scalar compute functions are sum, min, max , binary_search, etc. These functions can also benefit from push-down, but in a tradition harking back to university, I will leave this as an exercise for the reader

It is not too hard to see how scalar compute functions can be pushed-down through simple light-weight compression schemes.

Constant Encoding

Let’s start with the simplest compression codec of them all, constant. If all values of an array are the same, Vortex collapses the data into a ConstantArray holding only a Scalar and a len.

In this case, we can perform all scalar compute in constant time, with the result also being a ConstantArray.

def add(array: ConstantArray, rhs: Scalar) -> ConstantArray:
   return ConstantArray(array.scalar().add(rhs), len=len(array))

Dictionary Encoding

Another very common type of compression is dictionary encoding. This array holds one child containing (typically) unique values, and a second child containing pointers into those values.

Hopefully by now you can guess how one might perform scalar compute over dictionary arrays!

def add(array: DictArray, rhs: Scalar) -> DictArray:
   return DictArray(values=array.values().add(rhs), codes=array.codes())

Scalar compute runs in time proportional to the number of unique values, rather than the length of the array. This can vastly reduce runtime, particularly for complex functions and large data types like strings.

ALP Encoding

Perhaps a less well-known compression codec is Adaptive Lossless Floating-Point.

In this codec, two exponents e and f are chosen such that floating point numbers can be stored losslessly as integers. For example, 1.234 might be stored as 1234. The details are very interesting, and our MIT-licensed Rust implementation is available on GitHub, but all you really need to know is that decoding a single float from ALP takes two floating-point multiplications.

def alp_decode(a):
   a * 10.0^f * 10.0^-e

While not all operations can be pushed down over ALP, many can. For example, a comparison such as a < 1.567 can be turned into a < 1567 provided that 1.567 successfully round-trips the ALP compression with the given e and f exponents.

Running some quick benchmarks, a < 1567_i64 is ~40% (0.6) faster than alp_decode(a) < 1.567_f64 , and so it makes sense for us to push-down comparisons over the encoded ALP integers rather than decompress the floats.

Hierarchical Compression

Things get even more exciting when we consider than Vortex uses cascading hierarchical compression. What starts life as a float array, may end up as Dict + ALP + FastLanes BitPacked array and benefit from several layers of compute push-down.

💡
Recall that FastLanes BitPacking enables us to pack integers into their smallest bit-width with incredible SIMD acceleration.

Let’s set aside dictionary push-down for now (given it is asymptotically beneficial in almost all cases) and focus on ALP + BitPacked.

The output of the Vortex command line helpfully describes the structure of the l_discount column of the TPC-H dataset.

root: vortex.alp(f64, len=8388608) nbytes=4.19 MB (100.00%)
  metadata: ALPMetadata { exponents: Exponents { e: 14, f: 12 } }
  encoded: fastlanes.bitpacked(i64, len=8388608) nbytes=4.19 MB (100.00%)
    metadata: BitPackedMetadata { bit_width: 4, offset: 0, patches: None }
    buffer (align=8): 4.19 MB

In Parquet, a simple a < 10.0 operation first requires page decompression (typically a general purpose compressor like LZ4 or SNAPPY), followed by dictionary decoding, and finally running the floating point comparison.

Vortex avoids opaque general purpose compressors, meaning we are able to directly memory-map the hierarchical array before performing our push-down compute. Once the compressed array is in memory, we push the operator through ALP encoding into integer space arriving at the BitPacked array.

SIMD, SIMD, SIMD

And this is where things get even more interesting. Much of the performance of lightweight compression comes from heavy use of SIMD instructions. My MacBook has 128-bit SIMD registers, meaning it can run 2 x 64-bit, 4 x 32-bit, 8 x 16-bit or 16 x 8-bit instructions at the same time.

As you might expect, code written to benefit from SIMD acceleration sees practically linear scaling with these factors. That is, code operating over i8s runs ~8x faster than code running over i64s.

Take our bit-packed integer array from l_discount

fastlanes.bitpacked(i64, len=8388608) nbytes=4.19 MB (100.00%)
  metadata: BitPackedMetadata { bit_width: 4, offset: 0, patches: None }
  buffer (align=8): 4.19 MB

We see that 8.3 million i64 values are bit-packed into 4-bit unsigned integers, with no patched exceptions. That means all of the values are between0 <= x < 16.

Since we’re performing a comparison of a < 10 , rather than any sort of integer arithmetic, we don’t care about overflow semantics. Vortex therefore performs the comparison over the narrowest plausible integer type, in this case a u8.

From our baseline of decompressing 8.3 million ALP-encoded floats (1.0), we push the comparison into 2xi64 (recall my MacBook’s 128-bit SIMD registers) integer space (0.6), and finally into 16xu8 space (0.075), for a total 83% improvement!

Can we go further?

I have to admit that Vortex doesn’t currently perform the last step of running the comparison in u8-space, but the architecture of Vortex means it’s very easy to add and is coming soon!

Besides this, we can ask ourselves whether there’s even more room for improvement? Much of what we’ve discussed here can be further optimized to make better use of CPU caching. By fusing the logic of bit-unpacking + comparison operators, we are seeing early indications of another ~40-50% performance improvement and will be talking more about this soon.

In the meantime, we are busy working away to stabilize the initial Vortex file format ready for release into the world. Reach out if you’re facing problems in this space, or if you’re interesting in joining our in-person teams in London 🇬🇧 + NYC 🇺🇸