What if we just didn’t decompress it?
How push-down compute helps Vortex run faster and decompress less

All columnar file formats support the idea of projection push-down. That is, only reading columns that are requested by the reader or are required to evaluate any filter predicates.
Most columnar file formats support the idea of predicate push-down. That is, using Zone Maps to prune large chunks of rows whose statistics provably fail the predicate.
Vortex is unique in the way it evaluates filter and projection expressions by supporting full compute push-down, in many cases avoiding decompression entirely.
Scalar Compute
Most compute functions in Vortex are “scalar”, meaning they operate element-wise, and are very often invoked with a literal value on the right-hand side. For example, a < 10
is much more common than a < b
where both sides are arrays.
sum
, min
, max
, binary_search
, etc. These functions can also benefit from push-down, but in a tradition harking back to university, I will leave this as an exercise for the reader It is not too hard to see how scalar compute functions can be pushed-down through simple light-weight compression schemes.
Constant Encoding
Let’s start with the simplest compression codec of them all, constant. If all values of an array are the same, Vortex collapses the data into a ConstantArray
holding only a Scalar
and a len
.
In this case, we can perform all scalar compute in constant time, with the result also being a ConstantArray
.
def add(array: ConstantArray, rhs: Scalar) -> ConstantArray:
return ConstantArray(array.scalar().add(rhs), len=len(array))
Dictionary Encoding
Another very common type of compression is dictionary encoding. This array holds one child containing (typically) unique values, and a second child containing pointers into those values.
Hopefully by now you can guess how one might perform scalar compute over dictionary arrays!
def add(array: DictArray, rhs: Scalar) -> DictArray:
return DictArray(values=array.values().add(rhs), codes=array.codes())
Scalar compute runs in time proportional to the number of unique values, rather than the length of the array. This can vastly reduce runtime, particularly for complex functions and large data types like strings.
ALP Encoding
Perhaps a less well-known compression codec is Adaptive Lossless Floating-Point.
In this codec, two exponents e
and f
are chosen such that floating point numbers can be stored losslessly as integers. For example, 1.234
might be stored as 1234
. The details are very interesting, and our MIT-licensed Rust implementation is available on GitHub, but all you really need to know is that decoding a single float from ALP takes two floating-point multiplications.
def alp_decode(a):
a * 10.0^f * 10.0^-e
While not all operations can be pushed down over ALP, many can. For example, a comparison such as a < 1.567
can be turned into a < 1567
provided that 1.567
successfully round-trips the ALP compression with the given e
and f
exponents.
Running some quick benchmarks, a < 1567_i64
is ~40% (0.6) faster than alp_decode(a) < 1.567_f64
, and so it makes sense for us to push-down comparisons over the encoded ALP integers rather than decompress the floats.
Hierarchical Compression
Things get even more exciting when we consider than Vortex uses cascading hierarchical compression. What starts life as a float array, may end up as Dict + ALP + FastLanes BitPacked array and benefit from several layers of compute push-down.
Let’s set aside dictionary push-down for now (given it is asymptotically beneficial in almost all cases) and focus on ALP + BitPacked.
The output of the Vortex command line helpfully describes the structure of the l_discount
column of the TPC-H dataset.
root: vortex.alp(f64, len=8388608) nbytes=4.19 MB (100.00%)
metadata: ALPMetadata { exponents: Exponents { e: 14, f: 12 } }
encoded: fastlanes.bitpacked(i64, len=8388608) nbytes=4.19 MB (100.00%)
metadata: BitPackedMetadata { bit_width: 4, offset: 0, patches: None }
buffer (align=8): 4.19 MB
In Parquet, a simple a < 10.0
operation first requires page decompression (typically a general purpose compressor like LZ4 or SNAPPY), followed by dictionary decoding, and finally running the floating point comparison.
Vortex avoids opaque general purpose compressors, meaning we are able to directly memory-map the hierarchical array before performing our push-down compute. Once the compressed array is in memory, we push the operator through ALP encoding into integer space arriving at the BitPacked array.
SIMD, SIMD, SIMD
And this is where things get even more interesting. Much of the performance of lightweight compression comes from heavy use of SIMD instructions. My MacBook has 128-bit SIMD registers, meaning it can run 2 x 64-bit, 4 x 32-bit, 8 x 16-bit or 16 x 8-bit instructions at the same time.
As you might expect, code written to benefit from SIMD acceleration sees practically linear scaling with these factors. That is, code operating over i8s runs ~8x faster than code running over i64s.
Take our bit-packed integer array from l_discount
fastlanes.bitpacked(i64, len=8388608) nbytes=4.19 MB (100.00%)
metadata: BitPackedMetadata { bit_width: 4, offset: 0, patches: None }
buffer (align=8): 4.19 MB
We see that 8.3 million i64 values are bit-packed into 4-bit unsigned integers, with no patched exceptions. That means all of the values are between0 <= x < 16
.
Since we’re performing a comparison of a < 10
, rather than any sort of integer arithmetic, we don’t care about overflow semantics. Vortex therefore performs the comparison over the narrowest plausible integer type, in this case a u8
.
From our baseline of decompressing 8.3 million ALP-encoded floats (1.0), we push the comparison into 2xi64 (recall my MacBook’s 128-bit SIMD registers) integer space (0.6), and finally into 16xu8 space (0.075), for a total 83% improvement!
Can we go further?
I have to admit that Vortex doesn’t currently perform the last step of running the comparison in u8-space, but the architecture of Vortex means it’s very easy to add and is coming soon!
Besides this, we can ask ourselves whether there’s even more room for improvement? Much of what we’ve discussed here can be further optimized to make better use of CPU caching. By fusing the logic of bit-unpacking + comparison operators, we are seeing early indications of another ~40-50% performance improvement and will be talking more about this soon.
In the meantime, we are busy working away to stabilize the initial Vortex file format ready for release into the world. Reach out if you’re facing problems in this space, or if you’re interesting in joining our in-person teams in London 🇬🇧 + NYC 🇺🇸