Comparing Compression Algorithms for Moving Big Data
In a previous post, I wrote about the best way to transfer a directory across the network.
In this post, I compare Brotli v Gzip v Zstd v LZ4 on blockchain dataset to determine which compression can give me the fastest transmission rates to move data around.
I took a Badger DB directory that I needed to transmit over the wire, and ran various compression algorithms on it to see how fast can they compress. All the compression algorithm tools were installed using apt
on Ubuntu 22.04.1 LTS (jammy). No custom packages were used. Note that Badger blocks (many blocks make one SSTable file) are by default compressed via Snappy.
Before sending the directory over, it makes sense to tar it, so it creates a single stream. The directory is ~70 GiB. For this test, I picked the first 4 GiB of the tar output. I used pv -trab
to monitor the output size, the current and the average throughput.
In other words, I’m comparing how fast can these compression algorithms compress 4 GiBs of Badger output.
# With no compression. Yes, we're only picking the first 4 GiB.
$ tar cf - index | head -c 4294967296 | pv -trab > /dev/null
4.00GiB 0:00:00 [4.97GiB/s] [4.97GiB/s]
# With Bzip2
$ tar cf - index | head -c 4294967296 | bzip2 | pv -trab > /dev/null
2.53GiB 0:03:51 [11.2MiB/s] [11.2MiB/s]
# With Gzip
$ tar cf - index | head -c 4294967296 | gzip | pv -trab > /dev/null
2.79GiB 0:02:59 [15.9MiB/s] [15.9MiB/s]
# With ZSTD
$ tar cf - index | head -c 4294967296 | zstd | pv -trab > /dev/null
2.69GiB 0:00:20 [ 132MiB/s] [ 132MiB/s]
# With LZ4
tar cf - index | head -c 4294967296 | lz4 | pv -trab > /dev/null
3.55GiB 0:00:05 [ 660MiB/s] [ 660MiB/s]
Summarizing the results in this table:
Compression Algorithm | Generation Time | Output | Ratio | 2.5 Gbps | 1 Gbps Transmission Time | 100 Mbps |
---|---|---|---|---|---|---|
None (Baseline) | 0s | 4.00 GiB | 1.00x | 13s | 32s | 327s |
Brotli | 1.5+ hrs | - | - | - | - | - |
Bzip2 | 231s | 2.53 GiB | 1.58x | 231s (> 8s) | 231s (> 20s) | 231s (>202s) |
Gzip | 180s | 2.79 GiB | 1.43x | 180s (> 9s) | 180s (> 22s) | 228s |
Zstd | 20s | 2.69 GiB | 1.48x | 20s (> 9s) | 21s (win) | 220s (win) |
LZ4 | 5s | 3.55 GiB | 1.13x | 11s (win) | 28s | 290s |
Right out of the gate, Brotli is unworkable. It’s really, really slow. Doesn’t seem to make any sense to use Brotli for any decent size datasets. It really seems to only make sense for small payloads under 2 MiB — the kind that you’d expect from HTTP traffic which is what its designed for. In my experimentation, even after 1.5 hrs, Brotli didn’t finish processing. So, it’s full results are omitted.
What’s interesting to note here is that Gzip takes 9x more time than Zstd, with worse compression ratio. Bzip2 achieves better compression, but is even slower than Gzip. Given how comparable the compression ratios are between Bzip2, Gzip and Zstd, Zstd’s 10x faster performance wins outright.
Now it’s between Zstd and LZ4.
On a 2.5 Gbps connection, LZ4 wins by just being the fastest compression algorithm and giving a nice 10% reduction in size. As ZFS recommends, this should just be the default for compression.
Things change on (relatively) slower connections. On both 1 Gbps and 100 Mbps connection, Zstd starts to win because of its better compression ratios, while still maintaining significantly faster compression throughput compared to Gzip.
Conclusion
You can’t go wrong with Zstd. Use Zstd for any significantly sized data transfers over the internet. LZ4 is cheaper and faster. So, you can throw that in if you’re sitting on a really fast network, for example, when moving files within the rack / datacenter.
What this blog post didn’t test for is the CPU usage. From experience, I know Zstd uses significantly more CPU compared to LZ4. I prefer to use LZ4 for storage on ZFS so it can catch any easily compressible data, while using the more powerful and (relatively expensive) Zstd for data transmission.
As this test shows, stay away from Gzip. It’s slow and expensive.