Comparing Compression Algorithms for Moving Big Data

In a previous post, I wrote about the best way to transfer a directory across the network.

In this post, I compare Brotli v Gzip v Zstd v LZ4 on blockchain dataset to determine which compression can give me the fastest transmission rates to move data around.

I took a Badger DB directory that I needed to transmit over the wire, and ran various compression algorithms on it to see how fast can they compress. All the compression algorithm tools were installed using apt on Ubuntu 22.04.1 LTS (jammy). No custom packages were used. Note that Badger blocks (many blocks make one SSTable file) are by default compressed via Snappy.

Before sending the directory over, it makes sense to tar it, so it creates a single stream. The directory is ~70 GiB. For this test, I picked the first 4 GiB of the tar output. I used pv -trab to monitor the output size, the current and the average throughput.

In other words, I’m comparing how fast can these compression algorithms compress 4 GiBs of Badger output.

# With no compression. Yes, we're only picking the first 4 GiB.
$ tar cf - index | head -c 4294967296 | pv -trab > /dev/null
4.00GiB 0:00:00 [4.97GiB/s] [4.97GiB/s]

# With Bzip2
$ tar cf - index | head -c 4294967296 | bzip2 | pv -trab > /dev/null
2.53GiB 0:03:51 [11.2MiB/s] [11.2MiB/s]

# With Gzip
$ tar cf - index | head -c 4294967296 | gzip | pv -trab > /dev/null
2.79GiB 0:02:59 [15.9MiB/s] [15.9MiB/s]

# With ZSTD
$ tar cf - index | head -c 4294967296 | zstd | pv -trab > /dev/null
2.69GiB 0:00:20 [ 132MiB/s] [ 132MiB/s]

# With LZ4
tar cf - index | head -c 4294967296 | lz4 | pv -trab > /dev/null
3.55GiB 0:00:05 [ 660MiB/s] [ 660MiB/s]

Summarizing the results in this table:

Compression Algorithm	Generation Time	Output	Ratio	2.5 Gbps	1 Gbps Transmission Time	100 Mbps
None (Baseline)	0s	4.00 GiB	1.00x	13s	32s	327s
Brotli	1.5+ hrs	-	-	-	-	-
Bzip2	231s	2.53 GiB	1.58x	231s (> 8s)	231s (> 20s)	231s (>202s)
Gzip	180s	2.79 GiB	1.43x	180s (> 9s)	180s (> 22s)	228s
Zstd	20s	2.69 GiB	1.48x	20s (> 9s)	21s (win)	220s (win)
LZ4	5s	3.55 GiB	1.13x	11s (win)	28s	290s

Right out of the gate, Brotli is unworkable. It’s really, really slow. Doesn’t seem to make any sense to use Brotli for any decent size datasets. It really seems to only make sense for small payloads under 2 MiB — the kind that you’d expect from HTTP traffic which is what its designed for. In my experimentation, even after 1.5 hrs, Brotli didn’t finish processing. So, it’s full results are omitted.

What’s interesting to note here is that Gzip takes 9x more time than Zstd, with worse compression ratio. Bzip2 achieves better compression, but is even slower than Gzip. Given how comparable the compression ratios are between Bzip2, Gzip and Zstd, Zstd’s 10x faster performance wins outright.

Now it’s between Zstd and LZ4.

On a 2.5 Gbps connection, LZ4 wins by just being the fastest compression algorithm and giving a nice 10% reduction in size. As ZFS recommends, this should just be the default for compression.

Things change on (relatively) slower connections. On both 1 Gbps and 100 Mbps connection, Zstd starts to win because of its better compression ratios, while still maintaining significantly faster compression throughput compared to Gzip.

Conclusion

You can’t go wrong with Zstd. Use Zstd for any significantly sized data transfers over the internet. LZ4 is cheaper and faster. So, you can throw that in if you’re sitting on a really fast network, for example, when moving files within the rack / datacenter.

What this blog post didn’t test for is the CPU usage. From experience, I know Zstd uses significantly more CPU compared to LZ4. I prefer to use LZ4 for storage on ZFS so it can catch any easily compressible data, while using the more powerful and (relatively expensive) Zstd for data transmission.

As this test shows, stay away from Gzip. It’s slow and expensive.

Date

December 6, 2022

Up next

How to increase open file limits on Ubuntu 22.04 Originally written: Sep 7, 2022 In /etc/security/limits.conf, add Log out, log back in again, and you should have: Original Source If you want to

Previously

HTTPS Reverse Proxy: Caddy outperforms NGINX 4x In my setup, where I have an HTTPS load balancer, sending traffic over HTTP to the server, aka reverse proxy, Caddy outperforms NGINX by 4x. That’s

Comparing Compression Algorithms for Moving Big Data

Conclusion

Tags

Date

Up next

Previously