Tranfer a Giant Directory across Network

Originally Posted: Aug 24, 2022

Dealing with huge web3 data, I see myself doing this a lot: moving data directories across machines. In this case, we have a 250 GB directory with 10K files. The network connection between machines is ~5 MBps. But using these techniques, we can achieve a 6x faster transfer (~30 MBps).

There’re two ways to do this:

  1. Use compressed tar (Newer and Older)
  2. Use rsync

Update (Dec 6, 2022)

I wrote another post comparing the performance of various compression algorithms for moving big data around.

Update (Nov 6, 2022): Use Compressed Tar (Newer, Faster)

There are a bunch of issues with the compressed tar” approach I had previously described.

  1. ssh tends to be slower in terms of transfer. A faster way is to use nc if you’re OK with the connection being insecure.
  2. gzip is slow. I’ve switched over to using zstd for these transfers, which is significantly faster. Note that zstd is quite CPU intensive. If you want something cheaper, go for lz4 — though the latter doesn’t compress as efficiently as the former. The best way to figure out what works is to look at the output from pv. You want the absolute highest thoughput, and sometimes that can even come from no compression. So, keep a watch on those numbers.
  3. The destination server receives a tar file. For big datasets, it could take a significant amount of additional time to untar it. The below method also has the benefit of simultaneously running untar, so when the transfer completes, the directory is nicely laid out on the destination server.

This is how my new system works:

# Run the following first on the destination server on port 1499
nc -l 1499 | unzstd | pv -trab | tar xf -

# Then, run the following on the source server.
# Note the pipe commands are in reverse order.
tar cf - <dir> | pv -trab | zstd | nc <ip-addr-dest-server> 1499

The caveat here is that if the transfer fails for some reason, there’s no good way to resume the transfer. It would have to restart from scratch. If you know of a way to resume a broken connection with the above technique, DM me on Twitter.

The other caveat here is that, even when the tranfer is complete, the listener would continue to listen. It’s a simple fix. You can just ctrl+c to end the transfer.

This method is significantly faster than the ones below.

Use Compressed Tar (Older, Slower)

In this example, we want to move data directory, which has thousands of files.

$ tar cf - data | pv -trab | pigz --best -p 32 | ssh user@host "cat > /data/data.tgz"
 242GiB 2:22:36 [29.1MiB/s] [29.1MiB/s]

We first create a single stream out of the directory using tar. A single stream allows you to keep streaming at the maximum possible transfer rate, without having to stop-and-go for every file.

Here, pv is a great way to show the progress tar is making. It doesn’t interfere with the pipe. pigz would make compression faster by distributing the task across cores. And finally, this stream would be copied via ssh to the destination machine.

Use Rsync (Most Common, Slower)

# rsync src folder to dest folder
# -r = recursive
# -t = preserves times
# -v = verbose
# -u = Instead of blind copy, check if file already exists, and update it
# -c = Skip files based on checksum, not mod-time & size (expensive)
# -a = Archive mode, preserves symbolic links (useful).
# -h = human mode
# --progress to show progress and ETA.
# trailing slashes are important because with them "src" folder doesn't get
# created inside dst folder, like so -> dst/src. Instead src/* -> dst/*.
rsync -rtvuc --delete src/ dst/

# If sending data over network or between drives, also consider using
# -z option, which would compress data before transfer.
rsync -rtvucz --delete src/ -e ssh server:/dst/

# Restart them in case of temporary network errors, until they exit
# successfully.
until !!; do :; done

Now you know how to move lots of data across the internet.

December 6, 2022