Tranfer a Giant Directory across Network
Originally Posted: Aug 24, 2022
Dealing with huge web3 data, I see myself doing this a lot: moving data directories across machines. In this case, we have a 250 GB directory with 10K files. The network connection between machines is ~5 MBps. But using these techniques, we can achieve a 6x faster transfer (~30 MBps).
There’re two ways to do this:
- Use compressed tar (Newer and Older)
- Use rsync
Update (Dec 6, 2022)
I wrote another post comparing the performance of various compression algorithms for moving big data around.
Update (Nov 6, 2022): Use Compressed Tar (Newer, Faster)
There are a bunch of issues with the “compressed tar” approach I had previously described.
sshtends to be slower in terms of transfer. A faster way is to use
ncif you’re OK with the connection being insecure.
gzipis slow. I’ve switched over to using
zstdfor these transfers, which is significantly faster. Note that
zstdis quite CPU intensive. If you want something cheaper, go for
lz4— though the latter doesn’t compress as efficiently as the former. The best way to figure out what works is to look at the output from
pv. You want the absolute highest thoughput, and sometimes that can even come from no compression. So, keep a watch on those numbers.
- The destination server receives a
tarfile. For big datasets, it could take a significant amount of additional time to
untarit. The below method also has the benefit of simultaneously running
untar, so when the transfer completes, the directory is nicely laid out on the destination server.
This is how my new system works:
# Run the following first on the destination server on port 1499 nc -l 1499 | unzstd | pv -trab | tar xf - # Then, run the following on the source server. # Note the pipe commands are in reverse order. tar cf - <dir> | pv -trab | zstd | nc <ip-addr-dest-server> 1499
The caveat here is that if the transfer fails for some reason, there’s no good way to resume the transfer. It would have to restart from scratch. If you know of a way to resume a broken connection with the above technique, DM me on Twitter.
The other caveat here is that, even when the tranfer is complete, the listener would continue to listen. It’s a simple fix. You can just
ctrl+c to end the transfer.
This method is significantly faster than the ones below.
Use Compressed Tar (Older, Slower)
In this example, we want to move
data directory, which has thousands of files.
$ tar cf - data | pv -trab | pigz --best -p 32 | ssh user@host "cat > /data/data.tgz" 242GiB 2:22:36 [29.1MiB/s] [29.1MiB/s]
We first create a single stream out of the directory using
tar. A single stream allows you to keep streaming at the maximum possible transfer rate, without having to stop-and-go for every file.
pv is a great way to show the progress tar is making. It doesn’t interfere with the pipe.
pigz would make compression faster by distributing the task across cores. And finally, this stream would be copied via ssh to the destination machine.
Use Rsync (Most Common, Slower)
# rsync src folder to dest folder # -r = recursive # -t = preserves times # -v = verbose # -u = Instead of blind copy, check if file already exists, and update it # -c = Skip files based on checksum, not mod-time & size (expensive) # -a = Archive mode, preserves symbolic links (useful). # -h = human mode # --progress to show progress and ETA. # # trailing slashes are important because with them "src" folder doesn't get # created inside dst folder, like so -> dst/src. Instead src/* -> dst/*. rsync -rtvuc --delete src/ dst/ # If sending data over network or between drives, also consider using # -z option, which would compress data before transfer. rsync -rtvucz --delete src/ -e ssh server:/dst/ # Restart them in case of temporary network errors, until they exit # successfully. until !!; do :; done
Now you know how to move lots of data across the internet.