Detailed Guide to Setting Up ZFS RAID on Ubuntu 22.04
Original: Sep 14, 2022
In a previous post, I showed how to put together a cheap load balancer server for $500. I later had an idea to use that server for storage too. Storage servers don’t need much CPU, just a lot of disk space. And this one had very little disk, and a pretty decent CPU. The combination of both the workloads — a load balancer and a NAS is a good one.
Typically, you’d pick up a TrueNAS, or an Unraid OS and install that. But, in this case, given I wanted to solve two purposes, I wanted to stick to Ubuntu Server. So, it made sense to go with a RAID software on Ubuntu, not a dedicated NAS OS.
I initially started with mdadm. But, later learnt that ZFS does checksumming. Or, actually the other way around — I realized that mdadm doesn’t do checksumming.
How can you build a reliable storage without block checksums?! — I ask sarcastically
That sealed the decision in favor of ZFS. The project seems stable from what I could see — even though Linus Torvalds doesn’t want to merge ZFS filesystem code into the Linux Kernel, because … Oracle likes lawsuits1. Lawsuits aside, the OpenZFS project has active contributions2, been in development for a while.
Before we get started, let’s familiarize ourselves with some basic concepts of ZFS:
- ZFS Pool: A pool is a collection of one or more virtual devices.
- Virtual Devices: vdev can be a file, disk, a RAID array, a spare disk, a log or a cache.
- Write hole: When the parity data doesn’t match up against the data in the other drives, and you can’t determine which drive has the right data.
- RAID-Z: A variation of Raid 5 implemented by ZFS, which can provide write-atomicity (via “copy-on-write”) required to avoid write holes.
- ZFS Intent Log (ZIL): A write-ahead log used to log operations to disk before writing to the pool. If a log vdev is set, ZFS would use that — so makes sense to use an NVMe drive as a log vdev (also called SLOG — separate intent log, just a fancy term).
- ZFS Cache: ZFS uses ARC (Adaptive Replacement Cache) to speed up reads out of RAM. But, it also has a level 2 ARC, which can house data evicted from ARC. An NVMe drive can be set as a cache vdev to be used for this purpose.
Something that I found surprising about ZFS when compared to mdadm was that it doesn’t allow expansion of an array — until now 3. The reason has been that adding a new drive puts strain on the array, and elevates the chances of other disks failing. Plus, waiting for the array to be rebuilt for days isn’t practical for businesses.
So, the (original) idea is to add a new (vdev) raid array to a pool to expand the pool, instead of adding a single disk to a vdev raid array. The new raid array can be any level. Hence, the idea that a pool is a collection of virtual devices.
So, let’s start with ZFS. Our goal is to put together a 3 disk RAIDZ array — equivalent of Raid Level 5, which uses a disk to store parity bits, allowing a single disk loss out of the three. Now, the thing is, one of the disks (Disk-Z) actually contains the information I want to keep. So, I’m going to create an array with a “fake device”, copy the data over from Disk-Z, then replace the fake with Disk-Z.
Use Serial IDs of Disks
I’m using the serial number to identify the drives and want that to be used by ZFS instead of the
sd* device names allocated by Linux.
Note that one of my disks
sdc actually has data that I want to copy over. To note the serial IDs of the drives, use the following:
$ lsblk -o NAME,SIZE,SERIAL,LABEL,FSTYPE NAME SIZE SERIAL LABEL FSTYPE sda 12.7T XXXXXXXX sdb 12.7T YYYYYYYY sdc 12.7T ZZZZZZZZ └─sdc1 12.7T outserv-zero:0 linux_raid_member └─md127 12.7T ext4
Detour: Remove disk from Linux RAID
/dev/sdb was part of the Linux Raid array, along with
/dev/sdc. To remove it from the array, I had run:
$ sudo mdadm --manage /dev/md127 --fail /dev/sdb1 mdadm: set /dev/sdb1 faulty in /dev/md127 sudo mdadm --detail /dev/md127 ... $ sudo mdadm --manage /dev/md127 --remove /dev/sdb1 mdadm: hot removed /dev/sdb1 from /dev/md127
Create a Fake Drive
Next, we want to get the disk size of the drives. In this case, all 3 of my drives are the same size — which is what you’d want ideally.
$ sudo fdisk -l /dev/sda Disk /dev/sda: 12.73 TiB, 14000519643136 bytes, 27344764928 sectors Disk model: XXXXXXXXXXXXXXXX Units: sectors of 1 * 512 = 512 bytes Sector size (logical/physical): 512 bytes / 4096 bytes I/O size (minimum/optimal): 4096 bytes / 4096 bytes $ sudo fdisk -l /dev/sdb Disk /dev/sdb: 12.73 TiB, 14000519643136 bytes, 27344764928 sectors Disk model: YYYYYYYYYYYYYYYY Units: sectors of 1 * 512 = 512 bytes Sector size (logical/physical): 512 bytes / 4096 bytes I/O size (minimum/optimal): 4096 bytes / 4096 bytes Disklabel type: gpt
/dev/sdc contains the data that I want, and is currently mounted as a Linux RAID. So, let’s first create a fake drive to look like it. Even though it would have the same size as a 14 TB drive, it won’t actually consume any space on disk.
$ truncate -s 14000519643136 /tmp/tmp.img $ ls -al /tmp/tmp.img -rw-rw-r-- 1 out out 14000519643136 Sep 14 02:27 /tmp/tmp.img
Create a ZFS Pool
Let’s create a ZFS Pool named
lake, and immediately take the fake device offline. I’m using raidz1, to allow for one spare disk. While creating the pool, I had to use
-f to overwrite the ata-YYY disk, which was previously part of the Linux RAID array.
$ sudo zpool create -f lake raidz1 /dev/disk/by-id/ata-XXX /dev/disk/by-id/ata-YYY /tmp/tmp.img $ sudo zpool offline lake /tmp/tmp.img $ sudo zpool status pool: lake state: DEGRADED status: One or more devices has been taken offline by the administrator. Sufficient replicas exist for the pool to continue functioning in a degraded state. action: Online the device using 'zpool online' or replace the device with 'zpool replace'. config: NAME STATE READ WRITE CKSUM lake DEGRADED 0 0 0 ┆ raidz1-0 DEGRADED 0 0 0 ┆ ┆ ata-XXX ONLINE 0 0 0 ┆ ┆ ata-YYY ONLINE 0 0 0 ┆ ┆ /tmp/tmp.img OFFLINE 0 0 0 errors: No known data errors # Note the Solaris tagged filesystem on the disks. $ sudo fdisk -l ... Disk /dev/sda: 12.73 TiB, 14000519643136 bytes, 27344764928 sectors Disk model: XXXXXXXXXXXXXXXX Units: sectors of 1 * 512 = 512 bytes Sector size (logical/physical): 512 bytes / 4096 bytes I/O size (minimum/optimal): 4096 bytes / 4096 bytes Disklabel type: gpt Disk identifier: aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa Device Start End Sectors Size Type /dev/sda1 2048 27344746495 27344744448 12.7T Solaris /usr & Apple ZFS /dev/sda9 27344746496 27344762879 16384 8M Solaris reserved 1
The above command would also mount the newly created pool on
/lake, deriving it from the name of the pool
$ df -h Filesystem Size Used Avail Use% Mounted on ... lake 26T 58G 26T 1% /lake
Turn on LZ4 Compression
Next, let’s set up the default
lz4 compression on the pool. Having LZ4 compression is highly recommended 4.
Set compression=lz4 on your pools’ root datasets so that all datasets inherit it unless you have a reason not to enable it. Userland tests of LZ4 compression of incompressible data in a single thread has shown that it can process 10GB/sec, so it is unlikely to be a bottleneck even on incompressible data. Furthermore, incompressible data will be stored without compression such that reads of incompressible data with compression enabled will not be subject to decompression. Writes are so fast that in-compressible data is unlikely to see a performance penalty from the use of LZ4 compression. The reduction in IO from LZ4 will typically be a performance win.
$ sudo zfs get compression lake NAME PROPERTY VALUE SOURCE lake compression off default $ sudo zfs set compression=on lake $ sudo zfs get compression lake NAME PROPERTY VALUE SOURCE lake compression on local # Shows lz4 is active and working. $ zpool get all lake | grep compress lake feature@lz4_compress active local lake feature@zstd_compress enabled local # Initially compressratio would start with 1.00x. $ sudo zfs get compressratio lake NAME PROPERTY VALUE SOURCE lake compressratio 1.00x -
I have big files largely doing sequential reads and writes. It is recommended to use a block size of 1MB for such workloads5.
$ sudo zfs get recordsize lake NAME PROPERTY VALUE SOURCE lake recordsize 128K default $ sudo zfs set recordsize=1M lake $ sudo zfs get recordsize lake NAME PROPERTY VALUE SOURCE lake recordsize 1M local
Moreover, it can be expected that the write performance is 2x in this array, while read performance should be 3x (to be tested). As per6:
If sequential writes are of primary importance, raidz will outperform mirrored vdevs. Sequential write throughput increases linearly with the number of data disks in raidz while writes are limited to the slowest drive in mirrored vdevs. Sequential read performance should be roughly the same on each.
Copy Data Over with Rsync
In my case, I have around 6 TB of data. With 200 MBps speed, this is an overnight operation.
# This ran for ~7hrs. rsync -rtvh --progress /data/ /lake/ # compressratio is now 1.18x. $ sudo zfs get compressratio lake NAME PROPERTY VALUE SOURCE lake compressratio 1.18x - # Reduced disk usage by 0.7 TB via compression. $ df -h Filesystem Size Used Avail Use% Mounted on ... /dev/md127 13T 5.1T 7.0T 42% /data lake 26T 4.4T 21T 18% /lake
Remove Linux Raid and assign to ZFS
Once the data copy is done, we can swap out the fake device with the real one. Remember the
tmp.img disk is already offline.
Let’s first remove Linux RAID from Disk-Z.
$ sudo umount /data $ sudo mdadm --stop /dev/md127 mdadm: stopped /dev/md127 $ sudo mdadm --remove /dev/md127 mdadm: error opening /dev/md127: No such file or directory $ lsblk -o NAME,SIZE,SERIAL,LABEL,FSTYPE NAME SIZE SERIAL LABEL FSTYPE ... sdc 12.7T ZZZZZZZZ └─sdc1 12.7T outserv-zero:0 linux_raid_member $ sudo mdadm --zero-superblock /dev/sdc1 $ lsblk -o NAME,SIZE,SERIAL,LABEL,FSTYPE NAME SIZE SERIAL LABEL FSTYPE sdc 12.7T ZZZZZZZZ └─sdc1 12.7T ext4 $ cat /proc/mdstat Personalities : [raid1] [linear] [multipath] [raid0] [raid6] [raid5] [raid4] [raid10] unused devices: <none>
Now, we can assign this drive to ZFS.
$ sudo zpool replace lake /tmp/tmp.img /dev/disk/by-id/ZZZ $ sudo zpool status pool: lake state: DEGRADED status: One or more devices is currently being resilvered. The pool will continue to function, possibly in a degraded state. action: Wait for the resilver to complete. scan: resilver in progress since Wed Sep 14 14:47:26 2022 6.57T scanned at 1.60G/s, 2.06T issued at 513M/s, 6.57T total 696G resilvered, 31.30% done, 02:33:40 to go config: NAME STATE READ WRITE CKSUM lake DEGRADED 0 0 0 raidz1-0 DEGRADED 0 0 0 ata-XXXXXXXXXXXXXXXXXXXXXXXXXXXXX ONLINE 0 0 0 ata-YYYYYYYYYYYYYYYYYYYYYYYYYYYYY ONLINE 0 0 0 replacing-2 DEGRADED 0 0 0 /tmp/tmp.img OFFLINE 0 0 0 ata-ZZZZZZZZZZZZZZZZZZZZZZZZZZZZZ ONLINE 0 0 0 (resilvering) errors: No known data errors
This would start the “resilvering” process, i.e. moving the data over the newly added drive. Once this process finishes, the ZFS Pool should be healthy.
In fact, the whole resilvering process finished within 3.5 hours. Which was pretty impressive, considering the rsync copy had ran for 7 hours.
$ zpool status pool: lake state: ONLINE scan: resilvered 2.18T in 03:30:32 with 0 errors on Wed Sep 14 18:17:58 2022 config: NAME STATE READ WRITE CKSUM lake ONLINE 0 0 0 raidz1-0 ONLINE 0 0 0 ata-XXXXXXXXXXXXXXXXXXXXXXXXXXXXX ONLINE 0 0 0 ata-YYYYYYYYYYYYYYYYYYYYYYYYYYYYY ONLINE 0 0 0 ata-ZZZZZZZZZZZZZZZZZZZZZZZZZZZZZ ONLINE 0 0 0 errors: No known data errors
Test Read Write Throughput
At 550 MBps, the array was able to achieve 2x the read speed achievable via one disk. And a write speed of 440 MBps.
/dev/random can only output at 500 MiBps on my machine. While
/dev/zero can output at 8 GiBps, compression would render that useless for testing write throughput.
$ cat /lake/some/data | pv -trab > /dev/null 23.2GiB 0:00:42 [ 556MiB/s] [ 556MiB/s] $ dd if=/dev/random of=/lake/test.img bs=1M count=10000 10000+0 records in 10000+0 records out 10485760000 bytes (10 GB, 9.8 GiB) copied, 23.7674 s, 441 MB/s
Add Cache and SLOG
My server has 64 GB RAM. It’s generally recommended to only add an L2ARC (level 2 ARC) cache if there’s plenty of primary cache — which 64 GB qualifies as.
I also realized some of my workloads might require doing small writes repeatedly. So, having a write cache made sense too. Backing all this is a 500 GB NVMe drive.
I didn’t have any partitions on the drive. And I didn’t want to muck with it. So, instead, I created two files, and used them as log and cache.
$ sudo mkdir /cache $ sudo dd if=/dev/zero of=/cache/write bs=1M count=131072 131072+0 records in 131072+0 records out 137438953472 bytes (137 GB, 128 GiB) copied, 196.709 s, 699 MB/s $ sudo dd if=/dev/zero of=/cache/read bs=1M count=262144 262144+0 records in 262144+0 records out 274877906944 bytes (275 GB, 256 GiB) copied, 860.288 s, 320 MB/s $ sudo zpool add lake log /cache/write $ sudo zpool add lake cache /cache/read $ sudo zpool status pool: lake state: ONLINE scan: scrub canceled on Wed Sep 14 18:52:48 2022 config: NAME STATE READ WRITE CKSUM lake ONLINE 0 0 0 raidz1-0 ONLINE 0 0 0 ata-XXXXXXXXXXXXXXXXXXXXXXXXXXXXX ONLINE 0 0 0 ata-YYYYYYYYYYYYYYYYYYYYYYYYYYYYY ONLINE 0 0 0 ata-ZZZZZZZZZZZZZZZZZZZZZZZZZZZZZ ONLINE 0 0 0 logs /cache/write ONLINE 0 0 0 cache /cache/read ONLINE 0 0 0 errors: No known data errors
Create file systems
We can create multiple file systems within the pool7 using the zfs create command. In this example, I create a file system
outserv under the
lake pool. It would be automatically mounted on
/lake/outserv. This filesystem can have its own properties like compression and recordsize separate from
$ sudo zfs create lake/outserv $ sudo zfs set recordsize=16K lake/outserv $ sudo zfs get recordsize lake/outserv $ sudo zfs get compressratio lake/outserv $ sudo zfs destroy lake/outserv
Now that you have divided up your workloads into filesystems, you can snapshot them independently. Having snapshots is a good way to ensure that if you accidentally deleted files, you can get them back. Can also be used for taking backups.
$ sudo zfs snapshot lake/outserv@Dec12.2022 # To create snapshots for all filesystems recursively. $ sudo zfs snapshot -r lake@Dec12.2022 # To destroy older snapshots $ sudo zfs destroy lake/outserv@Dec12.2022 # To rename a snapshot $ sudo zfs rename lake/outserv@Dec12.2022 lake/outserv@today
Share and mount as NFS
/lake as NFS:
$ sudo apt install nfs-kernel-server $ sudo zfs set sharenfs='rw' lake $ sudo zfs get sharenfs lake NAME PROPERTY VALUE SOURCE lake sharenfs rw local
To mount this NFS directory on a client machine:
sudo apt install nfs-common sudo mount -t nfs SERVER_IP_ADDR:/lake /lake
Put together a cron job to scrub the ZFS pool every couple of weeks. Say, 1st of every month. And mail status output after 13 hours to my email.
sudo crontab -e # Add the following lines 0 0 1 * * /sbin/zpool scrub lake 0 13 1 * * /sbin/zpool status | (echo "Subject: Zpool Status\n"; cat -) | (ssh some-machine-with-msmtp 'msmtp email@example.com')
The Zettabyte File System
Alright. We now have a fully-functional ZFS setup done and ready to go. Using 3x 14 TB drives, we got a 26 TiB of space, with the ability to lose one drive without affecting the data. We achieved file-system level compression of 1.18x over 5 TB of data, reducing our disk usage by 0.7 TB from the original source.
This is all pretty impressive stuff. Most importantly, all the data now has checksums, so ZFS can determine if there’s ever a data corruption in any block of any drive and fix it via the scrub step.
Update (Apr 27, 2023): ZFS scare
Today, after rearranging a few disks, suddenly my ZFS pool stopped working.
$ sudo zpool import -f pool: nile id: 5480319285548755054 state: FAULTED status: One or more devices contains corrupted data. action: The pool cannot be imported due to damaged devices or data. The pool may be active on another system, but can be imported using the '-f' flag. see: https://openzfs.github.io/openzfs-docs/msg/ZFS-8000-5E
This was a HUGE scare for me. The pool contains a lot of data. Turns out, the issue was that
mdadm was running in the system on a few drives which I had removed. Because
mdadm reads the drives by their relative location in the system
/dev/sdX, and those drives had been removed, the location got assigned to some of the ZFS drives. So, instead of realizing that the drives have been removed,
mdadm took over the ZFS drives, causing ZFS to error out.
This was plenty scary for me. So, I removed
mdadm completely from the system.
sudo mdadm --stop /dev/md127 sudo apt remove --purge mdadm
After doing this, zpool could be properly imported and started working again. Phew!
There’s no place for
mdadm in a reliable Linux box. Adios for good this time!