Detailed Guide to Setting Up ZFS RAID on Ubuntu 22.04

Original: Sep 14, 2022

In a previous post, I showed how to put together a cheap load balancer server for $500. I later had an idea to use that server for storage too. Storage servers don’t need much CPU, just a lot of disk space. And this one had very little disk, and a pretty decent CPU. The combination of both the workloads — a load balancer and a NAS is a good one.

Typically, you’d pick up a TrueNAS, or an Unraid OS and install that. But, in this case, given I wanted to solve two purposes, I wanted to stick to Ubuntu Server. So, it made sense to go with a RAID software on Ubuntu, not a dedicated NAS OS.

In Ubuntu Linux, the main choice is between two: mdadm, otherwise known as Linux RAID. And OpenZFS, an open source derivative of the ZFS project, Zettabyte file system, developed by Sun Microsystems.

I initially started with mdadm. But, later learnt that ZFS does checksumming. Or, actually the other way around — I realized that mdadm doesn’t do checksumming.

How can you build a reliable storage without block checksums?! — I ask sarcastically

That sealed the decision in favor of ZFS. The project seems stable from what I could see — even though Linus Torvalds doesn’t want to merge ZFS filesystem code into the Linux Kernel, because … Oracle likes lawsuits¹. Lawsuits aside, the OpenZFS project has active contributions², been in development for a while.

ZFS Concepts

Before we get started, let’s familiarize ourselves with some basic concepts of ZFS:

ZFS Pool: A pool is a collection of one or more virtual devices.
Virtual Devices: vdev can be a file, disk, a RAID array, a spare disk, a log or a cache.
Write hole: When the parity data doesn’t match up against the data in the other drives, and you can’t determine which drive has the right data.
RAID-Z: A variation of Raid 5 implemented by ZFS, which can provide write-atomicity (via “copy-on-write”) required to avoid write holes.
ZFS Intent Log (ZIL): A write-ahead log used to log operations to disk before writing to the pool. If a log vdev is set, ZFS would use that — so makes sense to use an NVMe drive as a log vdev (also called SLOG — separate intent log, just a fancy term).
ZFS Cache: ZFS uses ARC (Adaptive Replacement Cache) to speed up reads out of RAM. But, it also has a level 2 ARC, which can house data evicted from ARC. An NVMe drive can be set as a cache vdev to be used for this purpose.

Something that I found surprising about ZFS when compared to mdadm was that it doesn’t allow expansion of an array — until now ³. The reason has been that adding a new drive puts strain on the array, and elevates the chances of other disks failing. Plus, waiting for the array to be rebuilt for days isn’t practical for businesses.

So, the (original) idea is to add a new (vdev) raid array to a pool to expand the pool, instead of adding a single disk to a vdev raid array. The new raid array can be any level. Hence, the idea that a pool is a collection of virtual devices.

So, let’s start with ZFS. Our goal is to put together a 3 disk RAIDZ array — equivalent of Raid Level 5, which uses a disk to store parity bits, allowing a single disk loss out of the three. Now, the thing is, one of the disks (Disk-Z) actually contains the information I want to keep. So, I’m going to create an array with a “fake device”, copy the data over from Disk-Z, then replace the fake with Disk-Z.

Use Serial IDs of Disks

I’m using the serial number to identify the drives and want that to be used by ZFS instead of the sd* device names allocated by Linux.

Note that one of my disks sdc actually has data that I want to copy over. To note the serial IDs of the drives, use the following:

$ lsblk -o NAME,SIZE,SERIAL,LABEL,FSTYPE
NAME          SIZE SERIAL          LABEL          FSTYPE
sda          12.7T XXXXXXXX
sdb          12.7T YYYYYYYY
sdc          12.7T ZZZZZZZZ
└─sdc1       12.7T                 outserv-zero:0 linux_raid_member
  └─md127    12.7T                                ext4

Detour: Remove disk from Linux RAID

Here /dev/sdb was part of the Linux Raid array, along with /dev/sdc. To remove it from the array, I had run:

$ sudo mdadm --manage /dev/md127 --fail /dev/sdb1
mdadm: set /dev/sdb1 faulty in /dev/md127

sudo mdadm --detail /dev/md127
...

$ sudo mdadm --manage /dev/md127 --remove /dev/sdb1
mdadm: hot removed /dev/sdb1 from /dev/md127

Create a Fake Drive

Next, we want to get the disk size of the drives. In this case, all 3 of my drives are the same size — which is what you’d want ideally.

$ sudo fdisk -l /dev/sda
Disk /dev/sda: 12.73 TiB, 14000519643136 bytes, 27344764928 sectors
Disk model: XXXXXXXXXXXXXXXX
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 4096 bytes
I/O size (minimum/optimal): 4096 bytes / 4096 bytes

$ sudo fdisk -l /dev/sdb
Disk /dev/sdb: 12.73 TiB, 14000519643136 bytes, 27344764928 sectors
Disk model: YYYYYYYYYYYYYYYY
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 4096 bytes
I/O size (minimum/optimal): 4096 bytes / 4096 bytes
Disklabel type: gpt

Now, /dev/sdc contains the data that I want, and is currently mounted as a Linux RAID. So, let’s first create a fake drive to look like it. Even though it would have the same size as a 14 TB drive, it won’t actually consume any space on disk.

$ truncate -s 14000519643136 /tmp/tmp.img
$ ls -al /tmp/tmp.img
-rw-rw-r-- 1 out out 14000519643136 Sep 14 02:27 /tmp/tmp.img

Create a ZFS Pool

Let’s create a ZFS Pool named lake, and immediately take the fake device offline. I’m using raidz1, to allow for one spare disk. While creating the pool, I had to use -f to overwrite the ata-YYY disk, which was previously part of the Linux RAID array.

$ sudo zpool create -f lake raidz1 /dev/disk/by-id/ata-XXX /dev/disk/by-id/ata-YYY /tmp/tmp.img
$ sudo zpool offline lake /tmp/tmp.img
$ sudo zpool status
  pool: lake
 state: DEGRADED
status: One or more devices has been taken offline by the administrator.
  Sufficient replicas exist for the pool to continue functioning in a
  degraded state.
action: Online the device using 'zpool online' or replace the device with
  'zpool replace'.
config:

  NAME                                   STATE     READ WRITE CKSUM
  lake                                   DEGRADED     0     0     0
  ┆ raidz1-0                             DEGRADED     0     0     0
  ┆ ┆ ata-XXX                            ONLINE       0     0     0
  ┆ ┆ ata-YYY                            ONLINE       0     0     0
  ┆ ┆ /tmp/tmp.img                       OFFLINE      0     0     0

errors: No known data errors

# Note the Solaris tagged filesystem on the disks.
$ sudo fdisk -l
...
Disk /dev/sda: 12.73 TiB, 14000519643136 bytes, 27344764928 sectors
Disk model: XXXXXXXXXXXXXXXX
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 4096 bytes
I/O size (minimum/optimal): 4096 bytes / 4096 bytes
Disklabel type: gpt
Disk identifier: aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa

Device           Start         End     Sectors  Size Type
/dev/sda1         2048 27344746495 27344744448 12.7T Solaris /usr & Apple ZFS
/dev/sda9  27344746496 27344762879       16384    8M Solaris reserved 1

The above command would also mount the newly created pool on /lake, deriving it from the name of the pool lake.

$ df -h
Filesystem      Size  Used Avail Use% Mounted on
...
lake             26T   58G   26T   1% /lake

Turn on LZ4 Compression

Next, let’s set up the default lz4 compression on the pool. Having LZ4 compression is highly recommended ⁴.

Set compression=lz4 on your pools’ root datasets so that all datasets inherit it unless you have a reason not to enable it. Userland tests of LZ4 compression of incompressible data in a single thread has shown that it can process 10GB/sec, so it is unlikely to be a bottleneck even on incompressible data. Furthermore, incompressible data will be stored without compression such that reads of incompressible data with compression enabled will not be subject to decompression. Writes are so fast that in-compressible data is unlikely to see a performance penalty from the use of LZ4 compression. The reduction in IO from LZ4 will typically be a performance win.

$ sudo zfs get compression lake
NAME  PROPERTY     VALUE           SOURCE
lake  compression  off             default

$ sudo zfs set compression=on lake

$ sudo zfs get compression lake
NAME  PROPERTY     VALUE           SOURCE
lake  compression  on              local

# Shows lz4 is active and working.
$ zpool get all lake | grep compress
lake  feature@lz4_compress           active                         local
lake  feature@zstd_compress          enabled                        local

# Initially compressratio would start with 1.00x.
$ sudo zfs get compressratio lake
NAME  PROPERTY       VALUE  SOURCE
lake  compressratio  1.00x  -

Sequential Workload

I have big files largely doing sequential reads and writes. It is recommended to use a block size of 1MB for such workloads⁵.

$ sudo zfs get recordsize lake
NAME  PROPERTY    VALUE    SOURCE
lake  recordsize  128K     default

$ sudo zfs set recordsize=1M lake
$ sudo zfs get recordsize lake
NAME  PROPERTY    VALUE    SOURCE
lake  recordsize  1M       local

Moreover, it can be expected that the write performance is 2x in this array, while read performance should be 3x (to be tested). As per⁶:

If sequential writes are of primary importance, raidz will outperform mirrored vdevs. Sequential write throughput increases linearly with the number of data disks in raidz while writes are limited to the slowest drive in mirrored vdevs. Sequential read performance should be roughly the same on each.

Copy Data Over with Rsync

In my case, I have around 6 TB of data. With 200 MBps speed, this is an overnight operation.

# This ran for ~7hrs.
rsync -rtvh --progress /data/ /lake/

# compressratio is now 1.18x.
$ sudo zfs get compressratio lake
NAME  PROPERTY       VALUE  SOURCE
lake  compressratio  1.18x  -

# Reduced disk usage by 0.7 TB via compression.
$ df -h
Filesystem      Size  Used Avail Use% Mounted on
...
/dev/md127       13T  5.1T  7.0T  42% /data
lake             26T  4.4T   21T  18% /lake

Remove Linux Raid and assign to ZFS

Once the data copy is done, we can swap out the fake device with the real one. Remember the tmp.img disk is already offline.

Let’s first remove Linux RAID from Disk-Z.

$ sudo umount /data

$ sudo mdadm --stop /dev/md127
mdadm: stopped /dev/md127

$ sudo mdadm --remove /dev/md127
mdadm: error opening /dev/md127: No such file or directory

$ lsblk -o NAME,SIZE,SERIAL,LABEL,FSTYPE
NAME          SIZE SERIAL          LABEL          FSTYPE
...
sdc          12.7T ZZZZZZZZ
└─sdc1       12.7T                 outserv-zero:0 linux_raid_member

$ sudo mdadm --zero-superblock /dev/sdc1

$ lsblk -o NAME,SIZE,SERIAL,LABEL,FSTYPE
NAME          SIZE SERIAL          LABEL          FSTYPE
sdc          12.7T ZZZZZZZZ
└─sdc1       12.7T                                ext4

$ cat /proc/mdstat
Personalities : [raid1] [linear] [multipath] [raid0] [raid6] [raid5] [raid4] [raid10]
unused devices: <none>

Now, we can assign this drive to ZFS.

$ sudo zpool replace lake /tmp/tmp.img /dev/disk/by-id/ZZZ
$ sudo zpool status
  pool: lake
 state: DEGRADED
status: One or more devices is currently being resilvered.  The pool will
    continue to function, possibly in a degraded state.
action: Wait for the resilver to complete.
  scan: resilver in progress since Wed Sep 14 14:47:26 2022
    6.57T scanned at 1.60G/s, 2.06T issued at 513M/s, 6.57T total
    696G resilvered, 31.30% done, 02:33:40 to go
config:

    NAME                                     STATE     READ WRITE CKSUM
    lake                                     DEGRADED     0     0     0
      raidz1-0                               DEGRADED     0     0     0
        ata-XXXXXXXXXXXXXXXXXXXXXXXXXXXXX    ONLINE       0     0     0
        ata-YYYYYYYYYYYYYYYYYYYYYYYYYYYYY    ONLINE       0     0     0
        replacing-2                          DEGRADED     0     0     0
          /tmp/tmp.img                       OFFLINE      0     0     0
          ata-ZZZZZZZZZZZZZZZZZZZZZZZZZZZZZ  ONLINE       0     0     0  (resilvering)

errors: No known data errors

This would start the “resilvering” process, i.e. moving the data over the newly added drive. Once this process finishes, the ZFS Pool should be healthy.

In fact, the whole resilvering process finished within 3.5 hours. Which was pretty impressive, considering the rsync copy had ran for 7 hours.

$ zpool status
  pool: lake
 state: ONLINE
  scan: resilvered 2.18T in 03:30:32 with 0 errors on Wed Sep 14 18:17:58 2022
config:

    NAME                                   STATE     READ WRITE CKSUM
    lake                                   ONLINE       0     0     0
      raidz1-0                             ONLINE       0     0     0
        ata-XXXXXXXXXXXXXXXXXXXXXXXXXXXXX  ONLINE       0     0     0
        ata-YYYYYYYYYYYYYYYYYYYYYYYYYYYYY  ONLINE       0     0     0
        ata-ZZZZZZZZZZZZZZZZZZZZZZZZZZZZZ  ONLINE       0     0     0

errors: No known data errors

Test Read Write Throughput

At 550 MBps, the array was able to achieve 2x the read speed achievable via one disk. And a write speed of 440 MBps.

Note that /dev/random can only output at 500 MiBps on my machine. While /dev/zero can output at 8 GiBps, compression would render that useless for testing write throughput.

$ cat /lake/some/data | pv -trab > /dev/null
23.2GiB 0:00:42 [ 556MiB/s] [ 556MiB/s]

$ dd if=/dev/random of=/lake/test.img bs=1M count=10000
10000+0 records in
10000+0 records out
10485760000 bytes (10 GB, 9.8 GiB) copied, 23.7674 s, 441 MB/s

Add Cache and SLOG

My server has 64 GB RAM. It’s generally recommended to only add an L2ARC (level 2 ARC) cache if there’s plenty of primary cache — which 64 GB qualifies as.

I also realized some of my workloads might require doing small writes repeatedly. So, having a write cache made sense too. Backing all this is a 500 GB NVMe drive.

I didn’t have any partitions on the drive. And I didn’t want to muck with it. So, instead, I created two files, and used them as log and cache.

$ sudo mkdir /cache
$ sudo dd if=/dev/zero of=/cache/write bs=1M count=131072
131072+0 records in
131072+0 records out
137438953472 bytes (137 GB, 128 GiB) copied, 196.709 s, 699 MB/s

$ sudo dd if=/dev/zero of=/cache/read bs=1M count=262144
262144+0 records in
262144+0 records out
274877906944 bytes (275 GB, 256 GiB) copied, 860.288 s, 320 MB/s

$ sudo zpool add lake log /cache/write
$ sudo zpool add lake cache /cache/read
$ sudo zpool status
  pool: lake
 state: ONLINE
  scan: scrub canceled on Wed Sep 14 18:52:48 2022
config:

    NAME                                   STATE     READ WRITE CKSUM
    lake                                   ONLINE       0     0     0
      raidz1-0                             ONLINE       0     0     0
        ata-XXXXXXXXXXXXXXXXXXXXXXXXXXXXX  ONLINE       0     0     0
        ata-YYYYYYYYYYYYYYYYYYYYYYYYYYYYY  ONLINE       0     0     0
        ata-ZZZZZZZZZZZZZZZZZZZZZZZZZZZZZ  ONLINE       0     0     0
    logs
      /cache/write                         ONLINE       0     0     0
    cache
      /cache/read                          ONLINE       0     0     0

errors: No known data errors

Create file systems

We can create multiple file systems within the pool⁷ using the zfs create command. In this example, I create a file system outserv under the lake pool. It would be automatically mounted on /lake/outserv. This filesystem can have its own properties like compression and recordsize separate from lake.

$ sudo zfs create lake/outserv
$ sudo zfs set recordsize=16K lake/outserv
$ sudo zfs get recordsize lake/outserv
$ sudo zfs get compressratio lake/outserv
$ sudo zfs destroy lake/outserv

Create Snapshots

Now that you have divided up your workloads into filesystems, you can snapshot them independently. Having snapshots is a good way to ensure that if you accidentally deleted files, you can get them back. Can also be used for taking backups.

$ sudo zfs snapshot lake/outserv@Dec12.2022

# To create snapshots for all filesystems recursively.
$ sudo zfs snapshot -r lake@Dec12.2022

# To destroy older snapshots
$ sudo zfs destroy lake/outserv@Dec12.2022

# To rename a snapshot
$ sudo zfs rename lake/outserv@Dec12.2022 lake/outserv@today

To share /lake as NFS:

$ sudo apt install nfs-kernel-server
$ sudo zfs set sharenfs='rw' lake
$ sudo zfs get sharenfs lake
NAME  PROPERTY  VALUE     SOURCE
lake  sharenfs  rw        local

To mount this NFS directory on a client machine:

sudo apt install nfs-common
sudo mount -t nfs SERVER_IP_ADDR:/lake /lake

Scrub it

Put together a cron job to scrub the ZFS pool every couple of weeks. Say, 1st of every month. And mail status output after 13 hours to my email.

sudo crontab -e

# Add the following lines
0 0 1 * * /sbin/zpool scrub lake
0 13 1 * * /sbin/zpool status | (echo "Subject: Zpool Status\n"; cat -) | (ssh some-machine-with-msmtp 'msmtp toself@self.com')

The Zettabyte File System

Alright. We now have a fully-functional ZFS setup done and ready to go. Using 3x 14 TB drives, we got a 26 TiB of space, with the ability to lose one drive without affecting the data. We achieved file-system level compression of 1.18x over 5 TB of data, reducing our disk usage by 0.7 TB from the original source.

This is all pretty impressive stuff. Most importantly, all the data now has checksums, so ZFS can determine if there’s ever a data corruption in any block of any drive and fix it via the scrub step.

Update (Apr 27, 2023): ZFS scare

Today, after rearranging a few disks, suddenly my ZFS pool stopped working.

$ sudo zpool import -f
   pool: nile
     id: 5480319285548755054
  state: FAULTED
status: One or more devices contains corrupted data.
 action: The pool cannot be imported due to damaged devices or data.
    The pool may be active on another system, but can be imported using
    the '-f' flag.
   see: https://openzfs.github.io/openzfs-docs/msg/ZFS-8000-5E

This was a HUGE scare for me. The pool contains a lot of data. Turns out, the issue was that mdadm was running in the system on a few drives which I had removed. Because mdadm reads the drives by their relative location in the system /dev/sdX, and those drives had been removed, the location got assigned to some of the ZFS drives. So, instead of realizing that the drives have been removed, mdadm took over the ZFS drives, causing ZFS to error out.

This was plenty scary for me. So, I removed mdadm completely from the system.

sudo mdadm --stop /dev/md127
sudo apt remove --purge mdadm

After doing this, zpool could be properly imported and started working again. Phew!

There’s no place for mdadm in a reliable Linux box. Adios for good this time!

Date

April 27, 2023

Links to this post

Postgres 15 on Encrypted ZFS and Ubuntu 22.04 LTS I’m using Postgres for my new startup. The startup is still in stealth mode, so I can’t talk about it much. But, given the focus of the startup is

Up next

Previously

How ChatGPT won over Google today Spent the last hour wrecking my head against an npm error, trying search after search with Google, leading nowhere. Finally copied over the issue to