ZFS

From Leo's Notes
Last edited on 8 August 2023, at 16:23.

ZFS is different from a traditional filesystem in that it acts as a volume management system (such as LVM) as well as handling the filesystem layer above (such as ext4). The filesystem was originally developed by Sun Microsystems for the Solaris operating system but an open source version developed by OpenZFS has since been ported to FreeBSD, Mac OS, and Linux.

Cheat sheet

A quick reference of commonly used ZFS commands.

Description Command
Create a new zpool that spans all disks
# zpool create storage \
   /dev/disk1 /dev/disk2
Create a mirrored pool
# zpool create storage \
   mirror /dev/disk1 /dev/disk2
Create a raidz1 pool. Also available are raidz2 and raidz3.
# zpool create storage \
   raidz1 /dev/disk1 /dev/disk2 /dev/disk3
Create a stripped mirror (similar to raid 10)
# zpool create storage \
   mirror /dev/disk1 /dev/disk2
   mirror /dev/disk3 /dev/disk4
Create a pool with a specific ashift.

9 = 512b, 12=4k, 13 = 8k

# zpool create storage -o ashift=12 \
   /dev/disk1 /dev/disk2
Find the ashift value of a pool
# zdb storage | grep ashift
Enable and set compression for a data set
# zfs set compression=zstd storage
Create a new snapshot
# zfs snapshot storage@now
Rollback to a specific snapshot
# zfs rollback storage@now
Destroy a snapshot
# zfs destroy storage@now
List all snapshots in a zpool
# zfs list -t snapshot storage
Remove a zpool from the system
# zpool export storage
Import a zpool with a different name.
# zfs import storage storage-new
Import all zpools found on any disks
# zpool import -a
Show ARC summary
# arc_summary -s arc

Installation

ZFS is included with Solaris and FreeBSD out of the box.

Linux

On Linux, you can either use FUSE or compile a third-party kernel module available from the ZFS on Linux project at http://zfsonlinux.org/. For specific distributions, installation is made easier by pre-packaged dkms enabled packages which automates the building and loading of the kernel module.

Install ZFS on CentOS

Refer to the documentation at: https://openzfs.github.io/openzfs-docs/Getting%20Started/RHEL-based%20distro/index.html

In summary, enable the EPEL and ZFS repositories by installing the epel-release and the zfs-release packages. Proceed to install the zfs package which will automatically pull in the required dependencies and trigger the kernel module build using dkms.

# dnf install https://zfsonlinux.org/epel/zfs-release-2-2$(rpm --eval "%{dist}").noarch.rpm
# dnf install -y epel-release
# dnf install -y kernel-devel
# dnf install -y zfs

Installing the zfs package should also automatically build ZFS using DKMS for your kernel. Run dkms status to see whether zfs is installed.

# dkms status
zfs/2.1.9, 4.18.0-425.10.1.el8_7.x86_64, x86_64: installed

If you need to rebuild the zfs module for whatever reason, use the dkms build zfs/zfs-$version command (such as dkms build zfs/2.0.5). If you need to target a specific kernel version, you can also specify the kernel using the -k flag like so: dkms build -m zfs -v 0.6.5 -k 4.2.5-300.fc23.x86_64. After building the dkms module, you can install it by running: dkms install -m zfs

ArchLinux

# pacman -Sy base-devel linux-headers
# su alarm
$ gpg --keyserver pool.sks-keyservers.net --recv-keys 4F3BA9AB6D1F8D683DC2DFB56AD860EED4598027
$ curl https://aur.archlinux.org/cgit/aur.git/snapshot/zfs-linux.tar.gz > zfs-linux.tar.gz
$ tar -xzf zfs-linux.tar.gz
$ cd zfs-linux
$ makepkg

Introduction

A few fundamental concepts to understand when using ZFS are: devices, vdevs, zpools, and datasets. A quick summary of each concept are:

  • A device can be any block device installed on the system. It can be a SSD, traditional HDD, or even a file.
  • A vdev represents one or more devices in ZFS and employs one of five parity methods: single device, mirror, RAIDz1, RAIDz2, RAIDz3.
  • A zpool contains one or more vdevs. It treats all vdevs like a JOBD and distributes data depending on factors such as load and utilization.
  • Datasets are created within zpools. They act like volumes but act like an already formatted filesystem and can be mounted to a mount point on the system.

Zpool

A zpool contains one or more vdevs in any configuration. Within a zpool, ZFS treats all member vdevs similar to disks in a JBOD but distributes data evenly. As a consequence, a failure of any member vdev will result in the failure of the zpool.

As a side note, recovery of data from a simple zpool is unlike that of a JBOD due to how data is evenly distributed on all vdevs. See: ttps://www.klennet.com/notes/2018-12-20-no-jbod-in-zfs-mostly.aspx

When creating a zpool, the virtual device configuration must be given.

Virtual Device

The concept of a virtual device or vdev encapsulates one or more physical storage device. There are several types of vdev in ZFS:

  1. device - One or more physical disk or partition on the system. The capacity of the vdev is the sum of all underlying devices. Beware that there is no redundancy or fault tolerance.
  2. file - An absolute path to a disk image.
  3. mirror - similar to a RAID 1 where all blocks are mirrored across all devices, providing high performance and fault tolerance. A mirror can survive any failure so long as at least one device remains healthy. Capacity limited by the smallest disk. Read performance is excellent since data can be retrieved from all storage devices simultaneously.
  4. raidz1/2/3 - similar to a RAID 5 or RAID 6. The number represents how many disk failures can be tolerated.
  5. spare - A hot-spare that can be used as a temporary replacement. You must enable a setting for it to be dynamically added to a failed vdev, which is disabled by default.
  6. cache - A device used for level 2 adaptive read cache (L2ARC), more on this later
  7. log - A device for ZFS Intent Log (ZIL), more on this later

Virtual devices in ZFS have some limitations you must keep in mind:

  • All vdevs cannot shrink. You can only destroy and re-create a vdev with a smaller size.
  • You may not add additional devices to a raidz to expand it.
  • You may only grow a raidz by replacing each storage device with a larger capacity.

Creating a zpool

Create a zpool and its corresponding vdevs using the zpool create command.

# zpool create [-fnd] [-o property=value] ... \
              [-O file-system-property=value] ... \
              [-m mountpoint] [-R root] ${POOL_NAME}  ( ${POOL_TYPE} ${DISK} ) ...

Parameters for zpool create are:

  • -f - Force
  • -n - Display creation but don't create pool
  • -d - Do not enable any features unless specified
  • -o - Set a pool property
  • -O - Set a property on root filesystem
  • -m - Mount point
  • -R - Set an alternate root location
  • POOL_NAME - the name of the pool
  • POOL_TYPE + DISK - one or more vdev configuration

When specifying the device on Linux, it's recommended to use disk IDs. In fact, it's a recommendation made by the ZoL project. Using the disk names such as /dev/sdx is not reliable as the naming can change when udev rules are changed which can potentially prevent your pool from importing properly on startup.

Pool Properties

Some pool properties can only be set when it is first created. Important to keep in mind are:

  • ashift=N. For disks with advanced format (AF) where sector sizes are 4K, you must adjust the ashift=12 property manually to avoid performance degradation. Other ashift values will be covered below.
  • copies=N, where N is the number of copies. For important data, you may specify additional copies of the data to be written. In the event of any corruption, the additional copies help safe guard against data loss.
Advanced Format Sector Alignment
When creating a pool, ensure that the sector alignment value is set appropriately for the underlying storage devices that are used. The sector size that is used in I/O operations is defined as an exponent value to the power of 2 referred to as the ashift value. Incorrectly setting the sector alignment value may cause write amplification (such as using 512B sectors on disks with 4KiB AF format).

Set the ashift value depending on the underlying storage devices outlined below.

Device ashift value
Hard Drives with 512B sectors 9
Flash Media / HDD with 4K sectors 12
Flash Media / HDD with 8K sectors 13
Amazon EC2 12

Once a pool is created, the ashift value can be obtained from zdb:

# zdb | grep ashift
            ashift: 12

See Also:


A simple pool

To create a simple zpool named storage, similar to a RAID 0/JBOD:

# zpool create storage \
   /dev/disk1 /dev/disk2
A mirrored pool

To create a zpool with a mirrored vdev

# zpool create storage \
   mirror /dev/disk1 /dev/disk2

To create a zpool that stripes data across two mirrors, specify two mirrors:

# zpool create storage \
   mirror /dev/disk1 /dev/disk2 \
   mirror /dev/disk3 /dev/disk4

To add or remove a disk from the mirror, use the zpool attach and zpool detach commands.

For example, to add a new disk (gpt/disk03) to an existing mirror:

## We start off with a 2 disk mirror
# zpool status storage
        NAME            STATE     READ WRITE CKSUM
        storage         ONLINE       0     0     0
          mirror-0      ONLINE       0     0     0
            gpt/disk01  ONLINE       0     0     0
            gpt/disk02  ONLINE       0     0     0

## Note that the existing device can be either gpt/disk01 or gpt/disk02.
# zpool attach storage gpt/disk02  gpt/disk03

# zpool status storage
        NAME            STATE     READ WRITE CKSUM
        storage         ONLINE       0     0     0
          mirror-0      ONLINE       0     0     0
            gpt/disk01  ONLINE       0     0     0
            gpt/disk02  ONLINE       0     0     0
            gpt/disk03  ONLINE       0     0     0
			
## Remove a device fro a mirror vdev with zpool detach.
# zpool detach storage gpt/disk02
# zpool status storage
        NAME            STATE     READ WRITE CKSUM
        storage         ONLINE       0     0     0
          mirror-0      ONLINE       0     0     0
            gpt/disk01  ONLINE       0     0     0
            gpt/disk03  ONLINE       0     0     0


A raidz pool

Another example using RAIDz1:

# zpool create storage \
   raidz1 /dev/disk/by-id/device01-part1 /dev/disk/by-id/device02-part1 /dev/disk/by-id/device03-part1

Once created, you can see the status of the pool using zfs status.

# zpool status
  pool: storage
  state: ONLINE
  scan: none requested
 config:
 
         NAME            STATE     READ WRITE CKSUM
         storage         ONLINE       0     0     0
           raidz1-0      ONLINE       0     0     0
             gpt/disk01  ONLINE       0     0     0
             gpt/disk02  ONLINE       0     0     0
             gpt/disk03  ONLINE       0     0     0
 
 errors: No known data errors
A file based zpool

vdevs can be backed by a file or disk image. This is useful when testing.

# for i in {1..4}; do dd if=/dev/zero of=/tmp/file$i bs=1G count=4 &> /dev/null; done
# zpool create storage \
   /tmp/file1 /tmp/file2 /tmp/file3 /tmp/file4
A hybrid zpool

A zpool can be created out of a combination of the different vdevs. Mix and match to your liking.

# zpool create storage \
   mirror /dev/disk1 /dev/disk2 \
   mirror /dev/disk3 /dev/disk4 \
   log mirror /dev/disk5 /dev/disk6 \
   cache /dev/disk7

Managing zpools

List zpools using zpool list:

# zpool list
NAME      SIZE  ALLOC   FREE  EXPANDSZ   FRAG    CAP  DEDUP  HEALTH  ALTROOT
data     8.12T  5.42T  2.71T         -    50%    66%  1.00x  ONLINE  -
storage  9.06T  5.93T  3.13T         -    13%    65%  1.00x  ONLINE  -

The FRAG value is the average fragmentation of available space. There is no defragment operation in ZFS. If you wish to decrease the fragmentation in your pool, consider transferring your pool data to another location and then back with zfs send and zfs recv to rewrite entire chunks of data.

Expanding an existing zpool

A zpool can be expanded by either adding additional vdevs to the pool or by expanding an underlying vdev.

Adding an additional vdev to a zpool is akin to adding additional disks to a JBOD to grow its size. The zpool grows in capacity by as much as the capacity of the vdev that is being added. To add an additional vdev to a zpool, use the zpool add command. For example:

# Adding 2 1TB disks mirrored will add an additional 1TB to the pool
# zpool add pool  mirror /dev/disk1 /dev/disk2

Expanding the underlying vdev used by a zpool is another way to increase a zpool's capacity. For mirror and raidz vdevs, the size can only increase if all underlying storage devices grow. On raidz vdevs, this involves replacing each member disk and resilvering one disk at a time. On larger arrays, this may not be practical as it requires resilvering as many times as there are disks in the vdev.

The zpool will only grow once every disk has been replaced and when the zpool has the autoexpand=on option set. If you forgot to set the autoexpand option before replacing all disks, you can still expand the pool by enabling autoexpand and bringing a disk online:

## After replacing all disks in a vdev, the zpool still shows the same size
# zpool list storage
NAME      SIZE  ALLOC   FREE  EXPANDSZ   FRAG    CAP  DEDUP  HEALTH  ALTROOT
storage  9.06T  8.35T   732G     4.50T    33%    92%  1.00x  ONLINE  -

## Set autoexpand=on on the zpool and bring one of the devices in the affected vdev online again.
# zpool set autoexpand=on storage
# zpool online -e storage ata-Hitachi_HUS724030ALE641_P8GH7GVR

## The zpool should now pick up on the new vdev size and expand accordingly
# zpool list storage
NAME      SIZE  ALLOC   FREE  EXPANDSZ   FRAG    CAP  DEDUP  HEALTH  ALTROOT
storage  13.6T  8.35T  5.28T         -    22%    61%  1.00x  ONLINE  -


Verifying data integrity

Data can silently be corrupted by faulty hardware, from failing sectors to bad memory, or through a fault with the ZFS implementation. To safeguard against data corruption, every block is checksumed using SHA-256. A verification on each block, called a ZFS scrub, can then be used to ensure the correctness of each block. This check can be done while the system is online, though it may degrade performance. A ZFS scrub can be triggered manually with the zpool scrub command. Because the scrub checks each block, the amount of time required depends on the amount of data and the speed of the underlying storage devices. During a scrub, ZFS will attempt to fix any errors that are discovered.

To initiate a scrub, run:

# zpool scrub storage

To stop a scrub process, run:

# zpool scrub -s storage

Once the scrub process is underway, you can view its status by running:

# zpool status storage
   pool: storage
  state: ONLINE
  scan: scrub in progress since Mon Dec  3 23:54:53 2012
     18.3G scanned out of 6.05T at 211M/s, 8h20m to go
     0 repaired, 0.30% done
 config:
 
         NAME            STATE     READ WRITE CKSUM
         storage         ONLINE       0     0     0
           raidz1-0      ONLINE       0     0     0
             gpt/disk00  ONLINE       0     0     0
             gpt/disk01  ONLINE       0     0     0
             gpt/disk02  ONLINE       0     0     0
             gpt/disk03  ONLINE       0     0     0
             gpt/disk04  ONLINE       0     0     0
 
 errors: No known data errors

Any errors that have been found during the scrub will be fixed automatically. The zpool status shows 3 columns:

  • READ: IO errors while reading
  • WRITE: IO errors while writing
  • CKSUM: checksum errors that was found during a read.

ZFS datasets

Datasets are similar to filesystem volumes and are created within a ZFS zpool. Unlike a traditional filesystem however, datasets do not need to be created to a particular size but instead can use as much storage the storage pool has available, though this can be limited with quotas if desired. This design allows for flexible storage designs for many applications. Individual datasets can also be snapshotted which will be covered under the snapshot section below.

The following sections will go over ZFS dataset management and dataset properties.

Creating a dataset

Creating a new dataset is as simple as:

# zfs create zpool/dataset-name

To list all ZFS datasets:

# zfs list
NAME                         USED  AVAIL  REFER  MOUNTPOINT
storage                      153K  1.11T   153K  /storage
storage/logs                 153K  1.11T   153K  /storage/logs

Certain parameters can be set during the creation process with the -o flag, or set after creation using the zfs set command.

Dataset options and parameters

Like zpool properties, each ZFS dataset also contains properties that can be used to fine tune the filesystem to your storage needs.

Dataset properties are inherited from its parent zpool or its parent datasets or be overridden. For example, having enabled compression on the parent zpool, each ZFS dataset will inherit the compression setting which can be overridden by setting the compress property.

Custom user-defined properties can also be created. These have no affect on the filesystem and is merely used to annotate, tag, or label a dataset for applications designed around ZFS. Custom properties must include a ':' so distingush them from the native dataset properties.

Getting properties

To get a ZFS property, use the zfs get property-name zpool-name command. Multiple properties can be retrieved by using a comma separated list. For example:

# zfs get compressratio,available storage
NAME                        PROPERTY       VALUE  SOURCE
storage                     compressratio  1.01x  -
storage                     available      1.00G  -

The 'source' column specifies where the value originated from. For datasets that are inheriting properties from other datasets, this field will specify its source.

Some useful properties are:

  • all - to get all properties
  • compressratio - get the compression ratio of the dataset
  • readonly - Whether the dataset is read only
  • compression=X - the compression algorithm to use. Defaults to 'off' and depending on the system, valid choices are "on", "off", "lzjb", "gzip", "gzip-N", "zle", and "zstd".
  • copies=N - how many times data within this dataset are written
  • mountpoint=X - The dataset mount point
Setting properties

Properties are set using the zfs set property=value zpool-name command. Here are some examples.

To change a ZFS mount point:

# zfs set mountpoint=/export storage

To change the ZFS dataset compression:

# zfs set compression=zstd storage

Transferring and receiving datasets

You can transfer a snapshot of a ZFS dataset using the send and receive commands.

Using SSH:

# zfs send  zones/UUID@snapshot | ssh root@10.10.11.5 zfs recv zones/UUID

Using Netcat:

## On the source machine
# zfs send data/linbuild@B | nc -w 20 phoenix 8000

## On the destination machine
# nc -w 120 -l 8000 | pv | zfs receive -F data/linbuild

Using Bash /dev/tcp to send appears to have lower overhead than using netcat to send:

## On the source machine
# zfs send data/linbuild@B > /dev/tcp/phoenix/8000

## On the destination machine, use netcat to listen
# nc -w 120 -l 8000 | pv | zfs receive -F data/linbuild

ZFS volumes

A ZVOL is a ZFS volume that is exposed to the system as a block device. Like a dataset, a ZVOL can be snapshotted, scrubbed, compressed, and deduped.

Create a ZVOL

A ZVOL can be created with the same zfs create command but with the addition of the -V option followed by the volume size. For example:

# zfs create -V 1G storage/disk1

ZVOLs are exposed under the /dev/zvol path.

Using ZVOLs

Since a ZVOL is just a block device, you can:

  • Make a filesystem (like ext4) on it (with mkfs)
  • Use it as a swap space (with mkswap)

One interesting feature with using ZVOLs is that ZFS can provide transparent compression. This means if the volume has compression enabled, your swap or the filesystem is automatically compressed.

Snapshots

ZFS snapshots creates a copy of the filesystem at the exact moment of creation. They can be created on entire zpools or on specific datasets. Creating snapshot is near instantaneous and requires no additional storage.

While snapshots take no additional space, keep in mind that because of he copy-on-write nature of ZFS, storage will increase once the data begins to change from the snapshot. As a consequence, keeping many old snapshots around on a filesystem with constant change can quickly reduce the amount of available space. Also remember that deleting data that has been snapshotted will also not reclaim space until the snapshot is deleted.

You may have up to 264 snapshots. Each snapshot name may have up to 88 characters.

Creating snapshots

Snapshots are referred to by their name and follows the following syntax:

  • Entire zpool snapshot: zpool@snapshot-name
  • Individual dataset snapshot: zpool/dataset@snapshot-name

Create a new snapshot with the zfs snapshot command followed by the snapshot name. Eg: zfs snapshot zpool/dataset@snapshot-name.

Listing snapshots

Snapshots can be listed by running zfs list -t snapshot.

# zfs list -t snapshot
NAME               USED  AVAIL  REFER  MOUNTPOINT
storage@20120820  31.4G      -  3.21T  -
storage@20120924   134G      -  4.15T  -
storage@20121028  36.2G      -  4.26T  -
storage@20121201  33.2M      -  4.55T  -

The USED column shows the amount of space used by the snapshot. This amount will go up as files from the snapshot are deleted since the space freed cannot be reclaimed until the snapshot is deleted.

The REFER column shows the actual size of the pool at the snapshot's timepoint.

To quickly get snapshots ordered by when they were made, use the -r option. Passing in the fields you need (name) will speed this operation up.

# zfs list -t snapshot -o name -s name -r data/home
NAME
data/home@zbk-daily-20170502-003001
data/home@zbk-daily-20170503-003001
data/home@zbk-daily-20170504-003001
data/home@zbk-daily-20170505-003001

## Get the most recent snapshot name
# zfs list -t snapshot -o name -s name -r data/home | tail -n 1 | awk -F@ '{print $2}'
zbk-daily-20170505-003001

Using snapshots and rollback

Snapshot contents can be accessed through a special .zfs/snapshot/ directory. Each snapshot will contain a read-only copy of the data that existed when the snapshot was taken.

# ls /storage/.zfs/snapshot/
20120820/ 20120924/ 20121028/ 20121201/ today/

To rollback to a specific snapshot, run zfs rollback storage@yesterday. This will restore your dataset to the snapshot state.

# zfs rollback storage@yesterday

Destroying snapshots

To destroy a snapshot, use the zfs destroy command as you would if you were to destroy a zpool or dataset. Destroying a snapshot will fail if the snapshot has any nested datasets depending on it, such as if you cloned it as another volume.

# zfs destroy storage@today

Holding a snapshot

Use the zfs hold command to hold a snapshot. Once held, the snapshot cannot be deleted until it is released. This is useful to avoid accidental deletion.

ZFS Clones

ZFS clones are a writable filesystem created from a snapshot. You must destroy the cloned filesystem before the source snapshot can be removed.

To create a cloned dataset 'yesterday' from a specific snapshot:

# zfs clone storage/test@yesterday storage/yesterday

Clones can be destroyed like any other dataset, using the zfs destroy command. Eg. zfs destroy storage/yesterday.

Automated snapshots

Check out zfs-snap.

Administration

Enabling on Startup

Linux Freebsd
On Systemd based systems, you will need to enable the following services in order to have the zpool loaded and mounted on start up.
  • zfs-import-cache
  • zfs-mount
  • zfs-share
Enable the zfs service on startup by appending the following to /etc/rc.conf:
zfs_enable="YES"
Systemd Service Side Notes
The systemd service files should be located at /usr/lib/systemd/system/.

Enable the services by running systemctl enable service-name.

There is an issue where on reboot, the systemd service does not run properly which results in the ZFS pools not being imported. The fix is to run:

# systemctl preset zfs-import-cache zfs-import-scan zfs-mount zfs-share zfs-zed zfs.target


Linux without systemd

On non-systemd Linux distros, check for any startup scripts in /etc/init.d. If you're doing everything manually, you may need to create a file in either /etc/modules-load.d or /etc/modprobe.d so that the ZFS module gets loaded.

To only load the module, create a file at /etc/modules-load.d/zfs.conf containing:

zfs

To load the module with options, create a file at /etc/modprobe.d/zfs.conf containing (for example):

options zfs zfs_arc_max=4294967296


At-Rest Encryption

At-rest encryption is a new feature in ZFS which can be enabled with zpool set feature@encryption=enabled <pool>. Data is written using Authenticated Ciphers (AEAD) such as AES-CCM and AES-GCM and is configurable in various dataset properties. Per-dataset encryption can also be enabled or disabled using the -o encryption=[on|off] flag.

User defined keys can be inherited or set manually for each dataset and can be loaded from different sources in various formats. This key is used to encrypt the mater key which is generated randomly and is never exposed to the user directly. Data encryption is done using this mater key which allows the ability to change the user-defined key without requiring re-encrypting data on the dataset.

For dedup to work, the cyphertext must match for the same plaintext. This is achieved by using the same salt and IV generated from a HMAC of the plaintext.

What is and isn't encrypted is listed below.

Encrypted Unencrypted
  • File data and metadata
  • ACLs, names, permissions, attrs
  • Directory listings
  • All Zvol data
  • FUID Mappings
  • Master encryption keys
  • All of the above in the L2ARC
  • All of the above in the ZIL
  • Dataset / snapshot names
  • Dataset properties
  • Pool layout
  • ZFS Structure
  • Dedup tables
  • Everything in RAM

The ZFS version must have encryption installed and enabled for any of this to work. Enable the feature on the pool.

# truncate -s 1G block
# zpool create test ./block
# zpool set feature@encryption=enabled test

Then, create an encrypted dataset by passing in these additional parameters to the zfs create command.

Parameter Description
-o encryption=.. Controls the ciphersuite (cipher, key length and mode).

The default is aes-256-ccm, which is used if you specify -o encryption=on.

-o keyformat=.. Controls what format the encryption key will be provided as and where it should be loaded from.

Valid options include:

  • passphrase
  • hex
  • raw
-o keylocation=.. Controls the source of the key. The key can be formatted as raw bytes, as hex representation or as a user password. It can be provided via a user prompt which will pop up when you first create it, or when you mount the dataset (zfs mount) or load the key manually (zfs key -l).

Valid options include:

  • prompt
  • file:///dev/shm/enckey
-o pbkdf2iters=.. Only used if a passphrase is used (-o keysource=passphrase,..). It controls the iterations of PBKDF2 for key stretching. Higher is better as it slows down potential dictionary attacks on the password.

The default is -o pbkdf2iters=100000.

keystatus This is a read-only value and not something you set and is included here for reference. Possible values are:
  • off - Not encrypted
  • available - key is loaded
  • unavailable - key is unavailable and data cannot be read

For example, to create an encrypted dataset using only default values and a passphrase:

$ zfs create \
    -o encryption=on \
    -o keylocation=prompt \
    -o keyformat=passphrase \
    testpool/enc1

# This will ask you to enter/confirm a password.

Because all child datasets inherits all parameters of its parent, a child dataset will also be encrypted using the same properties as the parent. For example, running zfs create test/enc1/also-encrypted will create a new encrypted dataset with the same keysource and encryption method.

Encryption properties can be read as any other ZFS properties.

# zfs get -p encryption,keystatus,keysource,pbkdf2iters

If a ZFS key is not available, it can be provided using the zfs load-key pool/dataset command. Attempting to mount an encrypted dataset without a valid key will also prompt you for a key.

A key can be unloaded using zfs unload-key pool/dataset after it has been unmounted.

NFS Export

FreeBSD

To share a ZFS pool via NFS on a FreeBSD system, ensure that you have the following in your /etc/rc.conf file.

mountd_enable="YES"
rpcbind_enable="YES"
nfs_server_enable="YES"
mountd_flags="-r -p 735"
zfs_enable="YES"

mountd is required for exports to be loaded from /etc/exports. You must also either reload or restart mountd every time you make a change to the exports file in order to have it reread.

Set the sharenfs property using the zfs utility. To share it with anyone, set sharenfs to 'on' or provide a network that the export can be accessed from. Eg:

# zfs sharenfs=off storage
# zfs sharenfs="-network 172.17.12.0/24" storage/linbuild
zfs get sharenfs
NAME                      PROPERTY  VALUE                    SOURCE
data                      sharenfs  off                      local
data/linbuild             sharenfs  -network 172.17.12.0/24  local
data/linbuild/centos      sharenfs  -network 172.17.12.0/24  inherited from data/linbuild
data/linbuild/fedora      sharenfs  -network 172.17.12.0/24  inherited from data/linbuild
data/linbuild/scientific  sharenfs  -network 172.17.12.0/24  inherited from data/linbuild

Add the paths you wish to export to /etc/exports and reload mountd.

By setting the sharenfs property, your system will automatically create an export for the zfs pool using mountd. By default, the NFS share will only be accessible locally.

# showmount -e
Exports list on localhost:
/data                           Everyone

If you want to restrict the share to a specific network, you can specify the network in when setting the sharenfs property:

# zfs sharenfs="-network 10.1.1.0/24" storage
# showmount -e
Exports list on localhost:
/storage                           10.1.1.0

By the way, the exports are stored in /etc/zfs/exports and not in the usual /etc/exports. The ZFS and mountd service must be started for it to work. Therefore, you'll also need to append to /etc/rc.conf the following line:

mountd_enable="YES"

On Linux

To share a ZFS dataset via NFS, you may either do it traditionally by manually editing /etc/exports and exportfs, or by using ZFS's sharenfs property and managing shares using the zfs share and zfs unshare commands.

To use the traditional method of exportfs and /etc/exportfs, set sharenfs=off.

Sharing nested file systems
To share a nested dataset, use the crossmnt option. This will let you 'recursively' share datasets under the specified path by automatically mounting the child file system when accessed.

Eg. If I have multiple datasets under /storage/test:

# cat /etc/exports
/storage/test *(ro,no_subtree_check,crossmnt)
Cross mounting for nested datasets is only necessary if you use exportfs. This is not required when using ZFS share which is discussed next.


To use ZFS's automatic NFS exports, set sharenfs=on to allow world read/write access to the share. To control access based on network, add an additional rw=subnet/netmask property. Eg. rw=@192.168.1.0/24. Interfacing the shares in this manner changes the ZFS sharetab file located at /etc/dfs/sharetab. The sharetab file is used by the zfs-share service. If for some reason the sharetab is out of sync with what is actually configured, you can force an update by running zfs share -a or by restarting the zfs-share service in systemd.


Handling Drive Failure

When a drive fails, you will see something similar to:

# zpool status
  pool: data
 state: DEGRADED
status: One or more devices could not be used because the label is missing or
        invalid.  Sufficient replicas exist for the pool to continue
        functioning in a degraded state.
action: Replace the device using 'zpool replace'.
   see: http://zfsonlinux.org/msg/ZFS-8000-4J
  scan: scrub repaired 0 in 2h31m with 0 errors on Fri Jun  3 19:20:15 2016
config:

        NAME        STATE     READ WRITE CKSUM
        data        DEGRADED     0     0     0
          raidz2-0  DEGRADED     0     0     0
            sdb     ONLINE       0     0     0
            sdc     ONLINE       0     0     0
            sdd     ONLINE       0     0     0
            sde     ONLINE       0     0     0
            sdf     ONLINE       0     0     0
            sdg     UNAVAIL      3   222     0  corrupted data

errors: No known data errors

The device sdg failed and was removed from the pool.

Dell Server Info
Since this happened on a server, I removed the failed disk from the machine and replaced it with another one.

Because this server uses a Dell RAID controller, and these Dell RAID controllers can't do a pass-through, each disk that is part of the ZFS array is its own Raid0 vdisk. When inserting a new drive, the vdisk information needs to be re-created via OpenManage before the disk comes online.

Follow the steps on the Dell OpenManage after inserting the replacement drive if this applies to you.

Once the vdisk is recreated, it should show up as the old device name again.


Once the replacement disk is installed on the system, reinitialize the drive with the GPT label.

On a FreeBSD system: I reinitialized the disks using the geometry settings described above.

# gpart create -s GPT da0
da4 created
# gpart add -b 2048 -s 3906617520 -t freebsd-zfs -l disk03 da4
da4p1 added

On a ZFS on Linux system:

# parted /dev/sdg
GNU Parted 2.1
Using /dev/sdg
Welcome to GNU Parted! Type 'help' to view a list of commands.
(parted) mklabel GPT
Warning: The existing disk label on /dev/sdg will be destroyed and all data on this disk will be lost. Do
you want to continue?
Yes/No? y
(parted) quit

To replace the offline disk, use zpool replace pool old-device new-device.

# zpool replace data sdg /dev/sdg
# zpool status
  pool: data
 state: DEGRADED
status: One or more devices is currently being resilvered.  The pool will
        continue to function, possibly in a degraded state.
action: Wait for the resilver to complete.
  scan: resilver in progress since Mon Apr 10 16:36:35 2017
    35.5M scanned out of 3.03T at 3.55M/s, 249h19m to go
    5.45M resilvered, 0.00% done
config:

        NAME             STATE     READ WRITE CKSUM
        data             DEGRADED     0     0     0
          raidz2-0       DEGRADED     0     0     0
            sdb          ONLINE       0     0     0
            sdc          ONLINE       0     0     0
            sdd          ONLINE       0     0     0
            sde          ONLINE       0     0     0
            sdf          ONLINE       0     0     0
            replacing-5  UNAVAIL      0     0     0
              old        UNAVAIL      3   222     0  corrupted data
              sdg        ONLINE       0     0     0  (resilvering)

errors: No known data errors

After the resilvering completes, you may need to manually detach the failed drive before the pool comes back 'online'.

# zpool status
  pool: data
 state: DEGRADED
status: One or more devices has experienced an error resulting in data
        corruption.  Applications may be affected.
action: Restore the file in question if possible.  Otherwise restore the
        entire pool from backup.
   see: http://zfsonlinux.org/msg/ZFS-8000-8A
  scan: resilvered 2.31T in 22h11m with 54 errors on Sun Oct  1 16:07:14 2017
config:

        NAME                                         STATE     READ WRITE CKSUM
        data                                         DEGRADED     0     0    54
          raidz1-0                                   DEGRADED     0     0   109
            replacing-0                              UNAVAIL      0     0     0
              13841903505263088383                   UNAVAIL      0     0     0  was /dev/disk/by-id/ata-ST3000DM001-1CH166_Z1F4HGS6-part1
              ata-ST4000DM005-2DP166_ZGY0956M-part1  ONLINE       0     0     0
            ata-ST3000DM001-1CH166_Z1F4HHZ7-part1    ONLINE       0     0     0
            ata-ST4000DM000-2AE166_WDH01SND-part1    ONLINE       0     0     0
        cache
          pci-0000:02:00.0-ata-3-part1               ONLINE       0     0     0

errors: 26 data errors, use '-v' for a list

# zpool detach data 13841903505263088383



Clearing Data Errors

If you get data errors:

# zpool status
  pool: data
 state: DEGRADED
status: One or more devices are faulted in response to persistent errors.
        Sufficient replicas exist for the pool to continue functioning in a
        degraded state.
action: Replace the faulted device, or use 'zpool clear' to mark the device
        repaired.
  scan: resilvered 509G in 8h24m with 0 errors on Sat Oct 12 02:33:07 2019
config:

        NAME                                        STATE     READ WRITE CKSUM
        data                                        DEGRADED     0     0     0
          raidz2-0                                  DEGRADED     0     0     0
            scsi-3600188b04c5ec1002533815852b8193e  ONLINE       0     0     0
            sdb                                     FAULTED      0     0     0  too many errors
            sdc                                     ONLINE       0     0     0
            sdd                                     ONLINE       0     0     0
            sde                                     ONLINE       0     0     0
            sdf                                     ONLINE       0     0     0

If the device is faulted because of a temporary issue, you can bring it back online using the zpool clear command. This will cause the pool to resilver only data since it dropped offline and depending on the amount of data, should be relatively quick.

# zpool clear data sdb

Devices that are unavailable may be brought offline and then back online using zpool offline pool disk and zpool online pool disk. An onlined disk could still be faulted and can be restored with zpool clear.

## Take the device offline, then do whatever you need to the disk
# zpool offline data sdb

## Then bring it back online
# zpool online data sdb
warning: device 'sdb' onlined, but remains in faulted state
use 'zpool clear' to restore a faulted device
# zpool clear data sdb

If you do not care about data integrity, you could also clear errors by initiating a scrub and then cancelling it immediately.

# zpool scrub data
# zpool scrub -s data


Tuning and Monitoring

ZFS ARC

The ZFS Adaptive Replacement Cache (ARC) is an in-memory cached managed by ZFS to help improve read speeds by caching frequently accessed blocks in memory. With ARC, file access after the first time can be retrieved from memory rather than from disk. The primary benefit of using ARC is for heavy random file access such as databases.

The ZFS ARC size from the ZFS on Linux implementation defaults to half the host's available memory and may decrease when available memory gets too low. Depending on the system setup, you may want to change the maximum amount of memory allocated to ARC.

To change the maximum ARC size, edit the ZFS zfs_arc_max kernel module parameter:

# cat /etc/modprobe.d/zfs.conf
options zfs zfs_arc_max=4294967296

or change the value on the fly:

# echo size_in_bytes >> /sys/module/zfs/parameters/zfs_arc_max

You may check the current ARC usage by checking arc_summary -s arc, cat /proc/spl/kstat/zfs/arcstats or use the arcstat.py script that's part of the zfs package.

# cat /proc/spl/kstat/zfs/arcstats
p                               4    1851836123
c                               4    4105979840
c_min                           4    33554432
c_max                           4    4294967296
size                            4    4105591928
hdr_size                        4    55529696
data_size                       4    3027917312
metadata_size                   4    747323904
other_size                      4    274821016
anon_size                       4    4979712
anon_evictable_data             4    0
anon_evictable_metadata         4    0
mru_size                        4    848103424
mru_evictable_data              4    605918208
mru_evictable_metadata          4    97727488
mru_ghost_size                  4    3224668160
mru_ghost_evictable_data        4    2267575296
mru_ghost_evictable_metadata    4    957092864
mfu_size                        4    2922158080
mfu_evictable_data              4    2421089280
mfu_evictable_metadata          4    495401984
mfu_ghost_size                  4    835275264
mfu_ghost_evictable_data        4    787349504
mfu_ghost_evictable_metadata    4    47925760
l2_hits                         4    0
l2_misses                       4    0
l2_feeds                        4    0
l2_rw_clash                     4    0
l2_read_bytes                   4    0
l2_write_bytes                  4    0
l2_writes_sent                  4    0
l2_writes_done                  4    0
l2_writes_error                 4    0
l2_writes_lock_retry            4    0
l2_evict_lock_retry             4    0
l2_evict_reading                4    0
l2_evict_l1cached               4    0
l2_free_on_write                4    0
l2_cdata_free_on_write          4    0
l2_abort_lowmem                 4    0
l2_cksum_bad                    4    0
l2_io_error                     4    0
l2_size                         4    0
l2_asize                        4    0
l2_hdr_size                     4    0
l2_compress_successes           4    0
l2_compress_zeros               4    0
l2_compress_failures            4    0
memory_throttle_count           4    0
duplicate_buffers               4    0
duplicate_buffers_size          4    0
duplicate_reads                 4    0
memory_direct_count             4    46505
memory_indirect_count           4    30062
arc_no_grow                     4    0
arc_tempreserve                 4    0
arc_loaned_bytes                4    0
arc_prune                       4    0
arc_meta_used                   4    1077674616
arc_meta_limit                  4    3113852928
arc_meta_max                    4    1077674616
arc_meta_min                    4    16777216
arc_need_free                   4    0
arc_sys_free                    4    129740800

# arcstat.py 10 10
    time  read  miss  miss%  dmis  dm%  pmis  pm%  mmis  mm%  arcsz     c
17:41:09     0     0      0     0    0     0    0     0    0   3.9G  3.9G
17:41:19   211    68     32     6    7    62   50    68   32   3.9G  3.9G
17:41:29   146    32     21     6   10    25   30    32   21   3.9G  3.9G
17:41:39   170    50     29     6    9    44   42    50   31   3.9G  3.9G
17:41:49   150    33     22     6    9    27   31    33   22   3.9G  3.9G
17:41:59   151    43     28     4    9    38   39    43   28   3.9G  3.9G


If you are running out of ARC, you might get arc_prune taking up all the CPU. See: https://github.com/zfsonlinux/zfs/issues/4345

# modprobe zfs zfs_arc_meta_strategy=0
# cat /sys/module/zfs/parameters/zfs_arc_meta_strategy

See Also:

ZFS L2ARC

The ZFS L2ARC is similar to the ARC by providing data caching on faster-than-storage-pool disks such as SLC/MLC SSDs to help improve random read workloads.

Cached data from the ARC are moved to the L2ARC when ARC needs more room for more recently and frequently accessed blocks.

To add or remove a L2ARC device:

## Add /dev/disk/by-path/path-to-disk to zpool 'data'
# zpool add data cache /dev/disk/by-path/path-to-disk

## To remove a device, just remove it like any other disk:
# zpool remove data /dev/disk/by-path/path-to-disk

You can see the usage of the L2ARC by running zpool list -v.

# zpool list -v
NAME   SIZE  ALLOC   FREE  EXPANDSZ   FRAG    CAP  DEDUP  HEALTH  ALTROOT
data  8.12T  5.72T  2.40T         -    48%    70%  1.00x  ONLINE  -
  raidz1  8.12T  5.72T  2.40T         -    48%    70%
    ata-ST3000DM001-part1      -      -      -         -      -      -
    ata-ST3000DM001-part1      -      -      -         -      -      -
    ata-ST4000DM000-part1      -      -      -         -      -      -
cache      -      -      -         -      -      -
  pci-0000:02:00.0-ata-3-part1  55.0G  28.3G  26.7G         -     0%    51%

ZFS ZIL and SLOG

ZIL

The ZFS Intent Log (ZIL) is how ZFS keeps track of synchronous write operations so that they can be completed or rolled back after a crash or failure. The ZIL is not used or asynchronous writes as those are still done through system caches.

Because ZIL is stored in the data pool, writing to a ZFS pool involves duplicate writes to the pool: Once to the ZIL and again to the zpool. This is detrimental to performance since one write operation now requires two or more writes (ie. Write Amplification).

SLOG

To improve performance, the ZIL could be moved to a separate device called a Separate Intent Log (SLOG) so that a write operation will write once to both the SLOG the zpool and thereby avoiding write amplification.

The storage device used by the SLOG should ideally be very fast and also reliable. The size of the SLOG doesn't need to be that big either - a couple gigabytes should be sufficient.

For example, assuming that data is flushed every 5 seconds, on a single gigabit connection, the ZIL usage shouldn't be any more than ~600MB.

For these reasons, SLC SSDs are preferable to MLC SSDs for SLOG.

See Also:

Troubleshooting

ZFS: removing nonexistent segment from range tree

One morning, a single 8TB zpool one decided to hang and stop working. After restarting the system, zpool imort hangs and the kernel logs show the following:

[Tue Aug  8 09:39:15 2023] PANIC: zfs: removing nonexistent segment from range tree (offset=2cb7c938000 size=68000)    
[Tue Aug  8 09:39:15 2023] Showing stack for process 3061
[Tue Aug  8 09:39:15 2023] CPU: 0 PID: 3061 Comm: z_wr_iss Tainted: P           OE    --------- -  - 4.18.0-425.19.2.el8_7.x86_64 #1
[Tue Aug  8 09:39:15 2023] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.16.1-0-g3208b098f51a-prebuilt.qemu.org 04/01/2014
[Tue Aug  8 09:39:15 2023] Call Trace:
[Tue Aug  8 09:39:15 2023]  dump_stack+0x41/0x60
[Tue Aug  8 09:39:15 2023]  vcmn_err.cold.0+0x50/0x68 [spl]                                                            
[Tue Aug  8 09:39:15 2023]  ? bt_grow_leaf+0xbd/0x160 [zfs]                                                            
[Tue Aug  8 09:39:15 2023]  ? pn_free+0x30/0x30 [zfs]
[Tue Aug  8 09:39:15 2023]  ? zfs_btree_insert_leaf_impl+0x21/0x40 [zfs]                                               
[Tue Aug  8 09:39:15 2023]  ? bt_shrink_leaf+0x8c/0xa0 [zfs]                                                           
[Tue Aug  8 09:39:15 2023]  ? pn_free+0x30/0x30 [zfs]
[Tue Aug  8 09:39:15 2023]  ? zfs_btree_find+0x183/0x300 [zfs]                                                         
[Tue Aug  8 09:39:15 2023]  zfs_panic_recover+0x6f/0x90 [zfs]                                                          
[Tue Aug  8 09:39:15 2023]  range_tree_remove_impl+0xabd/0xfa0 [zfs]                                                   
[Tue Aug  8 09:39:15 2023]  space_map_load_callback+0x22/0x90 [zfs]                                                    
[Tue Aug  8 09:39:15 2023]  space_map_iterate+0x1ae/0x3c0 [zfs]                                                        
[Tue Aug  8 09:39:15 2023]  ? spa_stats_destroy+0x190/0x190 [zfs]                                                      
[Tue Aug  8 09:39:15 2023]  space_map_load_length+0x64/0xe0 [zfs]                                                      
[Tue Aug  8 09:39:15 2023]  metaslab_load.part.26+0x13e/0x810 [zfs]                                                    
[Tue Aug  8 09:39:15 2023]  ? _cond_resched+0x15/0x30
[Tue Aug  8 09:39:15 2023]  ? spl_kmem_alloc+0xd9/0x120 [spl]                                                          
[Tue Aug  8 09:39:15 2023]  metaslab_activate+0x4b/0x230 [zfs]                                                         
[Tue Aug  8 09:39:15 2023]  ? metaslab_set_selected_txg+0x89/0xc0 [zfs]                                                
[Tue Aug  8 09:39:15 2023]  metaslab_group_alloc_normal+0x166/0xb30 [zfs]                                              
[Tue Aug  8 09:39:15 2023]  metaslab_alloc_dva+0x24c/0x8d0 [zfs]                                                       
[Tue Aug  8 09:39:15 2023]  ? vdev_disk_io_start+0x3d6/0x900 [zfs]                                                     
[Tue Aug  8 09:39:15 2023]  metaslab_alloc+0xc5/0x250 [zfs]                                                            
[Tue Aug  8 09:39:15 2023]  zio_dva_allocate+0xcb/0x860 [zfs]                                                          
[Tue Aug  8 09:39:15 2023]  ? spl_kmem_alloc+0xd9/0x120 [spl]                                                          
[Tue Aug  8 09:39:15 2023]  ? zio_push_transform+0x34/0x80 [zfs]                                                       
[Tue Aug  8 09:39:15 2023]  ? zio_io_to_allocate.isra.8+0x5f/0x80 [zfs]                                                
[Tue Aug  8 09:39:15 2023]  zio_execute+0x90/0xf0 [zfs]
[Tue Aug  8 09:39:15 2023]  taskq_thread+0x2e1/0x510 [spl]
[Tue Aug  8 09:39:15 2023]  ? wake_up_q+0x70/0x70
[Tue Aug  8 09:39:15 2023]  ? zio_taskq_member.isra.11.constprop.17+0x70/0x70 [zfs]                                    
[Tue Aug  8 09:39:15 2023]  ? taskq_thread_spawn+0x50/0x50 [spl]                                                       
[Tue Aug  8 09:39:15 2023]  kthread+0x10b/0x130
[Tue Aug  8 09:39:15 2023]  ? set_kthread_struct+0x50/0x50
[Tue Aug  8 09:39:15 2023]  ret_from_fork+0x35/0x40

I found a GitHub issue from last year (2022) with a similar problem: https://github.com/openzfs/zfs/issues/13483. The work around which seems to work is to enable the following two tunables before importing the zpool with zpool import -f $zpool:

# echo 1 > /sys/module/zfs/parameters/zil_replay_disable
# echo 1 > /sys/module/zfs/parameters/zfs_recover

That seemed to have allowed the pool to import. Kernel logs show the following messages at the time of import:

[   55.358591] WARNING: zfs: removing nonexistent segment from range tree (offset=2cb7c938000 size=68000)
[   55.360567] WARNING: zfs: adding existent segment to range tree (offset=2bb7c938000 size=68000)

The issue possibly could have been caused by a flakey SATA connection as the disk did appear to go offline a few times in the early hours. The disk was readable when I was diagnosing the issue before rebooting the system. SMART also seems to suggest the disk is healthy.

It's possible something corrupt was written as the zpool is used inside a VM via a QEMU disk passthrough. Perhaps too many layers causing corruption when the underlying SATA link got bumped?

See Also

ZFS encryption guide