ZFS is a file system and logical volume manager.

Unless otherwise noted, this article will contain instructions which will apply to Linux, FreeBSD, and Solaris.

Installation[edit]

ZFS is included with Solaris and FreeBSD. Linux support is offered by various open source projects including the ZFS on Linux project.

Linux[edit]

Download the source or prebuilt packages from ZFS on Linux's website http://zfsonlinux.org/

Packages from ZFS on Linux will make use of dkms and can be rebuilt when a new kernel is used. (eg: dkms build -m zfs -v 0.6.5 -k 4.2.5-300.fc23.x86_64)

Introduction[edit]

ZFS stores data inside datasets which are contained inside storage pools (zpool) which are created on top of block devices (vdev). When creating zpools, the vdevs used can be configured similar to different RAID levels for different redundancies. There are also many other options for ZFS which will be covered later.

Preparing Storage Devices[edit]

A ZFS storage pool can be created from a single disk, a disk partition, or a group of disks (eg. raidz, mirror, more on this later). The following section will go over how to properly partition a disk for use with ZFS, but this step is entirely optional since ZFS can work with devices directly. The benefit of using partitions over disks is the ability to control the size of the partition and have a buffer space in case replacement disks have slightly different sizes in the future.

When partitioning, keep in mind that most modern disks are advanced format and will need to be aligned to a 4 kilobyte boundary (or 2048 sectors on disks with 512 byte sectors). A buffer size of 200MB is probably enough. In terms of sectors, we need (200MB * 1024*1024bytes / 512 bytes/sector = 200 * 2048 sectors) 409600 sectors at the end of the disk.

To determine the total number of sectors your disks:

Linux FreeBSD
Use fdisk -l:
# fdisk -l /dev/sdf
Disk /dev/sdf: 2.7 TiB, 3000592982016 bytes, 5860533168 sectors
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 4096 bytes
I/O size (minimum/optimal): 4096 bytes / 4096 bytes
Disklabel type: gpt
Disk identifier: 357F62D3-D1B9-11E3-BE34-00E04C801A50
diskinfo -v
# diskinfo -v da1
da1
        512             # sectorsize
        10737418240     # mediasize in bytes (10G)
        20971520        # mediasize in sectors
        1048576         # stripesize
        0               # stripeoffset
        1305            # Cylinders according to firmware.
        255             # Heads according to firmware.
        63              # Sectors according to firmware.
                        # Disk ident.
        Not_Zoned       # Zone Mode

Create the partition with the required buffer space at the end. Take the total sector size and subtract 409600 sectors.

Linux FreeBSD
To partition /dev/sda:
# fdisk /dev/sda
## TBD
## Ensure you are using sectors by running 'u'
## Create a single partition starting at sector 2048

> p
Disk /dev/sda: 299.4 GB, 299439751168 bytes
255 heads, 63 sectors/track, 36404 cylinders
Units = cylinders of 16065 * 512 = 8225280 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disk identifier: 0x3f42ab96

   Device Boot      Start         End      Blocks   Id  System
/dev/sda                1       36405   292420608   83  Linux
To partition /dev/ada:
## -b defines the starting sector, -s defines the size in sectors
# gpart create -s GPT /dev/ada1
# gpart add -b 2048 -s 1953113520 -t freebsd-zfs -l disk01 /dev/ada1

# gpart create -s GPT /dev/ada2
# gpart add -b 2048 -s 1953113520 -t freebsd-zfs -l disk02 /dev/ada2

# gpart create -s GPT /dev/ada3
# gpart add -b 2048 -s 1953113520 -t freebsd-zfs -l disk03 /dev/ada3

ZFS Pool Types[edit]

There are a 3 basic types of storage pools: Basic, Mirrored, and Raid-Z. Data can be striped across multiple vdevs by specifying multiple mirrors or raid-z vdevs (see the compounds below).

Configuration Description Create Command
Basic Data is striped across all drives. Eg. JBOD, Raid0
# zpool create storage <device 1> <device 2> ...
Mirrored Data is mirrored across all drives. Eg. Raid1
# zpool create storage mirror <device 1> <device 2> ...
Compound Mirror Data is striped across all mirrors if two or more mirrors are created. Eg. Raid 1+0
# zpool create storage mirror <device 1> <device 2> ... mirror <device 3> <device 4> ...
Raid-Z Data is striped across all drives, with one or more (depending on the type of raidz) parity drives. Eg. Raid5, Raid6
# zpool create storage raidz[1-3] <device 1> <device 2> ...
Compound Raid-Z Data is striped across all raidz if two or more raidz are created. Eg. Raid 5+0, Raid 6+0
# zpool create storage raidz[1-3] <device 1> <device 2> ... raidz[1-3] <device 3> <device 4> ...
Raid-Z Limitations
It is not possible to add additional disks to, or remove any disks from an existing raidz configuration.

In order to change a Raid-Z configuration, it must be re-created from scratch.

Pool Creation[edit]

After determining the disks and pool type to use, create the pool using the zpool create command.

zpool create [-fnd] [-o property=value] ... \
              [-O file-system-property=value] ... \
              [-m mountpoint] [-R root] ${POOL_NAME} ${POOL_TYPE} ${DISK} ...

While all flags are optional, the required parameters are the pool name, pool type, and the disks to use. Flags:

  • -f - Force
  • -n - Display creation but don't create pool
  • -d - Do not enable any features unless specified
  • -o - Set a pool property
  • -O - Set a property on root filesystem
  • -m - Mount point
  • -R - Set an alternate root location

Some important pool properties will be described in the next section.

It's recommended to use disk IDs on Linux. In fact, it's a recommendation made by the ZoL project. Using the disk names such as /dev/sdx is not reliable as the naming can change when udev rules are changed.

For example, to create a ZFS storage pool named 'storage' across 3 disks using raidz1, we will use zpool create storage raidz1 followed by the devices.

FreeBSD Linux
# zpool create \
   storage raidz1 \
   gpt/disk01 gpt/disk02 gpt/disk03
# zpool create \
   storage raidz1 \
   /dev/disk/by-id/device01-part1 \
   /dev/disk/by-id/device02-part1 \
   /dev/disk/by-id/device03-part1


Once created, you can see the status of the pool using zfs status.

# zpool status
  pool: storage
  state: ONLINE
  scan: none requested
 config:
 
         NAME            STATE     READ WRITE CKSUM
         storage         ONLINE       0     0     0
           raidz1-0      ONLINE       0     0     0
             gpt/disk01  ONLINE       0     0     0
             gpt/disk02  ONLINE       0     0     0
             gpt/disk03  ONLINE       0     0     0
 
 errors: No known data errors

Pool Properties[edit]

When creating a storage pool, you may specify certain pool properties. Some properties such as the ashift value can only be set during the creation process while others such as compress can be set at any time. However, certain properties may require data to be re-written on the pool for it to take effect (such as data compression).

Properties are set using the -o flag in zfs create.

ashift for Advanced Format Disks[edit]

The ashift value used to specify advanced format disks.

Advanced Format Alignment
Make sure that your vdevs have the proper alignment in order to avoid write amplification due to blocks straddling sectors by following this table:
Device ashift value
Hard Drives with 512B sectors 9
Flash Media / HDD with 4K sectors 12
Flash Media / HDD with 8K sectors 13
Amazon EC2 12

See Also:

You may query these parameters after the storage pool's creation with zdb. For instance, to see the ashift value of a storage pool:

# zdb | grep ashift
            ashift: 12


compress for Storage Compression[edit]

For newer ZFS pools (versions >= 5000), lz4 compression should always be enabled because files that are not compressible will not be compressed which makes it highly efficient. Most CPUs should be able to keep up with IO.

copies=N for Storage Redundancy[edit]

This tells ZFS to maintain multiple copies of the file for redundancy. In case of bad sectors, ZFS will have the ability to recover the bad data.

Creating ZFS Datasets[edit]

Think of ZFS datasets as partitions of the ZFS pool that we created in the previous section but with lots of additional features. ZFS datasets can be snapshotted, organized into a hierarchy, and have per-dataset specific options set.

Datasets are managed using the zfs utility.

Creating a new dataset is as simple as:

# zfs create zpool/dataset-name

To list all ZFS datasets:

# zfs list
NAME                         USED  AVAIL  REFER  MOUNTPOINT
storage                      153K  1.11T   153K  /storage
storage/logs                 153K  1.11T   153K  /storage/logs

Certain parameters can be set during the creation process with the -o flag, or set after creation using the zfs set command.

Options and Parameters[edit]

Each ZFS dataset inherits all the pool properties it is created in but can be overridden on a per-dataset basis using the zfs set.

To get a ZFS parameter, use the zfs get command.

# zfs get compressratio storage
NAME                        PROPERTY       VALUE  SOURCE
storage                     compressratio  1.01x  -


In addition to the pool properties that can be set in the last section, a ZFS dataset can also have the following properties set as well.

mountpoint=x for Automatically Mounting[edit]

When the system starts up, the zfs mountd utility will automatically mount ZFS datasets to the specified mountpoint.

# zfs set mountpoint=/export storage


ZFS Snapshot[edit]

A ZFS snapshot is a read-only copy of a dataset from a previous state. Because ZFS makes use of Copy-On-Write, snapshots take no additional space until a change occurs.

Listing Snapshots[edit]

Snapshots can be listed by running zfs list -t snapshot.

# zfs list -t snapshot
NAME               USED  AVAIL  REFER  MOUNTPOINT
storage@20120820  31.4G      -  3.21T  -
storage@20120924   134G      -  4.15T  -
storage@20121028  36.2G      -  4.26T  -
storage@20121201  33.2M      -  4.55T  -

The USED column shows the amount of space used by the snapshot. This amount will go up as files from the snapshot are deleted since the space freed cannot be reclaimed until the snapshot is deleted.

The REFER column shows the actual size of the pool at the snapshot's timepoint.

To quickly get snapshots ordered by when they were made, use the -r option. Passing in the fields you need (name) will speed this operation up.

# zfs list -t snapshot -o name -s name -r data/home
NAME
data/home@zbk-daily-20170502-003001
data/home@zbk-daily-20170503-003001
data/home@zbk-daily-20170504-003001
data/home@zbk-daily-20170505-003001

## Get the most recent snapshot name
# zfs list -t snapshot -o name -s name -r data/home | tail -n 1 | awk -F@ '{print $2}'
zbk-daily-20170505-003001

Creating Snapshots[edit]

To create a new snapshot, run zfs snapshot storage@snapshot-name.

Snapshot Name Limit
There seems to be a limit of 88 characters for the snapshot name and file system name. (See: https://lists.freebsd.org/pipermail/freebsd-fs/2010-March/007964.html)

ZFS snapshots make use of Copy-On-Write (COW) and will not use any additional space. However, deleting files that are part of an existing snapshot will not reclaim space. Instead, the storage capacity will appear to go down since the amount of available space for new data remains the same.

For example, if a volume at 1TB of 2TB capacity has a 0.5TB file deleted, because the 0.5TB file is still stored in a snapshot, the total capacity is now 2TB - 0.5TB and the reported disk usage will now be 0.5TB of 1.5TB.

If I were to run zfs snapshot storage@snapshot-name now, listing the snapshots will yield:

# zfs list -t snapshot
NAME               USED  AVAIL  REFER  MOUNTPOINT
storage@20120820  31.4G      -  3.21T  -
storage@20120924   134G      -  4.15T  -
storage@20121028  36.2G      -  4.26T  -
storage@20121201  33.2M      -  4.55T  -
storage@today         0      -  4.58T  -

Accessing Snapshot Contents[edit]

Snapshot contents can be accessed through a special .zfs/snapshot/ directory. Each snapshot will contain a read-only copy of the data that existed when the snapshot was taken.

# ls /storage/.zfs/snapshot/
20120820/ 20120924/ 20121028/ 20121201/ today/

To roll back to a specific snapshot, run zfs rollback storage@yesterday. This will restore your volume to the snapshot state.

# zfs rollback storage@yesterday

To delete a specific snapshot, run zfs destroy storage@today. Note that this will not work if other volumes depend on it. eg: If you cloned it as another volume.

# zfs destroy storage@today

Data Integrity[edit]

One of the strengths of ZFS its resiliency thanks to its transactional file system. The only way data stored on a ZFS volume to be in an inconsistent state is through hardware failure or some sort of fault with the ZFS implementation. Similar to a fsck on ext file systems, a ZFS scrub provides a way to perform filesystem checking. To initiate a scrub, run:

# zpool scrub storage

Once the scrub process is underway, you can view its status by running:

# zpool status storage
   pool: storage
  state: ONLINE
  scan: scrub in progress since Mon Dec  3 23:54:53 2012
     18.3G scanned out of 6.05T at 211M/s, 8h20m to go
     0 repaired, 0.30% done
 config:
 
         NAME            STATE     READ WRITE CKSUM
         storage         ONLINE       0     0     0
           raidz1-0      ONLINE       0     0     0
             gpt/disk00  ONLINE       0     0     0
             gpt/disk01  ONLINE       0     0     0
             gpt/disk02  ONLINE       0     0     0
             gpt/disk03  ONLINE       0     0     0
             gpt/disk04  ONLINE       0     0     0
 
 errors: No known data errors

To stop a scrub process, run:

# zpool scrub -s storage

Administration[edit]

Listing Zpools and Datasets[edit]

List pools using zpool list:

# zpool list
NAME      SIZE  ALLOC   FREE  EXPANDSZ   FRAG    CAP  DEDUP  HEALTH  ALTROOT
data     8.12T  5.42T  2.71T         -    50%    66%  1.00x  ONLINE  -
storage  9.06T  5.93T  3.13T         -    13%    65%  1.00x  ONLINE  -

The FRAG value is the average fragmentation of available space.

To list ZFS datasets, use zfs list

# zfs list
NAME                         USED  AVAIL  REFER  MOUNTPOINT
data                        3.61T  1.63T   139K  /data
data/backups                 944G  1.63T   901G  /data/backups
data/crypt                   125G  1.63T   125G  /data/crypt
data/dashcam                84.5G  1.63T  84.5G  /data/dashcam
data/home                    169G  1.63T   153G  /data/home
data/images                  999G  1.63T   999G  /data/images
data/public                 1.11T  1.63T  1.03T  /data/public
data/scratch                 167G  1.63T   158G  /data/scratch
data/vm                     71.2G  1.63T  57.8G  /data/vm
storage                     4.74T  2.27T  26.2G  /storage

To list snapshots, pass -t snapshot

# zfs list -t snapshot
NAME                                                   USED  AVAIL  REFER  MOUNTPOINT
data/backups@2015-Dec-15                               658M      -   588G  -
data/backups@2017-Jan-01                                  0      -   927G  -
data/backups@2017-Jan-08                                  0      -   927G  -
data/backups@2017-Jan-15                              74.6K      -   927G  -
data/backups@2017-Jan-22                               416K      -   927G  -
data/backups@2017-Jan-29                               767K      -   927G  -
data/backups@2017-Feb-05                                  0      -   927G  -
data/backups@2017-Feb-12                                  0      -   927G  -
data/backups@2017-Feb-19                               202K      -   927G  -


To get properties of your ZFS datasets, use the zfs get PROPERTYNAME command:

# zfs get compressratio
NAME  PROPERTY       VALUE  SOURCE
data  compressratio  1.02x  -


Transferring ZFS Datasets[edit]

You can transfer a snapshot of a ZFS dataset using the send and receive commands.

Using SSH:

# zfs send  zones/UUID@snapshot | ssh root@10.10.11.5 zfs recv zones/UUID

Using Netcat:

## On the source machine
# zfs send data/linbuild@B | nc -w 20 phoenix 8000

## On the destination machine
# nc -w 120 -l 8000 | pv | zfs receive -F data/linbuild

NFS Export[edit]

FreeBSD[edit]

To share a ZFS pool via NFS on a FreeBSD system, ensure that you have the following in your /etc/rc.conf file.

mountd_enable="YES"
rpcbind_enable="YES"
nfs_server_enable="YES"
mountd_flags="-r -p 735"
zfs_enable="YES"

mountd is required for exports to be loaded from /etc/exports. You must also either reload or restart mountd every time you make a change to the exports file in order to have it reread.

Set the sharenfs property using the zfs utility. To share it with anyone, set sharenfs to 'on' or provide a network that the export can be accessed from. Eg:

# zfs sharenfs=off storage
# zfs sharenfs="-network 172.17.12.0/24" storage/linbuild
zfs get sharenfs
NAME                      PROPERTY  VALUE                    SOURCE
data                      sharenfs  off                      local
data/linbuild             sharenfs  -network 172.17.12.0/24  local
data/linbuild/centos      sharenfs  -network 172.17.12.0/24  inherited from data/linbuild
data/linbuild/fedora      sharenfs  -network 172.17.12.0/24  inherited from data/linbuild
data/linbuild/scientific  sharenfs  -network 172.17.12.0/24  inherited from data/linbuild

Add the paths you wish to export to /etc/exports and reload mountd.

By setting the sharenfs property, your system will automatically create an export for the zfs pool using mountd. By default, the NFS share will only be accessible locally.

# showmount -e
Exports list on localhost:
/data                           Everyone

If you want to restrict the share to a specific network, you can specify the network in when setting the sharenfs property:

# zfs sharenfs="-network 10.1.1.0/24" storage
# showmount -e
Exports list on localhost:
/storage                           10.1.1.0

By the way, the exports are stored in /etc/zfs/exports and not in the usual /etc/exports. The ZFS and mountd service must be started for it to work. Therefore, you'll also need to append to /etc/rc.conf the following line:

mountd_enable="YES"

On Linux[edit]

To share a ZFS dataset via NFS, you may either do it traditionally by manually editing /etc/exports and exportfs, or by using ZFS's sharenfs property and managing shares using the zfs share and zfs unshare commands.

To use the traditional method of exportfs and /etc/exportfs, set sharenfs=off.

Sharing nested file systems
To share a nested dataset, use the crossmnt option. This will let you 'recursively' share datasets under the specified path by automatically mounting the child file system when accessed.

Eg. If I have multiple datasets under /storage/test:

# cat /etc/exports
/storage/test *(ro,no_subtree_check,crossmnt)

Cross mounting for nested datasets is only necessary if you use exportfs. This is not required when using ZFS share which is discussed next.

To use ZFS's automatic NFS exports, set sharenfs=on to allow world read/write access to the share. To control access based on network, add an additional rw=subnet/netmask property. Eg. rw=@192.168.1.0/24. Interfacing the shares in this manner changes the ZFS sharetab file located at /etc/dfs/sharetab. The sharetab file is used by the zfs-share service. If for some reason the sharetab is out of sync with what is actually configured, you can force an update by running zfs share -a or by restarting the zfs-share service in systemd.

Enabling on Startup[edit]

Linux Freebsd
On Systemd based systems, you will need to enable the following services in order to have the zpool loaded and mounted on start up.
  • zfs-import-cache
  • zfs-mount
  • zfs-share
Enable the zfs service on startup by appending the following to /etc/rc.conf:
zfs_enable="YES"
Systemd Service Side Notes
The systemd service files should be located at /usr/lib/systemd/system/.

Enable the services by running systemctl enable service-name.

There is an issue where on reboot, the systemd service does not run properly which results in the ZFS pools not being imported. The fix is to run:

# systemctl preset zfs-import-cache zfs-import-scan zfs-mount zfs-share zfs-zed zfs.target

Linux without systemd[edit]

On non-systemd Linux distros, check for any startup scripts in /etc/init.d. If you're doing everything manually, you may need to create a file in either /etc/modules-load.d or /etc/modprobe.d so that the ZFS module gets loaded.

To only load the module, create a file at /etc/modules-load.d/zfs.conf containing:

zfs

To load the module with options, create a file at /etc/modprobe.d/zfs.conf containing (for example):

options zfs zfs_arc_max=4294967296

Handling Drive Failure[edit]

When a drive fails, you will see something similar to:

# zpool status
  pool: data
 state: DEGRADED
status: One or more devices could not be used because the label is missing or
        invalid.  Sufficient replicas exist for the pool to continue
        functioning in a degraded state.
action: Replace the device using 'zpool replace'.
   see: http://zfsonlinux.org/msg/ZFS-8000-4J
  scan: scrub repaired 0 in 2h31m with 0 errors on Fri Jun  3 19:20:15 2016
config:

        NAME        STATE     READ WRITE CKSUM
        data        DEGRADED     0     0     0
          raidz2-0  DEGRADED     0     0     0
            sdb     ONLINE       0     0     0
            sdc     ONLINE       0     0     0
            sdd     ONLINE       0     0     0
            sde     ONLINE       0     0     0
            sdf     ONLINE       0     0     0
            sdg     UNAVAIL      3   222     0  corrupted data

errors: No known data errors

The device sdg failed and was removed from the pool.

Dell Server Info
Since this happened on a server, I removed the failed disk from the machine and replaced it with another one.

Because this server uses a Dell RAID controller, and these Dell RAID controllers can't do a pass-through, each disk that is part of the ZFS array is its own Raid0 vdisk. When inserting a new drive, the vdisk information needs to be re-created via OpenManage before the disk comes online.

Follow the steps on the Dell OpenManage after inserting the replacement drive if this applies to you.

Once the vdisk is recreated, it should show up as the old device name again.

Once the replacement disk is installed on the system, reinitialize the drive with the GPT label.

On a FreeBSD system: I reinitialized the disks using the geometry settings described above.

# gpart create -s GPT da0
da4 created
# gpart add -b 2048 -s 3906617520 -t freebsd-zfs -l disk03 da4
da4p1 added

On a ZFS on Linux system:

# parted /dev/sdg
GNU Parted 2.1
Using /dev/sdg
Welcome to GNU Parted! Type 'help' to view a list of commands.
(parted) mklabel GPT
Warning: The existing disk label on /dev/sdg will be destroyed and all data on this disk will be lost. Do
you want to continue?
Yes/No? y
(parted) quit

To replace the offline disk, use zpool replace pool old-device new-device.

# zpool replace data sdg /dev/sdg
# zpool status
  pool: data
 state: DEGRADED
status: One or more devices is currently being resilvered.  The pool will
        continue to function, possibly in a degraded state.
action: Wait for the resilver to complete.
  scan: resilver in progress since Mon Apr 10 16:36:35 2017
    35.5M scanned out of 3.03T at 3.55M/s, 249h19m to go
    5.45M resilvered, 0.00% done
config:

        NAME             STATE     READ WRITE CKSUM
        data             DEGRADED     0     0     0
          raidz2-0       DEGRADED     0     0     0
            sdb          ONLINE       0     0     0
            sdc          ONLINE       0     0     0
            sdd          ONLINE       0     0     0
            sde          ONLINE       0     0     0
            sdf          ONLINE       0     0     0
            replacing-5  UNAVAIL      0     0     0
              old        UNAVAIL      3   222     0  corrupted data
              sdg        ONLINE       0     0     0  (resilvering)

errors: No known data errors

After the resilvering completes, you may need to manually detach the failed drive before the pool comes back 'online'.

# zpool status
  pool: data
 state: DEGRADED
status: One or more devices has experienced an error resulting in data
        corruption.  Applications may be affected.
action: Restore the file in question if possible.  Otherwise restore the
        entire pool from backup.
   see: http://zfsonlinux.org/msg/ZFS-8000-8A
  scan: resilvered 2.31T in 22h11m with 54 errors on Sun Oct  1 16:07:14 2017
config:

        NAME                                         STATE     READ WRITE CKSUM
        data                                         DEGRADED     0     0    54
          raidz1-0                                   DEGRADED     0     0   109
            replacing-0                              UNAVAIL      0     0     0
              13841903505263088383                   UNAVAIL      0     0     0  was /dev/disk/by-id/ata-ST3000DM001-1CH166_Z1F4HGS6-part1
              ata-ST4000DM005-2DP166_ZGY0956M-part1  ONLINE       0     0     0
            ata-ST3000DM001-1CH166_Z1F4HHZ7-part1    ONLINE       0     0     0
            ata-ST4000DM000-2AE166_WDH01SND-part1    ONLINE       0     0     0
        cache
          pci-0000:02:00.0-ata-3-part1               ONLINE       0     0     0

errors: 26 data errors, use '-v' for a list

# zpool detach data 13841903505263088383



Expanding Zpool[edit]

The ZFS zpool can be expanded two ways:

  1. Add additional disks to the zpool or
  2. Swap existing disks with larger disks

Adding Disks[edit]

The first method is done using the zfs add command. Do not confuse this with zfs attach which attaches a device into a vdev (see the following section).

Expanding zpools are irreversable!
The following example expands the pool from 10G to 20G. You will not be able to undo the second mirror addition without destroying the zpool.

For example, a mirrored ZFS pool can be doubled in size by adding a second mirror (thereby striping the data across the two mirrors):

root@bsd:/root# zpool create storage mirror gpt/disk01 gpt/disk02
root@bsd:/root# zpool status
  pool: storage
 state: ONLINE
  scan: none requested
config:

        NAME            STATE     READ WRITE CKSUM
        storage         ONLINE       0     0     0
          mirror-0      ONLINE       0     0     0
            gpt/disk01  ONLINE       0     0     0
            gpt/disk02  ONLINE       0     0     0

errors: No known data errors

root@bsd:/root# zpool add storage mirror gpt/disk03 gpt/disk04
root@bsd:/root# zpool status
  pool: storage
 state: ONLINE
  scan: none requested
config:

        NAME            STATE     READ WRITE CKSUM
        storage         ONLINE       0     0     0
          mirror-0      ONLINE       0     0     0
            gpt/disk01  ONLINE       0     0     0
            gpt/disk02  ONLINE       0     0     0
          mirror-1      ONLINE       0     0     0
            gpt/disk03  ONLINE       0     0     0
            gpt/disk04  ONLINE       0     0     0

errors: No known data errors

Swapping with Larger Disks[edit]

Ensure that autoexpand=on storage is set on the pool by running zpool set autoexpand=on storage. Replace and resilver each disk in the ZFS pool with a larger disk. After every disk has been swapped, the zpool size should increase automatically.

If you did not enable autoexpand before replacing every disk, expanding the disk can still be done by enabling autoexpand and bringing one of the disks online again.

root@nas:/data/scratch# zpool status storage
  pool: storage
 state: ONLINE
  scan: resilvered 1.67T in 8h9m with 0 errors on Sat Jun  9 08:20:15 2018
config:

        NAME                                      STATE     READ WRITE CKSUM
        storage                                   ONLINE       0     0     0
          raidz1-0                                ONLINE       0     0     0
            ata-Hitachi_HUS724030ALE641_P8GH7GVR  ONLINE       0     0     0
            ata-ST3000DM001-1CH166_Z1F4HHZ7       ONLINE       0     0     0
            ata-Hitachi_HUS724030ALE641_P8GBAW7P  ONLINE       0     0     0
            ata-ST3000DM001-1CH166_Z1F4HGS6       ONLINE       0     0     0
            ata-Hitachi_HUS724030ALE641_P8G9NZSR  ONLINE       0     0     0

errors: No known data errors
root@nas:/data/scratch# zpool list storage
NAME      SIZE  ALLOC   FREE  EXPANDSZ   FRAG    CAP  DEDUP  HEALTH  ALTROOT
storage  9.06T  8.35T   732G     4.50T    33%    92%  1.00x  ONLINE  -
root@nas:/data/scratch# zpool set autoexpand=on storage
## Pick a random disk from the pool to bring online.
root@nas:/data/scratch# zpool online -e storage ata-Hitachi_HUS724030ALE641_P8GH7GVR
root@nas:/data/scratch# zpool list storage
NAME      SIZE  ALLOC   FREE  EXPANDSZ   FRAG    CAP  DEDUP  HEALTH  ALTROOT
storage  13.6T  8.35T  5.28T         -    22%    61%  1.00x  ONLINE  -

Attaching device to vdev[edit]

To attach a new device to an existing vdev, use the zpool attach command.

For example, to add a new disk (gpt/disk03) to an existing mirror:

root@bsd:/root# zpool status
  pool: storage
 state: ONLINE
  scan: none requested
config:

        NAME            STATE     READ WRITE CKSUM
        storage         ONLINE       0     0     0
          mirror-0      ONLINE       0     0     0
            gpt/disk01  ONLINE       0     0     0
            gpt/disk02  ONLINE       0     0     0

errors: No known data errors

root@bsd:/root# zpool attach storage gpt/disk02  gpt/disk03
root@bsd:/root# zpool status
  pool: storage
 state: ONLINE
  scan: resilvered 78.5K in 0h0m with 0 errors on Sat Sep  2 16:22:54 2017
config:

        NAME            STATE     READ WRITE CKSUM
        storage         ONLINE       0     0     0
          mirror-0      ONLINE       0     0     0
            gpt/disk01  ONLINE       0     0     0
            gpt/disk02  ONLINE       0     0     0
            gpt/disk03  ONLINE       0     0     0

errors: No known data errors

You can specify any device in the vdev. In the example above, you can specify either gpt/disk01 or gpt/disk02.

Conversely, you can remove devices using zpool detach which removes the device from the vdev.

Clearing Data Errors[edit]

If you get data errors:

# zpool status
  pool: data
 state: DEGRADED
status: One or more devices has experienced an error resulting in data
        corruption.  Applications may be affected.
action: Restore the file in question if possible.  Otherwise restore the
        entire pool from backup.
   see: http://zfsonlinux.org/msg/ZFS-8000-8A
  scan: scrub in progress since Mon Oct  2 12:00:33 2017
    22.1M scanned out of 7.26T at 3.16M/s, 667h57m to go
    0 repaired, 0.00% done
config:

        NAME                                         STATE     READ WRITE CKSUM
        data                                         DEGRADED     0     0    54
          raidz1-0                                   DEGRADED     0     0   109
            replacing-0                              DEGRADED     0     0     0
              13841903505263088383                   UNAVAIL      0     0     0  was /dev/disk/by-id/ata-ST3000DM001-1CH166_Z1F4HGS6-part1
              ata-ST4000DM005-2DP166_ZGY0956M-part1  ONLINE       0     0     0
            ata-ST3000DM001-1CH166_Z1F4HHZ7-part1    ONLINE       0     0     0
            ata-ST4000DM000-2AE166_WDH01SND-part1    ONLINE       0     0     0
        cache
          pci-0000:02:00.0-ata-3-part1               ONLINE       0     0     0

errors: 26 data errors, use '-v' for a list

You can clear the error by initiating a scrub and then cancelling it.

# zpool scrub data
# zpool scrub -s data

At-Rest Encryption[edit]

At-rest encryption is a new feature in ZFS which can be enabled with zpool set feature@encryption=enabled <pool>. Data is written using Authenticated Ciphers (AEAD) such as AES-CCM and AES-GCM and is configurable in various dataset properties. Per-dataset encryption can also be enabled or disabled using the -o encryption=[on|off] flag.

User defined keys can be inherited or set manually for each dataset and can be loaded from different sources in various formats. This key is used to encrypt the mater key which is generated randomly and is never exposed to the user directly. Data encryption is done using this mater key which allows the ability to change the user-defined key without requiring re-encrypting data on the dataset.

For dedup to work, the cyphertext must match for the same plaintext. This is achieved by using the same salt and IV generated from a HMAC of the plaintext.

What is and isn't encrypted is listed below.

Encrypted Unencrypted
  • File data and metadata
  • ACLs, names, permissions, attrs
  • Directory listings
  • All Zvol data
  • FUID Mappings
  • Master encryption keys
  • All of the above in the L2ARC
  • All of the above in the ZIL
  • Dataset / snapshot names
  • Dataset properties
  • Pool layout
  • ZFS Structure
  • Dedup tables
  • Everything in RAM

Using ZFS Encryption[edit]

Needless to say, the ZFS version must have encryption installed and enabled for any of this to work. Enable the feature on the pool.

# truncate -s 1G block
# zpool create test ./block
# zpool set feature@encryption=enabled test

Then, create an encrypted dataset by passing in these additional parameters to the zfs create command.

Parameter Description
-o encryption=.. Controls the ciphersuite (cipher, key length and mode).

The default is aes-256-ccm, which is used if you specify -o encryption=on.

-o keysource=.. Controls what format the encryption key will be provided as and where it should be loaded from. The key can be formatted as raw bytes, as hex representation or as a user password. It can be provided via a user prompt which will pop up when you first create it, or when you mount the dataset (zfs mount) or load the key manually (zfs key -l).

Valid options include:

  • passphrase,prompt
  • hex,file:///dev/shm/enckey
  • raw,file:///dev/shm/enckey
-o pbkdf2iters=.. Only used if a passphrase is used (-o keysource=passphrase,..). It controls the iterations of PBKDF2 for key stretching. Higher is better as it slows down potential dictionary attacks on the password.

The default is -o pbkdf2iters=100000.

keystatus This is a read-only value and not something you set and is included here for reference. Possible values are:
  • off - Not encrypted
  • available - key is loaded
  • unavailable - key is unavailable and data cannot be read


For example, to create an ecrypted dataset using only default values and a passphrase:

$ zfs create \
    -o encryption=on \
    -o keysource=passphrase,prompt \
    testpool/enc1

# This will ask you to enter/confirm a password.

Because all child datasets inherits all parameters of its parent, a child dataset will also be encrypted using the same properties as the parent. For example, running zfs create test/enc1/also-encrypted will create a new encrypted dataset with the same keysource and encryption method.


Encryption properties can be read as any other ZFS properties.

# zfs get -p encryption,keystatus,keysource,pbkdf2iters

If a ZFS key is not available, it can be provided using the zfs key -l pool/dataset command. Attempting to mount an encrypted dataset without a valid key will also prompt you for a key.

A key can be unloaded using zfs key -u pool/dataset after it has been unmounted.

Tuning and Monitoring[edit]

ZFS ARC[edit]

The ZFS Adaptive Replacement Cache (ARC) is an in-memory cached managed by ZFS to help improve read speeds by caching frequently accessed blocks in memory. With ARC, file access after the first time can be retrieved from memory rather than from disk. The primary benefit of using ARC is for heavy random file access such as databases.

The ZFS ARC size from the ZFS on Linux implementation defaults to half the host's available memory and may decrease when available memory gets too low. Depending on the system setup, you may want to change the maximum amount of memory allocated to ARC.

To change the maximum ARC size, edit the ZFS zfs_arc_max kernel module parameter:

# cat /etc/modprobe.d/zfs.conf
options zfs zfs_arc_max=4294967296

You may check the current ARC usage by checking cat /proc/spl/kstat/zfs/arcstats or use the arcstat.py script that's part of the zfs package.

# cat /proc/spl/kstat/zfs/arcstats
p                               4    1851836123
c                               4    4105979840
c_min                           4    33554432
c_max                           4    4294967296
size                            4    4105591928
hdr_size                        4    55529696
data_size                       4    3027917312
metadata_size                   4    747323904
other_size                      4    274821016
anon_size                       4    4979712
anon_evictable_data             4    0
anon_evictable_metadata         4    0
mru_size                        4    848103424
mru_evictable_data              4    605918208
mru_evictable_metadata          4    97727488
mru_ghost_size                  4    3224668160
mru_ghost_evictable_data        4    2267575296
mru_ghost_evictable_metadata    4    957092864
mfu_size                        4    2922158080
mfu_evictable_data              4    2421089280
mfu_evictable_metadata          4    495401984
mfu_ghost_size                  4    835275264
mfu_ghost_evictable_data        4    787349504
mfu_ghost_evictable_metadata    4    47925760
l2_hits                         4    0
l2_misses                       4    0
l2_feeds                        4    0
l2_rw_clash                     4    0
l2_read_bytes                   4    0
l2_write_bytes                  4    0
l2_writes_sent                  4    0
l2_writes_done                  4    0
l2_writes_error                 4    0
l2_writes_lock_retry            4    0
l2_evict_lock_retry             4    0
l2_evict_reading                4    0
l2_evict_l1cached               4    0
l2_free_on_write                4    0
l2_cdata_free_on_write          4    0
l2_abort_lowmem                 4    0
l2_cksum_bad                    4    0
l2_io_error                     4    0
l2_size                         4    0
l2_asize                        4    0
l2_hdr_size                     4    0
l2_compress_successes           4    0
l2_compress_zeros               4    0
l2_compress_failures            4    0
memory_throttle_count           4    0
duplicate_buffers               4    0
duplicate_buffers_size          4    0
duplicate_reads                 4    0
memory_direct_count             4    46505
memory_indirect_count           4    30062
arc_no_grow                     4    0
arc_tempreserve                 4    0
arc_loaned_bytes                4    0
arc_prune                       4    0
arc_meta_used                   4    1077674616
arc_meta_limit                  4    3113852928
arc_meta_max                    4    1077674616
arc_meta_min                    4    16777216
arc_need_free                   4    0
arc_sys_free                    4    129740800

# arcstat.py 10 10
    time  read  miss  miss%  dmis  dm%  pmis  pm%  mmis  mm%  arcsz     c
17:41:09     0     0      0     0    0     0    0     0    0   3.9G  3.9G
17:41:19   211    68     32     6    7    62   50    68   32   3.9G  3.9G
17:41:29   146    32     21     6   10    25   30    32   21   3.9G  3.9G
17:41:39   170    50     29     6    9    44   42    50   31   3.9G  3.9G
17:41:49   150    33     22     6    9    27   31    33   22   3.9G  3.9G
17:41:59   151    43     28     4    9    38   39    43   28   3.9G  3.9G


If you are running out of ARC, you might get arc_prune taking up all the CPU. See: https://github.com/zfsonlinux/zfs/issues/4345

# modprobe zfs zfs_arc_meta_strategy=0
# cat /sys/module/zfs/parameters/zfs_arc_meta_strategy

See Also:

ZFS L2ARC[edit]

The ZFS L2ARC is similar to the ARC by providing data caching on faster-than-storage-pool disks such as SLC/MLC SSDs to help improve random read workloads.

Cached data from the ARC are moved to the L2ARC when ARC needs more room for more recently and frequently accessed blocks.

To add or remove a L2ARC device:

## Add /dev/disk/by-path/path-to-disk to zpool 'data'
# zpool add data cache /dev/disk/by-path/path-to-disk

## To remove a device, just remove it like any other disk:
# zpool remove data /dev/disk/by-path/path-to-disk

You can see the usage of the L2ARC by running zpool list -v.

# zpool list -v
NAME   SIZE  ALLOC   FREE  EXPANDSZ   FRAG    CAP  DEDUP  HEALTH  ALTROOT
data  8.12T  5.72T  2.40T         -    48%    70%  1.00x  ONLINE  -
  raidz1  8.12T  5.72T  2.40T         -    48%    70%
    ata-ST3000DM001-part1      -      -      -         -      -      -
    ata-ST3000DM001-part1      -      -      -         -      -      -
    ata-ST4000DM000-part1      -      -      -         -      -      -
cache      -      -      -         -      -      -
  pci-0000:02:00.0-ata-3-part1  55.0G  28.3G  26.7G         -     0%    51%

ZFS ZIL and SLOG[edit]

ZIL[edit]

The ZFS Intent Log (ZIL) is how ZFS keeps track of synchronous write operations so that they can be completed or rolled back after a crash or failure. The ZIL is not used or asynchronous writes as those are still done through system caches.

Because ZIL is stored in the data pool, writing to a ZFS pool involves duplicate writes to the pool: Once to the ZIL and again to the zpool. This is detrimental to performance since one write operation now requires two or more writes (ie. Write Amplification).

SLOG[edit]

To improve performance, the ZIL could be moved to a separate device called a Separate Intent Log (SLOG) so that a write operation will write once to both the SLOG the zpool and thereby avoiding write amplification.

The storage device used by the SLOG should ideally be very fast and also reliable. The size of the SLOG doesn't need to be that big either - a couple gigabytes should be sufficient.

For example, assuming that data is flushed every 5 seconds, on a single gigabit connection, the ZIL usage shouldn't be any more than ~600MB.

For these reasons, SLC SSDs are preferable to MLC SSDs for SLOG.

See Also:

See Also[edit]

ZFS encryption guide

Enable Dark Mode!