Introduction[edit | edit source]

ZFS is different from a traditional filesystem in that it acts as a volume management system (such as LVM) as well as handling the filesystem layer above (such as ext4). A few fundamental concepts to understand when using ZFS are: devices, vdevs, zpools, and datasets. A quick summary of each concept are:

  • A device can be any block device installed on the system. It can be a SSD, traditional HDD, or even a file.
  • A vdev represents one or more devices in ZFS and employs one of five parity methods: single device, mirror, RAIDz1, RAIDz2, RAIDz3.
  • A zpool contains one or more vdevs. It treats all vdevs like a JOBD and distributes data depending on factors such as load and utilization.
  • Datasets are created within zpools. They act like volumes but act like an already formatted filesystem and can be mounted to a mount point on the system.

Storage Device[edit | edit source]

A storage device can be any random access block device such as a traditional spinning hard drive, solid state devices, or even a raw file on a system. A storage device can be set as a hotspare which will be used as a temporary storage device to replace any failing device in the entire pool. Hotspares are meant to be temporary and is intended to be replaced by the permanent storage device after it has been replaced.

vdev[edit | edit source]

A vdev is a virtual device that is understood by ZFS and consists of one or more storage device. There are different uses of vdev in ZFS with the most common being for plain storage. A vdev for storage can be one of: Single device or file, mirrored, RAIDz1, RAIDz2, RAIDz3.

A vdev configured to use a single device is backed by a single storage device. There is no redundancy or fault tolerance and is inherently dangerous.

Mirrored vdevs will mirror blocks across all storage devices. Mirrored vdevs are typically used for performance and fault tolerance. Read performance is excellent since data can be retrieved from all storage devices simultaneously. A mirror vdev can survive any failure so long as at least one device remains healthy. You may attach additional disks to a mirror.

The RAIDz are similar to traditional RAID devices with the number referring to how many parity blocks are allocated for each stripe. However, unlike a traditional RAID where there are dedicated disks for parity data, all parity data is written evenly across all disks.

RAIDz Limitations
It is not possible to attach additional disks to or remove any disks from a vdev utilizing a RAIDz configuration.


vdevs can be used for other purposes as well but can be ignored for now. These include: spare (for hot spares), the intent log, deduplication tables, storage cache, and special allocations.

zpool[edit | edit source]

A zpool contains one or more vdevs in any configuration. A zpool treats all member vdevs like disks in a JBOD. As a consequence, a failure of any member vdev will result in the failure of the zpool. Additionally, like a JBOD, vdevs cannot be removed once they are added without first destroying the zpool.



Installation[edit | edit source]

ZFS is included with Solaris and FreeBSD. Linux support is offered by various open source projects including the ZFS on Linux project.

Linux[edit | edit source]

Download the source or prebuilt packages from ZFS on Linux's website http://zfsonlinux.org/

Packages from ZFS on Linux will make use of dkms and can be rebuilt when a new kernel is used. (eg: dkms build -m zfs -v 0.6.5 -k 4.2.5-300.fc23.x86_64 && dkms install -m zfs)


Set up storage devices[edit | edit source]

When setting up a storage device, you can either have ZFS use the entire device or a specific partition on the device. The one benefit of using a partition is the ability to control the size of the device available to ZFS. This may be important if you expect any future replacement devices to have a slightly different size since replacements devices must have the same or larger capacity. This can be a concern if you expect to replace your disks from a different manufacturer or model which may have a slightly different number of blocks.

If you do partition a disk, keep in mind that most modern disks utilize advanced format which writes blocks aligned to a 4K boundary (2048 sectors). To avoid write amplification, partitions must be aligned. To avoid having replacement disks be of a slightly different size, partition sizes should not take up the full size of the disk. A buffer size of 200MB is probably enough. Converted into sectors, 200MB equates to 409600 sectors (200MB * 1024*1024bytes / 512 bytes/sector = 200 * 2048 sectors).

To determine the total number of sectors your disks:

Linux FreeBSD
Use fdisk -l:
# fdisk -l /dev/sdf
Disk /dev/sdf: 2.7 TiB, 3000592982016 bytes, 5860533168 sectors
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 4096 bytes
I/O size (minimum/optimal): 4096 bytes / 4096 bytes
Disklabel type: gpt
Disk identifier: 357F62D3-D1B9-11E3-BE34-00E04C801A50
diskinfo -v
# diskinfo -v da1
da1
        512             # sectorsize
        10737418240     # mediasize in bytes (10G)
        20971520        # mediasize in sectors
        1048576         # stripesize
        0               # stripeoffset
        1305            # Cylinders according to firmware.
        255             # Heads according to firmware.
        63              # Sectors according to firmware.
                        # Disk ident.
        Not_Zoned       # Zone Mode

Create the partition with the required buffer space at the end. Take the total sector size and subtract 409600 sectors.

Linux FreeBSD
To partition /dev/sda:
# fdisk /dev/sda
## TBD
## Ensure you are using sectors by running 'u'
## Create a single partition starting at sector 2048

> p
Disk /dev/sda: 299.4 GB, 299439751168 bytes
255 heads, 63 sectors/track, 36404 cylinders
Units = cylinders of 16065 * 512 = 8225280 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disk identifier: 0x3f42ab96

   Device Boot      Start         End      Blocks   Id  System
/dev/sda                1       36405   292420608   83  Linux
To partition /dev/ada:
## -b defines the starting sector, -s defines the size in sectors
# gpart create -s GPT /dev/ada1
# gpart add -b 2048 -s 1953113520 -t freebsd-zfs -l disk01 /dev/ada1

# gpart create -s GPT /dev/ada2
# gpart add -b 2048 -s 1953113520 -t freebsd-zfs -l disk02 /dev/ada2

# gpart create -s GPT /dev/ada3
# gpart add -b 2048 -s 1953113520 -t freebsd-zfs -l disk03 /dev/ada3

vdev and zpool Creation[edit | edit source]

After setting up your storage devices, you can then create a zpool and its corresponding vdevs using the zpool create command.

# zpool create [-fnd] [-o property=value] ... \
              [-O file-system-property=value] ... \
              [-m mountpoint] [-R root] ${POOL_NAME}  ( ${POOL_TYPE} ${DISK} ) ...

Parameters for zpool create are:

  • -f - Force
  • -n - Display creation but don't create pool
  • -d - Do not enable any features unless specified
  • -o - Set a pool property
  • -O - Set a property on root filesystem
  • -m - Mount point
  • -R - Set an alternate root location
  • POOL_NAME - the name of the pool
  • POOL_TYPE + DISK - one or more vdev configuration

Some pool properties can only be set when it is first created. One important option that may affect performance is the pool sector size exponent, referred to as ashift internally. Other useful pool properties include setting compression (comress=lz4) and storing additional copies of the data (copies=N). For newer ZFS pools (versions >= 5000), lz4 compression should always be enabled because files that are not compressible will not be compressed which makes it highly efficient. Most CPUs should be able to keep up with IO.

Advanced Format Sector Alignment
When creating a pool, ensure that the sector alignment value is set appropriately for the underlying storage devices that are used. The sector size that is used in I/O operations is defined as an exponent value to the power of 2 referred to as the ashift value. Incorrectly setting the sector alignment value may cause write amplification (such as using 512B sectors on disks with 4KiB AF format).

Set the ashift value depending on the underlying storage devices outlined below.

Device ashift value
Hard Drives with 512B sectors 9
Flash Media / HDD with 4K sectors 12
Flash Media / HDD with 8K sectors 13
Amazon EC2 12

Once a pool is created, the ashift value can be obtained from zdb:

# zdb | grep ashift
            ashift: 12

See Also:


When creating a new pool, you must provide the zpool name and at least one vdev configuration.

For example, to create a simple zpool named storage with a single-disk configuration:

# zpool create storage \
   /dev/disk1 /dev/disk2

Another example to create a zpool with 2 mirrored vdevs (analogous to a RAID0+1)

# zpool create storage \
   mirror /dev/disk1 /dev/disk2 \
   mirror /dev/disk3 /dev/disk4

Another example using RAIDz1:

# zpool create storage \
   raidz1 /dev/disk/by-id/device01-part1 /dev/disk/by-id/device02-part1 /dev/disk/by-id/device03-part1

It's recommended to use disk IDs on Linux. In fact, it's a recommendation made by the ZoL project. Using the disk names such as /dev/sdx is not reliable as the naming can change when udev rules are changed which can potentially prevent your pool from importing properly on startup.

Once created, you can see the status of the pool using zfs status.

# zpool status
  pool: storage
  state: ONLINE
  scan: none requested
 config:
 
         NAME            STATE     READ WRITE CKSUM
         storage         ONLINE       0     0     0
           raidz1-0      ONLINE       0     0     0
             gpt/disk01  ONLINE       0     0     0
             gpt/disk02  ONLINE       0     0     0
             gpt/disk03  ONLINE       0     0     0
 
 errors: No known data errors

Creating ZFS Datasets[edit | edit source]

Datasets are similar to volumes and can be created within a zpool. Datasets can be nested in one another and properties from the parent can be inherited by the child. Snapshots can be created for individual or recursively on all child datasets.

Datasets are managed using the zfs utility.

Creating a new dataset is as simple as:

# zfs create zpool/dataset-name

To list all ZFS datasets:

# zfs list
NAME                         USED  AVAIL  REFER  MOUNTPOINT
storage                      153K  1.11T   153K  /storage
storage/logs                 153K  1.11T   153K  /storage/logs

Certain parameters can be set during the creation process with the -o flag, or set after creation using the zfs set command.

Options and Parameters[edit | edit source]

Each ZFS dataset inherits all the pool properties it is created in but can be overridden on a per-dataset basis using the zfs set.

To get a ZFS parameter, use the zfs get command.

# zfs get compressratio storage
NAME                        PROPERTY       VALUE  SOURCE
storage                     compressratio  1.01x  -


In addition to the pool properties that can be set in the last section, a ZFS dataset can also have the following properties set as well.

mountpoint=x for Automatically Mounting[edit | edit source]

When the system starts up, the zfs mountd utility will automatically mount ZFS datasets to the specified mountpoint.

# zfs set mountpoint=/export storage


Enabling on Startup[edit | edit source]

Linux Freebsd
On Systemd based systems, you will need to enable the following services in order to have the zpool loaded and mounted on start up.
  • zfs-import-cache
  • zfs-mount
  • zfs-share
Enable the zfs service on startup by appending the following to /etc/rc.conf:
zfs_enable="YES"
Systemd Service Side Notes
The systemd service files should be located at /usr/lib/systemd/system/.

Enable the services by running systemctl enable service-name.

There is an issue where on reboot, the systemd service does not run properly which results in the ZFS pools not being imported. The fix is to run:

# systemctl preset zfs-import-cache zfs-import-scan zfs-mount zfs-share zfs-zed zfs.target


Linux without systemd[edit | edit source]

On non-systemd Linux distros, check for any startup scripts in /etc/init.d. If you're doing everything manually, you may need to create a file in either /etc/modules-load.d or /etc/modprobe.d so that the ZFS module gets loaded.

To only load the module, create a file at /etc/modules-load.d/zfs.conf containing:

zfs

To load the module with options, create a file at /etc/modprobe.d/zfs.conf containing (for example):

options zfs zfs_arc_max=4294967296


Administration[edit | edit source]

Managing datasets with zfs[edit | edit source]

To list ZFS datasets, use zfs list

# zfs list
NAME                         USED  AVAIL  REFER  MOUNTPOINT
data                        3.61T  1.63T   139K  /data
data/backups                 944G  1.63T   901G  /data/backups
data/crypt                   125G  1.63T   125G  /data/crypt
data/dashcam                84.5G  1.63T  84.5G  /data/dashcam
data/home                    169G  1.63T   153G  /data/home
data/images                  999G  1.63T   999G  /data/images
data/public                 1.11T  1.63T  1.03T  /data/public
data/scratch                 167G  1.63T   158G  /data/scratch
data/vm                     71.2G  1.63T  57.8G  /data/vm
storage                     4.74T  2.27T  26.2G  /storage


To get properties of your ZFS datasets, use the zfs get PROPERTYNAME command:

# zfs get compressratio
NAME  PROPERTY       VALUE  SOURCE
data  compressratio  1.02x  -


Transferring a dataset with zfs send andzfs recv[edit | edit source]

You can transfer a snapshot of a ZFS dataset using the send and receive commands.

Using SSH:

# zfs send  zones/UUID@snapshot | ssh root@10.10.11.5 zfs recv zones/UUID

Using Netcat:

## On the source machine
# zfs send data/linbuild@B | nc -w 20 phoenix 8000

## On the destination machine
# nc -w 120 -l 8000 | pv | zfs receive -F data/linbuild

Using Bash /dev/tcp to send appears to have lower overhead than using netcat to send:

## On the source machine
# zfs send data/linbuild@B > /dev/tcp/phoenix/8000

## On the destination machine, use netcat to listen
# nc -w 120 -l 8000 | pv | zfs receive -F data/linbuild



Snapshot[edit | edit source]

A ZFS snapshot is a read-only copy of a dataset from a previous state. Because ZFS makes use of Copy-On-Write, snapshots take no additional space until a change occurs.

Listing Snapshots[edit | edit source]

Snapshots can be listed by running zfs list -t snapshot.

# zfs list -t snapshot
NAME               USED  AVAIL  REFER  MOUNTPOINT
storage@20120820  31.4G      -  3.21T  -
storage@20120924   134G      -  4.15T  -
storage@20121028  36.2G      -  4.26T  -
storage@20121201  33.2M      -  4.55T  -

The USED column shows the amount of space used by the snapshot. This amount will go up as files from the snapshot are deleted since the space freed cannot be reclaimed until the snapshot is deleted.

The REFER column shows the actual size of the pool at the snapshot's timepoint.

To quickly get snapshots ordered by when they were made, use the -r option. Passing in the fields you need (name) will speed this operation up.

# zfs list -t snapshot -o name -s name -r data/home
NAME
data/home@zbk-daily-20170502-003001
data/home@zbk-daily-20170503-003001
data/home@zbk-daily-20170504-003001
data/home@zbk-daily-20170505-003001

## Get the most recent snapshot name
# zfs list -t snapshot -o name -s name -r data/home | tail -n 1 | awk -F@ '{print $2}'
zbk-daily-20170505-003001

Creating Snapshots[edit | edit source]

To create a new snapshot, run zfs snapshot storage@snapshot-name.

Snapshot Name Limit
There seems to be a limit of 88 characters for the snapshot name and file system name. (See: https://lists.freebsd.org/pipermail/freebsd-fs/2010-March/007964.html)


ZFS snapshots make use of Copy-On-Write (COW) and will not use any additional space. However, deleting files that are part of an existing snapshot will not reclaim space. Instead, the storage capacity will appear to go down since the amount of available space for new data remains the same.

For example, if a volume at 1TB of 2TB capacity has a 0.5TB file deleted, because the 0.5TB file is still stored in a snapshot, the total capacity is now 2TB - 0.5TB and the reported disk usage will now be 0.5TB of 1.5TB.

If I were to run zfs snapshot storage@snapshot-name now, listing the snapshots will yield:

# zfs list -t snapshot
NAME               USED  AVAIL  REFER  MOUNTPOINT
storage@20120820  31.4G      -  3.21T  -
storage@20120924   134G      -  4.15T  -
storage@20121028  36.2G      -  4.26T  -
storage@20121201  33.2M      -  4.55T  -
storage@today         0      -  4.58T  -

Accessing Snapshot Data[edit | edit source]

Snapshot contents can be accessed through a special .zfs/snapshot/ directory. Each snapshot will contain a read-only copy of the data that existed when the snapshot was taken.

# ls /storage/.zfs/snapshot/
20120820/ 20120924/ 20121028/ 20121201/ today/

To roll back to a specific snapshot, run zfs rollback storage@yesterday. This will restore your volume to the snapshot state.

# zfs rollback storage@yesterday

To delete a specific snapshot, run zfs destroy storage@today. Note that this will not work if other volumes depend on it. eg: If you cloned it as another volume.

# zfs destroy storage@today



Managing zpools with zpool[edit | edit source]

List zpools using zpool list:

# zpool list
NAME      SIZE  ALLOC   FREE  EXPANDSZ   FRAG    CAP  DEDUP  HEALTH  ALTROOT
data     8.12T  5.42T  2.71T         -    50%    66%  1.00x  ONLINE  -
storage  9.06T  5.93T  3.13T         -    13%    65%  1.00x  ONLINE  -

The FRAG value is the average fragmentation of available space. There is no defragment operation in ZFS. If you wish to decrease the fragmentation in your pool, consider transferring your pool data to another location and then back with zfs send and zfs recv to rewrite entire chunks of data.

Expanding a zpool[edit | edit source]

A zpool can be expanded by either adding additional vdevs to the pool or by expanding an underlying vdev.

Adding an additional vdev to a zpool is akin to adding additional disks to a JBOD to grow its size. The zpool grows in capacity by as much as the capacity of the vdev that is being added. To add an additional vdev to a zpool, use the zpool add command. For example:

# Adding 2 1TB disks mirrored will add an additional 1TB to the pool
# zpool add pool  mirror /dev/disk1 /dev/disk2

Expanding the underlying vdev used by a zpool is another way to increase a zpool's capacity. A vdev's size only increases if all underlying storage devices grow. For systems using RAIDz, this involves replacing each member disk and resilvering one disk at a time. On larger arrays, this may not be practical as it requires resilvering as many times as there are disks in the vdev. The zpool will only grow once every disk has been replaced and when the zpool has the autoexpand=on option set. If you forgot to set the autoexpand option before replacing all disks, you can still expand the pool by enabling autoexpand and bringing a disk online:

## After replacing all disks in a vdev, the zpool still shows the same size
# zpool list storage
NAME      SIZE  ALLOC   FREE  EXPANDSZ   FRAG    CAP  DEDUP  HEALTH  ALTROOT
storage  9.06T  8.35T   732G     4.50T    33%    92%  1.00x  ONLINE  -

## Set autoexpand=on on the zpool and bring one of the devices in the affected vdev online again.
# zpool set autoexpand=on storage
# zpool online -e storage ata-Hitachi_HUS724030ALE641_P8GH7GVR

## The zpool should now pick up on the new vdev size and expand accordingly
# zpool list storage
NAME      SIZE  ALLOC   FREE  EXPANDSZ   FRAG    CAP  DEDUP  HEALTH  ALTROOT
storage  13.6T  8.35T  5.28T         -    22%    61%  1.00x  ONLINE  -

zpool attach and zpool detach[edit | edit source]

The zpool attach and zpool detach only manages devices that are part of mirror vdevs. Use zpool attach to attach a device to an existing mirror vdev or top level device to create a mirrored vdev. The command takes as an argument the zpool name, an existing device in a mirror, and the new device to attach to the mirror.

For example, to add a new disk (gpt/disk03) to an existing mirror:

# zpool status storage
        NAME            STATE     READ WRITE CKSUM
        storage         ONLINE       0     0     0
          mirror-0      ONLINE       0     0     0
            gpt/disk01  ONLINE       0     0     0
            gpt/disk02  ONLINE       0     0     0

## Note that the existing device can be either gpt/disk01 or gpt/disk02.
# zpool attach storage gpt/disk02  gpt/disk03

# zpool status storage
        NAME            STATE     READ WRITE CKSUM
        storage         ONLINE       0     0     0
          mirror-0      ONLINE       0     0     0
            gpt/disk01  ONLINE       0     0     0
            gpt/disk02  ONLINE       0     0     0
            gpt/disk03  ONLINE       0     0     0
			
## Remove a device fro a mirror vdev with zpool detach.
# zpool detach storage gpt/disk02


At-Rest Encryption[edit | edit source]

At-rest encryption is a new feature in ZFS which can be enabled with zpool set feature@encryption=enabled <pool>. Data is written using Authenticated Ciphers (AEAD) such as AES-CCM and AES-GCM and is configurable in various dataset properties. Per-dataset encryption can also be enabled or disabled using the -o encryption=[on|off] flag.

User defined keys can be inherited or set manually for each dataset and can be loaded from different sources in various formats. This key is used to encrypt the mater key which is generated randomly and is never exposed to the user directly. Data encryption is done using this mater key which allows the ability to change the user-defined key without requiring re-encrypting data on the dataset.

For dedup to work, the cyphertext must match for the same plaintext. This is achieved by using the same salt and IV generated from a HMAC of the plaintext.

What is and isn't encrypted is listed below.

Encrypted Unencrypted
  • File data and metadata
  • ACLs, names, permissions, attrs
  • Directory listings
  • All Zvol data
  • FUID Mappings
  • Master encryption keys
  • All of the above in the L2ARC
  • All of the above in the ZIL
  • Dataset / snapshot names
  • Dataset properties
  • Pool layout
  • ZFS Structure
  • Dedup tables
  • Everything in RAM

The ZFS version must have encryption installed and enabled for any of this to work. Enable the feature on the pool.

# truncate -s 1G block
# zpool create test ./block
# zpool set feature@encryption=enabled test

Then, create an encrypted dataset by passing in these additional parameters to the zfs create command.

Parameter Description
-o encryption=.. Controls the ciphersuite (cipher, key length and mode).

The default is aes-256-ccm, which is used if you specify -o encryption=on.

-o keyformat=.. Controls what format the encryption key will be provided as and where it should be loaded from.

Valid options include:

  • passphrase
  • hex
  • raw
-o keylocation=.. Controls the source of the key. The key can be formatted as raw bytes, as hex representation or as a user password. It can be provided via a user prompt which will pop up when you first create it, or when you mount the dataset (zfs mount) or load the key manually (zfs key -l).

Valid options include:

  • prompt
  • file:///dev/shm/enckey
-o pbkdf2iters=.. Only used if a passphrase is used (-o keysource=passphrase,..). It controls the iterations of PBKDF2 for key stretching. Higher is better as it slows down potential dictionary attacks on the password.

The default is -o pbkdf2iters=100000.

keystatus This is a read-only value and not something you set and is included here for reference. Possible values are:
  • off - Not encrypted
  • available - key is loaded
  • unavailable - key is unavailable and data cannot be read


For example, to create an ecrypted dataset using only default values and a passphrase:

$ zfs create \
    -o encryption=on \
    -o keylocation=prompt \
    -o keyformat=passphrase \
    testpool/enc1

# This will ask you to enter/confirm a password.

Because all child datasets inherits all parameters of its parent, a child dataset will also be encrypted using the same properties as the parent. For example, running zfs create test/enc1/also-encrypted will create a new encrypted dataset with the same keysource and encryption method.


Encryption properties can be read as any other ZFS properties.

# zfs get -p encryption,keystatus,keysource,pbkdf2iters

If a ZFS key is not available, it can be provided using the zfs load-key pool/dataset command. Attempting to mount an encrypted dataset without a valid key will also prompt you for a key.

A key can be unloaded using zfs unload-key pool/dataset after it has been unmounted.

Verifying Data Integrity with zfs scrub[edit | edit source]

One of the strengths of ZFS its resiliency thanks to its transactional file system. Data can still be silently corrupted by faulty hardware, be corrupted due to bad memory, or through a fault with the ZFS implementation. To ensure the integrity of the data, ZFS provides a mechanism to verify the checksums of all data in a pool called a scrub. During a scrub, ZFS will attempt to fix any errors that are discovered on pool using mirror or RAIDz vdevs.

To initiate a scrub, run:

# zpool scrub storage

Once the scrub process is underway, you can view its status by running:

# zpool status storage
   pool: storage
  state: ONLINE
  scan: scrub in progress since Mon Dec  3 23:54:53 2012
     18.3G scanned out of 6.05T at 211M/s, 8h20m to go
     0 repaired, 0.30% done
 config:
 
         NAME            STATE     READ WRITE CKSUM
         storage         ONLINE       0     0     0
           raidz1-0      ONLINE       0     0     0
             gpt/disk00  ONLINE       0     0     0
             gpt/disk01  ONLINE       0     0     0
             gpt/disk02  ONLINE       0     0     0
             gpt/disk03  ONLINE       0     0     0
             gpt/disk04  ONLINE       0     0     0
 
 errors: No known data errors

To stop a scrub process, run:

# zpool scrub -s storage


NFS Export[edit | edit source]

FreeBSD[edit | edit source]

To share a ZFS pool via NFS on a FreeBSD system, ensure that you have the following in your /etc/rc.conf file.

mountd_enable="YES"
rpcbind_enable="YES"
nfs_server_enable="YES"
mountd_flags="-r -p 735"
zfs_enable="YES"

mountd is required for exports to be loaded from /etc/exports. You must also either reload or restart mountd every time you make a change to the exports file in order to have it reread.

Set the sharenfs property using the zfs utility. To share it with anyone, set sharenfs to 'on' or provide a network that the export can be accessed from. Eg:

# zfs sharenfs=off storage
# zfs sharenfs="-network 172.17.12.0/24" storage/linbuild
zfs get sharenfs
NAME                      PROPERTY  VALUE                    SOURCE
data                      sharenfs  off                      local
data/linbuild             sharenfs  -network 172.17.12.0/24  local
data/linbuild/centos      sharenfs  -network 172.17.12.0/24  inherited from data/linbuild
data/linbuild/fedora      sharenfs  -network 172.17.12.0/24  inherited from data/linbuild
data/linbuild/scientific  sharenfs  -network 172.17.12.0/24  inherited from data/linbuild

Add the paths you wish to export to /etc/exports and reload mountd.

By setting the sharenfs property, your system will automatically create an export for the zfs pool using mountd. By default, the NFS share will only be accessible locally.

# showmount -e
Exports list on localhost:
/data                           Everyone

If you want to restrict the share to a specific network, you can specify the network in when setting the sharenfs property:

# zfs sharenfs="-network 10.1.1.0/24" storage
# showmount -e
Exports list on localhost:
/storage                           10.1.1.0

By the way, the exports are stored in /etc/zfs/exports and not in the usual /etc/exports. The ZFS and mountd service must be started for it to work. Therefore, you'll also need to append to /etc/rc.conf the following line:

mountd_enable="YES"

On Linux[edit | edit source]

To share a ZFS dataset via NFS, you may either do it traditionally by manually editing /etc/exports and exportfs, or by using ZFS's sharenfs property and managing shares using the zfs share and zfs unshare commands.

To use the traditional method of exportfs and /etc/exportfs, set sharenfs=off.

Sharing nested file systems
To share a nested dataset, use the crossmnt option. This will let you 'recursively' share datasets under the specified path by automatically mounting the child file system when accessed.

Eg. If I have multiple datasets under /storage/test:

# cat /etc/exports
/storage/test *(ro,no_subtree_check,crossmnt)
Cross mounting for nested datasets is only necessary if you use exportfs. This is not required when using ZFS share which is discussed next.


To use ZFS's automatic NFS exports, set sharenfs=on to allow world read/write access to the share. To control access based on network, add an additional rw=subnet/netmask property. Eg. rw=@192.168.1.0/24. Interfacing the shares in this manner changes the ZFS sharetab file located at /etc/dfs/sharetab. The sharetab file is used by the zfs-share service. If for some reason the sharetab is out of sync with what is actually configured, you can force an update by running zfs share -a or by restarting the zfs-share service in systemd.


Handling Drive Failure[edit | edit source]

When a drive fails, you will see something similar to:

# zpool status
  pool: data
 state: DEGRADED
status: One or more devices could not be used because the label is missing or
        invalid.  Sufficient replicas exist for the pool to continue
        functioning in a degraded state.
action: Replace the device using 'zpool replace'.
   see: http://zfsonlinux.org/msg/ZFS-8000-4J
  scan: scrub repaired 0 in 2h31m with 0 errors on Fri Jun  3 19:20:15 2016
config:

        NAME        STATE     READ WRITE CKSUM
        data        DEGRADED     0     0     0
          raidz2-0  DEGRADED     0     0     0
            sdb     ONLINE       0     0     0
            sdc     ONLINE       0     0     0
            sdd     ONLINE       0     0     0
            sde     ONLINE       0     0     0
            sdf     ONLINE       0     0     0
            sdg     UNAVAIL      3   222     0  corrupted data

errors: No known data errors

The device sdg failed and was removed from the pool.

Dell Server Info
Since this happened on a server, I removed the failed disk from the machine and replaced it with another one.

Because this server uses a Dell RAID controller, and these Dell RAID controllers can't do a pass-through, each disk that is part of the ZFS array is its own Raid0 vdisk. When inserting a new drive, the vdisk information needs to be re-created via OpenManage before the disk comes online.

Follow the steps on the Dell OpenManage after inserting the replacement drive if this applies to you.

Once the vdisk is recreated, it should show up as the old device name again.


Once the replacement disk is installed on the system, reinitialize the drive with the GPT label.

On a FreeBSD system: I reinitialized the disks using the geometry settings described above.

# gpart create -s GPT da0
da4 created
# gpart add -b 2048 -s 3906617520 -t freebsd-zfs -l disk03 da4
da4p1 added

On a ZFS on Linux system:

# parted /dev/sdg
GNU Parted 2.1
Using /dev/sdg
Welcome to GNU Parted! Type 'help' to view a list of commands.
(parted) mklabel GPT
Warning: The existing disk label on /dev/sdg will be destroyed and all data on this disk will be lost. Do
you want to continue?
Yes/No? y
(parted) quit

To replace the offline disk, use zpool replace pool old-device new-device.

# zpool replace data sdg /dev/sdg
# zpool status
  pool: data
 state: DEGRADED
status: One or more devices is currently being resilvered.  The pool will
        continue to function, possibly in a degraded state.
action: Wait for the resilver to complete.
  scan: resilver in progress since Mon Apr 10 16:36:35 2017
    35.5M scanned out of 3.03T at 3.55M/s, 249h19m to go
    5.45M resilvered, 0.00% done
config:

        NAME             STATE     READ WRITE CKSUM
        data             DEGRADED     0     0     0
          raidz2-0       DEGRADED     0     0     0
            sdb          ONLINE       0     0     0
            sdc          ONLINE       0     0     0
            sdd          ONLINE       0     0     0
            sde          ONLINE       0     0     0
            sdf          ONLINE       0     0     0
            replacing-5  UNAVAIL      0     0     0
              old        UNAVAIL      3   222     0  corrupted data
              sdg        ONLINE       0     0     0  (resilvering)

errors: No known data errors

After the resilvering completes, you may need to manually detach the failed drive before the pool comes back 'online'.

# zpool status
  pool: data
 state: DEGRADED
status: One or more devices has experienced an error resulting in data
        corruption.  Applications may be affected.
action: Restore the file in question if possible.  Otherwise restore the
        entire pool from backup.
   see: http://zfsonlinux.org/msg/ZFS-8000-8A
  scan: resilvered 2.31T in 22h11m with 54 errors on Sun Oct  1 16:07:14 2017
config:

        NAME                                         STATE     READ WRITE CKSUM
        data                                         DEGRADED     0     0    54
          raidz1-0                                   DEGRADED     0     0   109
            replacing-0                              UNAVAIL      0     0     0
              13841903505263088383                   UNAVAIL      0     0     0  was /dev/disk/by-id/ata-ST3000DM001-1CH166_Z1F4HGS6-part1
              ata-ST4000DM005-2DP166_ZGY0956M-part1  ONLINE       0     0     0
            ata-ST3000DM001-1CH166_Z1F4HHZ7-part1    ONLINE       0     0     0
            ata-ST4000DM000-2AE166_WDH01SND-part1    ONLINE       0     0     0
        cache
          pci-0000:02:00.0-ata-3-part1               ONLINE       0     0     0

errors: 26 data errors, use '-v' for a list

# zpool detach data 13841903505263088383



Clearing Data Errors[edit | edit source]

If you get data errors:

# zpool status
  pool: data
 state: DEGRADED
status: One or more devices are faulted in response to persistent errors.
        Sufficient replicas exist for the pool to continue functioning in a
        degraded state.
action: Replace the faulted device, or use 'zpool clear' to mark the device
        repaired.
  scan: resilvered 509G in 8h24m with 0 errors on Sat Oct 12 02:33:07 2019
config:

        NAME                                        STATE     READ WRITE CKSUM
        data                                        DEGRADED     0     0     0
          raidz2-0                                  DEGRADED     0     0     0
            scsi-3600188b04c5ec1002533815852b8193e  ONLINE       0     0     0
            sdb                                     FAULTED      0     0     0  too many errors
            sdc                                     ONLINE       0     0     0
            sdd                                     ONLINE       0     0     0
            sde                                     ONLINE       0     0     0
            sdf                                     ONLINE       0     0     0

If the device is faulted because of a temporary issue, you can bring it back online using the zpool clear command. This will cause the pool to resilver only data since it dropped offline and depending on the amount of data, should be relatively quick.

# zpool clear data sdb

Devices that are unavailable may be brought offline and then back online using zpool offline pool disk and zpool online pool disk. An onlined disk could still be faulted and can be restored with zpool clear.

## Take the device offline, then do whatever you need to the disk
# zpool offline data sdb

## Then bring it back online
# zpool online data sdb
warning: device 'sdb' onlined, but remains in faulted state
use 'zpool clear' to restore a faulted device
# zpool clear data sdb

If you do not care about data integrity, you could also clear errors by initiating a scrub and then cancelling it immediately.

# zpool scrub data
# zpool scrub -s data


Tuning and Monitoring[edit | edit source]

ZFS ARC[edit | edit source]

The ZFS Adaptive Replacement Cache (ARC) is an in-memory cached managed by ZFS to help improve read speeds by caching frequently accessed blocks in memory. With ARC, file access after the first time can be retrieved from memory rather than from disk. The primary benefit of using ARC is for heavy random file access such as databases.

The ZFS ARC size from the ZFS on Linux implementation defaults to half the host's available memory and may decrease when available memory gets too low. Depending on the system setup, you may want to change the maximum amount of memory allocated to ARC.

To change the maximum ARC size, edit the ZFS zfs_arc_max kernel module parameter:

# cat /etc/modprobe.d/zfs.conf
options zfs zfs_arc_max=4294967296

You may check the current ARC usage by checking cat /proc/spl/kstat/zfs/arcstats or use the arcstat.py script that's part of the zfs package.

# cat /proc/spl/kstat/zfs/arcstats
p                               4    1851836123
c                               4    4105979840
c_min                           4    33554432
c_max                           4    4294967296
size                            4    4105591928
hdr_size                        4    55529696
data_size                       4    3027917312
metadata_size                   4    747323904
other_size                      4    274821016
anon_size                       4    4979712
anon_evictable_data             4    0
anon_evictable_metadata         4    0
mru_size                        4    848103424
mru_evictable_data              4    605918208
mru_evictable_metadata          4    97727488
mru_ghost_size                  4    3224668160
mru_ghost_evictable_data        4    2267575296
mru_ghost_evictable_metadata    4    957092864
mfu_size                        4    2922158080
mfu_evictable_data              4    2421089280
mfu_evictable_metadata          4    495401984
mfu_ghost_size                  4    835275264
mfu_ghost_evictable_data        4    787349504
mfu_ghost_evictable_metadata    4    47925760
l2_hits                         4    0
l2_misses                       4    0
l2_feeds                        4    0
l2_rw_clash                     4    0
l2_read_bytes                   4    0
l2_write_bytes                  4    0
l2_writes_sent                  4    0
l2_writes_done                  4    0
l2_writes_error                 4    0
l2_writes_lock_retry            4    0
l2_evict_lock_retry             4    0
l2_evict_reading                4    0
l2_evict_l1cached               4    0
l2_free_on_write                4    0
l2_cdata_free_on_write          4    0
l2_abort_lowmem                 4    0
l2_cksum_bad                    4    0
l2_io_error                     4    0
l2_size                         4    0
l2_asize                        4    0
l2_hdr_size                     4    0
l2_compress_successes           4    0
l2_compress_zeros               4    0
l2_compress_failures            4    0
memory_throttle_count           4    0
duplicate_buffers               4    0
duplicate_buffers_size          4    0
duplicate_reads                 4    0
memory_direct_count             4    46505
memory_indirect_count           4    30062
arc_no_grow                     4    0
arc_tempreserve                 4    0
arc_loaned_bytes                4    0
arc_prune                       4    0
arc_meta_used                   4    1077674616
arc_meta_limit                  4    3113852928
arc_meta_max                    4    1077674616
arc_meta_min                    4    16777216
arc_need_free                   4    0
arc_sys_free                    4    129740800

# arcstat.py 10 10
    time  read  miss  miss%  dmis  dm%  pmis  pm%  mmis  mm%  arcsz     c
17:41:09     0     0      0     0    0     0    0     0    0   3.9G  3.9G
17:41:19   211    68     32     6    7    62   50    68   32   3.9G  3.9G
17:41:29   146    32     21     6   10    25   30    32   21   3.9G  3.9G
17:41:39   170    50     29     6    9    44   42    50   31   3.9G  3.9G
17:41:49   150    33     22     6    9    27   31    33   22   3.9G  3.9G
17:41:59   151    43     28     4    9    38   39    43   28   3.9G  3.9G


If you are running out of ARC, you might get arc_prune taking up all the CPU. See: https://github.com/zfsonlinux/zfs/issues/4345

# modprobe zfs zfs_arc_meta_strategy=0
# cat /sys/module/zfs/parameters/zfs_arc_meta_strategy

See Also:

ZFS L2ARC[edit | edit source]

The ZFS L2ARC is similar to the ARC by providing data caching on faster-than-storage-pool disks such as SLC/MLC SSDs to help improve random read workloads.

Cached data from the ARC are moved to the L2ARC when ARC needs more room for more recently and frequently accessed blocks.

To add or remove a L2ARC device:

## Add /dev/disk/by-path/path-to-disk to zpool 'data'
# zpool add data cache /dev/disk/by-path/path-to-disk

## To remove a device, just remove it like any other disk:
# zpool remove data /dev/disk/by-path/path-to-disk

You can see the usage of the L2ARC by running zpool list -v.

# zpool list -v
NAME   SIZE  ALLOC   FREE  EXPANDSZ   FRAG    CAP  DEDUP  HEALTH  ALTROOT
data  8.12T  5.72T  2.40T         -    48%    70%  1.00x  ONLINE  -
  raidz1  8.12T  5.72T  2.40T         -    48%    70%
    ata-ST3000DM001-part1      -      -      -         -      -      -
    ata-ST3000DM001-part1      -      -      -         -      -      -
    ata-ST4000DM000-part1      -      -      -         -      -      -
cache      -      -      -         -      -      -
  pci-0000:02:00.0-ata-3-part1  55.0G  28.3G  26.7G         -     0%    51%

ZFS ZIL and SLOG[edit | edit source]

ZIL[edit | edit source]

The ZFS Intent Log (ZIL) is how ZFS keeps track of synchronous write operations so that they can be completed or rolled back after a crash or failure. The ZIL is not used or asynchronous writes as those are still done through system caches.

Because ZIL is stored in the data pool, writing to a ZFS pool involves duplicate writes to the pool: Once to the ZIL and again to the zpool. This is detrimental to performance since one write operation now requires two or more writes (ie. Write Amplification).

SLOG[edit | edit source]

To improve performance, the ZIL could be moved to a separate device called a Separate Intent Log (SLOG) so that a write operation will write once to both the SLOG the zpool and thereby avoiding write amplification.

The storage device used by the SLOG should ideally be very fast and also reliable. The size of the SLOG doesn't need to be that big either - a couple gigabytes should be sufficient.

For example, assuming that data is flushed every 5 seconds, on a single gigabit connection, the ZIL usage shouldn't be any more than ~600MB.

For these reasons, SLC SSDs are preferable to MLC SSDs for SLOG.

See Also:

See Also[edit | edit source]

ZFS encryption guide