ZFS
ZFS is different from a traditional filesystem in that it acts as a volume management system (such as LVM) as well as handling the filesystem layer above (such as ext4). The filesystem was originally developed by Sun Microsystems for the Solaris operating system but an open source version developed by OpenZFS has since been ported to FreeBSD, Mac OS, and Linux.
Cheat sheet
A quick reference of commonly used ZFS commands.
Description | Command |
---|---|
Create a new zpool that spans all disks | # zpool create storage \
/dev/disk1 /dev/disk2
|
Create a mirrored pool | # zpool create storage \
mirror /dev/disk1 /dev/disk2
|
Create a raidz1 pool. Also available are raidz2 and raidz3. | # zpool create storage \
raidz1 /dev/disk1 /dev/disk2 /dev/disk3
|
Create a stripped mirror (similar to raid 10) | # zpool create storage \
mirror /dev/disk1 /dev/disk2
mirror /dev/disk3 /dev/disk4
|
Create a pool with a specific ashift.
9 = 512b, 12=4k, 13 = 8k |
# zpool create storage -o ashift=12 \
/dev/disk1 /dev/disk2
|
Find the ashift value of a pool | # zdb storage | grep ashift
|
Enable and set compression for a data set | # zfs set compression=zstd storage
|
Create a new snapshot | # zfs snapshot storage@now
|
Rollback to a specific snapshot | # zfs rollback storage@now
|
Destroy a snapshot | # zfs destroy storage@now
|
List all snapshots in a zpool | # zfs list -t snapshot storage
|
Remove a zpool from the system | # zpool export storage
|
Import a zpool with a different name. | # zfs import storage storage-new
|
Import all zpools found on any disks | # zpool import -a
|
Show ARC summary | # arc_summary -s arc
|
Installation
ZFS is included with Solaris and FreeBSD out of the box.
Linux
On Linux, you can either use FUSE or compile a third-party kernel module available from the ZFS on Linux project at http://zfsonlinux.org/. For specific distributions, installation is made easier by pre-packaged dkms enabled packages which automates the building and loading of the kernel module.
Install ZFS on CentOS
Refer to the documentation at: https://openzfs.github.io/openzfs-docs/Getting%20Started/RHEL-based%20distro/index.html
In summary, enable the EPEL and ZFS repositories by installing the epel-release
and the zfs-release
packages. Proceed to install the zfs
package which will automatically pull in the required dependencies and trigger the kernel module build using dkms.
# dnf install https://zfsonlinux.org/epel/zfs-release-2-2$(rpm --eval "%{dist}").noarch.rpm
# dnf install -y epel-release
# dnf install -y kernel-devel
# dnf install -y zfs
Installing the zfs package should also automatically build ZFS using DKMS for your kernel. Run dkms status
to see whether zfs is installed.
# dkms status
zfs/2.1.9, 4.18.0-425.10.1.el8_7.x86_64, x86_64: installed
If you need to rebuild the zfs module for whatever reason, use the dkms build zfs/zfs-$version
command (such as dkms build zfs/2.0.5
). If you need to target a specific kernel version, you can also specify the kernel using the -k
flag like so: dkms build -m zfs -v 0.6.5 -k 4.2.5-300.fc23.x86_64
. After building the dkms module, you can install it by running: dkms install -m zfs
ArchLinux
# pacman -Sy base-devel linux-headers
# su alarm
$ gpg --keyserver pool.sks-keyservers.net --recv-keys 4F3BA9AB6D1F8D683DC2DFB56AD860EED4598027
$ curl https://aur.archlinux.org/cgit/aur.git/snapshot/zfs-linux.tar.gz > zfs-linux.tar.gz
$ tar -xzf zfs-linux.tar.gz
$ cd zfs-linux
$ makepkg
Introduction
A few fundamental concepts to understand when using ZFS are: devices, vdevs, zpools, and datasets. A quick summary of each concept are:
- A device can be any block device installed on the system. It can be a SSD, traditional HDD, or even a file.
- A vdev represents one or more devices in ZFS and employs one of five parity methods: single device, mirror, RAIDz1, RAIDz2, RAIDz3.
- A zpool contains one or more vdevs. It treats all vdevs like a JOBD and distributes data depending on factors such as load and utilization.
- Datasets are created within zpools. They act like volumes but act like an already formatted filesystem and can be mounted to a mount point on the system.
Zpool
A zpool contains one or more vdevs in any configuration. Within a zpool, ZFS treats all member vdevs similar to disks in a JBOD but distributes data evenly. As a consequence, a failure of any member vdev will result in the failure of the zpool.
As a side note, recovery of data from a simple zpool is unlike that of a JBOD due to how data is evenly distributed on all vdevs. See: ttps://www.klennet.com/notes/2018-12-20-no-jbod-in-zfs-mostly.aspx
When creating a zpool, the virtual device configuration must be given.
Virtual Device
The concept of a virtual device or vdev encapsulates one or more physical storage device. There are several types of vdev in ZFS:
- device - One or more physical disk or partition on the system. The capacity of the vdev is the sum of all underlying devices. Beware that there is no redundancy or fault tolerance.
- file - An absolute path to a disk image.
- mirror - similar to a RAID 1 where all blocks are mirrored across all devices, providing high performance and fault tolerance. A mirror can survive any failure so long as at least one device remains healthy. Capacity limited by the smallest disk. Read performance is excellent since data can be retrieved from all storage devices simultaneously.
- raidz1/2/3 - similar to a RAID 5 or RAID 6. The number represents how many disk failures can be tolerated.
- spare - A hot-spare that can be used as a temporary replacement. You must enable a setting for it to be dynamically added to a failed vdev, which is disabled by default.
- cache - A device used for level 2 adaptive read cache (L2ARC), more on this later
- log - A device for ZFS Intent Log (ZIL), more on this later
Virtual devices in ZFS have some limitations you must keep in mind:
- All vdevs cannot shrink. You can only destroy and re-create a vdev with a smaller size.
- You may not add additional devices to a raidz to expand it.
- You may only grow a raidz by replacing each storage device with a larger capacity.
Creating a zpool
Create a zpool and its corresponding vdevs using the zpool create
command.
# zpool create [-fnd] [-o property=value] ... \
[-O file-system-property=value] ... \
[-m mountpoint] [-R root] ${POOL_NAME} ( ${POOL_TYPE} ${DISK} ) ...
Parameters for zpool create
are:
-f
- Force-n
- Display creation but don't create pool-d
- Do not enable any features unless specified-o
- Set a pool property-O
- Set a property on root filesystem-m
- Mount point-R
- Set an alternate root locationPOOL_NAME
- the name of the poolPOOL_TYPE + DISK
- one or more vdev configuration
When specifying the device on Linux, it's recommended to use disk IDs. In fact, it's a recommendation made by the ZoL project. Using the disk names such as /dev/sdx
is not reliable as the naming can change when udev rules are changed which can potentially prevent your pool from importing properly on startup.
Pool Properties
Some pool properties can only be set when it is first created. Important to keep in mind are:
ashift=N
. For disks with advanced format (AF) where sector sizes are 4K, you must adjust theashift=12
property manually to avoid performance degradation. Other ashift values will be covered below.copies=N
, where N is the number of copies. For important data, you may specify additional copies of the data to be written. In the event of any corruption, the additional copies help safe guard against data loss.
Advanced Format Sector Alignment
When creating a pool, ensure that the sector alignment value is set appropriately for the underlying storage devices that are used. The sector size that is used in I/O operations is defined as an exponent value to the power of 2 referred to as theashift
value. Incorrectly setting the sector alignment value may cause write amplification (such as using 512B sectors on disks with 4KiB AF format).
Set the ashift
value depending on the underlying storage devices outlined below.
Device | ashift value |
---|---|
Hard Drives with 512B sectors | 9 |
Flash Media / HDD with 4K sectors | 12 |
Flash Media / HDD with 8K sectors | 13 |
Amazon EC2 | 12 |
Once a pool is created, the ashift
value can be obtained from zdb
:
# zdb | grep ashift
ashift: 12
See Also:
A simple pool
To create a simple zpool named storage, similar to a RAID 0/JBOD:
# zpool create storage \
/dev/disk1 /dev/disk2
A mirrored pool
To create a zpool with a mirrored vdev
# zpool create storage \
mirror /dev/disk1 /dev/disk2
To create a zpool that stripes data across two mirrors, specify two mirrors:
# zpool create storage \
mirror /dev/disk1 /dev/disk2 \
mirror /dev/disk3 /dev/disk4
To add or remove a disk from the mirror, use the zpool attach
and zpool detach
commands.
For example, to add a new disk (gpt/disk03) to an existing mirror:
## We start off with a 2 disk mirror
# zpool status storage
NAME STATE READ WRITE CKSUM
storage ONLINE 0 0 0
mirror-0 ONLINE 0 0 0
gpt/disk01 ONLINE 0 0 0
gpt/disk02 ONLINE 0 0 0
## Note that the existing device can be either gpt/disk01 or gpt/disk02.
# zpool attach storage gpt/disk02 gpt/disk03
# zpool status storage
NAME STATE READ WRITE CKSUM
storage ONLINE 0 0 0
mirror-0 ONLINE 0 0 0
gpt/disk01 ONLINE 0 0 0
gpt/disk02 ONLINE 0 0 0
gpt/disk03 ONLINE 0 0 0
## Remove a device fro a mirror vdev with zpool detach.
# zpool detach storage gpt/disk02
# zpool status storage
NAME STATE READ WRITE CKSUM
storage ONLINE 0 0 0
mirror-0 ONLINE 0 0 0
gpt/disk01 ONLINE 0 0 0
gpt/disk03 ONLINE 0 0 0
A raidz pool
Another example using RAIDz1:
# zpool create storage \
raidz1 /dev/disk/by-id/device01-part1 /dev/disk/by-id/device02-part1 /dev/disk/by-id/device03-part1
Once created, you can see the status of the pool using zfs status
.
# zpool status
pool: storage
state: ONLINE
scan: none requested
config:
NAME STATE READ WRITE CKSUM
storage ONLINE 0 0 0
raidz1-0 ONLINE 0 0 0
gpt/disk01 ONLINE 0 0 0
gpt/disk02 ONLINE 0 0 0
gpt/disk03 ONLINE 0 0 0
errors: No known data errors
A file based zpool
vdevs can be backed by a file or disk image. This is useful when testing.
# for i in {1..4}; do dd if=/dev/zero of=/tmp/file$i bs=1G count=4 &> /dev/null; done
# zpool create storage \
/tmp/file1 /tmp/file2 /tmp/file3 /tmp/file4
A hybrid zpool
A zpool can be created out of a combination of the different vdevs. Mix and match to your liking.
# zpool create storage \
mirror /dev/disk1 /dev/disk2 \
mirror /dev/disk3 /dev/disk4 \
log mirror /dev/disk5 /dev/disk6 \
cache /dev/disk7
Managing zpools
List zpools using zpool list
:
# zpool list
NAME SIZE ALLOC FREE EXPANDSZ FRAG CAP DEDUP HEALTH ALTROOT
data 8.12T 5.42T 2.71T - 50% 66% 1.00x ONLINE -
storage 9.06T 5.93T 3.13T - 13% 65% 1.00x ONLINE -
The FRAG
value is the average fragmentation of available space. There is no defragment operation in ZFS. If you wish to decrease the fragmentation in your pool, consider transferring your pool data to another location and then back with zfs send
and zfs recv
to rewrite entire chunks of data.
Expanding an existing zpool
A zpool can be expanded by either adding additional vdevs to the pool or by expanding an underlying vdev.
Adding an additional vdev to a zpool is akin to adding additional disks to a JBOD to grow its size. The zpool grows in capacity by as much as the capacity of the vdev that is being added. To add an additional vdev to a zpool, use the zpool add
command. For example:
# Adding 2 1TB disks mirrored will add an additional 1TB to the pool
# zpool add pool mirror /dev/disk1 /dev/disk2
Expanding the underlying vdev used by a zpool is another way to increase a zpool's capacity. For mirror and raidz vdevs, the size can only increase if all underlying storage devices grow. On raidz vdevs, this involves replacing each member disk and resilvering one disk at a time. On larger arrays, this may not be practical as it requires resilvering as many times as there are disks in the vdev.
The zpool will only grow once every disk has been replaced and when the zpool has the autoexpand=on
option set. If you forgot to set the autoexpand option before replacing all disks, you can still expand the pool by enabling autoexpand and bringing a disk online:
## After replacing all disks in a vdev, the zpool still shows the same size
# zpool list storage
NAME SIZE ALLOC FREE EXPANDSZ FRAG CAP DEDUP HEALTH ALTROOT
storage 9.06T 8.35T 732G 4.50T 33% 92% 1.00x ONLINE -
## Set autoexpand=on on the zpool and bring one of the devices in the affected vdev online again.
# zpool set autoexpand=on storage
# zpool online -e storage ata-Hitachi_HUS724030ALE641_P8GH7GVR
## The zpool should now pick up on the new vdev size and expand accordingly
# zpool list storage
NAME SIZE ALLOC FREE EXPANDSZ FRAG CAP DEDUP HEALTH ALTROOT
storage 13.6T 8.35T 5.28T - 22% 61% 1.00x ONLINE -
Verifying data integrity
Data can silently be corrupted by faulty hardware, from failing sectors to bad memory, or through a fault with the ZFS implementation. To safeguard against data corruption, every block is checksumed using SHA-256. A verification on each block, called a ZFS scrub, can then be used to ensure the correctness of each block. This check can be done while the system is online, though it may degrade performance. A ZFS scrub can be triggered manually with the zpool scrub
command. Because the scrub checks each block, the amount of time required depends on the amount of data and the speed of the underlying storage devices. During a scrub, ZFS will attempt to fix any errors that are discovered.
To initiate a scrub, run:
# zpool scrub storage
To stop a scrub process, run:
# zpool scrub -s storage
Once the scrub process is underway, you can view its status by running:
# zpool status storage
pool: storage
state: ONLINE
scan: scrub in progress since Mon Dec 3 23:54:53 2012
18.3G scanned out of 6.05T at 211M/s, 8h20m to go
0 repaired, 0.30% done
config:
NAME STATE READ WRITE CKSUM
storage ONLINE 0 0 0
raidz1-0 ONLINE 0 0 0
gpt/disk00 ONLINE 0 0 0
gpt/disk01 ONLINE 0 0 0
gpt/disk02 ONLINE 0 0 0
gpt/disk03 ONLINE 0 0 0
gpt/disk04 ONLINE 0 0 0
errors: No known data errors
Any errors that have been found during the scrub will be fixed automatically. The zpool status shows 3 columns:
- READ: IO errors while reading
- WRITE: IO errors while writing
- CKSUM: checksum errors that was found during a read.
ZFS datasets
Datasets are similar to filesystem volumes and are created within a ZFS zpool. Unlike a traditional filesystem however, datasets do not need to be created to a particular size but instead can use as much storage the storage pool has available, though this can be limited with quotas if desired. This design allows for flexible storage designs for many applications. Individual datasets can also be snapshotted which will be covered under the snapshot section below.
The following sections will go over ZFS dataset management and dataset properties.
Creating a dataset
Creating a new dataset is as simple as:
# zfs create zpool/dataset-name
To list all ZFS datasets:
# zfs list
NAME USED AVAIL REFER MOUNTPOINT
storage 153K 1.11T 153K /storage
storage/logs 153K 1.11T 153K /storage/logs
Certain parameters can be set during the creation process with the -o
flag, or set after creation using the zfs set
command.
Dataset options and parameters
Like zpool properties, each ZFS dataset also contains properties that can be used to fine tune the filesystem to your storage needs.
Dataset properties are inherited from its parent zpool or its parent datasets or be overridden. For example, having enabled compression on the parent zpool, each ZFS dataset will inherit the compression setting which can be overridden by setting the compress property.
Custom user-defined properties can also be created. These have no affect on the filesystem and is merely used to annotate, tag, or label a dataset for applications designed around ZFS. Custom properties must include a ':' so distingush them from the native dataset properties.
Getting properties
To get a ZFS property, use the zfs get property-name zpool-name
command. Multiple properties can be retrieved by using a comma separated list. For example:
# zfs get compressratio,available storage
NAME PROPERTY VALUE SOURCE
storage compressratio 1.01x -
storage available 1.00G -
The 'source' column specifies where the value originated from. For datasets that are inheriting properties from other datasets, this field will specify its source.
Some useful properties are:
all
- to get all propertiescompressratio
- get the compression ratio of the datasetreadonly
- Whether the dataset is read onlycompression=X
- the compression algorithm to use. Defaults to 'off' and depending on the system, valid choices are "on", "off", "lzjb", "gzip", "gzip-N", "zle", and "zstd".copies=N
- how many times data within this dataset are writtenmountpoint=X
- The dataset mount point
Setting properties
Properties are set using the zfs set property=value zpool-name
command. Here are some examples.
To change a ZFS mount point:
# zfs set mountpoint=/export storage
To change the ZFS dataset compression:
# zfs set compression=zstd storage
Transferring and receiving datasets
You can transfer a snapshot of a ZFS dataset using the send
and receive
commands.
Using SSH:
# zfs send zones/UUID@snapshot | ssh root@10.10.11.5 zfs recv zones/UUID
Using Netcat:
## On the source machine
# zfs send data/linbuild@B | nc -w 20 phoenix 8000
## On the destination machine
# nc -w 120 -l 8000 | pv | zfs receive -F data/linbuild
Using Bash /dev/tcp
to send appears to have lower overhead than using netcat to send:
## On the source machine
# zfs send data/linbuild@B > /dev/tcp/phoenix/8000
## On the destination machine, use netcat to listen
# nc -w 120 -l 8000 | pv | zfs receive -F data/linbuild
ZFS volumes
A ZVOL is a ZFS volume that is exposed to the system as a block device. Like a dataset, a ZVOL can be snapshotted, scrubbed, compressed, and deduped.
Create a ZVOL
A ZVOL can be created with the same zfs create
command but with the addition of the -V
option followed by the volume size. For example:
# zfs create -V 1G storage/disk1
ZVOLs are exposed under the /dev/zvol
path.
Using ZVOLs
Since a ZVOL is just a block device, you can:
- Make a filesystem (like ext4) on it (with mkfs)
- Use it as a swap space (with mkswap)
One interesting feature with using ZVOLs is that ZFS can provide transparent compression. This means if the volume has compression enabled, your swap or the filesystem is automatically compressed.
Snapshots
ZFS snapshots creates a copy of the filesystem at the exact moment of creation. They can be created on entire zpools or on specific datasets. Creating snapshot is near instantaneous and requires no additional storage.
While snapshots take no additional space, keep in mind that because of he copy-on-write nature of ZFS, storage will increase once the data begins to change from the snapshot. As a consequence, keeping many old snapshots around on a filesystem with constant change can quickly reduce the amount of available space. Also remember that deleting data that has been snapshotted will also not reclaim space until the snapshot is deleted.
You may have up to 264 snapshots. Each snapshot name may have up to 88 characters.
Creating snapshots
Snapshots are referred to by their name and follows the following syntax:
- Entire zpool snapshot:
zpool@snapshot-name
- Individual dataset snapshot:
zpool/dataset@snapshot-name
Create a new snapshot with the zfs snapshot
command followed by the snapshot name. Eg: zfs snapshot zpool/dataset@snapshot-name
.
Listing snapshots
Snapshots can be listed by running zfs list -t snapshot
.
# zfs list -t snapshot
NAME USED AVAIL REFER MOUNTPOINT
storage@20120820 31.4G - 3.21T -
storage@20120924 134G - 4.15T -
storage@20121028 36.2G - 4.26T -
storage@20121201 33.2M - 4.55T -
The USED
column shows the amount of space used by the snapshot. This amount will go up as files from the snapshot are deleted since the space freed cannot be reclaimed until the snapshot is deleted.
The REFER
column shows the actual size of the pool at the snapshot's timepoint.
To quickly get snapshots ordered by when they were made, use the -r
option. Passing in the fields you need (name) will speed this operation up.
# zfs list -t snapshot -o name -s name -r data/home
NAME
data/home@zbk-daily-20170502-003001
data/home@zbk-daily-20170503-003001
data/home@zbk-daily-20170504-003001
data/home@zbk-daily-20170505-003001
## Get the most recent snapshot name
# zfs list -t snapshot -o name -s name -r data/home | tail -n 1 | awk -F@ '{print $2}'
zbk-daily-20170505-003001
Using snapshots and rollback
Snapshot contents can be accessed through a special .zfs/snapshot/
directory. Each snapshot will contain a read-only copy of the data that existed when the snapshot was taken.
# ls /storage/.zfs/snapshot/
20120820/ 20120924/ 20121028/ 20121201/ today/
To rollback to a specific snapshot, run zfs rollback storage@yesterday
. This will restore your dataset to the snapshot state.
# zfs rollback storage@yesterday
Destroying snapshots
To destroy a snapshot, use the zfs destroy
command as you would if you were to destroy a zpool or dataset. Destroying a snapshot will fail if the snapshot has any nested datasets depending on it, such as if you cloned it as another volume.
# zfs destroy storage@today
Holding a snapshot
Use the zfs hold
command to hold a snapshot. Once held, the snapshot cannot be deleted until it is released. This is useful to avoid accidental deletion.
ZFS Clones
ZFS clones are a writable filesystem created from a snapshot. You must destroy the cloned filesystem before the source snapshot can be removed.
To create a cloned dataset 'yesterday' from a specific snapshot:
# zfs clone storage/test@yesterday storage/yesterday
Clones can be destroyed like any other dataset, using the zfs destroy
command. Eg. zfs destroy storage/yesterday
.
Automated snapshots
Check out zfs-snap.
Administration
Enabling on Startup
Linux | Freebsd |
---|---|
On Systemd based systems, you will need to enable the following services in order to have the zpool loaded and mounted on start up.
|
Enable the zfs service on startup by appending the following to /etc/rc.conf :
zfs_enable="YES"
|
Systemd Service Side Notes
The systemd service files should be located at/usr/lib/systemd/system/
.
Enable the services by running systemctl enable service-name
.
There is an issue where on reboot, the systemd service does not run properly which results in the ZFS pools not being imported. The fix is to run:
# systemctl preset zfs-import-cache zfs-import-scan zfs-mount zfs-share zfs-zed zfs.target
Linux without systemd
On non-systemd Linux distros, check for any startup scripts in /etc/init.d
. If you're doing everything manually, you may need to create a file in either /etc/modules-load.d
or /etc/modprobe.d
so that the ZFS module gets loaded.
To only load the module, create a file at /etc/modules-load.d/zfs.conf
containing:
zfs
To load the module with options, create a file at /etc/modprobe.d/zfs.conf
containing (for example):
options zfs zfs_arc_max=4294967296
At-Rest Encryption
At-rest encryption is a new feature in ZFS which can be enabled with zpool set feature@encryption=enabled <pool>
. Data is written using Authenticated Ciphers (AEAD) such as AES-CCM and AES-GCM and is configurable in various dataset properties. Per-dataset encryption can also be enabled or disabled using the -o encryption=[on|off]
flag.
User defined keys can be inherited or set manually for each dataset and can be loaded from different sources in various formats. This key is used to encrypt the mater key which is generated randomly and is never exposed to the user directly. Data encryption is done using this mater key which allows the ability to change the user-defined key without requiring re-encrypting data on the dataset.
For dedup to work, the cyphertext must match for the same plaintext. This is achieved by using the same salt and IV generated from a HMAC of the plaintext.
What is and isn't encrypted is listed below.
Encrypted | Unencrypted |
---|---|
|
|
The ZFS version must have encryption installed and enabled for any of this to work. Enable the feature on the pool.
# truncate -s 1G block
# zpool create test ./block
# zpool set feature@encryption=enabled test
Then, create an encrypted dataset by passing in these additional parameters to the zfs create command
.
Parameter | Description |
---|---|
-o encryption=.. |
Controls the ciphersuite (cipher, key length and mode).
The default is |
-o keyformat=.. |
Controls what format the encryption key will be provided as and where it should be loaded from.
Valid options include:
|
-o keylocation=.. |
Controls the source of the key. The key can be formatted as raw bytes, as hex representation or as a user password. It can be provided via a user prompt which will pop up when you first create it, or when you mount the dataset (zfs mount ) or load the key manually (zfs key -l ).
Valid options include:
|
-o pbkdf2iters=.. |
Only used if a passphrase is used (-o keysource=passphrase,..). It controls the iterations of PBKDF2 for key stretching. Higher is better as it slows down potential dictionary attacks on the password.
The default is |
keystatus |
This is a read-only value and not something you set and is included here for reference. Possible values are:
|
For example, to create an encrypted dataset using only default values and a passphrase:
$ zfs create \
-o encryption=on \
-o keylocation=prompt \
-o keyformat=passphrase \
testpool/enc1
# This will ask you to enter/confirm a password.
Because all child datasets inherits all parameters of its parent, a child dataset will also be encrypted using the same properties as the parent. For example, running zfs create test/enc1/also-encrypted
will create a new encrypted dataset with the same keysource and encryption method.
Encryption properties can be read as any other ZFS properties.
# zfs get -p encryption,keystatus,keysource,pbkdf2iters
If a ZFS key is not available, it can be provided using the zfs load-key pool/dataset
command. Attempting to mount an encrypted dataset without a valid key will also prompt you for a key.
A key can be unloaded using zfs unload-key pool/dataset
after it has been unmounted.
NFS Export
FreeBSD
To share a ZFS pool via NFS on a FreeBSD system, ensure that you have the following in your /etc/rc.conf
file.
mountd_enable="YES"
rpcbind_enable="YES"
nfs_server_enable="YES"
mountd_flags="-r -p 735"
zfs_enable="YES"
mountd
is required for exports to be loaded from /etc/exports
. You must also either reload or restart mountd
every time you make a change to the exports file in order to have it reread.
Set the sharenfs
property using the zfs
utility. To share it with anyone, set sharenfs
to 'on' or provide a network that the export can be accessed from. Eg:
# zfs sharenfs=off storage
# zfs sharenfs="-network 172.17.12.0/24" storage/linbuild
zfs get sharenfs
NAME PROPERTY VALUE SOURCE
data sharenfs off local
data/linbuild sharenfs -network 172.17.12.0/24 local
data/linbuild/centos sharenfs -network 172.17.12.0/24 inherited from data/linbuild
data/linbuild/fedora sharenfs -network 172.17.12.0/24 inherited from data/linbuild
data/linbuild/scientific sharenfs -network 172.17.12.0/24 inherited from data/linbuild
Add the paths you wish to export to /etc/exports
and reload mountd
.
By setting the sharenfs
property, your system will automatically create an export for the zfs pool using mountd
. By default, the NFS share will only be accessible locally.
# showmount -e
Exports list on localhost:
/data Everyone
If you want to restrict the share to a specific network, you can specify the network in when setting the sharenfs
property:
# zfs sharenfs="-network 10.1.1.0/24" storage
# showmount -e
Exports list on localhost:
/storage 10.1.1.0
By the way, the exports are stored in /etc/zfs/exports
and not in the usual /etc/exports
. The ZFS and mountd service must be started for it to work. Therefore, you'll also need to append to /etc/rc.conf
the following line:
mountd_enable="YES"
On Linux
To share a ZFS dataset via NFS, you may either do it traditionally by manually editing /etc/exports
and exportfs
, or by using ZFS's sharenfs
property and managing shares using the zfs share
and zfs unshare
commands.
To use the traditional method of exportfs
and /etc/exportfs
, set sharenfs=off
.
Sharing nested file systems
To share a nested dataset, use thecrossmnt
option. This will let you 'recursively' share datasets under the specified path by automatically mounting the child file system when accessed.
Eg. If I have multiple datasets under /storage/test:
# cat /etc/exports
/storage/test *(ro,no_subtree_check,crossmnt)
To use ZFS's automatic NFS exports, set sharenfs=on
to allow world read/write access to the share. To control access based on network, add an additional rw=subnet/netmask
property. Eg. rw=@192.168.1.0/24
. Interfacing the shares in this manner changes the ZFS sharetab file located at /etc/dfs/sharetab
. The sharetab file is used by the zfs-share
service. If for some reason the sharetab is out of sync with what is actually configured, you can force an update by running zfs share -a
or by restarting the zfs-share
service in systemd.
Handling Drive Failure
When a drive fails, you will see something similar to:
# zpool status
pool: data
state: DEGRADED
status: One or more devices could not be used because the label is missing or
invalid. Sufficient replicas exist for the pool to continue
functioning in a degraded state.
action: Replace the device using 'zpool replace'.
see: http://zfsonlinux.org/msg/ZFS-8000-4J
scan: scrub repaired 0 in 2h31m with 0 errors on Fri Jun 3 19:20:15 2016
config:
NAME STATE READ WRITE CKSUM
data DEGRADED 0 0 0
raidz2-0 DEGRADED 0 0 0
sdb ONLINE 0 0 0
sdc ONLINE 0 0 0
sdd ONLINE 0 0 0
sde ONLINE 0 0 0
sdf ONLINE 0 0 0
sdg UNAVAIL 3 222 0 corrupted data
errors: No known data errors
The device sdg
failed and was removed from the pool.
Dell Server Info
Since this happened on a server, I removed the failed disk from the machine and replaced it with another one.Because this server uses a Dell RAID controller, and these Dell RAID controllers can't do a pass-through, each disk that is part of the ZFS array is its own Raid0 vdisk. When inserting a new drive, the vdisk information needs to be re-created via OpenManage before the disk comes online.
Follow the steps on the Dell OpenManage after inserting the replacement drive if this applies to you.
Once the vdisk is recreated, it should show up as the old device name again.
Once the replacement disk is installed on the system, reinitialize the drive with the GPT label.
On a FreeBSD system: I reinitialized the disks using the geometry settings described above.
# gpart create -s GPT da0
da4 created
# gpart add -b 2048 -s 3906617520 -t freebsd-zfs -l disk03 da4
da4p1 added
On a ZFS on Linux system:
# parted /dev/sdg
GNU Parted 2.1
Using /dev/sdg
Welcome to GNU Parted! Type 'help' to view a list of commands.
(parted) mklabel GPT
Warning: The existing disk label on /dev/sdg will be destroyed and all data on this disk will be lost. Do
you want to continue?
Yes/No? y
(parted) quit
To replace the offline disk, use zpool replace pool old-device new-device
.
# zpool replace data sdg /dev/sdg
# zpool status
pool: data
state: DEGRADED
status: One or more devices is currently being resilvered. The pool will
continue to function, possibly in a degraded state.
action: Wait for the resilver to complete.
scan: resilver in progress since Mon Apr 10 16:36:35 2017
35.5M scanned out of 3.03T at 3.55M/s, 249h19m to go
5.45M resilvered, 0.00% done
config:
NAME STATE READ WRITE CKSUM
data DEGRADED 0 0 0
raidz2-0 DEGRADED 0 0 0
sdb ONLINE 0 0 0
sdc ONLINE 0 0 0
sdd ONLINE 0 0 0
sde ONLINE 0 0 0
sdf ONLINE 0 0 0
replacing-5 UNAVAIL 0 0 0
old UNAVAIL 3 222 0 corrupted data
sdg ONLINE 0 0 0 (resilvering)
errors: No known data errors
After the resilvering completes, you may need to manually detach the failed drive before the pool comes back 'online'.
# zpool status
pool: data
state: DEGRADED
status: One or more devices has experienced an error resulting in data
corruption. Applications may be affected.
action: Restore the file in question if possible. Otherwise restore the
entire pool from backup.
see: http://zfsonlinux.org/msg/ZFS-8000-8A
scan: resilvered 2.31T in 22h11m with 54 errors on Sun Oct 1 16:07:14 2017
config:
NAME STATE READ WRITE CKSUM
data DEGRADED 0 0 54
raidz1-0 DEGRADED 0 0 109
replacing-0 UNAVAIL 0 0 0
13841903505263088383 UNAVAIL 0 0 0 was /dev/disk/by-id/ata-ST3000DM001-1CH166_Z1F4HGS6-part1
ata-ST4000DM005-2DP166_ZGY0956M-part1 ONLINE 0 0 0
ata-ST3000DM001-1CH166_Z1F4HHZ7-part1 ONLINE 0 0 0
ata-ST4000DM000-2AE166_WDH01SND-part1 ONLINE 0 0 0
cache
pci-0000:02:00.0-ata-3-part1 ONLINE 0 0 0
errors: 26 data errors, use '-v' for a list
# zpool detach data 13841903505263088383
Clearing Data Errors
If you get data errors:
# zpool status
pool: data
state: DEGRADED
status: One or more devices are faulted in response to persistent errors.
Sufficient replicas exist for the pool to continue functioning in a
degraded state.
action: Replace the faulted device, or use 'zpool clear' to mark the device
repaired.
scan: resilvered 509G in 8h24m with 0 errors on Sat Oct 12 02:33:07 2019
config:
NAME STATE READ WRITE CKSUM
data DEGRADED 0 0 0
raidz2-0 DEGRADED 0 0 0
scsi-3600188b04c5ec1002533815852b8193e ONLINE 0 0 0
sdb FAULTED 0 0 0 too many errors
sdc ONLINE 0 0 0
sdd ONLINE 0 0 0
sde ONLINE 0 0 0
sdf ONLINE 0 0 0
If the device is faulted because of a temporary issue, you can bring it back online using the zpool clear
command. This will cause the pool to resilver only data since it dropped offline and depending on the amount of data, should be relatively quick.
# zpool clear data sdb
Devices that are unavailable may be brought offline and then back online using zpool offline pool disk
and zpool online pool disk
. An onlined disk could still be faulted and can be restored with zpool clear
.
## Take the device offline, then do whatever you need to the disk
# zpool offline data sdb
## Then bring it back online
# zpool online data sdb
warning: device 'sdb' onlined, but remains in faulted state
use 'zpool clear' to restore a faulted device
# zpool clear data sdb
If you do not care about data integrity, you could also clear errors by initiating a scrub and then cancelling it immediately.
# zpool scrub data
# zpool scrub -s data
Tuning and Monitoring
ZFS ARC
The ZFS Adaptive Replacement Cache (ARC) is an in-memory cached managed by ZFS to help improve read speeds by caching frequently accessed blocks in memory. With ARC, file access after the first time can be retrieved from memory rather than from disk. The primary benefit of using ARC is for heavy random file access such as databases.
The ZFS ARC size from the ZFS on Linux implementation defaults to half the host's available memory and may decrease when available memory gets too low. Depending on the system setup, you may want to change the maximum amount of memory allocated to ARC.
To change the maximum ARC size, edit the ZFS zfs_arc_max
kernel module parameter:
# cat /etc/modprobe.d/zfs.conf
options zfs zfs_arc_max=4294967296
or change the value on the fly:
# echo size_in_bytes >> /sys/module/zfs/parameters/zfs_arc_max
You may check the current ARC usage by checking arc_summary -s arc
, cat /proc/spl/kstat/zfs/arcstats
or use the arcstat.py
script that's part of the zfs
package.
# cat /proc/spl/kstat/zfs/arcstats
p 4 1851836123
c 4 4105979840
c_min 4 33554432
c_max 4 4294967296
size 4 4105591928
hdr_size 4 55529696
data_size 4 3027917312
metadata_size 4 747323904
other_size 4 274821016
anon_size 4 4979712
anon_evictable_data 4 0
anon_evictable_metadata 4 0
mru_size 4 848103424
mru_evictable_data 4 605918208
mru_evictable_metadata 4 97727488
mru_ghost_size 4 3224668160
mru_ghost_evictable_data 4 2267575296
mru_ghost_evictable_metadata 4 957092864
mfu_size 4 2922158080
mfu_evictable_data 4 2421089280
mfu_evictable_metadata 4 495401984
mfu_ghost_size 4 835275264
mfu_ghost_evictable_data 4 787349504
mfu_ghost_evictable_metadata 4 47925760
l2_hits 4 0
l2_misses 4 0
l2_feeds 4 0
l2_rw_clash 4 0
l2_read_bytes 4 0
l2_write_bytes 4 0
l2_writes_sent 4 0
l2_writes_done 4 0
l2_writes_error 4 0
l2_writes_lock_retry 4 0
l2_evict_lock_retry 4 0
l2_evict_reading 4 0
l2_evict_l1cached 4 0
l2_free_on_write 4 0
l2_cdata_free_on_write 4 0
l2_abort_lowmem 4 0
l2_cksum_bad 4 0
l2_io_error 4 0
l2_size 4 0
l2_asize 4 0
l2_hdr_size 4 0
l2_compress_successes 4 0
l2_compress_zeros 4 0
l2_compress_failures 4 0
memory_throttle_count 4 0
duplicate_buffers 4 0
duplicate_buffers_size 4 0
duplicate_reads 4 0
memory_direct_count 4 46505
memory_indirect_count 4 30062
arc_no_grow 4 0
arc_tempreserve 4 0
arc_loaned_bytes 4 0
arc_prune 4 0
arc_meta_used 4 1077674616
arc_meta_limit 4 3113852928
arc_meta_max 4 1077674616
arc_meta_min 4 16777216
arc_need_free 4 0
arc_sys_free 4 129740800
# arcstat.py 10 10
time read miss miss% dmis dm% pmis pm% mmis mm% arcsz c
17:41:09 0 0 0 0 0 0 0 0 0 3.9G 3.9G
17:41:19 211 68 32 6 7 62 50 68 32 3.9G 3.9G
17:41:29 146 32 21 6 10 25 30 32 21 3.9G 3.9G
17:41:39 170 50 29 6 9 44 42 50 31 3.9G 3.9G
17:41:49 150 33 22 6 9 27 31 33 22 3.9G 3.9G
17:41:59 151 43 28 4 9 38 39 43 28 3.9G 3.9G
If you are running out of ARC, you might get arc_prune taking up all the CPU. See:
https://github.com/zfsonlinux/zfs/issues/4345
# modprobe zfs zfs_arc_meta_strategy=0
# cat /sys/module/zfs/parameters/zfs_arc_meta_strategy
See Also:
ZFS L2ARC
The ZFS L2ARC is similar to the ARC by providing data caching on faster-than-storage-pool disks such as SLC/MLC SSDs to help improve random read workloads.
Cached data from the ARC are moved to the L2ARC when ARC needs more room for more recently and frequently accessed blocks.
To add or remove a L2ARC device:
## Add /dev/disk/by-path/path-to-disk to zpool 'data'
# zpool add data cache /dev/disk/by-path/path-to-disk
## To remove a device, just remove it like any other disk:
# zpool remove data /dev/disk/by-path/path-to-disk
You can see the usage of the L2ARC by running zpool list -v
.
# zpool list -v
NAME SIZE ALLOC FREE EXPANDSZ FRAG CAP DEDUP HEALTH ALTROOT
data 8.12T 5.72T 2.40T - 48% 70% 1.00x ONLINE -
raidz1 8.12T 5.72T 2.40T - 48% 70%
ata-ST3000DM001-part1 - - - - - -
ata-ST3000DM001-part1 - - - - - -
ata-ST4000DM000-part1 - - - - - -
cache - - - - - -
pci-0000:02:00.0-ata-3-part1 55.0G 28.3G 26.7G - 0% 51%
ZFS ZIL and SLOG
ZIL
The ZFS Intent Log (ZIL) is how ZFS keeps track of synchronous write operations so that they can be completed or rolled back after a crash or failure. The ZIL is not used or asynchronous writes as those are still done through system caches.
Because ZIL is stored in the data pool, writing to a ZFS pool involves duplicate writes to the pool: Once to the ZIL and again to the zpool. This is detrimental to performance since one write operation now requires two or more writes (ie. Write Amplification).
SLOG
To improve performance, the ZIL could be moved to a separate device called a Separate Intent Log (SLOG) so that a write operation will write once to both the SLOG the zpool and thereby avoiding write amplification.
The storage device used by the SLOG should ideally be very fast and also reliable. The size of the SLOG doesn't need to be that big either - a couple gigabytes should be sufficient.
For example, assuming that data is flushed every 5 seconds, on a single gigabit connection, the ZIL usage shouldn't be any more than ~600MB.
For these reasons, SLC SSDs are preferable to MLC SSDs for SLOG.
See Also:
- http://www.freenas.org/blog/zfs-zil-and-slog-demystified/
- https://www.ixsystems.com/blog/o-slog-not-slog-best-configure-zfs-intent-log/
Troubleshooting
ZFS: removing nonexistent segment from range tree
One morning, a single 8TB zpool one decided to hang and stop working. After restarting the system, zpool imort hangs and the kernel logs show the following:
[Tue Aug 8 09:39:15 2023] PANIC: zfs: removing nonexistent segment from range tree (offset=2cb7c938000 size=68000)
[Tue Aug 8 09:39:15 2023] Showing stack for process 3061
[Tue Aug 8 09:39:15 2023] CPU: 0 PID: 3061 Comm: z_wr_iss Tainted: P OE --------- - - 4.18.0-425.19.2.el8_7.x86_64 #1
[Tue Aug 8 09:39:15 2023] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.16.1-0-g3208b098f51a-prebuilt.qemu.org 04/01/2014
[Tue Aug 8 09:39:15 2023] Call Trace:
[Tue Aug 8 09:39:15 2023] dump_stack+0x41/0x60
[Tue Aug 8 09:39:15 2023] vcmn_err.cold.0+0x50/0x68 [spl]
[Tue Aug 8 09:39:15 2023] ? bt_grow_leaf+0xbd/0x160 [zfs]
[Tue Aug 8 09:39:15 2023] ? pn_free+0x30/0x30 [zfs]
[Tue Aug 8 09:39:15 2023] ? zfs_btree_insert_leaf_impl+0x21/0x40 [zfs]
[Tue Aug 8 09:39:15 2023] ? bt_shrink_leaf+0x8c/0xa0 [zfs]
[Tue Aug 8 09:39:15 2023] ? pn_free+0x30/0x30 [zfs]
[Tue Aug 8 09:39:15 2023] ? zfs_btree_find+0x183/0x300 [zfs]
[Tue Aug 8 09:39:15 2023] zfs_panic_recover+0x6f/0x90 [zfs]
[Tue Aug 8 09:39:15 2023] range_tree_remove_impl+0xabd/0xfa0 [zfs]
[Tue Aug 8 09:39:15 2023] space_map_load_callback+0x22/0x90 [zfs]
[Tue Aug 8 09:39:15 2023] space_map_iterate+0x1ae/0x3c0 [zfs]
[Tue Aug 8 09:39:15 2023] ? spa_stats_destroy+0x190/0x190 [zfs]
[Tue Aug 8 09:39:15 2023] space_map_load_length+0x64/0xe0 [zfs]
[Tue Aug 8 09:39:15 2023] metaslab_load.part.26+0x13e/0x810 [zfs]
[Tue Aug 8 09:39:15 2023] ? _cond_resched+0x15/0x30
[Tue Aug 8 09:39:15 2023] ? spl_kmem_alloc+0xd9/0x120 [spl]
[Tue Aug 8 09:39:15 2023] metaslab_activate+0x4b/0x230 [zfs]
[Tue Aug 8 09:39:15 2023] ? metaslab_set_selected_txg+0x89/0xc0 [zfs]
[Tue Aug 8 09:39:15 2023] metaslab_group_alloc_normal+0x166/0xb30 [zfs]
[Tue Aug 8 09:39:15 2023] metaslab_alloc_dva+0x24c/0x8d0 [zfs]
[Tue Aug 8 09:39:15 2023] ? vdev_disk_io_start+0x3d6/0x900 [zfs]
[Tue Aug 8 09:39:15 2023] metaslab_alloc+0xc5/0x250 [zfs]
[Tue Aug 8 09:39:15 2023] zio_dva_allocate+0xcb/0x860 [zfs]
[Tue Aug 8 09:39:15 2023] ? spl_kmem_alloc+0xd9/0x120 [spl]
[Tue Aug 8 09:39:15 2023] ? zio_push_transform+0x34/0x80 [zfs]
[Tue Aug 8 09:39:15 2023] ? zio_io_to_allocate.isra.8+0x5f/0x80 [zfs]
[Tue Aug 8 09:39:15 2023] zio_execute+0x90/0xf0 [zfs]
[Tue Aug 8 09:39:15 2023] taskq_thread+0x2e1/0x510 [spl]
[Tue Aug 8 09:39:15 2023] ? wake_up_q+0x70/0x70
[Tue Aug 8 09:39:15 2023] ? zio_taskq_member.isra.11.constprop.17+0x70/0x70 [zfs]
[Tue Aug 8 09:39:15 2023] ? taskq_thread_spawn+0x50/0x50 [spl]
[Tue Aug 8 09:39:15 2023] kthread+0x10b/0x130
[Tue Aug 8 09:39:15 2023] ? set_kthread_struct+0x50/0x50
[Tue Aug 8 09:39:15 2023] ret_from_fork+0x35/0x40
I found a GitHub issue from last year (2022) with a similar problem: https://github.com/openzfs/zfs/issues/13483. The work around which seems to work is to enable the following two tunables before importing the zpool with zpool import -f $zpool
:
# echo 1 > /sys/module/zfs/parameters/zil_replay_disable
# echo 1 > /sys/module/zfs/parameters/zfs_recover
That seemed to have allowed the pool to import. Kernel logs show the following messages at the time of import:
[ 55.358591] WARNING: zfs: removing nonexistent segment from range tree (offset=2cb7c938000 size=68000)
[ 55.360567] WARNING: zfs: adding existent segment to range tree (offset=2bb7c938000 size=68000)
The issue possibly could have been caused by a flakey SATA connection as the disk did appear to go offline a few times in the early hours. The disk was readable when I was diagnosing the issue before rebooting the system. SMART also seems to suggest the disk is healthy.
It's possible something corrupt was written as the zpool is used inside a VM via a QEMU disk passthrough. Perhaps too many layers causing corruption when the underlying SATA link got bumped?
See Also
- https://wiki.archlinux.org/index.php/ZFS
- https://www.freebsd.org/doc/handbook/zfs-zfs.html
- http://superuser.com/questions/608612/4k-hard-drives-freebsd-gpart-and-zfs
ZFS encryption guide