SMART
SMART (Self-Monitoring, Analysis and Reporting Technology) is a monitoring system included in all computer hard drives and solid state drives.
Usage
On Linux, most distros package SMART as the smartmontools
package. Two utilities come with this package:
smartctl
, which can be used to perform SMART related tasks such as starting tests and printing health status reportssmartd
, which monitors SMART and performs self-tests periodically. This may be configured to alert you when issues are detected.
Show Disk Information
To show all capabilities, monitored attributes, errors, etc. of a particular disk, run:
# smartctl -a /dev/sda
Run Self-Test
Use the --test=long
or the --test=short
options to begin a self-test on the device. To abort a test, use the -X
or --abort
option. Test results will be listed under the self-test section in smartctl -a
or smartctl -l selftest
.
If a test failed on a bad sector, you may try to force a reallocation of that sector.
It isn't a bad idea to schedule long self tests periodically (like once a month?).
Troubleshooting
USB Support
When working with hard drives connected via USB, you need to disable USB Attached SCSI (UAS). Otherwise, smartctl
will report:
# smartctl -a /dev/sdb
smartctl 7.1 2019-12-30 r5022 [x86_64-linux-5.6.14-300.fc32.x86_64] (local build)
Copyright (C) 2002-19, Bruce Allen, Christian Franke, www.smartmontools.org
Read Device Identity failed: scsi error unsupported field in scsi command
Use one of three methods described in the How to disable USB Attached Storage page. Once UAS is disabled, smartctl
should be able to report the SMART parameters. For some devices, you may need to also pass in a -d
option to smartctl
. Check the supported devices page. https://www.smartmontools.org/wiki/Supported_USB-Devices for more information.
Attributes
SMART attributes are listed as a table and looks similar to the following:
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
1 Raw_Read_Error_Rate 0x000b 100 100 016 Pre-fail Always - 0
2 Throughput_Performance 0x0005 136 136 054 Pre-fail Offline - 80
3 Spin_Up_Time 0x0007 158 158 024 Pre-fail Always - 279 (Average 512)
4 Start_Stop_Count 0x0012 100 100 000 Old_age Always - 72
5 Reallocated_Sector_Ct 0x0033 100 100 005 Pre-fail Always - 0
7 Seek_Error_Rate 0x000b 100 100 067 Pre-fail Always - 0
8 Seek_Time_Performance 0x0005 121 121 020 Pre-fail Offline - 34
9 Power_On_Hours 0x0012 094 094 000 Old_age Always - 47400
10 Spin_Retry_Count 0x0013 100 100 060 Pre-fail Always - 0
12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always - 69
192 Power-Off_Retract_Count 0x0032 100 100 000 Old_age Always - 1168
193 Load_Cycle_Count 0x0012 100 100 000 Old_age Always - 1168
194 Temperature_Celsius 0x0002 181 181 000 Old_age Always - 33 (Min/Max 19/51)
196 Reallocated_Event_Count 0x0032 100 100 000 Old_age Always - 0
197 Current_Pending_Sector 0x0022 100 100 000 Old_age Always - 0
198 Offline_Uncorrectable 0x0008 100 100 000 Old_age Offline - 0
199 UDMA_CRC_Error_Count 0x000a 200 200 000 Old_age Always - 0
The columns from left to right are:
- Attribute number (8 bit, or 1 through 255 or sometimes reported as 01 to FF).
- Attribute name
- Attribute handling flag, which you may ignore
- The current value of this attribute. The value is reported from a scale of either 100 (best) to 1 (worst), or 200 to 1. Note that 0, 254, and 255 are reserved values and 253 means the attribute value is not yet set (typically only seen on a new drive).
- The worst value is a watermark value of the worst recorded value.
- The threshold value is the lowest value before the attribute is considered failed. The severity of this attribute failing is determined by its next column, type.
- Type is either pre-fail or old age. A pre-fail attribute that is past the threshold is considered a critical failure and at risk of imminent failure. A old age attribute is related to normal aging and the attribute is for informational use only.
- Updated determines when the value is updated and is either Always or Offline. Always means the value is 'live'. Offline means the value is only updated during an offline test.
- When failed identifies when the attribute failed.
- Raw value is the raw value of the attribute, controlled by the manufacturer.
Notes on specific attributes
- Reallocated_Sector_Ct (5)
- Disks automatically remap bad sectors to a pool of reserved sectors. This attribute is the number of sectors that have been reallocated. You should be wary of disks with a non-zero value.
- Current_Pending_Sector (197)
- The number of sectors that the disk had issues reading. The sector will be reallocated when either: 1. the sector can be read properly or 2. the sector is written over which may force a reallocation.
- Temperature_Celsius (194)
- The temperature of the disk.
- Airflow_Temperature_Cel (190)
- The airflow temperature of the disk, though I can't find any information what this value actually represents. I have only ever seen this value mirror the value in Temperature_Celsius (194).
- Seagate drives may have this value report as failed in the past. If desired, you may have
smartd
ignore this attribute with-I 190 -i 190
.
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
190 Airflow_Temperature_Cel 0x0022 073 040 040 Old_age Always In_the_past 27 (Min/Max 22/27)
- Raw_Read_Error_Rate (1)
- An indicator of the current rate of errors when performing low level sector read operations. Error correction handled by the firmware of the drive will transparently correct any errors.
- Do not read the RAW_VALUE number. Most manufacturers report this value as 0. Seagate on the other hand will report this number which may be some large number. Instead, when reading this attribute, use the scaled VALUE metric.