SMART

From Leo's Notes
Last edited on 17 July 2021, at 00:06.

SMART (Self-Monitoring, Analysis and Reporting Technology) is a monitoring system included in all computer hard drives and solid state drives.

Usage

On Linux, most distros package SMART as the smartmontools package. Two utilities come with this package:

  • smartctl, which can be used to perform SMART related tasks such as starting tests and printing health status reports
  • smartd, which monitors SMART and performs self-tests periodically. This may be configured to alert you when issues are detected.

Show Disk Information

To show all capabilities, monitored attributes, errors, etc. of a particular disk, run:

# smartctl -a /dev/sda

Run Self-Test

Use the --test=long or the --test=short options to begin a self-test on the device. To abort a test, use the -X or --abort option. Test results will be listed under the self-test section in smartctl -a or smartctl -l selftest.

If a test failed on a bad sector, you may try to force a reallocation of that sector.

It isn't a bad idea to schedule long self tests periodically (like once a month?).

Troubleshooting

USB Support

When working with hard drives connected via USB, you need to disable USB Attached SCSI (UAS). Otherwise, smartctl will report:

# smartctl -a /dev/sdb
smartctl 7.1 2019-12-30 r5022 [x86_64-linux-5.6.14-300.fc32.x86_64] (local build)
Copyright (C) 2002-19, Bruce Allen, Christian Franke, www.smartmontools.org

Read Device Identity failed: scsi error unsupported field in scsi command

Use one of three methods described in the How to disable USB Attached Storage page. Once UAS is disabled, smartctl should be able to report the SMART parameters. For some devices, you may need to also pass in a -d option to smartctl. Check the supported devices page. https://www.smartmontools.org/wiki/Supported_USB-Devices for more information.

See also: https://askubuntu.com/questions/637450/cannot-perform-smart-data-and-self-test-on-external-hard-drive

Attributes

SMART attributes are listed as a table and looks similar to the following:

ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x000b   100   100   016    Pre-fail  Always       -       0
  2 Throughput_Performance  0x0005   136   136   054    Pre-fail  Offline      -       80
  3 Spin_Up_Time            0x0007   158   158   024    Pre-fail  Always       -       279 (Average 512)
  4 Start_Stop_Count        0x0012   100   100   000    Old_age   Always       -       72
  5 Reallocated_Sector_Ct   0x0033   100   100   005    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x000b   100   100   067    Pre-fail  Always       -       0
  8 Seek_Time_Performance   0x0005   121   121   020    Pre-fail  Offline      -       34
  9 Power_On_Hours          0x0012   094   094   000    Old_age   Always       -       47400
 10 Spin_Retry_Count        0x0013   100   100   060    Pre-fail  Always       -       0
 12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       69
192 Power-Off_Retract_Count 0x0032   100   100   000    Old_age   Always       -       1168
193 Load_Cycle_Count        0x0012   100   100   000    Old_age   Always       -       1168
194 Temperature_Celsius     0x0002   181   181   000    Old_age   Always       -       33 (Min/Max 19/51)
196 Reallocated_Event_Count 0x0032   100   100   000    Old_age   Always       -       0
197 Current_Pending_Sector  0x0022   100   100   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0008   100   100   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x000a   200   200   000    Old_age   Always       -       0

The columns from left to right are:

  1. Attribute number (8 bit, or 1 through 255 or sometimes reported as 01 to FF).
  2. Attribute name
  3. Attribute handling flag, which you may ignore
  4. The current value of this attribute. The value is reported from a scale of either 100 (best) to 1 (worst), or 200 to 1. Note that 0, 254, and 255 are reserved values and 253 means the attribute value is not yet set (typically only seen on a new drive).
  5. The worst value is a watermark value of the worst recorded value.
  6. The threshold value is the lowest value before the attribute is considered failed. The severity of this attribute failing is determined by its next column, type.
  7. Type is either pre-fail or old age. A pre-fail attribute that is past the threshold is considered a critical failure and at risk of imminent failure. A old age attribute is related to normal aging and the attribute is for informational use only.
  8. Updated determines when the value is updated and is either Always or Offline. Always means the value is 'live'. Offline means the value is only updated during an offline test.
  9. When failed identifies when the attribute failed.
  10. Raw value is the raw value of the attribute, controlled by the manufacturer.

Notes on specific attributes

Reallocated_Sector_Ct (5)
Disks automatically remap bad sectors to a pool of reserved sectors. This attribute is the number of sectors that have been reallocated. You should be wary of disks with a non-zero value.
Current_Pending_Sector (197)
The number of sectors that the disk had issues reading. The sector will be reallocated when either: 1. the sector can be read properly or 2. the sector is written over which may force a reallocation.
Temperature_Celsius (194)
The temperature of the disk.
Airflow_Temperature_Cel (190)
The airflow temperature of the disk, though I can't find any information what this value actually represents. I have only ever seen this value mirror the value in Temperature_Celsius (194).
Seagate drives may have this value report as failed in the past. If desired, you may have smartd ignore this attribute with -I 190 -i 190.
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
190 Airflow_Temperature_Cel 0x0022   073   040   040    Old_age   Always   In_the_past 27 (Min/Max 22/27)
Raw_Read_Error_Rate (1)
An indicator of the current rate of errors when performing low level sector read operations. Error correction handled by the firmware of the drive will transparently correct any errors.
Do not read the RAW_VALUE number. Most manufacturers report this value as 0. Seagate on the other hand will report this number which may be some large number. Instead, when reading this attribute, use the scaled VALUE metric.

See Also