IBM Spectrum Archive

IBM Spectrum Archive Enterprise Edition (EE) can be used to tier data from a GPFS storage pool to a Tape Library. Files stored on tape can be accessed like any other files stored on spinning disk via LTFS.

Usage

Use the ltfsee command to interact with Spectrum Archive. Other binaries related to LTFS can be found at /opt/ibm/ltfsee/bin/.

Here is a quick cheat sheet of common commands:

Command	Description
ltfsee status	Shows Spectrum Archive status
ltfsee info libraries	Shows library information
ltfsee info drives	Shows drive status
ltfsee info jobs	Shows all jobs that are waiting to be processed
ltfsee info pools	Shows all pools configured in Spectrum Archive
ltfsee info tapes	Shows all tapes that are managed by Spectrum Archive
ltfsee info files -f /path/to/file	Show migration status of the given file path
ltfsee pool remove -p POOL -t TAPEID	Remove the tape from LTFSEE
ltfsee pool remove -p POOL -t TAPEID -r	Force remove the tape from LTFSEE
ltfsee pool add -p POOL -t TAPEID	Add a tape to the pool
ltfsee pool add -p POOL -t TAPEID -f	format before adding to pool
ltfsee pool add -p POOL -t TAPEID -c	Checks the tape before adding to pool
ltfsee pool add -p POOL -t TAPEID -d	Do a deep recovery and add tape to pool
ltfsee reclaim -p POOL -t TAPEID	Reclaims all data from the tape and remove it from the pool
ltfsee retrieve	Retrieve inventory from underlying tape library.
ltfsee drive remove -d DRIVESERIAL	remove a drive
ltfsee drive add DRIVESERIAL[:ROLES] NODEID	Adds the drive, with optional roles on the node to LTFS.
ltfsee tape move homeslot -t TAPEID	Moves the tape to its home slot
ltfsee tape move ieslot -t TAPEID	Moves the tape to a input/export slot
/opt/ibm/ltfsee/bin/ltfsee_log_collection	Collects logs to the current directory for IBM support
dsmrecall -resident <file name>	Put files in premigrate state to resident

Spectrum Archive Status

ltfsee info libraries shows all libraries.

# ltfsee info libraries 
Library Name  Status  Model     Serial Number     Ctrl Node   
T_ARCH        Active  03584L32  0000078BA6130402  172.26.3.249

Shows whether MD and MMM are active.

# ltfsee status
Ctrl Node     MD      MMM     Library
172.26.3.249  Active  Active  T_ARCH

ltfsee info pools shows all pools.

# ltfsee info pools
Pool Name  Total(TiB)  Used(TiB)  Free(TiB)  Reclaimable(TiB)  Tapes  Type  Library  Node Group
TIERED1         644.3       84.6      559.7               0.0     61  LTO   T_ARCH   G0        
TIERED2         638.9       80.8      558.1               0.0     61  LTO   T_ARCH   G0

Tape Drive Management

To show all tape drives statuses and what tapes are in mounted:

# ltfsee info drives
Drive S/N   Status   Type  Role  Library  Address  Node ID  Tape      Node Group
0007807A4B  In use   LTO8   mrg  T_ARCH   260      4        T00031L8  G0        
0007807A0B  Mounted  LTO8   mrg  T_ARCH   261      4        -         G0        
000780765B  In use   LTO8   mrg  T_ARCH   262      4        T00054L8  G0

A detailed list of drive status can be found at https://www.ibm.com/support/knowledgecenter/ST9MBR_1.2.3/ltfs_ee_ltfsee_info_drives.html#ltfs_ee_ltfsee_info_drives. Drives that are in use are most likely in use by a running job. You may check jobs that are in progress and their run-time (the Idle column).

# ltfsee info jobs
Job Type         Status       Idle(sec)  Scan ID     Tape      Pool     Library  Node  File Name or inode
Reclaim(Source)  In-progress     573811  2944737025  T00031L8  TIERED1  T_ARCH      4  -                 
Reclaim(Target)  In-progress     573683  2944737281  T00054L8  TIERED1  T_ARCH      4  -                 
Validate         Unscheduled       4469  1602429441  T00054L8  TIERED1  T_ARCH      -  -

When performing maintenance on a drive, you should first remove it from Spectrum Archive so that it won't be used by any jobs. Remove a drive with ltfsee drive remove driveid. Drives can be added with the ltfsee drive add command. A drive can have 3 roles assigned in any combination as flags: Migration (4), Recall (2), or Generic (1). Assign the roles by setting the appropriate value. The default is to allow all roles (7).

## To add a drive to node 4 for recall jobs only:
# ltfsee drive add 0007807A0B:2 4

## To add a drive to node 4 for all 3 (Migrate, Recall, Generic) roles, do not specify a role.
# ltfsee drive add 0007807A0B 4

Tape Cartridge Management

List Tape Cartridges

To show all tapes in the library, run ltfsee info tapes. Common tape status codes are:

Consult the documentation for more status codes at https://www.ibm.com/support/knowledgecenter/ST9MBR_1.2.6/ltfs_ee_ltfsee_info_tapes.html
Status Code	Description
Critical	The Critical status indicates that an attempt to write to this cartridge failed. To avoid data loss, recover the data on this cartridge by using the `ltfsee recover` command, then discard this cartridge.
Error	The Error status indicates that the cartridge has an error and cannot be recovered. This cartridge cannot be used with the IBM Spectrum Archive Enterprise Edition.
Invalid	The Invalid status indicates that the cartridge is inconsistent with the LTFS format. To check and repair this tape before you add it to a tape storage pool, use the `ltfsee pool add` command with the check option.
Unavailable	The Unavailable status indicates that the cartridge is not available in the IBM Spectrum Archive Enterprise Edition system. Tapes that are newly inserted into the tape library have an Unavailable status. To recall files from a tape with this status, first import the tape by using the ltfsee import command. To add a tape with this status to the IBM Spectrum Archive Enterprise Edition system: First, move the tape to a home slot by using the `ltfsee tape move` command. Then, add the tape to a tape storage pool by using the `ltfsee pool` command.
Unknown	A tape has an Unknown status if it was removed and then re-added to Spectrum Archive. Tapes in this state can be made valid again by validating the tape using `ltfsee tape validate -p POOL -t TAPEID`.
Valid	The Valid status indicates that the cartridge is valid.

# ltfsee info tapes
Tape ID   Status       Type  Capacity(GiB)  Used(GiB)  Free(GiB)  Reclaimable(GiB)  Pool     Library  Address  Drive       Appendable
T00000L8  Valid        L8            10907          0      10907                 0  TIERED1  T_ARCH   1161     -           yes       
T00001L8  Valid        L8            10907          0      10907                 0  TIERED1  T_ARCH   1160     -           yes       
T00003L8  Valid        L8            10907          0      10907                 0  TIERED1  T_ARCH   1158     -           yes       
T00004L8  Valid        L8            10907          0      10907                 0  TIERED1  T_ARCH   1157     -           yes

Adding Tapes to a Pool

To add tapes to a particular pool, use ltfsee pool add -p pool -t tape ...:

# ltfsee pool add -p TIERED2 -t T00099L8
GLESL042I(00894): Adding tape T00099L8 to storage pool TIERED2.
Added tape T00099L8 to pool TIERED2 successfully.

Tapes can only be added if they contain a valid format. Tapes that require formatting will need the -f option. Tapes that have been used previously and has old metadata associated with the cartridge cannot be formatted using this method as a safe guard against unintended data loss. An exception format is required using the -e option to ignore metadata during formatting.

# ltfsee pool add -p TIERED1 -t T00030L8 -f
GLESL042I(00894): Adding tape T00030L8 to storage pool TIERED1.
GLESL091E(00937): This operation is not allowed to this state of tape. Need to check the status of Tape T00030L8 by using the ltfsee info tapes command.
# ltfsee pool add -p TIERED1 -t T00030L8 -e
GLESL042I(00894): Adding tape T00030L8 to storage pool TIERED1.
Tape T00030L8 successfully formatted.
Added tape T00030L8 to pool TIERED1 successfully.

When dealing with tapes that are in an Error state, you must first remove the tape from the storage pool using ltfsee pool remove -p ... -t ... -r, unassign it from the logical library in the tape library GUI, run ltfsee retrieve to remove the tape entirely from LTFS, then re-add the tape back to the logical library and re-run ltfsee retrieve. These steps will make the tape in an Unavailable state and only then can the cartridge be added back to a storage pool via re-formatting or checking.

Removing Tape from a pool

To remove a tape containing migrated data from a pool, use ltfsee reclaim. This command will move all migrated data to other tapes in the storage pool and then remove the tape from the storage pool. Reclaiming a damaged tape may take a very long time. You can always interrupt a reclaim process with a SIGINT (ctrl-c) and wait for the process to terminate.

# ltfsee reclaim -p TIERED2 -t T00101L8
GLESL682E(01020): Tape with ID: T00093L8 is an invalid state. Target tapes must be in state "Valid LTFS".
GLESL682E(01020): Tape with ID: T00100L8 is an invalid state. Target tapes must be in state "Valid LTFS".
Start reclaiming 1 tapes in the following list of tapes:
T00101L8 .
Files in tape T00101L8 are copied to tape T00071L8.
GLESL086I(01596): Reclamation has completed but some of the files remain, and a reconcile is required. At the least tape T00101L8 must be reconciled.

Tapes containing no migrated data can also be removed from a pool with ltfsee pool remove -p pool -t tape ....

# ltfsee pool remove -p TIERED1 -t T00030L8 
GLESL043I(01134): Removing tape T00030L8 from storage pool TIERED1.
Removed tape T00030L8 from pool TIERED1 successfully.

For tapes that Spectrum Archive believes contains migrated data when in fact it doesn't (verified by checking every file listed in the tape volume_cache or using the list_by_tape.policy), you may force remove a tape using the -r option. If you force remove a tape containing migrated data, you may still be able to recover the data if you do a deep recovery check when adding the tape without reformatting.

Force removing a mounted cartridge

If you force remove a tape that's mounted in a drive, the tape will be removed from the pool and unmount. However, Spectrum Archive will not be moved to its homeslot since it is no longer aware of the tape's existence. The tape will sit ejected in the drive and will prevent the tape changer from fulfilling any requests to this tape drive made by Spectrum Archive until it is moved manually via the Tape Library interface.

# ltfsee pool remove -p TIERED1 -t T00037L8
GLESL043I(01134): Removing tape T00037L8 from storage pool TIERED1.
GLESL357E(01223): Tape T00037L8 has migrated files or saved files. It has not been removed from the pool.
# ltfsee pool remove -p TIERED1 -t T00037L8 -r
GLESL043I(01134): Removing tape T00037L8 from storage pool TIERED1.
GLESL361W(01199): Tape T00037L8 was removed from pool TIERED1 forcefully. Cannot recall files on tape T00037L8. Add the tape to the same pool again if you need to recall files from the tape without formatting.

Moving a tape cartridge

If a tape is in a drive, you can move it back to its homeslot using ltfsee tape move homeslot -t tape -p pool. Alternatively, use ieslot to move it to the IO slot of the tape library for extraction.

# ltfsee tape move homeslot -t T00037L8 -p TIERED1
GLESL373I(00890): Moving tape T00037L8.
Tape T00037L8 is unmounted because it is inserted into the drive.
Tape T00037L8 is moved successfully.

File statuses, recalls

A file is migrated when the file contents have been copied to one or more tape and its original GPFS location is turned into a file stub. All files that are only stored on GPFS are resident. Files that are in the process of being migrated, where the data has been copied to the LTFSEE storage and is waiting to be copied to tape are pre-migrated.

Side Notes on Migrated Files

Migrated files will have a stub file with zero size. To count sizes with du, you need to use --apparent-size Eg.

[root@node001 xuexu]# du -sh dbgap_UK_OTTO
23T	dbgap_UK_OTTO

[root@node001 xuexu]# du -sh --apparent-size dbgap_UK_OTTO
35T	dbgap_UK_OTTO 

[root@node001 dbgap_UK_OTTO]# du SRR3021288_2.fastq
0	SRR3021288_2.fastq

[root@node001 dbgap_UK_OTTO]# du -sh --apparent-size SRR3021288_2.fastq
15G	SRR3021288_2.fastq

See file status by path

File statuses can be viewed with ltfsee info files -f filepath. The location (or locations if multiple copies were made) is given as the tape ID and library name under each file entry.

# ltfsee info files -f /tiered/ewang_scratch/xuexu/dbgap_UK_OTTO/SRR302*
Name: /tiered/ewang_scratch/xuexu/dbgap_UK_OTTO/SRR3021220_2.fastq
Tape id:-          Status: resident
Name: /tiered/ewang_scratch/xuexu/dbgap_UK_OTTO/SRR3021242_1.fastq
Tape id:T00042L8@T_ARCH:T00102L8@T_ARCH Status: migrated
Name: /tiered/ewang_scratch/xuexu/dbgap_UK_OTTO/SRR3021288_1.fastq
Tape id:T00055L8@T_ARCH:T00107L8@T_ARCH Status: migrated

File status by tape ID

There are two ways to get a list of files that are migrated to a particular tape.:

Use the list_by_tape.policy policy file. This requires a full filesystem scan using the policy which may take a long time. The file can be obtained from IBM and the usage is documented in the header of the policy file.

# mmapplypolicy /dev/gpfs1 -P list_by_tape.policy -I defer -f /tmp/output_t00100l8 -M tape=T00100L8
## Generates /tmp/output_t00100l8.list.mig and /tmp/output_t00100l8.list.premig
## No output files means no files were found on this tape.

Read the tape's metadata from the volume_cache directory. The .schema file contains a list of all migrated and premigrated files for a particular tape.

# cat /tiered/.ltfsee/meta/library-id/volume_cache/T00100*schema|grep gpfs -A 1 | grep tiered | sed "s/<\/*value>//g" | while read i ; do ltfsee info files -f "$i" ; done
Name: /tiered/file/path.txt
Tape id:-          Status: resident
...

Movement Policies

File movements are defined and applied by mmapplypolicy. A scheduled job invoking mmapplypolicy daily can be used to tier files on a regular basis.

The primary policy is on the LTFS server located at /mmpolicies/mmpolicyLATEST.txt and are applied with a daily cronjob that invokes:

# mmapplypolicy /dev/gpfs1 -P /mmpolicies/mmpolicyLATEST.txt >/dev/null 2>&1
# mmapplypolicy /dev/gpfs0 -P /mmpolicies/mmpolicyLATEST.txt >/dev/null 2>&1

The policy file contains the rules that govern which files are moved off to tape. LTFS/GPFS related files, specific filesystem logs, Space Manager(?) should not be migrated.

To prevent tiering of small amounts of data which could cause wear loading/unloading tape excessively, tiering can also be configured to happen only if filesystem usage exceeds 90% and will attempt to lower filesystem usage down to 80% as defined by the THRESHOLD values. Small files can also be ignored with the FILE_SIZE condition.

define(user_exclude_list,(PATH_NAME LIKE '/ibm/gpfs/.ltfsee/%' OR PATH_NAME LIKE'/ibm/gpfs/.SpaceMan/%' OR NAME LIKE 'dsmerror.log'))
define(user_include_list,(PATH_NAME LIKE '/tiered/%'))
define(is_premigrated,(MISC_ATTRIBUTES LIKE '%M%' AND MISC_ATTRIBUTES NOT LIKE'%V%'))
define(is_migrated,(MISC_ATTRIBUTES LIKE '%V%'))
define(is_resident,(NOT MISC_ATTRIBUTES LIKE '%M%'))

RULE 'DATA_POOL_PLACEMENT_RULE' SET POOL 'data'
RULE EXTERNAL POOL 'LTFSEE_FILES'
EXEC '/opt/ibm/ltfsee/bin/ltfsee'
OPTS '-p TIERED1 TIERED2'

RULE 'LTFSEE_FILES_RULE' MIGRATE FROM POOL 'data'
THRESHOLD(90,80)
TO POOL 'LTFSEE_FILES'
WHERE FILE_SIZE > 1048576 
AND (CURRENT_TIMESTAMP - ACCESS_TIME > INTERVAL '525600' MINUTES )
AND (CURRENT_TIMESTAMP - MODIFICATION_TIME > INTERVAL '525600' MINUTES )
AND is_resident OR is_premigrated
AND NOT user_exclude_list
AND user_include_list

Tivoli Storage Manager HSM

The underlying system that handles the data migration and recall is the Tivoli Storage Manager (TSM) Hierarchical Storage Manager (HSM). When a client attempts to read a file that's migrated, the dsmwatchd daemon will automatically attempt to recall the file from the source (such as LTFSEE) and then serve the file request transparently.

HSM has a few utilities specific for data migration.

Command

Description

dsmls

Shows information of migrated files.

root@ltfs# dsmls /tiered/ewang/data/ukbiobank/EGAD00010001497/ukb_int_chr1_v2.bin.gz.cip
IBM Spectrum Protect
Command Line Space Management Client Interface
  Client Version 8, Release 1, Level 4.1
  Client date/time: 04/07/2020 11:47:09
(c) Copyright by IBM Corporation and other(s) 1990, 2018. All Rights Reserved.

        ActS         ResS         ResB   FSt   Prvd       FName
213526643339            0            0   m      ltfs      ukb_int_chr1_v2.bin.gz.cip

The file status of 'm' denotes the file has been migrated to LTFS.

dsmrecall

Recalls a file

dsmmigrate -filelist=x.txt

Migrates files listed in x.txt.

Tasks

Shutdown and Startup

If the tape library needs to go offline, turn off LTFS to ensure drives are not accessed.

# ltfsee stop

When completed, start up LTFS:

# ltfsee start

Remove or Reclaim Tape

If a tape needs to be removed from the pool or tape library, you need to first reclaim its data so that migrated or saved files on the tape are retained. The reclaim command will automatically remove the tape from the pool if successful.

To start a reclamation for a specific tape, use ltfsee reclaim -p pool -t tapeid:

# ltfsee reclaim -p TIERED1 -t T00060L8
Start reclaiming 1 tapes in the following list of tapes:
T00060L8 .
Files in tape T00060L8 are copied to tape T00000L8.
GLESL373I(00890): Moving tape T00060L8.
Tape T00060L8 is unmounted because it is inserted into the drive.
Tape T00060L8 successfully reclaimed, formatted, and removed from storage pool TIERED1.
Reclamation complete. 1 tapes reclaimed, 1 tapes removed from the storage pool.

Once completed, the tape can be moved to an IO slot (ltfsee tape move ieslot -t T00060L8 ), or re-added to a pool after reformatting.

If you don't reclaim the contents first or if for some reason LTFS thinks there is migrated data, you will get these error messages:

# ltfsee pool remove -p TIERED1 -t T00037L8
GLESL043I(01134): Removing tape T00037L8 from storage pool TIERED1.
GLESL357E(01223): Tape T00037L8 has migrated files or saved files. It has not been removed from the pool.

# ltfsee tape move ieslot -t T00037L8 -p TIERED1
GLESL170E(00472): Failed to move tape T00037L8 because tape is assigned to a pool and not offline.

If you are absolutely sure all the data has been migrated, you can force remove the tape and reformat.

Cronjobs for automatic reconciliation and reclamation

The following cronjob is installed.

# Cron Job Used to Move Data Via policy to LTFS Tape at 1:00 AM
00 1 * * * /mmpolicies/movetogpfs1.sh 
# Cron Job Used to Move Data Via policy to LTFS Tape at 3:00 AM
00 3 * * * /mmpolicies/movetogpfs0.sh 
# Cron Job Used to Reconcile disk data with LTFS Tape data Pool TIERED1
00 20 * * * /opt/ibm/ltfsee/bin/ltfsee reconcile -p TIERED1 -l T_ARCH -g /tiered
# Cron Job Used to Reconcile disk data with LTFS Tape data Pool TIERED2
00 22 * * * /opt/ibm/ltfsee/bin/ltfsee reconcile -p TIERED2 -l T_ARCH -g /tiered
# Cron Job Used to Reclaim LTFS Tape data in Pool TIERED1
00 5 * * sun /opt/ibm/ltfsee/bin/ltfsee reclaim -p TIERED1 -l T_ARCH
# Cron Job Used to Reclaim LTFS Tape Data in Pool TIERED2
00 7 * * sun /opt/ibm/ltfsee/bin/ltfsee reclaim -p TIERED2 -l T_ARCH

Recovering files from second replica

If a tape was forcefully removed and it no longer has a valid LTFS format, any files that are stored on that tape should be recalled from the 2nd copy so that it can be re-copied elsewhere.

IBM provides a script called relocate_replica.sh which recalls data from a replica. For tapes that are in an "Invalid LTFS" state, you will need to modify the script because it isn't able to handle the space in the status. I needed to hard code the variables POOL_TO_REMOVE and LIB_TO_REMOVE based on the tape that I was trying to recover data from. To use this script:

## Usage:
# sh relocate_replica.sh -t TAPE_TO_RECOVER -p COPY1:COPY2 -P GPFS_DEV

## For example:
# sh relocate_replica.sh -t T00050L8 -p TIERED1@T_ARCH:TIERED2@T_ARCH -P /dev/gpfs1

Once the files has been recalled, it is in a premigrated state. This can be updated to resident by running dsmrecall -resident filename or using the ltfsee repair command.

Alternatively, if you have a list of files you want to recall (such as from reading the volume schema file at /tiered/.ltfsee/meta/$LIBRARY/volume_cache/*schema), you can create a text file containing a list of filenames prefixed with -- and then running ltfsee recall filelist.txt to recall these specific files. For example:

# cat <<EOF > /tmp/files
 -- /tiered/xyz/file1.txt
 -- /tiered/xyz/file2.txt
EOF
# ltfsee recall /tmp/files
GLESL268I(00156): 2 file name(s) have been provided to recall.

More information from Chapter 10 of the IBM Spectrum Archive Enterprise Edition Redbook.

Updating lin_tape

Download the lin_taped and lin_tape packages from IBM Fix Central. The lin_tape package is a SRPM and needs to be built into a RPM on your system.

## Build the srpm
# rpmbuild --rebuild lin_tape-3.0.52-1.src.rpm

## Stop LTFSEE and hsm. Unload the lin_tape module before upgrading.
# ltfsee stop
# systemctl stop hsm.service

## Update lin_tape
# yum update lin_taped-3.0.52-rhel7.x86_64.rpm /root/rpmbuild/RPMS/x86_64/lin_tape-3.0.52-1.x86_64.rpm

## Restart services
# systemctl start hsm.service
# ltfsee start

Troubleshooting

Logging

Logs worth investigating when things go wrong.

Service	Log Location
Spectrum Archive Logs	/var/log/ltfsee.log
Spectrum Archive Debug Logs	ltfsee_catcsvlog
HSM Logs	/opt/tivoli/tsm/client/hsm/bin/dsmerror.log
Tape Cartridge Volume Cache	/tiered/.ltfsee/meta/library-id/volume_cache
lin_tape driver	/var/log/lin_tape.*
LTFSEE Log Collection	Run: `/opt/ibm/ltfsee/bin/ltfsee_log_collection`

`The reclamation process failed`

Two tape drives have tapes in it but doesn't appear to be doing anything. Looking at /var/log/ltfsee.log, we can see that the reclamation process failed. The T00037L8 tape was in a drive that stopped functioning, showing a '5' in the single character display with the cartridge ejected.

Addendum: Bad Tape?

After a week, the same error occurred again with the same tape cartridge. It's most likely that the T00037L8 tape cartridge is faulty.

2020-01-20T10:23:05.514522-07:00 ltfs reclaim_target[10652]: GLESA112E(00590): The following command failed with (rc:256:1) : /bin/cp /ltfs/T00037L8/.LTFSEE_DATA/10991691794470275686-15679318773138264748-78339434-18551699-0 /ltfs/T00022L8/.LTFSEE_DATA 2>&1.
2020-01-20T10:23:14.925509-07:00 ltfs reclaim_target[10652]: GLESA112E(00590): The following command failed with (rc:256:1) : /bin/cp /ltfs/T00037L8/.LTFSEE_DATA/10991691794470275686-15679318773138264748-78339434-18551699-0 /ltfs/T00022L8/.LTFSEE_DATA 2>&1.
2020-01-20T10:23:14.925818-07:00 ltfs reclaim_target[10652]: GLESR035E(01182): The copy process from source to destination tape failed for the file 10991691794470275686-15679318773138264748-78339434-18551699-0.
2020-01-20T10:23:14.926141-07:00 ltfs reclaim_target[10652]: GLESR004E(01930): Processing file /tiered/ewang/xuexu/dbgap_tcga_germline/SRR3341182_SRR3341183_varscan.pileup failed: exiting the reclamation driver.
2020-01-20T10:23:14.926413-07:00 ltfs reclaim_target[10652]: GLESR026E(00158): The reclamation process failed (1932).#012           Have a look for previous messages.
2020-01-20T10:23:14.927489-07:00 ltfs mmm[2692]: GLESM221E(02136): Generic job with identifier REC_TGTT00022L8 failed.
2020-01-20T10:23:15.501824-07:00 ltfs mmm[2692]: GLESM223E(02024): Not all generic requests for session 1447695617 have been successful: 1 failed.
2020-01-20T10:23:15.502533-07:00 ltfs mmm[2692]: GLESM221E(02136): Generic job with identifier REC_SRCT00037L8 failed.
2020-01-20T10:23:16.130885-07:00 ltfs mmm[2692]: GLESM223E(02024): Not all generic requests for session 1442649089 have been successful: 1 failed.
2020-01-20T10:23:16.132052-07:00 ltfs ltfsee[14335]: GLESL082E(01590): Reclamation failed while reclaiming tape T00037L8 to target tape T00022L8.
2020-01-26T05:00:02.516999-07:00 ltfs ltfsee[23826]: GLESL668E(00979): Unable to get the state of tape T00030L8. Skip to reclaim. Consult the log files. (rc=1040)
2020-01-26T05:00:02.823888-07:00 ltfs ltfsee[23826]: GLESL682E(01028): Tape with ID: T00030L8 is an invalid state. Source tapes must be in state either "Valid LTFS" or "Warning".
2020-01-26T05:04:36.512781-07:00 ltfs reclaim_target[31377]: GLESR030E(00178): The reclamation process failed. (1655)#012           Have a look for previous messages.
2020-01-26T05:04:36.513142-07:00 ltfs mmm[2692]: GLESM221E(02136): Generic job with identifier REC_TGTT00029L8 failed.
2020-01-26T05:04:36.620168-07:00 ltfs mmm[2692]: GLESM223E(02024): Not all generic requests for session 2135234305 have been successful: 1 failed.

The two drives that are still in use are idling for the past week. Interestingly, ltfsee log did not show any information for 31L8 and 54L8 tapes.

# ltfsee info jobs
Job Type         Status       Idle(sec)  Scan ID     Tape      Pool     Library  Node  File Name or inode
Reclaim(Source)  In-progress     576566  2944737025  T00031L8  TIERED1  T_ARCH      4  -                 
Reclaim(Target)  In-progress     576438  2944737281  T00054L8  TIERED1  T_ARCH      4  -                 
Validate         Unscheduled       7224  1602429441  T00054L8  TIERED1  T_ARCH      -  -

The job cannot be stopped. Force a LTFS stop using ltfsee stop -f and wait. The tapes drives should eventually eject the tapes. I got the following error when stopping, but the tapes got ejected and everything seemed to have stopped.

# ltfsee stop -f
Library name: T_ARCH, library serial: 0000078BA6130402, control node (ltfsee_md) IP address: 172.26.3.249.
Running stop command - sending request and waiting for the completion.
GLESL030E(00909): Unable to connect to the MMM service. Check whether the IBM Spectrum Archive EE has been started.
GLESL358E(00494): Error on processing tape T00054L8 (1).
GLESL661E(00104): IPC got failure result (result=1).
GLESL646E(00164): Unable to stop the IBM Spectrum Archive EE monitor daemon for library T_ARCH.

Cannot Start LTFS

# /opt/ibm/ltfsee/bin/ltfsee start
Library name: T_ARCH, library serial: 0000078BA6130402, control node (ltfsee_md) IP address: 172.26.3.249.
Running start command - sending request : T_ARCH.
Running start command - waiting for completion : T_ARCH.
...
GLESL657E(00191): Fail to start the IBM Spectrum Archive EE service (MMM) for library T_ARCH.
                  Use the 'ltfsee info nodes' command to see the error modules.
                  The monitor daemon will start the recovery sequence.
# ltfsee info nodes

Spectrum Archive EE service (MMM) for library T_ARCH fails to start or is not running on ltfs-ib.gpfs.net Node ID:4

Problem Detected:
Node ID  Error Modules
      4  MMM;

Looking at /var/log/ltfsee.log, we see:

2020-02-03T13:15:35.266926-07:00 ltfs mmm[31142]: GLESM709E(00369): Assign tape (T00099L8) command error: 172.26.3.249:7600 (4): Request Error (070E): [Cartridge.cc:61]: Cartridge add is failed: 7c05 LTFSI1079E The operation is not allowed.

Seemed to have started up by itself?

# /opt/ibm/ltfsee/bin/ltfsee start
Library name: T_ARCH, library serial: 0000078BA6130402, control node (ltfsee_md) IP address: 172.26.3.249.
GLESL519I(00344): The IBM Spectrum Archive EE service (ltfsee_md) for library T_ARCH is already running.

# ltfsee info nodes
Node ID  Status     Node IP       Drives  Ctrl Node    Library  Node Group  Host Name       
4        Available  172.26.3.249       3  yes(active)  T_ARCH   G0          ltfs-ib.gpfs.net

Not sure what happened there...

Invalid tapes due to bad tape drive

# ltfsee info tapes
Tape ID   Status       Type  Capacity(GiB)  Used(GiB)  Free(GiB)  Reclaimable(GiB)  Pool     Library  Address  Drive       Appendable
T00030L8  Invalid      L8            10907      10895          0                22  TIERED1  T_ARCH   1131     -           no            
T00099L8  Invalid      L8            10907      10883          0                 0  TIERED2  T_ARCH   1031     -           no

According to IBM's documentation:

The Invalid status indicates that the cartridge is inconsistent with the LTFS format.
To check and repair this tape before you add it to a tape storage pool, use the ltfsee pool add command with the check option.

—IBM https://www.ibm.com/support/knowledgecenter/ST9MBR_1.2.6/ltfs_ee_ltfsee_info_tapes.html

The root problem of these invalid tapes was a bad tape drive. The bad tape drive damaged the tape file markers which caused reading issues and in some cases causing the tape to be marked as Invalid LTFS. Re-adding these invalid tapes back using the deep recovery option was not possible.

The fix that IBM proposed was to recall all files using ltfsee reclaim on the bad tape, and if necessary recall data manually by using ITDT and then replacing the stub file with the recovered file to make the file resident again.

Bad tape drives

One of the tape drives is in 'Error' state.

# ltfsee info drives
Drive S/N   Status  Type  Role  Library  Address  Node ID  Tape      Node Group
0007807A4B  In use  LTO8   mrg  T_ARCH   260      4        T00000L8  G0        
0007807A0B  Error   LTO8   mrg  T_ARCH   261      4        -         G0        
000780765B  In use  LTO8   mrg  T_ARCH   262      4        T00037L8  G0

The drive shows up as online to the tape library. No visible errors displayed on the tape drive itself.

I can't remove/readd it:

# ltfsee drive remove -d 0007807A0B
GLESL132E(00247): Could not remove a drive 0007807A0B. Drive is not in mount or not mounted state. The tape drive status:2.

Stopping and starting LTFSEE seemed to have cleared this error.

# ltfsee stop
Library name: T_ARCH, library serial: 0000078BA6130402, control node (ltfsee_md) IP address: 172.26.3.249.
Running stop command - sending request and waiting for the completion.
...
Stopped the IBM Spectrum Archive EE services for library T_ARCH.
# ltfsee start
Library name: T_ARCH, library serial: 0000078BA6130402, control node (ltfsee_md) IP address: 172.26.3.249.
Running start command - sending request : T_ARCH.
Running start command - waiting for completion : T_ARCH.
.....................................
Started the IBM Spectrum Archive EE services for library T_ARCH with good status.
# ltfsee info drives
Drive S/N   Status       Type  Role  Library  Address  Node ID  Tape  Node Group
0007807A4B  Not mounted  LTO8   mrg  T_ARCH   260      4        -     G0        
0007807A0B  Not mounted  LTO8   mrg  T_ARCH   261      4        -     G0        
000780765B  Not mounted  LTO8   mrg  T_ARCH   262      4        -     G0

Failing Tape Drive

The TS4500 tape library reported a tape drive issue with the error "Drive internal power-on self-tests failed." code 001F. The LTFS operations that were running at the time on this tape drive stopped and the tape cartridge itself is marked as 'Critical'.

/var/log/ltfsee.log showed:

2020-03-24T19:43:39.873263-06:00 ltfs ltfseecp[21628]: GLESG081E(00468): Migrating data of GPFS file /tiered/kkurek/sbarclay/speedseq_align/73.L005.realign.bam: write failed to tape T00086L8 and file /ltfs/T00086L8/.LTFSEE_DATA/10991691794470275686-15679318773138264748-228651564-67167434-0 (data length: 524288, rc: -1, errno: 5).
2020-03-24T19:43:39.873579-06:00 ltfs ltfseecp[21628]: GLESG506E(00803): Migration file (/tiered/kkurek/sbarclay/speedseq_align/73.L005.realign.bam) to tape T00086L8 failed (1091).
2020-03-24T19:43:39.873870-06:00 ltfs ltfseecp[21628]: GLESC003E(01158): Redundant copy for file /tiered/kkurek/sbarclay/speedseq_align/73.L005.realign.bam to tape T00086L8 failed.
2020-03-24T19:43:39.979980-06:00 ltfs mmm[28988]: GLESM110W(00210): Tape T00086L8 got critical.
2020-03-24T19:44:09.243830-06:00 ltfs mmm[28988]: GLESM709E(00442): Unmount T00086L8 command error: 172.26.3.249:7600 (4): Request Error (077E): [Cartridge.cc:146]: Cartridge unmount is failed: 578a LTFSI1086E This operation is not allowed on a cartridge with a critical error.
2020-03-24T19:44:09.244112-06:00 ltfs mmm[28988]: GLESM118E(00060): Unmount of tape T00086L8 failed (drive 000780765B). Check the state of tapes and drives.
2020-03-24T19:44:39.166261-06:00 ltfs mmm[28988]: GLESM709E(00442): Unmount T00086L8 command error: 172.26.3.249:7600 (4): Request Error (077E): [Cartridge.cc:146]: Cartridge unmount is failed: 5b10 LTFSI1086E This operation is not allowed on a cartridge with a critical error.
2020-03-24T19:44:39.166568-06:00 ltfs mmm[28988]: GLESM118E(00060): Unmount of tape T00086L8 failed (drive 000780765B). Check the state of tapes and drives.
2020-03-24T19:45:09.283571-06:00 ltfs mmm[28988]: GLESM709E(00442): Unmount T00086L8 command error: 172.26.3.249:7600 (4): Request Error (077E): [Cartridge.cc:146]: Cartridge unmount is failed: 5d71 LTFSI1086E This operation is not allowed on a cartridge with a critical error.
2020-03-24T19:45:09.283890-06:00 ltfs mmm[28988]: GLESM118E(00060): Unmount of tape T00086L8 failed (drive 000780765B). Check the state of tapes and drives.
...
2020-03-25T01:19:14.689053-06:00 ltfs ltfsee[14508]: GLESL062E(00577): Tape with ID: T00086L8 is in an invalid state. Tapes must be either in state "Valid LTFS" or "Unknown".
2020-03-25T01:19:14.695195-06:00 ltfs ltfsee[14501]: GLESL062E(00577): Tape with ID: T00086L8 is in an invalid state. Tapes must be either in state "Valid LTFS" or "Unknown".
2020-03-25T01:19:14.700692-06:00 ltfs ltfsee[14503]: GLESL062E(00577): Tape with ID: T00086L8 is in an invalid state. Tapes must be either in state "Valid LTFS" or "Unknown".
2020-03-25T01:19:14.706384-06:00 ltfs ltfsee[14504]: GLESL062E(00577): Tape with ID: T00086L8 is in an invalid state. Tapes must be either in state "Valid LTFS" or "Unknown".

# ltfsee info tapes 
Tape ID   Status       Type  Capacity(GiB)  Used(GiB)  Free(GiB)  Reclaimable(GiB)  Pool     Library  Address  Drive       Appendable                
T00086L8  Critical     L8            10907          0          0                 0  TIERED2  T_ARCH   262      000780765B  no                       
...
T00084L8  Critical     L8            10907      10859          0                 0  TIERED2  T_ARCH   262      000780765B  no              (this happened the next day)

You must recover all files from the tape prior to moving it:

# ltfsee tape move homeslot -t T00084L8 -p TIERED2
GLESL373I(00890): Moving tape T00084L8.
GLESL630E(00463): Cannot move tape T00084L8 because the status is "Critical".

The recovery process from IBM's documentation can be read at https://www.ibm.com/support/knowledgecenter/en/ST9MBR_1.2.4/ltfs_ee_recovering_critical_tapes.html

In summary, the recovery process is to:

Recover the files from tape back to Spectrum Scale.

## Show all files to recover
# ltfsee recover -s -p TIERED2 -l T_ARCH -t T00084L8
## Recover these files
# ltfsee recover -c -p TIERED2 -l T_ARCH -t T00084L8

Remove the tape from the LTFS library.

# ltfsee recover -r -p TIERED2 -l T_ARCH -t T00084L8

Move the tape to the IO port.

# ltfsee tape move homeslot -t T00084L8 -p TIERED2

If files are stuck in a premigrated state, the recover process won't succeed:

# ltfsee recover -r -p TIERED1 -l T_ARCH -t T00037L8
Scanning GPFS file systems to find migrated/saved objects in tape T00037L8.
Tape T00037L8 has 14 files to be recovered. The list is saved to /tmp/ltfs.29601.tiered.recoverlist.
GLESL613E(00822): Cannot remove tape T00037L8 because there are files to be recovered.

You can try to use the ltfsee recover command to change the status of the files from premigrated to resident.

# ltfsee repair "`cat /tmp/a`"
ANS9294I No files matching '/tiered/morph/....czi' were found.
GLESL257I(00083): Non-empty regular file /tiered/morph/....czi was in premigrated state.  The file is repaired to resident state.

But for some reason, that doesn't actually do what it says it did:

# ltfsee info files -f "`cat /tmp/a`"
Name: /tiered/morph/....czi
Tape id:T00037L8@T_ARCH:T00069L8@T_ARCH Status: premigrated

Since the files causing the issues are stored on another tape, stored on Spectrum Archive storage for premigration, and also stored on the GPFS storage, it should be safe to force remove the tape cartridge.

# ltfsee pool remove -p TIERED1 -t T00037L8 -r
GLESL043I(01134): Removing tape T00037L8 from storage pool TIERED1.
GLESL361W(01199): Tape T00037L8 was removed from pool TIERED1 forcefully. Cannot recall files on tape T00037L8. Add the tape to the same pool again if you need to recall files from the tape without formatting.

The tape was removed from the pool but was left in the tape drive. Because the tape isn't part of LTFSEE anymore, you have to move it to its homeslot using the tape library interface. The bad drive was then removed:

[root@ltfs tiered]# ltfsee drive remove -d 000780765B -n 4 -l T_ARCH
GLESL121I(00279): Drive serial 000780765B is removed from the tape drive list.

The 'bad' tape can then be re-added.

Cannot Add Tape

The tape library is unable to read 7 tape cartridges. When I attempt to add one of these tapes to the LTFS pool, I get the following errors:

# ltfsee pool add -p TIERED2 -t T00102L8 -c
GLESL042I(00894): Adding tape T00102L8 to storage pool TIERED2.
GLESL091E(00937): This operation is not allowed to this state of tape. Need to check the status of Tape T00102L8 by using the ltfsee info tapes command.

# ltfsee pool add -p TIERED2 -t T00102L8 -d
GLESL042I(00894): Adding tape T00102L8 to storage pool TIERED2.
GLESL091E(00937): This operation is not allowed to this state of tape. Need to check the status of Tape T00102L8 by using the ltfsee info tapes command.

From dmesg and in the ltfsee.log file:

# dmesg | tail
[  459.877936] lin_tape: IBMChgr0-----30402 changer_check_result sensekey: 5 asc: 24 ascq: 0
[ 1275.717464] lin_tape_set_active_partition: LOCATE_16 failed: -5
[ 3866.466645] lin_tape_set_active_partition: LOCATE_16 failed: -5
[ 4145.200700] lin_tape_set_active_partition: LOCATE_16 failed: -5

# tail /var/log/ltfsee.log 
2020-04-09T11:05:23.662537-06:00 ltfs mmm[31580]: GLESM709E(00369): Assign tape (T00102L8) command error: 172.26.3.249:7600 (4): Request Error (070E): [Cartridge.cc:61]: Cartridge add is failed: 165b LTFSI1079E The operation is not allowed.
2020-04-09T11:50:22.403860-06:00 ltfs mmm[31580]: GLESM709E(00559): Recovery T00102L8 command error: 172.26.3.249:7600 (4): Request Error (082E): [Cartridge.cc:242]: Cartridge recovery is failed: 167e LTFSD0301E Read or write permanent error.
2020-04-09T11:52:43.831125-06:00 ltfs mmm[31580]: GLESM555E(00341): Failed to format tape (T00102L8).
2020-04-09T11:52:44.348738-06:00 ltfs mmm[31580]: GLESM223E(02024): Not all generic requests for session 1439891713 have been successful: 1 failed.
2020-04-09T11:52:44.349214-06:00 ltfs ltfsee[5720]: GLESL091E(00937): This operation is not allowed to this state of tape. Need to check the status of Tape T00102L8 by using the ltfsee info tapes command.
2020-04-09T11:53:13.302341-06:00 ltfs mmm[31580]: GLESM709E(00369): Assign tape (T00102L8) command error: 172.26.3.249:7600 (4): Request Error (070E): [Cartridge.cc:61]: Cartridge add is failed: 7e7d LTFSI1084E This operation is not allowed on an invalid cartridge.

The tape library also shows the following entries in the event log.

Description: Cartridge T00102L8 had a read, write, or positioning error.
Error Code: 0003

Description: Cartridge T00102L8 had a read, write, or positioning error because of faulty media.
Error Code: 0004

It looks like the tape has just gone bad... along with 6 others in a short period of time.