IBM Spectrum Archive
IBM Spectrum Archive Enterprise Edition (EE) can be used to tier data from a GPFS storage pool to a Tape Library. Files stored on tape can be accessed like any other files stored on spinning disk via LTFS.
Usage
Use the ltfsee
command to interact with Spectrum Archive. Other binaries related to LTFS can be found at /opt/ibm/ltfsee/bin/
.
Here is a quick cheat sheet of common commands:
Command | Description |
---|---|
ltfsee status | Shows Spectrum Archive status |
ltfsee info libraries | Shows library information |
ltfsee info drives | Shows drive status |
ltfsee info jobs | Shows all jobs that are waiting to be processed |
ltfsee info pools | Shows all pools configured in Spectrum Archive |
ltfsee info tapes | Shows all tapes that are managed by Spectrum Archive |
ltfsee info files -f /path/to/file | Show migration status of the given file path |
ltfsee pool remove -p POOL -t TAPEID | Remove the tape from LTFSEE |
ltfsee pool remove -p POOL -t TAPEID -r | Force remove the tape from LTFSEE |
ltfsee pool add -p POOL -t TAPEID | Add a tape to the pool |
ltfsee pool add -p POOL -t TAPEID -f | format before adding to pool |
ltfsee pool add -p POOL -t TAPEID -c | Checks the tape before adding to pool |
ltfsee pool add -p POOL -t TAPEID -d | Do a deep recovery and add tape to pool |
ltfsee reclaim -p POOL -t TAPEID | Reclaims all data from the tape and remove it from the pool |
ltfsee retrieve | Retrieve inventory from underlying tape library. |
ltfsee drive remove -d DRIVESERIAL | remove a drive |
ltfsee drive add DRIVESERIAL[:ROLES] NODEID | Adds the drive, with optional roles on the node to LTFS. |
ltfsee tape move homeslot -t TAPEID | Moves the tape to its home slot |
ltfsee tape move ieslot -t TAPEID | Moves the tape to a input/export slot |
/opt/ibm/ltfsee/bin/ltfsee_log_collection | Collects logs to the current directory for IBM support |
dsmrecall -resident <file name> | Put files in premigrate state to resident |
Spectrum Archive Status
ltfsee info libraries
shows all libraries.
# ltfsee info libraries
Library Name Status Model Serial Number Ctrl Node
T_ARCH Active 03584L32 0000078BA6130402 172.26.3.249
Shows whether MD and MMM are active.
# ltfsee status
Ctrl Node MD MMM Library
172.26.3.249 Active Active T_ARCH
ltfsee info pools
shows all pools.
# ltfsee info pools
Pool Name Total(TiB) Used(TiB) Free(TiB) Reclaimable(TiB) Tapes Type Library Node Group
TIERED1 644.3 84.6 559.7 0.0 61 LTO T_ARCH G0
TIERED2 638.9 80.8 558.1 0.0 61 LTO T_ARCH G0
Tape Drive Management
To show all tape drives statuses and what tapes are in mounted:
# ltfsee info drives
Drive S/N Status Type Role Library Address Node ID Tape Node Group
0007807A4B In use LTO8 mrg T_ARCH 260 4 T00031L8 G0
0007807A0B Mounted LTO8 mrg T_ARCH 261 4 - G0
000780765B In use LTO8 mrg T_ARCH 262 4 T00054L8 G0
A detailed list of drive status can be found at https://www.ibm.com/support/knowledgecenter/ST9MBR_1.2.3/ltfs_ee_ltfsee_info_drives.html#ltfs_ee_ltfsee_info_drives. Drives that are in use are most likely in use by a running job. You may check jobs that are in progress and their run-time (the Idle column).
# ltfsee info jobs
Job Type Status Idle(sec) Scan ID Tape Pool Library Node File Name or inode
Reclaim(Source) In-progress 573811 2944737025 T00031L8 TIERED1 T_ARCH 4 -
Reclaim(Target) In-progress 573683 2944737281 T00054L8 TIERED1 T_ARCH 4 -
Validate Unscheduled 4469 1602429441 T00054L8 TIERED1 T_ARCH - -
When performing maintenance on a drive, you should first remove it from Spectrum Archive so that it won't be used by any jobs. Remove a drive with ltfsee drive remove driveid
. Drives can be added with the ltfsee drive add
command. A drive can have 3 roles assigned in any combination as flags: Migration (4), Recall (2), or Generic (1). Assign the roles by setting the appropriate value. The default is to allow all roles (7).
## To add a drive to node 4 for recall jobs only:
# ltfsee drive add 0007807A0B:2 4
## To add a drive to node 4 for all 3 (Migrate, Recall, Generic) roles, do not specify a role.
# ltfsee drive add 0007807A0B 4
Tape Cartridge Management
List Tape Cartridges
To show all tapes in the library, run ltfsee info tapes
. Common tape status codes are:
Status Code | Description |
---|---|
Critical | The Critical status indicates that an attempt to write to this cartridge failed.
To avoid data loss, recover the data on this cartridge by using the |
Error | The Error status indicates that the cartridge has an error and cannot be recovered.
This cartridge cannot be used with the IBM Spectrum Archive Enterprise Edition. |
Invalid | The Invalid status indicates that the cartridge is inconsistent with the LTFS format.
To check and repair this tape before you add it to a tape storage pool, use the |
Unavailable | The Unavailable status indicates that the cartridge is not available in the IBM Spectrum Archive Enterprise Edition system.
Tapes that are newly inserted into the tape library have an Unavailable status. To recall files from a tape with this status, first import the tape by using the ltfsee import command. To add a tape with this status to the IBM Spectrum Archive Enterprise Edition system:
First, move the tape to a home slot by using the |
Unknown | A tape has an Unknown status if it was removed and then re-added to Spectrum Archive. Tapes in this state can be made valid again by validating the tape using ltfsee tape validate -p POOL -t TAPEID .
|
Valid | The Valid status indicates that the cartridge is valid. |
# ltfsee info tapes
Tape ID Status Type Capacity(GiB) Used(GiB) Free(GiB) Reclaimable(GiB) Pool Library Address Drive Appendable
T00000L8 Valid L8 10907 0 10907 0 TIERED1 T_ARCH 1161 - yes
T00001L8 Valid L8 10907 0 10907 0 TIERED1 T_ARCH 1160 - yes
T00003L8 Valid L8 10907 0 10907 0 TIERED1 T_ARCH 1158 - yes
T00004L8 Valid L8 10907 0 10907 0 TIERED1 T_ARCH 1157 - yes
Adding Tapes to a Pool
To add tapes to a particular pool, use ltfsee pool add -p pool -t tape ...
:
# ltfsee pool add -p TIERED2 -t T00099L8
GLESL042I(00894): Adding tape T00099L8 to storage pool TIERED2.
Added tape T00099L8 to pool TIERED2 successfully.
Tapes can only be added if they contain a valid format. Tapes that require formatting will need the -f
option. Tapes that have been used previously and has old metadata associated with the cartridge cannot be formatted using this method as a safe guard against unintended data loss. An exception format is required using the -e
option to ignore metadata during formatting.
# ltfsee pool add -p TIERED1 -t T00030L8 -f
GLESL042I(00894): Adding tape T00030L8 to storage pool TIERED1.
GLESL091E(00937): This operation is not allowed to this state of tape. Need to check the status of Tape T00030L8 by using the ltfsee info tapes command.
# ltfsee pool add -p TIERED1 -t T00030L8 -e
GLESL042I(00894): Adding tape T00030L8 to storage pool TIERED1.
Tape T00030L8 successfully formatted.
Added tape T00030L8 to pool TIERED1 successfully.
When dealing with tapes that are in an Error
state, you must first remove the tape from the storage pool using ltfsee pool remove -p ... -t ... -r
, unassign it from the logical library in the tape library GUI, run ltfsee retrieve
to remove the tape entirely from LTFS, then re-add the tape back to the logical library and re-run ltfsee retrieve
. These steps will make the tape in an Unavailable
state and only then can the cartridge be added back to a storage pool via re-formatting or checking.
Removing Tape from a pool
To remove a tape containing migrated data from a pool, use ltfsee reclaim
. This command will move all migrated data to other tapes in the storage pool and then remove the tape from the storage pool. Reclaiming a damaged tape may take a very long time. You can always interrupt a reclaim process with a SIGINT (ctrl-c) and wait for the process to terminate.
# ltfsee reclaim -p TIERED2 -t T00101L8
GLESL682E(01020): Tape with ID: T00093L8 is an invalid state. Target tapes must be in state "Valid LTFS".
GLESL682E(01020): Tape with ID: T00100L8 is an invalid state. Target tapes must be in state "Valid LTFS".
Start reclaiming 1 tapes in the following list of tapes:
T00101L8 .
Files in tape T00101L8 are copied to tape T00071L8.
GLESL086I(01596): Reclamation has completed but some of the files remain, and a reconcile is required. At the least tape T00101L8 must be reconciled.
Tapes containing no migrated data can also be removed from a pool with ltfsee pool remove -p pool -t tape ...
.
# ltfsee pool remove -p TIERED1 -t T00030L8
GLESL043I(01134): Removing tape T00030L8 from storage pool TIERED1.
Removed tape T00030L8 from pool TIERED1 successfully.
For tapes that Spectrum Archive believes contains migrated data when in fact it doesn't (verified by checking every file listed in the tape volume_cache or using the list_by_tape.policy), you may force remove a tape using the -r
option. If you force remove a tape containing migrated data, you may still be able to recover the data if you do a deep recovery check when adding the tape without reformatting.
Force removing a mounted cartridge
If you force remove a tape that's mounted in a drive, the tape will be removed from the pool and unmount. However, Spectrum Archive will not be moved to its homeslot since it is no longer aware of the tape's existence. The tape will sit ejected in the drive and will prevent the tape changer from fulfilling any requests to this tape drive made by Spectrum Archive until it is moved manually via the Tape Library interface.
# ltfsee pool remove -p TIERED1 -t T00037L8
GLESL043I(01134): Removing tape T00037L8 from storage pool TIERED1.
GLESL357E(01223): Tape T00037L8 has migrated files or saved files. It has not been removed from the pool.
# ltfsee pool remove -p TIERED1 -t T00037L8 -r
GLESL043I(01134): Removing tape T00037L8 from storage pool TIERED1.
GLESL361W(01199): Tape T00037L8 was removed from pool TIERED1 forcefully. Cannot recall files on tape T00037L8. Add the tape to the same pool again if you need to recall files from the tape without formatting.
Moving a tape cartridge
If a tape is in a drive, you can move it back to its homeslot using ltfsee tape move homeslot -t tape -p pool
. Alternatively, use ieslot
to move it to the IO slot of the tape library for extraction.
# ltfsee tape move homeslot -t T00037L8 -p TIERED1
GLESL373I(00890): Moving tape T00037L8.
Tape T00037L8 is unmounted because it is inserted into the drive.
Tape T00037L8 is moved successfully.
File statuses, recalls
A file is migrated when the file contents have been copied to one or more tape and its original GPFS location is turned into a file stub. All files that are only stored on GPFS are resident. Files that are in the process of being migrated, where the data has been copied to the LTFSEE storage and is waiting to be copied to tape are pre-migrated.
Side Notes on Migrated Files
Migrated files will have a stub file with zero size. To count sizes withdu
, you need to use --apparent-size
Eg.
[root@node001 xuexu]# du -sh dbgap_UK_OTTO
23T dbgap_UK_OTTO
[root@node001 xuexu]# du -sh --apparent-size dbgap_UK_OTTO
35T dbgap_UK_OTTO
[root@node001 dbgap_UK_OTTO]# du SRR3021288_2.fastq
0 SRR3021288_2.fastq
[root@node001 dbgap_UK_OTTO]# du -sh --apparent-size SRR3021288_2.fastq
15G SRR3021288_2.fastq
See file status by path
File statuses can be viewed with ltfsee info files -f filepath
. The location (or locations if multiple copies were made) is given as the tape ID and library name under each file entry.
# ltfsee info files -f /tiered/ewang_scratch/xuexu/dbgap_UK_OTTO/SRR302*
Name: /tiered/ewang_scratch/xuexu/dbgap_UK_OTTO/SRR3021220_2.fastq
Tape id:- Status: resident
Name: /tiered/ewang_scratch/xuexu/dbgap_UK_OTTO/SRR3021242_1.fastq
Tape id:T00042L8@T_ARCH:T00102L8@T_ARCH Status: migrated
Name: /tiered/ewang_scratch/xuexu/dbgap_UK_OTTO/SRR3021288_1.fastq
Tape id:T00055L8@T_ARCH:T00107L8@T_ARCH Status: migrated
File status by tape ID
There are two ways to get a list of files that are migrated to a particular tape.:
- Use the
list_by_tape.policy
policy file. This requires a full filesystem scan using the policy which may take a long time. The file can be obtained from IBM and the usage is documented in the header of the policy file.# mmapplypolicy /dev/gpfs1 -P list_by_tape.policy -I defer -f /tmp/output_t00100l8 -M tape=T00100L8 ## Generates /tmp/output_t00100l8.list.mig and /tmp/output_t00100l8.list.premig ## No output files means no files were found on this tape.
- Read the tape's metadata from the
volume_cache
directory. The.schema
file contains a list of all migrated and premigrated files for a particular tape.# cat /tiered/.ltfsee/meta/library-id/volume_cache/T00100*schema|grep gpfs -A 1 | grep tiered | sed "s/<\/*value>//g" | while read i ; do ltfsee info files -f "$i" ; done Name: /tiered/file/path.txt Tape id:- Status: resident ...
Movement Policies
File movements are defined and applied by mmapplypolicy
. A scheduled job invoking mmapplypolicy
daily can be used to tier files on a regular basis.
The primary policy is on the LTFS server located at /mmpolicies/mmpolicyLATEST.txt
and are applied with a daily cronjob that invokes:
# mmapplypolicy /dev/gpfs1 -P /mmpolicies/mmpolicyLATEST.txt >/dev/null 2>&1
# mmapplypolicy /dev/gpfs0 -P /mmpolicies/mmpolicyLATEST.txt >/dev/null 2>&1
The policy file contains the rules that govern which files are moved off to tape. LTFS/GPFS related files, specific filesystem logs, Space Manager(?) should not be migrated.
To prevent tiering of small amounts of data which could cause wear loading/unloading tape excessively, tiering can also be configured to happen only if filesystem usage exceeds 90% and will attempt to lower filesystem usage down to 80% as defined by the THRESHOLD
values. Small files can also be ignored with the FILE_SIZE
condition.
define(user_exclude_list,(PATH_NAME LIKE '/ibm/gpfs/.ltfsee/%' OR PATH_NAME LIKE'/ibm/gpfs/.SpaceMan/%' OR NAME LIKE 'dsmerror.log'))
define(user_include_list,(PATH_NAME LIKE '/tiered/%'))
define(is_premigrated,(MISC_ATTRIBUTES LIKE '%M%' AND MISC_ATTRIBUTES NOT LIKE'%V%'))
define(is_migrated,(MISC_ATTRIBUTES LIKE '%V%'))
define(is_resident,(NOT MISC_ATTRIBUTES LIKE '%M%'))
RULE 'DATA_POOL_PLACEMENT_RULE' SET POOL 'data'
RULE EXTERNAL POOL 'LTFSEE_FILES'
EXEC '/opt/ibm/ltfsee/bin/ltfsee'
OPTS '-p TIERED1 TIERED2'
RULE 'LTFSEE_FILES_RULE' MIGRATE FROM POOL 'data'
THRESHOLD(90,80)
TO POOL 'LTFSEE_FILES'
WHERE FILE_SIZE > 1048576
AND (CURRENT_TIMESTAMP - ACCESS_TIME > INTERVAL '525600' MINUTES )
AND (CURRENT_TIMESTAMP - MODIFICATION_TIME > INTERVAL '525600' MINUTES )
AND is_resident OR is_premigrated
AND NOT user_exclude_list
AND user_include_list
Tivoli Storage Manager HSM
The underlying system that handles the data migration and recall is the Tivoli Storage Manager (TSM) Hierarchical Storage Manager (HSM). When a client attempts to read a file that's migrated, the dsmwatchd
daemon will automatically attempt to recall the file from the source (such as LTFSEE) and then serve the file request transparently.
HSM has a few utilities specific for data migration.
Command | Description |
---|---|
dsmls | Shows information of migrated files.
root@ltfs# dsmls /tiered/ewang/data/ukbiobank/EGAD00010001497/ukb_int_chr1_v2.bin.gz.cip
IBM Spectrum Protect
Command Line Space Management Client Interface
Client Version 8, Release 1, Level 4.1
Client date/time: 04/07/2020 11:47:09
(c) Copyright by IBM Corporation and other(s) 1990, 2018. All Rights Reserved.
ActS ResS ResB FSt Prvd FName
213526643339 0 0 m ltfs ukb_int_chr1_v2.bin.gz.cip
The file status of 'm' denotes the file has been migrated to LTFS. |
dsmrecall | Recalls a file |
dsmmigrate -filelist=x.txt | Migrates files listed in x.txt. |
Tasks
Shutdown and Startup
If the tape library needs to go offline, turn off LTFS to ensure drives are not accessed.
# ltfsee stop
When completed, start up LTFS:
# ltfsee start
Remove or Reclaim Tape
If a tape needs to be removed from the pool or tape library, you need to first reclaim its data so that migrated or saved files on the tape are retained. The reclaim command will automatically remove the tape from the pool if successful.
To start a reclamation for a specific tape, use ltfsee reclaim -p pool -t tapeid
:
# ltfsee reclaim -p TIERED1 -t T00060L8
Start reclaiming 1 tapes in the following list of tapes:
T00060L8 .
Files in tape T00060L8 are copied to tape T00000L8.
GLESL373I(00890): Moving tape T00060L8.
Tape T00060L8 is unmounted because it is inserted into the drive.
Tape T00060L8 successfully reclaimed, formatted, and removed from storage pool TIERED1.
Reclamation complete. 1 tapes reclaimed, 1 tapes removed from the storage pool.
Once completed, the tape can be moved to an IO slot (ltfsee tape move ieslot -t T00060L8
), or re-added to a pool after reformatting.
If you don't reclaim the contents first or if for some reason LTFS thinks there is migrated data, you will get these error messages:
# ltfsee pool remove -p TIERED1 -t T00037L8
GLESL043I(01134): Removing tape T00037L8 from storage pool TIERED1.
GLESL357E(01223): Tape T00037L8 has migrated files or saved files. It has not been removed from the pool.
# ltfsee tape move ieslot -t T00037L8 -p TIERED1
GLESL170E(00472): Failed to move tape T00037L8 because tape is assigned to a pool and not offline.
If you are absolutely sure all the data has been migrated, you can force remove the tape and reformat.
Cronjobs for automatic reconciliation and reclamation
The following cronjob is installed.
# Cron Job Used to Move Data Via policy to LTFS Tape at 1:00 AM
00 1 * * * /mmpolicies/movetogpfs1.sh
# Cron Job Used to Move Data Via policy to LTFS Tape at 3:00 AM
00 3 * * * /mmpolicies/movetogpfs0.sh
# Cron Job Used to Reconcile disk data with LTFS Tape data Pool TIERED1
00 20 * * * /opt/ibm/ltfsee/bin/ltfsee reconcile -p TIERED1 -l T_ARCH -g /tiered
# Cron Job Used to Reconcile disk data with LTFS Tape data Pool TIERED2
00 22 * * * /opt/ibm/ltfsee/bin/ltfsee reconcile -p TIERED2 -l T_ARCH -g /tiered
# Cron Job Used to Reclaim LTFS Tape data in Pool TIERED1
00 5 * * sun /opt/ibm/ltfsee/bin/ltfsee reclaim -p TIERED1 -l T_ARCH
# Cron Job Used to Reclaim LTFS Tape Data in Pool TIERED2
00 7 * * sun /opt/ibm/ltfsee/bin/ltfsee reclaim -p TIERED2 -l T_ARCH
Recovering files from second replica
If a tape was forcefully removed and it no longer has a valid LTFS format, any files that are stored on that tape should be recalled from the 2nd copy so that it can be re-copied elsewhere.
IBM provides a script called relocate_replica.sh
which recalls data from a replica. For tapes that are in an "Invalid LTFS" state, you will need to modify the script because it isn't able to handle the space in the status. I needed to hard code the variables POOL_TO_REMOVE
and LIB_TO_REMOVE
based on the tape that I was trying to recover data from. To use this script:
## Usage:
# sh relocate_replica.sh -t TAPE_TO_RECOVER -p COPY1:COPY2 -P GPFS_DEV
## For example:
# sh relocate_replica.sh -t T00050L8 -p TIERED1@T_ARCH:TIERED2@T_ARCH -P /dev/gpfs1
Once the files has been recalled, it is in a premigrated state. This can be updated to resident by running dsmrecall -resident filename
or using the ltfsee repair
command.
Alternatively, if you have a list of files you want to recall (such as from reading the volume schema file at /tiered/.ltfsee/meta/$LIBRARY/volume_cache/*schema
), you can create a text file containing a list of filenames prefixed with --
and then running ltfsee recall filelist.txt
to recall these specific files. For example:
# cat <<EOF > /tmp/files
-- /tiered/xyz/file1.txt
-- /tiered/xyz/file2.txt
EOF
# ltfsee recall /tmp/files
GLESL268I(00156): 2 file name(s) have been provided to recall.
More information from Chapter 10 of the IBM Spectrum Archive Enterprise Edition Redbook.
Updating lin_tape
Download the lin_taped and lin_tape packages from IBM Fix Central. The lin_tape package is a SRPM and needs to be built into a RPM on your system.
## Build the srpm
# rpmbuild --rebuild lin_tape-3.0.52-1.src.rpm
## Stop LTFSEE and hsm. Unload the lin_tape module before upgrading.
# ltfsee stop
# systemctl stop hsm.service
## Update lin_tape
# yum update lin_taped-3.0.52-rhel7.x86_64.rpm /root/rpmbuild/RPMS/x86_64/lin_tape-3.0.52-1.x86_64.rpm
## Restart services
# systemctl start hsm.service
# ltfsee start
Troubleshooting
Logging
Logs worth investigating when things go wrong.
Service | Log Location |
---|---|
Spectrum Archive Logs | /var/log/ltfsee.log |
Spectrum Archive Debug Logs | ltfsee_catcsvlog |
HSM Logs | /opt/tivoli/tsm/client/hsm/bin/dsmerror.log |
Tape Cartridge Volume Cache | /tiered/.ltfsee/meta/library-id/volume_cache |
lin_tape driver | /var/log/lin_tape.* |
LTFSEE Log Collection | Run: /opt/ibm/ltfsee/bin/ltfsee_log_collection
|
See Also: https://www.ibm.com/support/knowledgecenter/en/ST9MBR_1.3.0/ltfs_ee_logging_facilities_config.html
The reclamation process failed
Two tape drives have tapes in it but doesn't appear to be doing anything. Looking at /var/log/ltfsee.log
, we can see that the reclamation process failed. The T00037L8
tape was in a drive that stopped functioning, showing a '5' in the single character display with the cartridge ejected.
Addendum: Bad Tape?
After a week, the same error occurred again with the same tape cartridge. It's most likely that theT00037L8
tape cartridge is faulty.
2020-01-20T10:23:05.514522-07:00 ltfs reclaim_target[10652]: GLESA112E(00590): The following command failed with (rc:256:1) : /bin/cp /ltfs/T00037L8/.LTFSEE_DATA/10991691794470275686-15679318773138264748-78339434-18551699-0 /ltfs/T00022L8/.LTFSEE_DATA 2>&1.
2020-01-20T10:23:14.925509-07:00 ltfs reclaim_target[10652]: GLESA112E(00590): The following command failed with (rc:256:1) : /bin/cp /ltfs/T00037L8/.LTFSEE_DATA/10991691794470275686-15679318773138264748-78339434-18551699-0 /ltfs/T00022L8/.LTFSEE_DATA 2>&1.
2020-01-20T10:23:14.925818-07:00 ltfs reclaim_target[10652]: GLESR035E(01182): The copy process from source to destination tape failed for the file 10991691794470275686-15679318773138264748-78339434-18551699-0.
2020-01-20T10:23:14.926141-07:00 ltfs reclaim_target[10652]: GLESR004E(01930): Processing file /tiered/ewang/xuexu/dbgap_tcga_germline/SRR3341182_SRR3341183_varscan.pileup failed: exiting the reclamation driver.
2020-01-20T10:23:14.926413-07:00 ltfs reclaim_target[10652]: GLESR026E(00158): The reclamation process failed (1932).#012 Have a look for previous messages.
2020-01-20T10:23:14.927489-07:00 ltfs mmm[2692]: GLESM221E(02136): Generic job with identifier REC_TGTT00022L8 failed.
2020-01-20T10:23:15.501824-07:00 ltfs mmm[2692]: GLESM223E(02024): Not all generic requests for session 1447695617 have been successful: 1 failed.
2020-01-20T10:23:15.502533-07:00 ltfs mmm[2692]: GLESM221E(02136): Generic job with identifier REC_SRCT00037L8 failed.
2020-01-20T10:23:16.130885-07:00 ltfs mmm[2692]: GLESM223E(02024): Not all generic requests for session 1442649089 have been successful: 1 failed.
2020-01-20T10:23:16.132052-07:00 ltfs ltfsee[14335]: GLESL082E(01590): Reclamation failed while reclaiming tape T00037L8 to target tape T00022L8.
2020-01-26T05:00:02.516999-07:00 ltfs ltfsee[23826]: GLESL668E(00979): Unable to get the state of tape T00030L8. Skip to reclaim. Consult the log files. (rc=1040)
2020-01-26T05:00:02.823888-07:00 ltfs ltfsee[23826]: GLESL682E(01028): Tape with ID: T00030L8 is an invalid state. Source tapes must be in state either "Valid LTFS" or "Warning".
2020-01-26T05:04:36.512781-07:00 ltfs reclaim_target[31377]: GLESR030E(00178): The reclamation process failed. (1655)#012 Have a look for previous messages.
2020-01-26T05:04:36.513142-07:00 ltfs mmm[2692]: GLESM221E(02136): Generic job with identifier REC_TGTT00029L8 failed.
2020-01-26T05:04:36.620168-07:00 ltfs mmm[2692]: GLESM223E(02024): Not all generic requests for session 2135234305 have been successful: 1 failed.
The two drives that are still in use are idling for the past week. Interestingly, ltfsee log did not show any information for 31L8 and 54L8 tapes.
# ltfsee info jobs
Job Type Status Idle(sec) Scan ID Tape Pool Library Node File Name or inode
Reclaim(Source) In-progress 576566 2944737025 T00031L8 TIERED1 T_ARCH 4 -
Reclaim(Target) In-progress 576438 2944737281 T00054L8 TIERED1 T_ARCH 4 -
Validate Unscheduled 7224 1602429441 T00054L8 TIERED1 T_ARCH - -
The job cannot be stopped. Force a LTFS stop using ltfsee stop -f
and wait. The tapes drives should eventually eject the tapes. I got the following error when stopping, but the tapes got ejected and everything seemed to have stopped.
# ltfsee stop -f
Library name: T_ARCH, library serial: 0000078BA6130402, control node (ltfsee_md) IP address: 172.26.3.249.
Running stop command - sending request and waiting for the completion.
GLESL030E(00909): Unable to connect to the MMM service. Check whether the IBM Spectrum Archive EE has been started.
GLESL358E(00494): Error on processing tape T00054L8 (1).
GLESL661E(00104): IPC got failure result (result=1).
GLESL646E(00164): Unable to stop the IBM Spectrum Archive EE monitor daemon for library T_ARCH.
Cannot Start LTFS
# /opt/ibm/ltfsee/bin/ltfsee start
Library name: T_ARCH, library serial: 0000078BA6130402, control node (ltfsee_md) IP address: 172.26.3.249.
Running start command - sending request : T_ARCH.
Running start command - waiting for completion : T_ARCH.
...
GLESL657E(00191): Fail to start the IBM Spectrum Archive EE service (MMM) for library T_ARCH.
Use the 'ltfsee info nodes' command to see the error modules.
The monitor daemon will start the recovery sequence.
# ltfsee info nodes
Spectrum Archive EE service (MMM) for library T_ARCH fails to start or is not running on ltfs-ib.gpfs.net Node ID:4
Problem Detected:
Node ID Error Modules
4 MMM;
Looking at /var/log/ltfsee.log
, we see:
2020-02-03T13:15:35.266926-07:00 ltfs mmm[31142]: GLESM709E(00369): Assign tape (T00099L8) command error: 172.26.3.249:7600 (4): Request Error (070E): [Cartridge.cc:61]: Cartridge add is failed: 7c05 LTFSI1079E The operation is not allowed.
Seemed to have started up by itself?
# /opt/ibm/ltfsee/bin/ltfsee start
Library name: T_ARCH, library serial: 0000078BA6130402, control node (ltfsee_md) IP address: 172.26.3.249.
GLESL519I(00344): The IBM Spectrum Archive EE service (ltfsee_md) for library T_ARCH is already running.
# ltfsee info nodes
Node ID Status Node IP Drives Ctrl Node Library Node Group Host Name
4 Available 172.26.3.249 3 yes(active) T_ARCH G0 ltfs-ib.gpfs.net
Not sure what happened there...
Invalid tapes due to bad tape drive
# ltfsee info tapes
Tape ID Status Type Capacity(GiB) Used(GiB) Free(GiB) Reclaimable(GiB) Pool Library Address Drive Appendable
T00030L8 Invalid L8 10907 10895 0 22 TIERED1 T_ARCH 1131 - no
T00099L8 Invalid L8 10907 10883 0 0 TIERED2 T_ARCH 1031 - no
According to IBM's documentation:
The Invalid status indicates that the cartridge is inconsistent with the LTFS format.To check and repair this tape before you add it to a tape storage pool, use the ltfsee pool add command with the check option.
The root problem of these invalid tapes was a bad tape drive. The bad tape drive damaged the tape file markers which caused reading issues and in some cases causing the tape to be marked as Invalid LTFS. Re-adding these invalid tapes back using the deep recovery option was not possible.
The fix that IBM proposed was to recall all files using ltfsee reclaim
on the bad tape, and if necessary recall data manually by using ITDT and then replacing the stub file with the recovered file to make the file resident again.
Bad tape drives
One of the tape drives is in 'Error' state.
# ltfsee info drives
Drive S/N Status Type Role Library Address Node ID Tape Node Group
0007807A4B In use LTO8 mrg T_ARCH 260 4 T00000L8 G0
0007807A0B Error LTO8 mrg T_ARCH 261 4 - G0
000780765B In use LTO8 mrg T_ARCH 262 4 T00037L8 G0
The drive shows up as online to the tape library. No visible errors displayed on the tape drive itself.
I can't remove/readd it:
# ltfsee drive remove -d 0007807A0B
GLESL132E(00247): Could not remove a drive 0007807A0B. Drive is not in mount or not mounted state. The tape drive status:2.
Stopping and starting LTFSEE seemed to have cleared this error.
# ltfsee stop
Library name: T_ARCH, library serial: 0000078BA6130402, control node (ltfsee_md) IP address: 172.26.3.249.
Running stop command - sending request and waiting for the completion.
...
Stopped the IBM Spectrum Archive EE services for library T_ARCH.
# ltfsee start
Library name: T_ARCH, library serial: 0000078BA6130402, control node (ltfsee_md) IP address: 172.26.3.249.
Running start command - sending request : T_ARCH.
Running start command - waiting for completion : T_ARCH.
.....................................
Started the IBM Spectrum Archive EE services for library T_ARCH with good status.
# ltfsee info drives
Drive S/N Status Type Role Library Address Node ID Tape Node Group
0007807A4B Not mounted LTO8 mrg T_ARCH 260 4 - G0
0007807A0B Not mounted LTO8 mrg T_ARCH 261 4 - G0
000780765B Not mounted LTO8 mrg T_ARCH 262 4 - G0
Failing Tape Drive
The TS4500 tape library reported a tape drive issue with the error "Drive internal power-on self-tests failed." code 001F. The LTFS operations that were running at the time on this tape drive stopped and the tape cartridge itself is marked as 'Critical'.
/var/log/ltfsee.log
showed:
2020-03-24T19:43:39.873263-06:00 ltfs ltfseecp[21628]: GLESG081E(00468): Migrating data of GPFS file /tiered/kkurek/sbarclay/speedseq_align/73.L005.realign.bam: write failed to tape T00086L8 and file /ltfs/T00086L8/.LTFSEE_DATA/10991691794470275686-15679318773138264748-228651564-67167434-0 (data length: 524288, rc: -1, errno: 5).
2020-03-24T19:43:39.873579-06:00 ltfs ltfseecp[21628]: GLESG506E(00803): Migration file (/tiered/kkurek/sbarclay/speedseq_align/73.L005.realign.bam) to tape T00086L8 failed (1091).
2020-03-24T19:43:39.873870-06:00 ltfs ltfseecp[21628]: GLESC003E(01158): Redundant copy for file /tiered/kkurek/sbarclay/speedseq_align/73.L005.realign.bam to tape T00086L8 failed.
2020-03-24T19:43:39.979980-06:00 ltfs mmm[28988]: GLESM110W(00210): Tape T00086L8 got critical.
2020-03-24T19:44:09.243830-06:00 ltfs mmm[28988]: GLESM709E(00442): Unmount T00086L8 command error: 172.26.3.249:7600 (4): Request Error (077E): [Cartridge.cc:146]: Cartridge unmount is failed: 578a LTFSI1086E This operation is not allowed on a cartridge with a critical error.
2020-03-24T19:44:09.244112-06:00 ltfs mmm[28988]: GLESM118E(00060): Unmount of tape T00086L8 failed (drive 000780765B). Check the state of tapes and drives.
2020-03-24T19:44:39.166261-06:00 ltfs mmm[28988]: GLESM709E(00442): Unmount T00086L8 command error: 172.26.3.249:7600 (4): Request Error (077E): [Cartridge.cc:146]: Cartridge unmount is failed: 5b10 LTFSI1086E This operation is not allowed on a cartridge with a critical error.
2020-03-24T19:44:39.166568-06:00 ltfs mmm[28988]: GLESM118E(00060): Unmount of tape T00086L8 failed (drive 000780765B). Check the state of tapes and drives.
2020-03-24T19:45:09.283571-06:00 ltfs mmm[28988]: GLESM709E(00442): Unmount T00086L8 command error: 172.26.3.249:7600 (4): Request Error (077E): [Cartridge.cc:146]: Cartridge unmount is failed: 5d71 LTFSI1086E This operation is not allowed on a cartridge with a critical error.
2020-03-24T19:45:09.283890-06:00 ltfs mmm[28988]: GLESM118E(00060): Unmount of tape T00086L8 failed (drive 000780765B). Check the state of tapes and drives.
...
2020-03-25T01:19:14.689053-06:00 ltfs ltfsee[14508]: GLESL062E(00577): Tape with ID: T00086L8 is in an invalid state. Tapes must be either in state "Valid LTFS" or "Unknown".
2020-03-25T01:19:14.695195-06:00 ltfs ltfsee[14501]: GLESL062E(00577): Tape with ID: T00086L8 is in an invalid state. Tapes must be either in state "Valid LTFS" or "Unknown".
2020-03-25T01:19:14.700692-06:00 ltfs ltfsee[14503]: GLESL062E(00577): Tape with ID: T00086L8 is in an invalid state. Tapes must be either in state "Valid LTFS" or "Unknown".
2020-03-25T01:19:14.706384-06:00 ltfs ltfsee[14504]: GLESL062E(00577): Tape with ID: T00086L8 is in an invalid state. Tapes must be either in state "Valid LTFS" or "Unknown".
# ltfsee info tapes
Tape ID Status Type Capacity(GiB) Used(GiB) Free(GiB) Reclaimable(GiB) Pool Library Address Drive Appendable
T00086L8 Critical L8 10907 0 0 0 TIERED2 T_ARCH 262 000780765B no
...
T00084L8 Critical L8 10907 10859 0 0 TIERED2 T_ARCH 262 000780765B no (this happened the next day)
You must recover all files from the tape prior to moving it:
# ltfsee tape move homeslot -t T00084L8 -p TIERED2
GLESL373I(00890): Moving tape T00084L8.
GLESL630E(00463): Cannot move tape T00084L8 because the status is "Critical".
The recovery process from IBM's documentation can be read at https://www.ibm.com/support/knowledgecenter/en/ST9MBR_1.2.4/ltfs_ee_recovering_critical_tapes.html
In summary, the recovery process is to:
- Recover the files from tape back to Spectrum Scale.
## Show all files to recover # ltfsee recover -s -p TIERED2 -l T_ARCH -t T00084L8 ## Recover these files # ltfsee recover -c -p TIERED2 -l T_ARCH -t T00084L8
- Remove the tape from the LTFS library.
# ltfsee recover -r -p TIERED2 -l T_ARCH -t T00084L8
- Move the tape to the IO port.
# ltfsee tape move homeslot -t T00084L8 -p TIERED2
If files are stuck in a premigrated state, the recover process won't succeed:
# ltfsee recover -r -p TIERED1 -l T_ARCH -t T00037L8
Scanning GPFS file systems to find migrated/saved objects in tape T00037L8.
Tape T00037L8 has 14 files to be recovered. The list is saved to /tmp/ltfs.29601.tiered.recoverlist.
GLESL613E(00822): Cannot remove tape T00037L8 because there are files to be recovered.
You can try to use the ltfsee recover
command to change the status of the files from premigrated to resident.
# ltfsee repair "`cat /tmp/a`"
ANS9294I No files matching '/tiered/morph/....czi' were found.
GLESL257I(00083): Non-empty regular file /tiered/morph/....czi was in premigrated state. The file is repaired to resident state.
But for some reason, that doesn't actually do what it says it did:
# ltfsee info files -f "`cat /tmp/a`"
Name: /tiered/morph/....czi
Tape id:T00037L8@T_ARCH:T00069L8@T_ARCH Status: premigrated
Since the files causing the issues are stored on another tape, stored on Spectrum Archive storage for premigration, and also stored on the GPFS storage, it should be safe to force remove the tape cartridge.
# ltfsee pool remove -p TIERED1 -t T00037L8 -r
GLESL043I(01134): Removing tape T00037L8 from storage pool TIERED1.
GLESL361W(01199): Tape T00037L8 was removed from pool TIERED1 forcefully. Cannot recall files on tape T00037L8. Add the tape to the same pool again if you need to recall files from the tape without formatting.
The tape was removed from the pool but was left in the tape drive. Because the tape isn't part of LTFSEE anymore, you have to move it to its homeslot using the tape library interface. The bad drive was then removed:
[root@ltfs tiered]# ltfsee drive remove -d 000780765B -n 4 -l T_ARCH
GLESL121I(00279): Drive serial 000780765B is removed from the tape drive list.
The 'bad' tape can then be re-added.
Cannot Add Tape
The tape library is unable to read 7 tape cartridges. When I attempt to add one of these tapes to the LTFS pool, I get the following errors:
# ltfsee pool add -p TIERED2 -t T00102L8 -c
GLESL042I(00894): Adding tape T00102L8 to storage pool TIERED2.
GLESL091E(00937): This operation is not allowed to this state of tape. Need to check the status of Tape T00102L8 by using the ltfsee info tapes command.
# ltfsee pool add -p TIERED2 -t T00102L8 -d
GLESL042I(00894): Adding tape T00102L8 to storage pool TIERED2.
GLESL091E(00937): This operation is not allowed to this state of tape. Need to check the status of Tape T00102L8 by using the ltfsee info tapes command.
From dmesg and in the ltfsee.log file:
# dmesg | tail
[ 459.877936] lin_tape: IBMChgr0-----30402 changer_check_result sensekey: 5 asc: 24 ascq: 0
[ 1275.717464] lin_tape_set_active_partition: LOCATE_16 failed: -5
[ 3866.466645] lin_tape_set_active_partition: LOCATE_16 failed: -5
[ 4145.200700] lin_tape_set_active_partition: LOCATE_16 failed: -5
# tail /var/log/ltfsee.log
2020-04-09T11:05:23.662537-06:00 ltfs mmm[31580]: GLESM709E(00369): Assign tape (T00102L8) command error: 172.26.3.249:7600 (4): Request Error (070E): [Cartridge.cc:61]: Cartridge add is failed: 165b LTFSI1079E The operation is not allowed.
2020-04-09T11:50:22.403860-06:00 ltfs mmm[31580]: GLESM709E(00559): Recovery T00102L8 command error: 172.26.3.249:7600 (4): Request Error (082E): [Cartridge.cc:242]: Cartridge recovery is failed: 167e LTFSD0301E Read or write permanent error.
2020-04-09T11:52:43.831125-06:00 ltfs mmm[31580]: GLESM555E(00341): Failed to format tape (T00102L8).
2020-04-09T11:52:44.348738-06:00 ltfs mmm[31580]: GLESM223E(02024): Not all generic requests for session 1439891713 have been successful: 1 failed.
2020-04-09T11:52:44.349214-06:00 ltfs ltfsee[5720]: GLESL091E(00937): This operation is not allowed to this state of tape. Need to check the status of Tape T00102L8 by using the ltfsee info tapes command.
2020-04-09T11:53:13.302341-06:00 ltfs mmm[31580]: GLESM709E(00369): Assign tape (T00102L8) command error: 172.26.3.249:7600 (4): Request Error (070E): [Cartridge.cc:61]: Cartridge add is failed: 7e7d LTFSI1084E This operation is not allowed on an invalid cartridge.
The tape library also shows the following entries in the event log.
Description: Cartridge T00102L8 had a read, write, or positioning error.
Error Code: 0003
Description: Cartridge T00102L8 had a read, write, or positioning error because of faulty media.
Error Code: 0004
It looks like the tape has just gone bad... along with 6 others in a short period of time.