#618 [critical] hdd issue on out-of-warranty backup server
Closed: Fixed with Explanation 6 months ago by arrfab. Opened 7 months ago by arrfab.

the backup server for the whole centos infra (including backups for Stream 8 builds, etc) is suffering from a HDD in PFA mode

49727990.700965] sd 6:0:6:0: [sdh] FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE cmd_age=16s
[49727990.700972] sd 6:0:6:0: [sdh] Sense Key : Medium Error [current] [descriptor] 
[49727990.700978] sd 6:0:6:0: [sdh] Add. Sense: Unrecovered read error - auto reallocate failed
[49727990.700984] sd 6:0:6:0: [sdh] CDB: Read(16) 88 00 00 00 00 00 08 19 f6 d8 00 00 00 80 00 00
[49727990.700989] blk_update_request: critical medium error, dev sdh, sector 135919320
[49728003.052364] raid5_end_read_request: 5 callbacks suppressed
[49728003.052367] md/raid:md1: read error corrected (8 sectors at 135919320 on sdh)

It slows down all operations for backup and because node is out-of-warranty, we need to investigate where to move it, but as we don't have nodes under warranty, that will be to another node out-of-warranty. ...

Let's use this ticket for some services rebalance wrt nfs/iscsi and local storage usage in RDU2c


domino cascade effect but after analysis and risk mitigation plan (for the time being), I'll move some iscsi lun to different storage, same for nfs exports, and recondition one other node as new backup server (still out of warranty but more disks and so more capacity for some HDDs to be used as hot-spare devices in the software raid array in case of other disk issues)

First identified iscsi lun (WIP) that is being moved to free the new backup storage node : iscsi lun used for the main primary mirror (where all centos artifacts - Stream, Linux and SIGs - are pushed)

moving also NFS exports used for cbs and announcing downtime for the nfs switch to new host

Could we also change the default Vendor to CentOS Community Build Service for CBS, since everything is going down anyway...?

Could we also change the default Vendor to CentOS Community Build Service for CBS, since everything is going down anyway...?
That was my initial intention, as updating that through ansible and rolling out would just restart kojid but on second thought, in other ticket #621, we mentioned doing it at the tag level and if we want to implement that change as "system-wide" maybe we should just announce intention on centos-devel list and wait for reaction/feedback and if no feedback, just implement what was announced. But not doing that in a hurry just because there is a small maintenance window (and kojid can be restarted later, even during quiet period so it's fine for later)

Moved multiple iscsi and NFS luns/exports in the last days and so working now on the backup server (last one in the chain). WIP and content slowly getting there.

(Is replacing the troubled hard drive not possible?)

backup server is now operational on a a different host (still not under warranty) but at least we have local backup pool and remote (encrypted) snapshots over S3. Closing this ticket

Metadata Update from @arrfab:
- Issue close_status updated to: Fixed with Explanation
- Issue status updated to: Closed (was: Open)

6 months ago

Login to comment on this ticket.

Metadata
Boards 1