Issue #618: [critical] hdd issue on out-of-warranty backup server - centos-infra

centos-infra

#618 [critical] hdd issue on out-of-warranty backup server

Closed: Fixed with Explanation 2 years ago by arrfab. Opened 2 years ago by arrfab.

the backup server for the whole centos infra (including backups for Stream 8 builds, etc) is suffering from a HDD in PFA mode

49727990.700965] sd 6:0:6:0: [sdh] FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE cmd_age=16s
[49727990.700972] sd 6:0:6:0: [sdh] Sense Key : Medium Error [current] [descriptor] 
[49727990.700978] sd 6:0:6:0: [sdh] Add. Sense: Unrecovered read error - auto reallocate failed
[49727990.700984] sd 6:0:6:0: [sdh] CDB: Read(16) 88 00 00 00 00 00 08 19 f6 d8 00 00 00 80 00 00
[49727990.700989] blk_update_request: critical medium error, dev sdh, sector 135919320
[49728003.052364] raid5_end_read_request: 5 callbacks suppressed
[49728003.052367] md/raid:md1: read error corrected (8 sectors at 135919320 on sdh)

It slows down all operations for backup and because node is out-of-warranty, we need to investigate where to move it, but as we don't have nodes under warranty, that will be to another node out-of-warranty. ...

Let's use this ticket for some services rebalance wrt nfs/iscsi and local storage usage in RDU2c

arrfab commented 2 years ago

domino cascade effect but after analysis and risk mitigation plan (for the time being), I'll move some iscsi lun to different storage, same for nfs exports, and recondition one other node as new backup server (still out of warranty but more disks and so more capacity for some HDDs to be used as hot-spare devices in the software raid array in case of other disk issues)

First identified iscsi lun (WIP) that is being moved to free the new backup storage node : iscsi lun used for the main primary mirror (where all centos artifacts - Stream, Linux and SIGs - are pushed)

arrfab commented 2 years ago

moving also NFS exports used for cbs and announcing downtime for the nfs switch to new host

ngompa commented 2 years ago

Could we also change the default Vendor to CentOS Community Build Service for CBS, since everything is going down anyway...?

Edited 2 years ago by ngompa

arrfab commented 2 years ago

Could we also change the default Vendor to CentOS Community Build Service for CBS, since everything is going down anyway...?
That was my initial intention, as updating that through ansible and rolling out would just restart kojid but on second thought, in other ticket #621, we mentioned doing it at the tag level and if we want to implement that change as "system-wide" maybe we should just announce intention on centos-devel list and wait for reaction/feedback and if no feedback, just implement what was announced. But not doing that in a hurry just because there is a small maintenance window (and kojid can be restarted later, even during quiet period so it's fine for later)

Edited 2 years ago by arrfab

arrfab commented 2 years ago

Moved multiple iscsi and NFS luns/exports in the last days and so working now on the backup server (last one in the chain). WIP and content slowly getting there.

fche commented 2 years ago

(Is replacing the troubled hard drive not possible?)

arrfab commented 2 years ago

backup server is now operational on a a different host (still not under warranty) but at least we have local backup pool and remote (encrypted) snapshots over S3. Closing this ticket

Metadata Update from @arrfab:
- Issue close_status updated to: Fixed with Explanation
- Issue status updated to: Closed (was: Open)

2 years ago

Metadata

Assignee

arrfab

Tags

Blocking

None

Depending on

None

Priority

🔥 Urgent 🔥

Boards 1

CentOS_Common_Infrastructure Status: Backlog

centos-infra

Source Code

#618 [critical] hdd issue on out-of-warranty backup server Closed: Fixed with Explanation 2 years ago by arrfab. Opened 2 years ago by arrfab.

Metadata

centos-common-infra centos-stream high-gain medium-trouble

Boards 1

#618 [critical] hdd issue on out-of-warranty backup server

Closed: Fixed with Explanation 2 years ago by arrfab. Opened 2 years ago by arrfab.