Issue #11029: bvmhost-x86-03 has lost 2 drives - fedora-infrastructure

fedora-infrastructure

#11029 bvmhost-x86-03 has lost 2 drives

Closed: Fixed with Explanation a year ago by kevin. Opened a year ago by smooge.

From emails and configs it looks like sdh and sdb are failed on the hardware. Putting in a ticket as it could cause a big outage which will need to be tracked.

This is an automatically generated mail message from mdadm
running on bvmhost-x86-03.iad2.fedoraproject.org

A Fail event had been detected on md device /dev/md/0.

It could be related to component device /dev/sdb1.

Faithfully yours, etc.

P.S. The /proc/mdstat file currently contains the following:

Personalities : [raid1] [raid6] [raid5] [raid4]
md1 : active raid1 sdh2[7](F) sdg2[6] sdf2[5] sde2[4] sdc2[2] sda2[0] sdd2[3]
      488384 blocks super 1.0 [8/6] [U_UUUUU_]
      bitmap: 0/1 pages [0KB], 65536KB chunk

bvmhost-x86-03.iad2.fedoraproject.org:compose-iot01.iad2.fedoraproject.org:running:1
bvmhost-x86-03.iad2.fedoraproject.org:compose-rawhide01.iad2.fedoraproject.org:running:1
bvmhost-x86-03.iad2.fedoraproject.org:koji02.iad2.fedoraproject.org:running:1
bvmhost-x86-03.iad2.fedoraproject.org:mbs-backend01.iad2.fedoraproject.org:running:1
bvmhost-x86-03.iad2.fedoraproject.org:oci-registry02.iad2.fedoraproject.org:running:1
bvmhost-x86-03.iad2.fedoraproject.org:odcs-backend01.iad2.fedoraproject.org:running:1
bvmhost-x86-03.iad2.fedoraproject.org:sign-bridge01.iad2.fedoraproject.org:running:1

kevin commented a year ago

I'll try and call dell tomorrow on this.

One drive is completely gone (doesn't even show in mgmt). The other one shows fine in mgmt, but errors in linux.

Ideally they would replace both drives, but failing that, they would replace the one thats completely off line and we reboot and readd it, then the other.

In the event of doom, all the vm's on this server could be redeployed on another, none of them should have local data.

Metadata Update from @zlopez:
- Issue tagged with: medium-gain, medium-trouble, ops

a year ago

kevin commented a year ago

Dell is sending a tech to swap 2 drives and some memory. This will happen later today.

we will try and keep downtime to a min...

kevin commented a year ago

Drive 1 was replaced. Drive 7 actually didn't produce errors on reboot, so we think it might have just been in a bad state or the memory issue was affecting it.

Memory was replaced.

Machine is back up and running.

Metadata Update from @kevin:
- Issue close_status updated to: Fixed with Explanation
- Issue status updated to: Closed (was: Open)

a year ago

Metadata

Assignee

None

Tags

Blocking

None

Depending on

None

Priority

🔥 URGENT 🔥

Boards 1

ops Status: Backlog

fedora-infrastructure

Source Code

#11029 bvmhost-x86-03 has lost 2 drives Closed: Fixed with Explanation a year ago by kevin. Opened a year ago by smooge.

Metadata

ops medium-trouble medium-gain

Boards 1

#11029 bvmhost-x86-03 has lost 2 drives

Closed: Fixed with Explanation a year ago by kevin. Opened a year ago by smooge.