#11029 bvmhost-x86-03 has lost 2 drives
Closed: Fixed with Explanation a year ago by kevin. Opened a year ago by smooge.

From emails and configs it looks like sdh and sdb are failed on the hardware. Putting in a ticket as it could cause a big outage which will need to be tracked.

This is an automatically generated mail message from mdadm
running on bvmhost-x86-03.iad2.fedoraproject.org

A Fail event had been detected on md device /dev/md/0.

It could be related to component device /dev/sdb1.

Faithfully yours, etc.

P.S. The /proc/mdstat file currently contains the following:

Personalities : [raid1] [raid6] [raid5] [raid4]
md1 : active raid1 sdh2[7](F) sdg2[6] sdf2[5] sde2[4] sdc2[2] sda2[0] sdd2[3]
      488384 blocks super 1.0 [8/6] [U_UUUUU_]
      bitmap: 0/1 pages [0KB], 65536KB chunk

bvmhost-x86-03.iad2.fedoraproject.org:compose-iot01.iad2.fedoraproject.org:running:1
bvmhost-x86-03.iad2.fedoraproject.org:compose-rawhide01.iad2.fedoraproject.org:running:1
bvmhost-x86-03.iad2.fedoraproject.org:koji02.iad2.fedoraproject.org:running:1
bvmhost-x86-03.iad2.fedoraproject.org:mbs-backend01.iad2.fedoraproject.org:running:1
bvmhost-x86-03.iad2.fedoraproject.org:oci-registry02.iad2.fedoraproject.org:running:1
bvmhost-x86-03.iad2.fedoraproject.org:odcs-backend01.iad2.fedoraproject.org:running:1
bvmhost-x86-03.iad2.fedoraproject.org:sign-bridge01.iad2.fedoraproject.org:running:1

I'll try and call dell tomorrow on this.

One drive is completely gone (doesn't even show in mgmt). The other one shows fine in mgmt, but errors in linux.

Ideally they would replace both drives, but failing that, they would replace the one thats completely off line and we reboot and readd it, then the other.

In the event of doom, all the vm's on this server could be redeployed on another, none of them should have local data.

Metadata Update from @zlopez:
- Issue tagged with: medium-gain, medium-trouble, ops

a year ago

Dell is sending a tech to swap 2 drives and some memory. This will happen later today.

we will try and keep downtime to a min...

Drive 1 was replaced. Drive 7 actually didn't produce errors on reboot, so we think it might have just been in a bad state or the memory issue was affecting it.

Memory was replaced.

Machine is back up and running.

Metadata Update from @kevin:
- Issue close_status updated to: Fixed with Explanation
- Issue status updated to: Closed (was: Open)

a year ago

Login to comment on this ticket.

Metadata
Boards 1
ops Status: Backlog