#7907 qa14 down
Closed: Fixed 4 years ago by kevin. Opened 4 years ago by kevin.

qa14 is down.

I noticed the other day it had 2 dropped disks, so I bet it dropped another one and was unable to continue.

md2 : active raid6 sde3[6] sdd3[11] sdf3[8] sdc3[10] sda3[9] sdb3[2]
      2719623168 blocks super 1.2 level 6, 512k chunk, algorithm 2 [8/6] [U_U_UUUU]
      bitmap: 3/4 pages [12KB], 65536KB chunk

md1 : active raid6 sda1[9] sde1[6] sdd1[11] sdc1[10] sdf1[8] sdb1[2]
      786038784 blocks super 1.2 level 6, 512k chunk, algorithm 2 [8/6] [U_U_UUUU]
      bitmap: 1/1 pages [4KB], 65536KB chunk

md0 : active raid1 sda2[9] sde2[6] sdc2[10] sdd2[11] sdf2[8] sdb2[2]
      1023424 blocks super 1.2 [8/6] [U_U_UUUU]
      bitmap: 1/1 pages [4KB], 65536KB chunk

unused devices: <none>

Likely we will need to replace drives and do a reinstall.

cc @adamwill @tflink @smooge


OK, thanks for the heads-up. For the record, qa14 is the 'medium-sized' production host, it runs 10 worker instances; losing it cuts our prod capacity from 44 to 34 (the 'small' host, qa05, hosts 4 instances, the 'large' host, qa02, hosts 30). So it's not a disaster but it's a problem, if you can get it back up soon I'd appreciate that. If we lose it for a long time we can consider switching qa09 from staging to prod so the loss affects staging rather than prod, I guess.

So the box had 2 disks not added and 1 disk dead. I am reinstalling the box and we can add the failed disk later.

Metadata Update from @smooge:
- Issue assigned to smooge

4 years ago

Looks like it's all back up.

Feel free to reopen if there's anything further to do.

Metadata Update from @kevin:
- Issue close_status updated to: Fixed
- Issue status updated to: Closed (was: Open)

4 years ago

Actually I wasn't done yet as I haven't documented it in the ticket. I will do that now:
1. Bad drive has not been replaced. I need to get various debug data from the system and call it into Dell.
2. System was built with 7 drives so when drive gets replaced I recommend it being brought back online and then added as a spare to the raid. The r520's eat disks a lot so it would be better to have an online spare.
3. The kickstart I used has problems due to using authconfig in the kickstart. This was used in various other ones and we need to just edit those kickstarts.

Login to comment on this ticket.

Metadata