#11755 vmhost-x86-copr02 hardware issues
Closed: Fixed with Explanation 5 days ago by kevin. Opened 2 months ago by kevin.

vmhost-x86-copr02 is currently down with hardware issues.

I opened a ticket with Dell on it on friday, but haven't heard back yet, will ping them again on it.

It refuses to power up with a nasty sounding hardware message, so I think it needs something replaced.

Filing this ticket for visibility.

CC: @frostyx @praiskup

Metadata Update from @phsmoura:
- Issue priority set to: Waiting on Assignee (was: Needs Review)
- Issue tagged with: medium-gain, medium-trouble, ops

2 months ago

Thank you @kevin for taking care of this.

I called dell today since the support ticket I submitted via the chat thing 10 days ago was not going anywhere.

They filed a new ticket for me, had me gather debug info and try a few things.

Next steps they want to do a 'flea drain' on it and see if that helps, if not, they want us to try and pull some memory banks that were showing errors in the past and see if it boots.

I've filed an internal ticket to get someone on site at rdu2 to do that.

Metadata Update from @praiskup:
- Issue tagged with: copr

2 months ago

Just a FYI, I am still working on this... it's proving hard to isolate what part is faulty.

Hopefully we will have some news soon.

So, with some DC folks help we were finally able to isolate the problem to one of the CPU's.

The machine is back up now, but without the bad cpu. We have asked the vendor to ship us a replacement. So, there will need to be some downtime to swap in the new cpu when it arrives.

Thank you for the update and good news! I re-enabled the machine in Copr (with only half = 10 workers for now).

Ah, the replacement cpu came in and they are replacing it right now. :(
Hopefully that doesn't mess up anything... it should be back up with full cpus in a few here.

The cpu has been installed and the machine is back up now.

I do want to check it's mgmt console later today to confirm that there's no errors, but after that we should hopefully finally be done here.

Nice, I increased the per-host quota for this machine again. Feel free to close when you are OK with doing so (the machine seems OK from my perspective). Thank you again!

So, there's a error showing up for one of the memory sticks.

They sent us some new memory, and we swapped it in... but the error kept happening.

Then, we swapped that stick with another one and the error stayed with the slot. ;(

I have sent dell a new diag log and waiting to see what they suggest.

Finally the saga has come to an end. :)

Dell admited that the motherboard was not working right and sent a new one (and a tech to replace it).

The machine is back up and everything is finally showing green. :)

Thanks for all the patience here...

Metadata Update from @kevin:
- Issue close_status updated to: Fixed with Explanation
- Issue status updated to: Closed (was: Open)

5 days ago

Login to comment on this ticket.

Boards 1
ops Status: Backlog