#11950 vmhost-x86-copr01.rdu-cc.fedoraproject.org DOWN
Opened 2 months ago by praiskup. Modified 4 hours ago

IDRAC claims:

The system board Pfault fail-safe voltage is outside of range.  Sun 26 May 2024 23:48:49
The OCP PG voltage is outside of range.     Sun 26 May 2024 23:48:42

The machine can not be turned on.


Yeah, this happened yesterday night... I wasn't able to file a ticket on it then, so many thanks for doing so.

We will need to engage dell tomorrow and see what can be done... ;(

Metadata Update from @phsmoura:
- Issue priority set to: Waiting on Assignee (was: Needs Review)
- Issue tagged with: medium-gain, medium-trouble, ops

2 months ago

I think @dkirwan and @jnsamyak and @patrikp are going to work on this one and try and get dell on the line to fix things. ;)

Working with Dell tech support to resolve.

Latest update:

  • updating idrac firmware
  • failing to update server firmware
  • Requesting RH tech in datacenter reseat an "OCP card" on the server.

RH Tech is unavailable until week after June 10th to carry out a reseat of the OCP card on the server, we are blocked with the Dell tech support steps until this work is carried out.

James Gibson has responded to me this afternoon, he will be in the RDU2-CC datacenter and can take care of reseating this OCP card.

James messaged that he discovered an issue with a PSU. Updated Dell with information.

It's a bad PSU, removed one, reboots itself
Remove the other, boots just fine

So whats the current status here? I see the machine is up, do we need to replace the bad psu? or ?

Yup, just going through the tech support process once more, and will try get them to replace the damaged PSU.

Metadata Update from @dkirwan:
- Issue assigned to dkirwan

24 days ago

Updated internal ticket for James Gibson, to carry out the next troubleshooting steps requested by Dell.

James swapped the PSUs on Friday, and the server boots up correctly with the other PSU also so not a PSU issue on its own, updated Dell tech support with the information. Should get another update today from them.

Checking to see if there is an issue external to the server, perhaps with the UPS.

Asking James Gibson to check:

  • Are the PSUs set to redundant?
  • When plugged at the same time, are them being plug to the same outlet/UPS?
  • If so, can we test by plugging them to different outlets/UPS ?

OK, looks like we're back in action fully, the 2nd psu is connected, and we've connected to power from different power sockets. Seems like it may have been a faulty power outlet?

Metadata Update from @dkirwan:
- Issue close_status updated to: Fixed
- Issue status updated to: Closed (was: Open)

10 days ago

Metadata Update from @praiskup:
- Issue status updated to: Open (was: Closed)

4 hours ago

Can not be turned on :shrug:

Log in to comment on this ticket.

Metadata
Boards 1
ops Status: Backlog