#11950 vmhost-x86-copr01.rdu-cc.fedoraproject.org DOWN
Opened 8 months ago by praiskup. Modified a month ago

IDRAC claims:

The system board Pfault fail-safe voltage is outside of range.  Sun 26 May 2024 23:48:49
The OCP PG voltage is outside of range.     Sun 26 May 2024 23:48:42

The machine can not be turned on.


Yeah, this happened yesterday night... I wasn't able to file a ticket on it then, so many thanks for doing so.

We will need to engage dell tomorrow and see what can be done... ;(

Metadata Update from @phsmoura:
- Issue priority set to: Waiting on Assignee (was: Needs Review)
- Issue tagged with: medium-gain, medium-trouble, ops

8 months ago

I think @dkirwan and @jnsamyak and @patrikp are going to work on this one and try and get dell on the line to fix things. ;)

Working with Dell tech support to resolve.

Latest update:

  • updating idrac firmware
  • failing to update server firmware
  • Requesting RH tech in datacenter reseat an "OCP card" on the server.

RH Tech is unavailable until week after June 10th to carry out a reseat of the OCP card on the server, we are blocked with the Dell tech support steps until this work is carried out.

James Gibson has responded to me this afternoon, he will be in the RDU2-CC datacenter and can take care of reseating this OCP card.

James messaged that he discovered an issue with a PSU. Updated Dell with information.

It's a bad PSU, removed one, reboots itself
Remove the other, boots just fine

So whats the current status here? I see the machine is up, do we need to replace the bad psu? or ?

Yup, just going through the tech support process once more, and will try get them to replace the damaged PSU.

Metadata Update from @dkirwan:
- Issue assigned to dkirwan

7 months ago

Updated internal ticket for James Gibson, to carry out the next troubleshooting steps requested by Dell.

James swapped the PSUs on Friday, and the server boots up correctly with the other PSU also so not a PSU issue on its own, updated Dell tech support with the information. Should get another update today from them.

Checking to see if there is an issue external to the server, perhaps with the UPS.

Asking James Gibson to check:

  • Are the PSUs set to redundant?
  • When plugged at the same time, are them being plug to the same outlet/UPS?
  • If so, can we test by plugging them to different outlets/UPS ?

OK, looks like we're back in action fully, the 2nd psu is connected, and we've connected to power from different power sockets. Seems like it may have been a faulty power outlet?

Metadata Update from @dkirwan:
- Issue close_status updated to: Fixed
- Issue status updated to: Closed (was: Open)

6 months ago

Metadata Update from @praiskup:
- Issue status updated to: Open (was: Closed)

6 months ago

Can not be turned on :shrug:

Yeah, it has some weird power error. ;(

"The system board Pfault fail-safe voltage is outside of range."

Seems like this box is just cursed. ;(

@dkirwan can you look at this and get onsite/dell folks to work on it?

Spoke with Dell, and James Gibson in RH. Opened tickets and got the following troubleshooting steps to carryout next:

1. Power the server down.  
2. Disconnect server from all power cables, Network cables. 
3. Hold down the power button continuously for at least 10 seconds.  
4. Insert power cabless and network cables back to the system.  
5. Wait about 2 minutes before powering on the server for iDRAC to be refreshed.
6. Power the system on.  

If the issue persist after the power flea drain, please perform the following:

Reseat the OCP card and perform another flea drain.

Once that is performed please let us know the results, if the server still doesn't turn on, we'll have to perform a minimum 2 POST.

The components mentioned below are the minimum configuration to POST:

● One processor (CPU) in socket processor 1
● One memory module (DIMM) in socket A1
● One power supply unit
● System board + LOM card + RIO 

Everything else must be disconnected / removed (please take pictures to confirm the configuration).

James Gibson has configured the server with minimal config. Server is showing orange light, gathering logs and reopening case with Dell to troubleshoot further.

Got caught here with tickets timing out and closing, Flock, and pto, having to open a new ticket with Dell support :crying_cat_face:

Some new support steps to follow. Reaching out to James Gibson to try arrange someone to handle it.

1. Check the electrical environment for any external voltage issues.
2. Update the Firmware 

iDRAC with Lifecycle Controller to v7.10.50.10 

BIOS to v2.16.2 (System restart is required) 

3.  Swap the Network Card from Slot 1 to Slot 2 

Collect a TSR at this point to identify if the errors are following the Card or the Slot. 

If the error persists, please connect one card at a time and check if the error persists. 

Keep a note of which card is throwing the errors when connected. 

Network Cards are located in the Riser 2. 

4. Perform a flea power drain 

- Power the server down.  
- Disconnect server from all power cables, Network cables. 
- Hold down the power button continuously for at least 10 seconds.  
- Insert power cables and network cables back to the system.  
- Wait about 2 minutes before powering on the server for iDRAC to be refreshed.
- Power the system on. 

James will get to visit later this week, currently in RDU3.

James has carried out the steps today, orange light still on the chassis, capturing logs and uploaded to Dell.

  • James Gibson has managed to swap the network cards as requested
  • Unfortunately the Server still showing yellow light, same voltage warnings in the logs on idrac.
  • Uploaded the logs to dell once more, and now they are going to replace OCP card, MB and the CPU.
  • Contacted James to get information required in order to have Dell engineer call out and perform this hardware swap out.

Dell has responded with an appointment date.

We have successfully placed the order for the parts replacement. The appointment is scheduled for 9/9/2024.

Dell engineer has replaced the hardware and the system now appears to be up and running although with different networking config. Will look into getting external network access restored asap to this machine now.

We should just need to update the ansible vars and re-run the playbook for the main interfaces... the mgmt is static, so shouldn't matter.

The network device with mac f4:02:70:d0:05:00 is still present - that's the one we use for networking; so I think everything is OK right now (Copr allocated VMs, and system is utilized).

The changed devices are not used I think: https://pagure.io/fedora-infra/ansible/c/79ee807af52a0ed12ef7d6588d39f3198d917b9f?branch=main

Thank you for the help here! :heart:

Metadata Update from @praiskup:
- Issue close_status updated to: Fixed
- Issue status updated to: Closed (was: Open)

4 months ago

Hm, for the record, I spend a while in the ssh command line and the system alerts:

Message from syslogd@vmhost-x86-copr01 at Sep 12 07:31:04 ...
 kernel:[Hardware Error]: Corrected error, no action required.

Message from syslogd@vmhost-x86-copr01 at Sep 12 07:31:04 ...
 kernel:[Hardware Error]: CPU:0 (17:31:0) MC255_STATUS[-|CE|-|AddrV|-|-|-|-|-]: 0x940000000000009f

Message from syslogd@vmhost-x86-copr01 at Sep 12 07:31:04 ...
 kernel:[Hardware Error]: Error Addr: 0x0000001a1c188820

Message from syslogd@vmhost-x86-copr01 at Sep 12 07:31:04 ...
 kernel:[Hardware Error]: PPIN: 0x02b49efcae18c05e

Message from syslogd@vmhost-x86-copr01 at Sep 12 07:31:04 ...
 kernel:[Hardware Error]: IPID: 0x0000000000000000

Message from syslogd@vmhost-x86-copr01 at Sep 12 07:31:04 ...
 kernel:[Hardware Error]: cache level: L3/GEN, tx: RESV

That doesn't look good. I guess it didn't crash though?

Perhaps we need a bios/firmware update?

I'll update the firmware on there and see if that improves anything so.

Metadata Update from @dkirwan:
- Issue status updated to: Open (was: Closed)

4 months ago

This server is already running the latest bios version apparently.

Need to get the server updated to RHEL 8.10 in order to install the Dell iDRAC Service Module iSM utility, so we can gather host bundle logs from iDRAC for Dell tech support.

Do it when you have time for it; copr will re-start the builds that were taken by this box. If you want to be super gentle on Copr users, let us know ~3 hours in advance, we'll deallocate the machine.

Hi @praiskup can I give you 3 hours notice now, and I'll take care of this upgrade later this afternoon!

System upgraded to RHEL 8.10, and the Dell Service Module iSM service is installed.

So we are waiting on dell here?

No this is currently with me, I need to figure out the connection between idrac and this ism service module to enable it to gather logs, once this is capable of gathering the host logs I'll then be able to approach dell tech support and re engage further troubleshooting.

ok, cool. Thanks for the update.

So the ISM service is installed and running:

systemctl status dcismeng.service:
EventCategory="Audit" EventSeverity="info" IsPastEvent="false" language="en-US"] The iDRAC Service Module is started on the operating system (OS) of server.     
8194" EventCategory="Audit" EventSeverity="warn" IsPastEvent="false" language="en-US"] The iDRAC Service Module is unable to discover iDRAC from the operating system of the server.
8194" EventCategory="Audit" EventSeverity="warn" IsPastEvent="false" language="en-US"] The iDRAC Service Module is running with Limited Functionality Mode hence some features are unavailable. Possible reasons are: 1) OS-to-BMC Passthrough setting in iDRAC is disabled 2) USBNIC interface on the host OS does not have a configured IP address.
 The iDRAC Service Module detected an OS to iDRAC Pass-through in the disabled mode. Enable the OS to iDRAC Pass-through (USB NIC) or reinstall iSM.

lsusb:
Bus 001 Device 009: ID 413c:a102 Dell Computer Corp. iDRAC Virtual NIC                                                 

dmesg: 
[5088820.001846] usb 1-1.3: USB disconnect, device number 9
[5088826.065569] usb 1-1.3: new high-speed USB device number 10 using xhci_hcd
[5088826.160723] usb 1-1.3: New USB device found, idVendor=413c, idProduct=a102, bcdDevice= 3.16
[5088826.160732] usb 1-1.3: New USB device strings: Mfr=1, Product=2, SerialNumber=3
[5088826.160735] usb 1-1.3: Product: iDRAC Virtual NIC USB Device
[5088826.160738] usb 1-1.3: Manufacturer: Dell(TM)
[5088826.160740] usb 1-1.3: SerialNumber: 5678

It can see the OS to iDRAC Pass-through ethernet device being enabled and disabled, but its not actually creating an ethernet device on the system. Might require a module to be enabled.. need to do some more research.

Hi @praiskup would it be possible to upgrade this system to RHEL 9.0 without affecting the copr workloads? Wondering if this and using the latest Dell iDRAC ISM on RHEL 9.0 might unblock me here..

Log in to comment on this ticket.

Metadata
Boards 1
ops Status: Backlog