buildhw-a64 01,02,20,21,22,23,24 are all lenovo emags that were running Fedora 39 happily buildhw-a64-19,20 are thunderx2 were running Fedora 39 happily bvmhost-a64-05 to bvmhost-a64-11 are all emags were running RHEL 8 (they were/are slated to be reinstalled with fedora and make into buildhw's)
After upgrading the Fedora 39 / buildhw's to 40 and rebooting they fail to boot with:
Checkpoint 9D Checkpoint 9C Checkpoint B4 Checkpoint B4 Checkpoint B4 Checkpoint B4 Checkpoint B4 Checkpoint 9D Checkpoint 9C Checkpoint B4 Checkpoint 92 Checkpoint A0 Checkpoint A2 Checkpoint A2 Checkpoint A0 Checkpoint A2 Checkpoint A2 Checkpoint 92 Checkpoint 92 Checkpoint 92 Checkpoint 92 Checkpoint 92 Checkpoint 92 Checkpoint 92 Checkpoint 92 Checkpoint A0 Checkpoint A0 Checkpoint A2 Checkpoint A2 Checkpoint AD Synchronous Exception at 0x000000BFF184B4A0
I was able to get one of them to pxe boot ok and start installer, so perhaps we just need to reinstall them. ;( But would be nice to know if f40 will work on them at all.
Metadata Update from @phsmoura: - Issue priority set to: Waiting on Assignee (was: Needs Review) - Issue tagged with: medium-gain, medium-trouble, ops
Spent a large chunk of today on this without too much to show for it. ;(
I did both f39 and f40 installs and neither boot.
I tried downgrading grub/shim/kernel and nothing booted.
openqa-a64-worker03.iad2.fedoraproject.org is a emag, on f40 and up and working... but it's not been rebooted in about 35days. ( @adamwill I advise against rebooting it until we sort this out).
CC: @pbrobinson @pwhalen @ausil :)
There was an image that Kevin posted to the channel from one of the systems about the install not able to update the EUFI boot loads (or something very similar to that). I don't know if that is consistent across the boxes, but someone was mentioned a similar problem on different hardware a while ago.. it seemed to be something with the EUFI only having so many slots available and not really cleaning them out. The fix on that board was to do some sort of factory reset with a paperclip on the board but I don't know what the Ampere will need. [The other similar problem I found last night was that the specific NVRAM for the EUFI is hosed once it is full and has to be replaced physically.. I am hoping this isn't the case.]
Installed F40 on an ampere-hr350 in beaker, booted ok with the latest bits.
Firmware details: SMpro FW version: 1.08 PMpro FW version: 1.08 FW date: 20200326 UEFI version: 1.15 hve104r UEFI date: 20210226
Ours seem to have:
SMpro FW version: 1.07 PMpro FW version: 1.07 FW date: 20190523 UEFI version: 1.12 HVE104N UEFI date: 20191129
Is there any chance you might have the firmware files available somewhere I could try updating with? I looked at lenovo and ampere's sites and found nothing yesterday. ;(
Hi Kevin - need the idiots guide as to what is missing, what you're testing on, etc - but happy to try and help track down anything. What are you running on and what's an emag :) Mark
These are lenovo emag aarch64 servers.
The current summary is that I can reinstall them by pxe booting just fine, but the end of the install results in:
<img alt="Screenshot_from_2024-07-11_13-28-17.png" src="/fedora-infrastructure/issue/raw/files/452dc6d9ef095c4451f366ade72555fdc9777fea8badadfc8db781f599a87263-Screenshot_from_2024-07-11_13-28-17.png" />
I also cannot change any efi boot entries with efibootmgr.
All rhel8/9, fedora 39/40 fail the same way.
It has to be shim and/or grub2, but downgrading those doesn't get it working.
So, I think it's a firmware issue and we need to upgrade to get them booting again. ;( I am working on getting the firmware files.
OK - Let me check with the server team if there is someone there who can help. Mark
My contact in the server team didn't know what that server was either :) Suggestion was to dump the dmidecode information (-t system?) so we can see if it matches anything we know about?
These are Lenovo HR350a systems which were a short run set of hardware that Lenovo rebadged for Ampere. All support for these was done via Ampere but they stopped supporting the hardware in 2022 and archived all the data on it as far as I could tell when I went to look for things.
The systems have a Supermicro BMC on the mainboard which might be able to be 'reset via paperclip' but i am not sure.
I was able to obtain the last firmware(s) and update them... the buildhw-01/02 and 21/22/23/24 have all been updated and resintalled and ansiblized.
builhw-a64-19 and 20 are cavium boxes that we can't seem to get to mgmt anymore, so those are going to have to wait for a person to be on-site.
There's also 7 more that are old bvmhost's... I plan to update firmware on them and reinstall them as buildhw's... but will not get to that today.
Thanks everyone for help on it.
I reinstalled those 7 other hosts before the mass rebuild.
So, that leaves 19 and 20 still down. I will get onsite folks to look at them (tenatively the week of aug 5th).
So that visit was delayed. It's scheduled now for next week.
I got 19 and 20 back alive and reinstalled and redeployed them. \o/
Metadata Update from @kevin: - Issue close_status updated to: Fixed with Explanation - Issue status updated to: Closed (was: Open)
Log in to comment on this ticket.