#12049 buildhw-a64 01,02,19,20,21,22,23,24 not booting
Closed: Fixed with Explanation 5 months ago by kevin. Opened 6 months ago by kevin.

buildhw-a64 01,02,20,21,22,23,24 are all lenovo emags that were running Fedora 39 happily
buildhw-a64-19,20 are thunderx2 were running Fedora 39 happily
bvmhost-a64-05 to bvmhost-a64-11 are all emags were running RHEL 8 (they were/are slated to be reinstalled with fedora and make into buildhw's)

After upgrading the Fedora 39 / buildhw's to 40 and rebooting they fail to boot with:

Checkpoint 9D
Checkpoint 9C
Checkpoint B4
Checkpoint B4
Checkpoint B4
Checkpoint B4
Checkpoint B4
Checkpoint 9D
Checkpoint 9C
Checkpoint B4
Checkpoint 92
Checkpoint A0
Checkpoint A2
Checkpoint A2
Checkpoint A0
Checkpoint A2
Checkpoint A2
Checkpoint 92
Checkpoint 92
Checkpoint 92
Checkpoint 92
Checkpoint 92
Checkpoint 92
Checkpoint 92
Checkpoint 92
Checkpoint A0
Checkpoint A0
Checkpoint A2
Checkpoint A2
Checkpoint AD


Synchronous Exception at 0x000000BFF184B4A0

I was able to get one of them to pxe boot ok and start installer, so perhaps we just need to reinstall them. ;(
But would be nice to know if f40 will work on them at all.


Metadata Update from @phsmoura:
- Issue priority set to: Waiting on Assignee (was: Needs Review)
- Issue tagged with: medium-gain, medium-trouble, ops

6 months ago

Spent a large chunk of today on this without too much to show for it. ;(

I did both f39 and f40 installs and neither boot.

I tried downgrading grub/shim/kernel and nothing booted.

openqa-a64-worker03.iad2.fedoraproject.org is a emag, on f40 and up and working... but it's not been rebooted in about 35days.
( @adamwill I advise against rebooting it until we sort this out).

There was an image that Kevin posted to the channel from one of the systems about the install not able to update the EUFI boot loads (or something very similar to that). I don't know if that is consistent across the boxes, but someone was mentioned a similar problem on different hardware a while ago.. it seemed to be something with the EUFI only having so many slots available and not really cleaning them out. The fix on that board was to do some sort of factory reset with a paperclip on the board but I don't know what the Ampere will need. [The other similar problem I found last night was that the specific NVRAM for the EUFI is hosed once it is full and has to be replaced physically.. I am hoping this isn't the case.]

Installed F40 on an ampere-hr350 in beaker, booted ok with the latest bits.

Firmware details:
SMpro FW version: 1.08
PMpro FW version: 1.08
FW date: 20200326
UEFI version: 1.15 hve104r
UEFI date: 20210226

Ours seem to have:

SMpro FW version: 1.07
PMpro FW version: 1.07
FW date: 20190523
UEFI version: 1.12 HVE104N
UEFI date: 20191129

Is there any chance you might have the firmware files available somewhere I could try updating with?
I looked at lenovo and ampere's sites and found nothing yesterday. ;(

Hi Kevin - need the idiots guide as to what is missing, what you're testing on, etc - but happy to try and help track down anything.
What are you running on and what's an emag :)
Mark

These are lenovo emag aarch64 servers.

The current summary is that I can reinstall them by pxe booting just fine, but the end of the install results in:

Screenshot_from_2024-07-11_13-28-17.png

I also cannot change any efi boot entries with efibootmgr.

All rhel8/9, fedora 39/40 fail the same way.

It has to be shim and/or grub2, but downgrading those doesn't get it working.

So, I think it's a firmware issue and we need to upgrade to get them booting again. ;(
I am working on getting the firmware files.

OK - Let me check with the server team if there is someone there who can help.
Mark

My contact in the server team didn't know what that server was either :)
Suggestion was to dump the dmidecode information (-t system?) so we can see if it matches anything we know about?

These are Lenovo HR350a systems which were a short run set of hardware that Lenovo rebadged for Ampere. All support for these was done via Ampere but they stopped supporting the hardware in 2022 and archived all the data on it as far as I could tell when I went to look for things.

The systems have a Supermicro BMC on the mainboard which might be able to be 'reset via paperclip' but i am not sure.

I was able to obtain the last firmware(s) and update them... the buildhw-01/02 and 21/22/23/24 have all been updated and resintalled and ansiblized.

builhw-a64-19 and 20 are cavium boxes that we can't seem to get to mgmt anymore, so those are going to have to wait for a person to be on-site.

There's also 7 more that are old bvmhost's... I plan to update firmware on them and reinstall them as buildhw's... but will not get to that today.

Thanks everyone for help on it.

I reinstalled those 7 other hosts before the mass rebuild.

So, that leaves 19 and 20 still down. I will get onsite folks to look at them (tenatively the week of aug 5th).

So that visit was delayed. It's scheduled now for next week.

I got 19 and 20 back alive and reinstalled and redeployed them. \o/

Metadata Update from @kevin:
- Issue close_status updated to: Fixed with Explanation
- Issue status updated to: Closed (was: Open)

5 months ago

Log in to comment on this ticket.

Metadata
Boards 1
ops Status: Backlog
Attachments 1