Issue #8750: VMs don't boot correctly on openqa-aarch64-02.qa and we don't know why - fedora-infrastructure

fedora-infrastructure

#8750 VMs don't boot correctly on openqa-aarch64-02.qa and we don't know why

Closed: Will Not/Can Not fix 4 years ago by smooge. Opened 4 years ago by adamwill.

We have a very weird issue with openqa-aarch64-02.qa.fedoraproject.org . That's one of the three X-Gene boxes that were recently deployed as openQA worker hosts, the other two being openqa-aarch64-01 and openqa-aarch64-03.

Sometime between around 2020-03-12T20:11:15.0753 UTC and 2020-03-12T23:55:53.0381 UTC, all openQA jobs run on that host started failing in early boot: it seems the UEFI firmware cannot find any attached drives (optical or hard) and so fails to boot. It happens every time on this host, and never seems to happen on the other two.

The system logs did not indicate that anything of note - or really anything at all - happened on the system between those times. On the server logs I do see something kinda interesting, some NFS errors like this that appear around the relevant time:

Mar 13 00:00:07 openqa-stg01.qa.fedoraproject.org kernel: NFSD: client 10.5.131.26 testing state ID with incorrect client ID
Mar 13 00:00:17 openqa-stg01.qa.fedoraproject.org kernel: NFSD: client 10.5.131.26 testing state ID with incorrect client ID
Mar 13 00:00:27 openqa-stg01.qa.fedoraproject.org kernel: NFSD: client 10.5.131.26 testing state ID with incorrect client ID

but since we've recreated the box those have gone, and the problem persists.

We've looked at everything we can think of that could cause this. We've checked the package loadout differences between the boxes, and tried different versions of anything possibly significant; no dice. We've tried the older, official qemu build (rather than the infra one). We've reinstalled various packages. We've reinstalled the entire box from scratch. We've tried SELinux permissive and enforcing. We've rebooted it a dozen times. Nothing we do seems to make any difference - the VMs still always fail on 02, and always boot on 01 and 03.

For now I'm going to take 02 "out of circulation": I'm not actually going to stop the worker instances, but I'm going to adjust its config so all the worker instances have a WORKER_CLASS such that they will not pick up any 'normal' jobs. This will also allow us to manually run jobs targeted at that WORKER_CLASS for convenient testing (those jobs will only be picked up on that host).

Metadata Update from @smooge:
- Issue priority set to: Waiting on Assignee (was: Needs Review)
- Issue tagged with: hardware

4 years ago

kevin commented 4 years ago

Is there a way I could launch a instance manually from a command line? or make it try and boot one so I could watch?

Can I install the libvirt goo on it to see if that works?

And I suppose we could try f32 on the off chance?

adamwill commented 4 years ago

"Is there a way I could launch a instance manually from a command line?"

Er...sort of but not really. You can run one, but you don't get to see any output (or at least I couldn't figure a way to); it seems the UEFI firmware doesn't output to the console, only to a video device, and you can't really get a VM GUI out of infra (at least I haven't figured out how). To get a qemu command you can start from the one os-autoinst actually runs - they're logged in each job - and then hack at it. This is a typical command:

/usr/bin/qemu-system-aarch64 -device virtio-gpu-pci -only-migratable -chardev ringbuf,id=serial0,logfile=serial0,logappend=on -serial chardev:serial0 -soundhw hda -m 3072 -machine virt,gic-version=max -cpu host -netdev user,id=qanet0 -device virtio-net,netdev=qanet0,mac=52:54:00:12:34:56 -object rng-random,filename=/dev/urandom,id=rng0 -device virtio-rng-pci,rng=rng0 -device nec-usb-xhci -device usb-tablet -device usb-kbd -smp 2 -enable-kvm -no-shutdown -vnc :95,share=force-shared -device virtio-serial -chardev pipe,id=virtio_console,path=virtio_console,logfile=virtio_console.log,logappend=on -device virtconsole,chardev=virtio_console,name=org.openqa.console.virtio_console -chardev socket,path=qmp_socket,server,nowait,id=qmp_socket,logfile=qmp_socket.log,logappend=on -qmp chardev:qmp_socket -S -device virtio-scsi-pci,id=scsi0 -blockdev driver=file,node-name=hd0-file,filename=/var/lib/openqa/pool/5/raid/hd0,cache.no-flush=on -blockdev driver=qcow2,node-name=hd0,file=hd0-file,cache.no-flush=on -device virtio-blk,id=hd0-device,drive=hd0,serial=hd0 -blockdev driver=file,node-name=cd0-overlay0-file,filename=/var/lib/openqa/pool/5/raid/cd0-overlay0,cache.no-flush=on -blockdev driver=qcow2,node-name=cd0-overlay0,file=cd0-overlay0-file,cache.no-flush=on -device scsi-cd,id=cd0-device,drive=cd0-overlay0,serial=cd0 -drive id=pflash-code-overlay0,if=pflash,file=/var/lib/openqa/pool/5/raid/pflash-code-overlay0,unit=0,readonly=on -drive id=pflash-vars-overlay0,if=pflash,file=/var/lib/openqa/pool/5/raid/pflash-vars-overlay0,unit=1

you'd want to drop the -device virtio-gpu-pci unless you actually want to try and get at the graphical output somehow, probably drop some of the console outputs (like the virtio_console.log one), and tweak the storage devices (os-autoinst creates overlay images for everything in special directories, you can create your own overlay images and point at those instead, or just point directly at some ISO image...you probably don't need a hard disk attached at all).

"or make it try and boot one so I could watch?"

Yeah, we can do this. Running this command as root on openqa-stg01.qa should do the trick:

/usr/share/openqa/script/clone_job.pl --skip-download --from localhost 772923 WORKER_CLASS=aarch64-02-broken

that should clone a job in such a way that it's guaranteed to run on the broken worker. It'll give you the job ID and you can visit https://openqa.stg.fedoraproject.org/tests/(jobid) to view it in the web UI, and/or monitor it any way you can think of from the worker host itself.

"Can I install the libvirt goo on it to see if that works?"

Installing it shouldn't hurt anything, but the trick would be actually running a VM through it and seeing what happens...

"And I suppose we could try f32 on the off chance?"

I mean, we could, I guess. It feels like an inelegant solution given the other two identical boxes have no issue on F31, though...

Edited 4 years ago by adamwill

kevin commented 4 years ago

ok, so I installed virt-install / libvirt and did a install... it came up fine. ;(

So, is there any assets that are specific to this machine in openqa? Might it be using a corrupt disk image/overlay or cache of some kind thats seperate from the ones used by the other workers? I guess that would have been cleaned up on reinstall, but is there anything on the server end that would have stuck around?

smooge commented 4 years ago

This system was removed from inventory and sent with another box to IAD2. When it gets there we will see if we can get it working again or swap it out for another system.

Metadata Update from @smooge:
- Issue close_status updated to: Will Not/Can Not fix
- Issue status updated to: Closed (was: Open)

4 years ago

Metadata

Assignee

None

Tags

Blocking

None

Depending on

None

Priority

Waiting on Assignee

fedora-infrastructure

Source Code

#8750 VMs don't boot correctly on openqa-aarch64-02.qa and we don't know why Closed: Will Not/Can Not fix 4 years ago by smooge. Opened 4 years ago by adamwill.

Metadata

hardware

#8750 VMs don't boot correctly on openqa-aarch64-02.qa and we don't know why

Closed: Will Not/Can Not fix 4 years ago by smooge. Opened 4 years ago by adamwill.