#7703 ppc builders seem broken: KVM implementation does not support Transactional Memory
Closed: Fixed 9 months ago Opened 10 months ago by dustymabe.

  • Describe the issue

koji image build tasks are failing with qemu-system-ppc64: KVM implementation does not support Transactional Memory, try cap-htm=off.

See:
- https://pagure.io/dusty/failed-composes/issue/673
- https://pagure.io/dusty/failed-composes/issue/674
- https://pagure.io/dusty/failed-composes/issue/675

  • When do you need this? (YYYY/MM/DD)

soon

  • When is this no longer needed or useful? (YYYY/MM/DD)

never

  • If we cannot complete your request, what is the impact?

ppc users sad


So, I am not sure what exactly is causing this. I did see that both the affected builders were running some old stuck compose vm's, so perhaps it was a resource issue?

In any case, I updated them, killed the old compose jobs and rebooted them.

If that doesn't fix it, at least we have a recent kernel/base to start debugging with.

@kevin, what's hypervisor on the builders? RHV?

The builders are running as kvm vm's on a rhel7-alt installed host.

I have not yet tried to update the host kernel/libvirt/qemu stack. Perhaps thats worth a shot?

I have now rebooted them with 'ppc_tm=off' to see if that helps any...

The things that come to mind:
issue with Transactional Memory on host or guest
regression in nested virt on either RHEL-Alt or Fedora (could be kernel/libvirt/qemu)

It would be useful to know if there was any patching of either host or build vm around the time it started to fail

yes, knowing the history of updates (host, builder vm) would be helpful, handling HTM is new feature in libvirt 4.6.0

disabling HTM from the bottom up might be a good workaround

this is still breaking basically all of our composes that run nightly for container/cloud/atomic can we look closer at it?

The builders are fedora 28. This started happening on the 25th.

On the 25th, they applied the following security updates:

qemu-kvm-2.11.2-2.fc28.ppc64le                Sat 25 Aug 2018 04:46:16 AM UTC
qemu-2.11.2-2.fc28.ppc64le                    Sat 25 Aug 2018 04:46:16 AM UTC
kernel-modules-4.17.17-200.fc28.ppc64le       Sat 25 Aug 2018 04:46:11 AM UTC
kernel-core-4.17.17-200.fc28.ppc64le          Sat 25 Aug 2018 04:46:09 AM UTC
qemu-system-xtensa-core-2.11.2-2.fc28.ppc64le Sat 25 Aug 2018 04:46:05 AM UTC
qemu-system-xtensa-2.11.2-2.fc28.ppc64le      Sat 25 Aug 2018 04:46:05 AM UTC
qemu-img-2.11.2-2.fc28.ppc64le                Sat 25 Aug 2018 04:46:05 AM UTC
qemu-system-x86-core-2.11.2-2.fc28.ppc64le    Sat 25 Aug 2018 04:46:04 AM UTC
qemu-system-x86-2.11.2-2.fc28.ppc64le         Sat 25 Aug 2018 04:46:04 AM UTC
qemu-system-unicore32-core-2.11.2-2.fc28.ppc64le Sat 25 Aug 2018 04:46:03 AM UTC
qemu-system-unicore32-2.11.2-2.fc28.ppc64le   Sat 25 Aug 2018 04:46:03 AM UTC
qemu-system-tricore-core-2.11.2-2.fc28.ppc64le Sat 25 Aug 2018 04:46:02 AM UTC
qemu-system-tricore-2.11.2-2.fc28.ppc64le     Sat 25 Aug 2018 04:46:02 AM UTC
qemu-system-sparc-core-2.11.2-2.fc28.ppc64le  Sat 25 Aug 2018 04:46:02 AM UTC
qemu-system-sparc-2.11.2-2.fc28.ppc64le       Sat 25 Aug 2018 04:46:02 AM UTC
qemu-system-sh4-core-2.11.2-2.fc28.ppc64le    Sat 25 Aug 2018 04:46:01 AM UTC
qemu-system-sh4-2.11.2-2.fc28.ppc64le         Sat 25 Aug 2018 04:46:01 AM UTC
qemu-system-s390x-core-2.11.2-2.fc28.ppc64le  Sat 25 Aug 2018 04:46:00 AM UTC
qemu-system-s390x-2.11.2-2.fc28.ppc64le       Sat 25 Aug 2018 04:46:00 AM UTC
qemu-system-ppc-core-2.11.2-2.fc28.ppc64le    Sat 25 Aug 2018 04:45:59 AM UTC
qemu-system-ppc-2.11.2-2.fc28.ppc64le         Sat 25 Aug 2018 04:45:59 AM UTC
qemu-system-or1k-2.11.2-2.fc28.ppc64le        Sat 25 Aug 2018 04:45:58 AM UTC
qemu-system-or1k-core-2.11.2-2.fc28.ppc64le   Sat 25 Aug 2018 04:45:57 AM UTC
qemu-system-nios2-core-2.11.2-2.fc28.ppc64le  Sat 25 Aug 2018 04:45:57 AM UTC
qemu-system-nios2-2.11.2-2.fc28.ppc64le       Sat 25 Aug 2018 04:45:57 AM UTC
qemu-system-moxie-2.11.2-2.fc28.ppc64le       Sat 25 Aug 2018 04:45:57 AM UTC
qemu-system-moxie-core-2.11.2-2.fc28.ppc64le  Sat 25 Aug 2018 04:45:56 AM UTC
qemu-system-mips-core-2.11.2-2.fc28.ppc64le   Sat 25 Aug 2018 04:45:56 AM UTC
qemu-system-mips-2.11.2-2.fc28.ppc64le        Sat 25 Aug 2018 04:45:56 AM UTC
qemu-system-microblaze-core-2.11.2-2.fc28.ppc64le Sat 25 Aug 2018 04:45:54 AM UTC
qemu-system-microblaze-2.11.2-2.fc28.ppc64le  Sat 25 Aug 2018 04:45:54 AM UTC
qemu-system-m68k-core-2.11.2-2.fc28.ppc64le   Sat 25 Aug 2018 04:45:53 AM UTC
qemu-system-m68k-2.11.2-2.fc28.ppc64le        Sat 25 Aug 2018 04:45:53 AM UTC
qemu-system-lm32-core-2.11.2-2.fc28.ppc64le   Sat 25 Aug 2018 04:45:52 AM UTC
qemu-system-lm32-2.11.2-2.fc28.ppc64le        Sat 25 Aug 2018 04:45:52 AM UTC
qemu-system-cris-core-2.11.2-2.fc28.ppc64le   Sat 25 Aug 2018 04:45:52 AM UTC
qemu-system-cris-2.11.2-2.fc28.ppc64le        Sat 25 Aug 2018 04:45:52 AM UTC
qemu-system-arm-core-2.11.2-2.fc28.ppc64le    Sat 25 Aug 2018 04:45:51 AM UTC
qemu-system-arm-2.11.2-2.fc28.ppc64le         Sat 25 Aug 2018 04:45:51 AM UTC
qemu-system-alpha-core-2.11.2-2.fc28.ppc64le  Sat 25 Aug 2018 04:45:50 AM UTC
qemu-system-alpha-2.11.2-2.fc28.ppc64le       Sat 25 Aug 2018 04:45:50 AM UTC
qemu-system-aarch64-2.11.2-2.fc28.ppc64le     Sat 25 Aug 2018 04:45:50 AM UTC
qemu-user-2.11.2-2.fc28.ppc64le               Sat 25 Aug 2018 04:45:49 AM UTC
qemu-system-aarch64-core-2.11.2-2.fc28.ppc64le Sat 25 Aug 2018 04:45:49 AM UTC
qemu-block-ssh-2.11.2-2.fc28.ppc64le          Sat 25 Aug 2018 04:45:46 AM UTC
qemu-block-rbd-2.11.2-2.fc28.ppc64le          Sat 25 Aug 2018 04:45:45 AM UTC
qemu-block-nfs-2.11.2-2.fc28.ppc64le          Sat 25 Aug 2018 04:45:45 AM UTC
qemu-block-iscsi-2.11.2-2.fc28.ppc64le        Sat 25 Aug 2018 04:45:45 AM UTC
qemu-block-gluster-2.11.2-2.fc28.ppc64le      Sat 25 Aug 2018 04:45:45 AM UTC
qemu-block-dmg-2.11.2-2.fc28.ppc64le          Sat 25 Aug 2018 04:45:45 AM UTC
qemu-common-2.11.2-2.fc28.ppc64le             Sat 25 Aug 2018 04:45:44 AM UTC
qemu-block-curl-2.11.2-2.fc28.ppc64le         Sat 25 Aug 2018 04:45:44 AM UTC

This lead me to:

https://bugzilla.redhat.com/show_bug.cgi?id=1614948

which is this same bug I am pretty sure... they are looking at reverting it now there.

Good catch, it looks like the culprit to me.

I suppose the question is do we downgrade qemu on just those builders building images/containers to allow the fix to settle through while we get beta freeze out of the way?

@sayanchowdhury FYI: we need to monitor the status of this ticket. Currently we can't create ppc64le cloud images so we won't be able to release FAH next week. @kevin, we'll monitor this ticket, but let us know if we can help you with anything.

cc @sinnykumari.

I've upgraded the builders to the 'fixed' qemu... so this should be fixed now.

It seems there is still some issue (different issue) after upgrading F28 builders to qemu-2.11.2-3.fc28. We have imagebuild failing from 20180831.n.0 composes.
https://pagure.io/dusty/failed-composes/issue/730
https://pagure.io/dusty/failed-composes/issue/733

I can reproduce the problem locally but didn't find the cause yet.

I see this in the logs:

2018-08-31T06:15:45.899543Z qemu-system-ppc64: warning: Failed to set KVM's VSMT mode to 8 (errno -22)
2018-08-31T06:15:45.979785Z qemu-system-ppc64: KVM implementation does not support Transactional Memory, try cap-htm=off

I see this in the logs:
2018-08-31T06:15:45.899543Z qemu-system-ppc64: warning: Failed to set KVM's VSMT mode to 8 (errno -22)
2018-08-31T06:15:45.979785Z qemu-system-ppc64: KVM implementation does not support Transactional Memory, try cap-htm=off

I think these log messages are from the qemu which introduced the bug. qemu-2.11.2-3.fc28 was built to fix this issue and is available under f28-infra tag and I believe @kevin upgraded power builders with qemu-2.11.2-3.fc28 .

From today's compose e.g. f28-container, we see different error now:

ApplianceError: Image status is FAILED: guestfs_launch failed.
This usually means the libguestfs appliance failed to start or crashed.
Do:
  export LIBGUESTFS_DEBUG=1 LIBGUESTFS_TRACE=1
and run the command again.

any more ideas? this is going to block FAH releases unfortunately if we don't get it fixed before next week.

@kevin could we downgrade as @pbrobinson suggested

I am not very sure that downgrading qemu version is going to help much. I later did a fresh install of F28 on ppc64le vm, installed imagebuild dependencies from F28 stable repo. Ran imagefactory, worked fine. Updated only qemu packages to 2.11.2-2 version and ran again imagefactory. Got transaction memory issue as expected. Updated only qemu packages to 2.11.2-3, and ran imagefactory. It worked fine.

So, I suspect that current error could be coming from some other package update. I didn't get much time to update rest of the packages individually to latest and see what causes the issue. I might try this over weekend if not then on Monday.

I maybe wrong, so feel free to downgrade qemu to last working version and see if it helps.

yeah, just downgrading qemu doesn't do it. ;(

I'm trying to isolate the problem package, but it's pretty daunting. ;(

I can't seem to isolate it. ;(

When everything is updated I do see in the libvirt logs:

2018-09-01T03:20:46.049548Z qemu-system-ppc64: warning: Failed to set KVM's VSMT mode to 8 (errno -22)
2018-09-01T03:20:49.443963Z qemu-system-ppc64: error creating device tree: (fdt_begin_node(fdt_skel, "")): FDT_ERR_BADSTATE

https://bugzilla.redhat.com/show_bug.cgi?id=1624539

Found the package which might be causing the issue, it is libfdt package update to 1.4.7-1. I downgraded it to 1.4.6-5 and problem seems to be gone.

libfdt package is part of dtc, maybe let's try again with downgrading to libfdt-1.4.6-5 on power image builders?

Image build succeeded for me locally with downgraded libfdt-1.4.6-5 and latest qemu-2.11.2-3.

Found the package which might be causing the issue

:heart:

Interesting, wouldn't have thought, be good to get that problem reported to the maintainer upstream and I can pull in a fix

You rock @sinnykumari ! That does indeed seem to be it. ;)

I'll move my bug to that package...and I have downgraded the image builders.

@sinnykumari You are awesome.

Thanks for figuring it out.

Closing this ticket.

Metadata Update from @mohanboddu:
- Issue close_status updated to: Fixed
- Issue status updated to: Closed (was: Open)

9 months ago

Login to comment on this ticket.

Metadata