Our team is seeing x86_64 specific build failures for OpenJDK packages (8, 11, 17 - latest) when they manage to get scheduled on a virtualized cascadelake builder. The same package builds fine if scheduled on a builder which isn't virtualized. We are unable to reproduce the build failures locally so we'd need help to be able to analyze the problem better. The failures are also very strange with very little details in the logs. First question: Was there a change in rawhide builder setup end of October early November? We are seeing those problems since early November and it didn't change since.
x86_64
hw_info shows this for builders where we see the issues:
CPU info: Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit Byte Order: Little Endian Address sizes: 46 bits physical, 48 bits virtual CPU(s): 6 On-line CPU(s) list: 0-5 Thread(s) per core: 1 Core(s) per socket: 1 Socket(s): 6 NUMA node(s): 1 Vendor ID: GenuineIntel CPU family: 6 Model: 85 Model name: Intel Xeon Processor (Cascadelake) Stepping: 6 CPU MHz: 2095.076 BogoMIPS: 4190.15 Virtualization: VT-x Hypervisor vendor: KVM Virtualization type: full L1d cache: 192 KiB L1i cache: 192 KiB L2 cache: 24 MiB L3 cache: 96 MiB NUMA node0 CPU(s): 0-5 Vulnerability Itlb multihit: Not affected Vulnerability L1tf: Not affected Vulnerability Mds: Not affected Vulnerability Meltdown: Not affected Vulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl and seccomp Vulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization Vulnerability Spectre v2: Mitigation; Enhanced IBRS, IBPB conditional, RSB filling Vulnerability Srbds: Not affected Vulnerability Tsx async abort: Mitigation; TSX disabled Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss syscall nx pdpe1gb rdtscp lm constant_tsc rep_good nopl xtopology cpuid tsc_known_freq pni pclmulqdq vmx ssse3 fma cx16 pdcm pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch cpuid_fault invpcid_single ssbd ibrs ibpb stibp ibrs_enhanced tpr_shadow vnmi flexpriority ept vpid ept_ad fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid avx512f avx512dq rdseed adx smap clflushopt clwb avx512cd avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves arat umip pku ospke avx512_vnni md_clear arch_capabilities Memory: total used free shared buff/cache available Mem: 15337816 1048260 1768176 812 12521380 13949200 Swap: 11534328 70400 11463928 Storage: Filesystem Size Used Avail Use% Mounted on /dev/vda5 271G 55G 216G 21% /
Examples:
JDK 8 (same NVRA): https://koji.fedoraproject.org/koji/buildinfo?buildID=1853322 (successful build) https://koji.fedoraproject.org/koji/taskinfo?taskID=78767856 (failed build; eln) https://koji.fedoraproject.org/koji/taskinfo?taskID=78504433 (failed build; f36)
eln
f36
JDK 11 (only failed builds): https://koji.fedoraproject.org/koji/buildinfo?buildID=1853354 https://koji.fedoraproject.org/koji/buildinfo?buildID=1853314
JDK 17 (only failed builds): https://koji.fedoraproject.org/koji/taskinfo?taskID=78757736 https://koji.fedoraproject.org/koji/taskinfo?taskID=78877368 https://koji.fedoraproject.org/koji/taskinfo?taskID=78836045
As soon as possible.
Any help with this would be greatly appreciated!
Builders were updated on 9 November, see #10302 How can we help to investigate this issue? What information do you need?
Metadata Update from @mizdebsk: - Issue assigned to mizdebsk - Issue priority set to: Waiting on Reporter (was: Needs Review) - Issue tagged with: koji
Thanks, as discussed offline I have the required info for the time being. Hopefully we'll be able to reproduce with that.
I'll close this issue - I provided Severin with initial information on the setup of Fedora Koji builders that is needed to reproduce the problem outsides of Koji. We'll be in touch if more info or help from Fedora sysadmins is needed.
Metadata Update from @mizdebsk: - Issue close_status updated to: Fixed - Issue status updated to: Closed (was: Open)
Is this possibly related to/the same as: https://bugzilla.redhat.com/show_bug.cgi?id=2022075 ?
@kevin @mizdebsk Could be. Do the hosts where VMs run set kvm_intel nested=1. I.e. enable nested virtualization.
kvm_intel nested=1
I'm guessing since we see the pdcm CPU flag in the hw_info.log[1] file that's a yes?
pdcm
hw_info.log
[1] https://kojipkgs.fedoraproject.org//work/tasks/4487/78504487/hw_info.log
Nested virtualization docs: https://docs.fedoraproject.org/en-US/quick-docs/using-nested-virtualization-in-kvm/
Yes, the virtualization host enables nested virt:
[root@bvmhost-x86-06 ~][PROD-IAD2]# cat /etc/modprobe.d/kvm_intel.conf options kvm_intel nested=1
For the record, I was able to replicate a setup where this reproduces. Thanks for your help!
How is this fixed??? It still fails:
https://koji.fedoraproject.org/koji/taskinfo?taskID=79443858 https://kojipkgs.fedoraproject.org/work/tasks/3858/79443858/hw_info.log
How is this fixed??? It still fails: https://koji.fedoraproject.org/koji/taskinfo?taskID=79443858 https://kojipkgs.fedoraproject.org/work/tasks/3858/79443858/hw_info.log
This issue isn't about fixing the build failure, but getting help on the setup so as to be able to reproduce. That part was successful, as to the root cause, not so much :-/
You can add your thoughts on: https://bugzilla.redhat.com/show_bug.cgi?id=2026399
I don't really have any thoughts, other than Fedora need to fix their broken build hardware. OpenJDK was building fine until they moved it to this weird setup and I don't see this as an OpenJDK issue.
I've already wasted enough time on this and will just concentrate on RHEL and F35 while rawhide remains unusable.
I thought builds for F34, F35 and rawhide are built on the same hardware so I don't know what 'this weird setup' is.
I've added information to the bug... some upcoming changes we are planning might affect this and I'm happy to gather more info for rhel qemu/virt maintainers.
The only thing that has changed in our setup is updating our hypervisors from RHEL 8.4 to 8.5, we have made no radical changes or weird setups.
I don't know how else to characterise what we're seeing here. We have one build of java-1.8.0-openjdk on non-virtualised hardware that completes fine on x86_64:
https://koji.fedoraproject.org/koji/buildinfo?buildID=1853322 https://kojipkgs.fedoraproject.org//packages/java-1.8.0-openjdk/1.8.0.312.b07/2.fc36/data/logs/x86_64/hw_info.log
I just run a scratch build of the exact same package and:
https://koji.fedoraproject.org/koji/taskinfo?taskID=79491184 https://kojipkgs.fedoraproject.org//work/tasks/1184/79491184/hw_info.log
when it is allocated this virtualised Cascadelake hardware. We see similar with java-11-openjdk and java-latest-openjdk but never been lucky enough to have one of those get the "good" hardware and pass.
The failures are incomprehensible. We see odd failures of basic text processing tools and presumably file copies, because libjvm.so is created but not then copied to the location make expects. I've tried running a serial build and setting LANG, but to no avail.
I can't see how it is anything in OpenJDK itself when basic facilities of the build environment seem to be unreliable and the same package builds fine in other environments. All our builds were fine until about a month ago, when we started being allocated this hardware and seeing failures on every build. It's frustrating and blocking us from updating OpenJDK in rawhide.
f35 builds on this same Cascadelake hardware do seem to pass:
https://koji.fedoraproject.org/koji/buildinfo?buildID=1861147
so maybe it is something to do with this hardware and rawhide.
I share your frustration, but alas I don't magically know whats going on sadly. ;(
I wonder if this could be some glibc issue in rawhide? Anyhow, lets move discussion to the bug and hopefully we can work together to figure it out.
Login to comment on this ticket.