#10348 Build failures on Cascadelake virtualized systems (rawhide only)
Closed: Fixed 2 years ago by mizdebsk. Opened 2 years ago by jerboaa.

Describe what you would like us to do:


Our team is seeing x86_64 specific build failures for OpenJDK packages (8, 11, 17 - latest) when they manage to get scheduled on a virtualized cascadelake builder. The same package builds fine if scheduled on a builder which isn't virtualized. We are unable to reproduce the build failures locally so we'd need help to be able to analyze the problem better. The failures are also very strange with very little details in the logs. First question: Was there a change in rawhide builder setup end of October early November? We are seeing those problems since early November and it didn't change since.

hw_info shows this for builders where we see the issues:

CPU info:
Architecture:                    x86_64
CPU op-mode(s):                  32-bit, 64-bit
Byte Order:                      Little Endian
Address sizes:                   46 bits physical, 48 bits virtual
CPU(s):                          6
On-line CPU(s) list:             0-5
Thread(s) per core:              1
Core(s) per socket:              1
Socket(s):                       6
NUMA node(s):                    1
Vendor ID:                       GenuineIntel
CPU family:                      6
Model:                           85
Model name:                      Intel Xeon Processor (Cascadelake)
Stepping:                        6
CPU MHz:                         2095.076
BogoMIPS:                        4190.15
Virtualization:                  VT-x
Hypervisor vendor:               KVM
Virtualization type:             full
L1d cache:                       192 KiB
L1i cache:                       192 KiB
L2 cache:                        24 MiB
L3 cache:                        96 MiB
NUMA node0 CPU(s):               0-5
Vulnerability Itlb multihit:     Not affected
Vulnerability L1tf:              Not affected
Vulnerability Mds:               Not affected
Vulnerability Meltdown:          Not affected
Vulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl and seccomp
Vulnerability Spectre v1:        Mitigation; usercopy/swapgs barriers and __user pointer sanitization
Vulnerability Spectre v2:        Mitigation; Enhanced IBRS, IBPB conditional, RSB filling
Vulnerability Srbds:             Not affected
Vulnerability Tsx async abort:   Mitigation; TSX disabled
Flags:                           fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss syscall nx pdpe1gb rdtscp lm constant_tsc rep_good nopl xtopology cpuid tsc_known_freq pni pclmulqdq vmx ssse3 fma cx16 pdcm pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch cpuid_fault invpcid_single ssbd ibrs ibpb stibp ibrs_enhanced tpr_shadow vnmi flexpriority ept vpid ept_ad fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid avx512f avx512dq rdseed adx smap clflushopt clwb avx512cd avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves arat umip pku ospke avx512_vnni md_clear arch_capabilities


Memory:
               total        used        free      shared  buff/cache   available
Mem:        15337816     1048260     1768176         812    12521380    13949200
Swap:       11534328       70400    11463928


Storage:
Filesystem      Size  Used Avail Use% Mounted on
/dev/vda5       271G   55G  216G  21% /

Examples:

JDK 8 (same NVRA):
https://koji.fedoraproject.org/koji/buildinfo?buildID=1853322 (successful build)
https://koji.fedoraproject.org/koji/taskinfo?taskID=78767856 (failed build; eln)
https://koji.fedoraproject.org/koji/taskinfo?taskID=78504433 (failed build; f36)

JDK 11 (only failed builds):
https://koji.fedoraproject.org/koji/buildinfo?buildID=1853354
https://koji.fedoraproject.org/koji/buildinfo?buildID=1853314

JDK 17 (only failed builds):
https://koji.fedoraproject.org/koji/taskinfo?taskID=78757736
https://koji.fedoraproject.org/koji/taskinfo?taskID=78877368
https://koji.fedoraproject.org/koji/taskinfo?taskID=78836045

When do you need this to be done by?


As soon as possible.

Any help with this would be greatly appreciated!


Builders were updated on 9 November, see #10302
How can we help to investigate this issue? What information do you need?

Metadata Update from @mizdebsk:
- Issue assigned to mizdebsk
- Issue priority set to: Waiting on Reporter (was: Needs Review)
- Issue tagged with: koji

2 years ago

Thanks, as discussed offline I have the required info for the time being. Hopefully we'll be able to reproduce with that.

I'll close this issue - I provided Severin with initial information on the setup of Fedora Koji builders that is needed to reproduce the problem outsides of Koji. We'll be in touch if more info or help from Fedora sysadmins is needed.

Metadata Update from @mizdebsk:
- Issue close_status updated to: Fixed
- Issue status updated to: Closed (was: Open)

2 years ago

@kevin @mizdebsk Could be. Do the hosts where VMs run set kvm_intel nested=1. I.e. enable nested virtualization.

I'm guessing since we see the pdcm CPU flag in the hw_info.log[1] file that's a yes?

[1] https://kojipkgs.fedoraproject.org//work/tasks/4487/78504487/hw_info.log

Yes, the virtualization host enables nested virt:

[root@bvmhost-x86-06 ~][PROD-IAD2]# cat /etc/modprobe.d/kvm_intel.conf 
options kvm_intel nested=1

For the record, I was able to replicate a setup where this reproduces. Thanks for your help!

How is this fixed??? It still fails:

https://koji.fedoraproject.org/koji/taskinfo?taskID=79443858
https://kojipkgs.fedoraproject.org/work/tasks/3858/79443858/hw_info.log

This issue isn't about fixing the build failure, but getting help on the setup so as to be able to reproduce. That part was successful, as to the root cause, not so much :-/

You can add your thoughts on:
https://bugzilla.redhat.com/show_bug.cgi?id=2026399

I don't really have any thoughts, other than Fedora need to fix their broken build hardware. OpenJDK was building fine until they moved it to this weird setup and I don't see this as an OpenJDK issue.

I've already wasted enough time on this and will just concentrate on RHEL and F35 while rawhide remains unusable.

I thought builds for F34, F35 and rawhide are built on the same hardware so I don't know what 'this weird setup' is.

I've added information to the bug... some upcoming changes we are planning might affect this and I'm happy to gather more info for rhel qemu/virt maintainers.

The only thing that has changed in our setup is updating our hypervisors from RHEL 8.4 to 8.5, we have made no radical changes or weird setups.

I don't know how else to characterise what we're seeing here. We have one build of java-1.8.0-openjdk on non-virtualised hardware that completes fine on x86_64:

https://koji.fedoraproject.org/koji/buildinfo?buildID=1853322
https://kojipkgs.fedoraproject.org//packages/java-1.8.0-openjdk/1.8.0.312.b07/2.fc36/data/logs/x86_64/hw_info.log

I just run a scratch build of the exact same package and:

https://koji.fedoraproject.org/koji/taskinfo?taskID=79491184
https://kojipkgs.fedoraproject.org//work/tasks/1184/79491184/hw_info.log

when it is allocated this virtualised Cascadelake hardware. We see similar with java-11-openjdk and java-latest-openjdk but never been lucky enough to have one of those get the "good" hardware and pass.

The failures are incomprehensible. We see odd failures of basic text processing tools and presumably file copies, because libjvm.so is created but not then copied to the location make expects. I've tried running a serial build and setting LANG, but to no avail.

I can't see how it is anything in OpenJDK itself when basic facilities of the build environment seem to be unreliable and the same package builds fine in other environments. All our builds were fine until about a month ago, when we started being allocated this hardware and seeing failures on every build. It's frustrating and blocking us from updating OpenJDK in rawhide.

f35 builds on this same Cascadelake hardware do seem to pass:

https://koji.fedoraproject.org/koji/buildinfo?buildID=1861147

so maybe it is something to do with this hardware and rawhide.

I share your frustration, but alas I don't magically know whats going on sadly. ;(

I wonder if this could be some glibc issue in rawhide? Anyhow, lets move discussion to the bug and hopefully we can work together to figure it out.

Login to comment on this ticket.

Metadata