#6653 Aarch64 build failure for package "root"
Closed: Fixed 7 years ago Opened 7 years ago by ellert.

I don't really know where to report this, since I have difficulty understanding what the underlying cause is.

I am trying to get one of my packages that failed the F26 rebuild to compile. There where a few minor issues like missing headers and similar, and I now have a version that builds on all architectures except aarch64.

The point where the package build fails is not during compilation, but when the build tries to execute the newly built software, which results in a segfault. I am not sure what would case this. There is a new compiler version and a new glibc, there are changes to the default compiler flags and a lot of other things that could make a difference. And I have no way to debug this since I don't have access to an aarch64 system.

What I find really confusing is that if I try to build on EPEL 7 I see the same failure as on rawhide for the aarch64 build. This is confusing since the previous EPEL 7 build succeeded. And in this case the new build uses the same gcc, the same glibc and the same kernel-headers as the previous successful build. So I really see no reason for it to fail. I can not see anything that changed between the builds that would cause the previously working build to now fail in the EPEL 7 case. And since the failure in the rawhide build is the same my guess is that the cause is the same.

Previous successful F26 build (23 January)
root-6.08.04-1.fc26 (including successful aarch64 build)
https://koji.fedoraproject.org/koji/buildinfo?buildID=835310

Previous successful EPEL 7 build (15 January)
root-6.08.04-1.el7 (including successful aarch64 build)
https://koji.fedoraproject.org/koji/buildinfo?buildID=833913

F26 scratch build (21 February)
Fails for aarch64
https://koji.fedoraproject.org/koji/taskinfo?taskID=17975998

EPEL 7 scratch build (21 February)
Fails for aarch64
https://koji.fedoraproject.org/koji/taskinfo?taskID=17972237

That the EPEL 7 build now fails does not make sense to me. And while the rawhide build could possibly fail for a number of different reasons, the failure is the same as in the EPEL 7 build and therefore possibly due to the same issue.

Has anything changed in koji or the aarch64 build servers since the successful builds that could explain this?


The aarch64 build hosts changed from RHEL to Fedora 25 between these two builds.

It may or may not be the culprit, but it's a starting point.

I also noted the following stack trace in the middle of the build logs:

===========================================================
The lines below might hint at the cause of the crash.
You may get help by asking at the ROOT forum http://root.cern.ch/forum.
Only if you are really convinced it is a bug in ROOT then please submit a
report at http://root.cern.ch/bugs. Please post the ENTIRE stack trace
from above as an attachment in addition to anything else
that might help us fixing this issue.
===========================================================
#6 0x0000ffff995d0308 in ?? ()
#7 0x0000ffffa0a998b0 in cling::Interpreter::executeTransaction (this=0xffff995d028c, T=...) at /builddir/build/BUILD/root-6.08.04/interpreter/cling/lib/Interpreter/Interpreter.cpp:1192
#8 0x0000ffffa0af35a0 in cling::IncrementalParser::commitTransaction (this=0xffffa280f000, PRT=...) at /builddir/build/BUILD/root-6.08.04/interpreter/cling/lib/Interpreter/IncrementalParser.cpp:457
#9 0x0000ffffa0a9c22c in cling::Interpreter::Interpreter (this=0x19fef900, argc=<optimized out>, argv=<optimized out>, llvmdir=<optimized out>, noRuntime=24, parentInterp=0x0) at /builddir/build/BUILD/root-6.08.04/interpreter/cling/lib/Interpreter/Interpreter.cpp:203
#10 0x0000ffffa0a20088 in Interpreter (noRuntime=false, llvmdir=0x19fef8b0 "/builddir/build/BUILD/root-6.08.04/builddir/etc/cling", argv=0x19fd7f80, argc=8, this=0x19fef900) at /builddir/build/BUILD/root-6.08.04/interpreter/cling/include/cling/Interpreter/Interpreter.h:303
#11 TCling::TCling (this=0x19fef0e0, name=<optimized out>, title=<optimized out>) at /builddir/build/BUILD/root-6.08.04/core/meta/src/TCling.cxx:1115
#12 0x0000ffffa0a20fd8 in CreateInterpreter (interpLibHandle=<optimized out>) at /builddir/build/BUILD/root-6.08.04/core/meta/src/TCling.cxx:592
#13 0x0000ffffa30ee004 in TROOT::InitInterpreter (this=0xffffa3358c80 <ROOT::Internal::GetROOT1()::alloc>) at /builddir/build/BUILD/root-6.08.04/core/base/src/TROOT.cxx:1924
#14 0x0000ffffa30ee3a4 in ROOT::Internal::GetROOT2 () at /builddir/build/BUILD/root-6.08.04/core/base/src/TROOT.cxx:364
#15 0x0000ffffa30ea90c in ROOT::GetROOT () at /builddir/build/BUILD/root-6.08.04/core/base/src/TROOT.cxx:488
#16 0x0000ffffa31623f0 in TApplication::TApplication (this=0x19fd7df0, appClassName=0x401540 "Rint", argc=0xfffff8ad70cc, argv=0xfffff8ad7218, numOptions=0) at /builddir/build/BUILD/root-6.08.04/core/base/src/TApplication.cxx:152
#17 0x0000ffffa33a1248 in TRint::TRint (this=0x19fd7df0, appClassName=<optimized out>, argc=<optimized out>, argv=<optimized out>, options=<optimized out>, numOptions=<optimized out>, noLogo=false) at /builddir/build/BUILD/root-6.08.04/core/rint/src/TRint.cxx:147
#18 0x0000000000401208 in main (argc=9, argv=<optimized out>) at /builddir/build/BUILD/root-6.08.04/main/src/rmain.cxx:27
===========================================================

The aarch64 build hosts changed from RHEL to Fedora 25 between these two builds.
It may or may not be the culprit, but it's a starting point.
I also noted the following stack trace in the middle of the build logs:

We've never used RHEL for aarch64 builders so I doubt that is the case. In fact even in mainline x86 and friends we've not used RHEL since prior to F-20

We may have moved from F-24 based build VMs to F-25 in that time, TBH I don't remember the date we moved over but that hasn't affected any other package in Fedora that I'm aware of.

Could this be related to the change to 48 bit VA space?
https://fedoraproject.org/wiki/Changes/aarch64-48bitVA
Do the build machines run a 48 bit VA kernel?

The Fedora aarch64 kernels have had 48 bit VA for some time, this isn't new to F-25. The feature was to ensure all the things were covered. The main things that I believe were affected by that were mozjs (in it's various versions) and luajit.

It's possible that might be the case, it could also be 64K Page sizes that we currently enable in the kernel (that might change before release) but again that's not a new feature to F-25 (I think we enabled that around F-22).

I seem to remember that root bundles a forked copy of llvm, I'm not sure what version that is but you might want to make sure that it has any appropriate arch fixed back ported.

At the time there where no aarch64 related patches in the llvm package. However, in an update of the llvm package this week (llvm-3.9.1-4) a patch addressing relocation issues on aarch64 was added. Using this patch in the root package resolves this issue (root-6.08.06-2).

At the time there where no aarch64 related patches in the llvm package. However, in an update of the llvm package this week (llvm-3.9.1-4) a patch addressing relocation issues on aarch64 was added. Using this patch in the root package resolves this issue (root-6.08.06-2).

That was added by me for rhbz 1429050 for reference. Good news it's fixed though with that.

Metadata Update from @pbrobinson:
- Issue close_status updated to: Fixed
- Issue status updated to: Closed (was: Open)

7 years ago

Login to comment on this ticket.

Metadata