#10544 Unhealthy armv7hl builder
Closed: Fixed 2 years ago by kevin. Opened 2 years ago by catanzaro.

Describe what you would like us to do:

I have a F35 WebKitGTK security update https://koji.fedoraproject.org/koji/taskinfo?taskID=82578306 that's been building for over 13 hours. All architectures except armv7hl finished quickly, but armv7hl is still less than halfway done building after 13 hours of progress. Currently it has built 2253 build targets out of 4933 total, but the first 1000-1500 or thereabouts are file copies that only take a few seconds, so my guess is it's really only 1/4 to 1/3 of the way done. I'm not sure if the build will eventually finish, or if it's going to die to a timeout.

I've never seen this happen before. The corresponding F34 update https://koji.fedoraproject.org/koji/buildinfo?buildID=1915115 finished in just under four hours, which is a more typical amount of time, so needless to say something is wrong with the builder that's building the F35 update.

I will kick off a rawhide build now and see what happens there.

When do you need this to be done by? (YYYY/MM/DD)

ASAP, this is blocking/delaying updates


Metadata Update from @mohanboddu:
- Issue priority set to: Waiting on Assignee (was: Needs Review)
- Issue tagged with: medium-gain, medium-trouble, ops

2 years ago

Oddly I don't see anything obviously wrong with the builder. ;(

It's pegging 1 cpu at 100%, but is really responsive:

  `-kojid,29559 /usr/sbin/kojid --fg --force-lock --verbose
      `-kojid,29560 /usr/sbin/kojid --fg --force-lock --verbose
          `-mock,29958 -tt /usr/libexec/mock/mock -r koji/f35-build-33016337-4450510 --old-chroot --no-clean --target armv7hl ...
              `-rpmbuild,30508 -bb --target armv7hl --nodeps /builddir/build/SPECS/webkit2gtk3.spec
                  `-sh,30551 -e /var/tmp/rpm-tmp.tfZ1ii
                      `-cmake,1970 --build redhat-linux-build -j5 --verbose -j1
                          `-ninja-build,1971 -v -j 1
                              `-g++,20697 -DBUILDING_GTK__=1 -DBUILDING_WEBKIT -DBUILDING_WITH_CMAKE=1 -DBUILDING_WebKit ...
                                  |-cc1plus,20698 -quiet -I/builddir/build/BUILD/webkitgtk-2.34.5/redhat-linux-build/WebKit2
                                  `-as,20699 -I /builddir/build/BUILD/webkitgtk-2.34.5/redhat-linux-build/WebKit2Gtk/Headers -I...

I do see that it restarted twice... perhaps one of the eariler two ones was on the actual bad builder and it's on an ok one now?

Metadata Update from @kevin:
- Issue untagged with: medium-gain, medium-trouble, ops
- Issue priority set to: Needs Review (was: Waiting on Assignee)

2 years ago

Metadata Update from @kevin:
- Issue priority set to: Waiting on Assignee (was: Needs Review)
- Issue tagged with: medium-gain, medium-trouble, ops

2 years ago

I do see that it restarted twice... perhaps one of the eariler two ones was on the actual bad builder and it's on an ok one now?

I didn't know that it was even possible for a build to restart. :O Maybe indeed. It looks like the build is now almost done, so that suggests it did almost all of its work in the past few hours....

It looks like it has restarted again. O_O

I'm not sure if it is going to finish unless you investigate to see what is going wrong. :(

It looks like it has restarted again. O_O

And yet again. :(

yep. it's hitting OOM and restarting.

I have forced it over to a btrfs builder. Those have been very stable and I need to just reinstall all of them that way.

So, fingers crossed...

yep. it's hitting OOM and restarting.

Odd. In my experience, that typically results in a failed build, not an infinite loop of failed builds. Both are bad, but one is worse....

So, fingers crossed...

Looks like it got stuck setting up the buildroot?

Oh no, it's "assigned" which must mean it's waiting for a builder to become available. I was confused because I'd never seen that before. OK then.

The OOM issue is where the build is going along, takes up all memory and OOM killer kills kojid. We have kojid set to restart on failure, so it restarts and says "oh, I have this job I should do, let me start that", lather, rince, repeat.

assign means it's assigned to a specific builder, but it seems I assigned it to one thats messed up. ;( Moved it again to one thats working properly.

Seems to be building OK now! I suppose we'll find out tomorrow whether this worked.

That worked: it finished in just 2.5 hours!

Looks like my F36 build is also going to require rescue as it has similarly restarted: https://koji.fedoraproject.org/koji/taskinfo?taskID=82602050.

done. I will try and get all of them moved over soon. sorry for the troubles...

My builds have succeeded, thanks to your shepherding.

Remaining task here is to take down the builder that was having trouble.

ok. All 32 bit arm builders are re-installed. Let us know if you see this again.

Metadata Update from @kevin:
- Issue close_status updated to: Fixed
- Issue status updated to: Closed (was: Open)

2 years ago

In this case, it was not 32 bit arm but instead x86_64, s390x, aarch64, ppc64le that seem to have oom'd or something and the tasks kept restarting. ppc64le has finished now though, but the others are still struggling and restarting.

Here's a link to the top level task: https://koji.fedoraproject.org/koji/taskinfo?taskID=89676281

Login to comment on this ticket.

Metadata
Boards 1
ops Status: Backlog