Issue #8664: Module mass-rebuild for F32 Rawhide needed - releng

releng

#8664 Module mass-rebuild for F32 Rawhide needed

Closed: Fixed 4 years ago by mohanboddu. Opened 4 years ago by sgallagh.

Describe the issue
At the Branch point, we need to perform a mass-rebuild of modular content so that they can be rebuilt against the new base module (in this case platform:f32). This does not seem to have happened during the F31 Branch from Rawhide, as https://kojipkgs.fedoraproject.org/compose/rawhide/Fedora-Rawhide-20190820.n.0/logs/global/ shows only those modules whose maintainers have done a manual rebuild since the Branch.
When do you need this? (YYYY/MM/DD)
Immediately (2019-08-21)
When is this no longer needed or useful? (YYYY/MM/DD)
When F32 goes EOL
If we cannot complete your request, what is the impact?
Upgrade path on Rawhide will be broken for anyone with modules enabled. (That means pretty much everyone at this point).

@sgallagh I did rebuild them and the only one that is available right now that is built during mass branching is subversion module build. I am guessing most of them failed to build, but I am not sure how to retrieve the stats.

mohanboddu commented 4 years ago

@sgallagh Still a lot of them failed to build. The list is:
https://release-engineering.github.io/mbs-ui/module/6054
.
.
.
https://release-engineering.github.io/mbs-ui/module/6077

sgallagh commented 4 years ago

Looking into it now.

sgallagh commented 4 years ago

So, it looks like the mass-rebuild was indeed initiated, but something strange happened in MBS and none of the ones started at that time have completed (though they've built all the components).

Currently still in progress:

I've enlisted @mprahl to look into what's happening on the MBS side.

There are two peculiarities:

The module-build tasks aren't completing, even though the components have all been built or failed.
The components are not being built with the appropriate %{dist} tag. They seem to be inheriting .fc32 from the standard build instead of the module-specific one that should be added by the module-build-macros package as part of the build.

sgallagh commented 4 years ago

The components are not being built with the appropriate %{dist} tag. They seem to be inheriting .fc32 from the standard build instead of the module-specific one that should be added by the module-build-macros package as part of the build.

OK, this one looks like it's already been solved in MBS: https://pagure.io/fm-orchestrator/pull-request/1360

We'll need to make sure that's included on prod before we restart the mass-rebuild.

mprahl commented 4 years ago

The mass rebuild for modules failed because of architectural issues in MBS that become apparent under high load, such as a mass rebuild. One such issue is that MBS only supports one worker that processes messages, so when MBS is under high load, it takes several minutes to an hour for it to start processing new messages. MBS has a separate process called the "poller" which will try to detect missed messages. Under these circumstances, it will assume a message has been lost and trigger it again even though MBS may have just not gotten to it yet. This causes very weird conditions that make debugging failed builds difficult. These issues are being addressed by issue 1311 [1], which will be worked on soon by the team and will be done before December 1st.

The other issue is that MBS has an internal Python queue which stores all the messages it receives. If MBS is restarted, all those messages are lost. The fedmsg-hub service crashed due to the file descriptor leak, so all those messages were lost and the poller wasn't smart enough to recover from there. The Python queue issue is being addressed by issue 1311 [1], but the fedmsg-hub issued is not currently planned and is tracked in issue 1234.

My suggestion is to cancel all existing builds that are stuck, and start the mass rebuild again on ones that didn't succeed, but do it at a much slower pace. This can be done by checking the number of module builds that are active, or sleeping an arbitrarily amount between builds.

I'm sorry about the inconvenience, but the team is working on addressing the issue.

1 - https://pagure.io/fm-orchestrator/issue/1311
2 - https://pagure.io/fm-orchestrator/issue/1234

mohanboddu commented 4 years ago

I will work with @mprahl on this sometime this week.

mprahl commented 4 years ago

I filed the following PR as a work-around:
https://pagure.io/releng/pull-request/8739

It will allow the Release Engineer to wait for a module build to complete before going to the next during a mass rebuild. It will be slow, but it's likely the fastest way to unblock us.

mohanboddu commented 4 years ago

@sgallagh I got stuck in IST timezone this week due to some network issues at home. I will be back in US next week and will work with @mprahl on this next week.

kevin commented 4 years ago

So, I worked on this some yesterday. I canceled all the old mass-rebuild tasks that were stuck and ran a new one.

I am not convinced it did everything is was supposed to, but it did build a bunch of modules.

On my rawhide laptop it's only complaining about stratis module (it failed to build) needing platform:31.

@sgallagh can you look over the rawhide modules and see if all looks ok now or if we have more work to do here?

mprahl commented 4 years ago

By working with @mohanboddu, we developed the following script which tells us which f32 module builds have never succeeded:

import requests

url = 'https://mbs.fedoraproject.org/module-build-service/1/module-builds/?base_module_br_name=platform&base_module_br_stream=f32&state=ready&state=failed&per_page=100&short=true'
ready = set()
failed = set()
while True:
    rv_json = requests.get(url).json()
    for module in rv_json['items']:
        if module['state_name'] == 'ready':
            ready.add('{}:{}'.format(module['name'], module['stream']))
        elif module['state_name'] == 'failed':
            failed.add('{}:{}'.format(module['name'], module['stream']))
    if rv_json['meta']['next']:
        url = rv_json['meta']['next']
    else:
        break

for mod_to_rebuild in (failed - ready):
    print(mod_to_rebuild)

The output was:

nest:2.16.0
dwm:latest
eclipse:2019-03
postgresql:9.6
mongodb:3.6
kubernetes:1.10
swig:3.0
octave:4.4
tomcat:stream-tomcat-201908
postgresql:master
rawtherapee:devel
ghc:8.6
ghc:8.4
eclipse:latest
nest:2.18.0
ghc:8.8
hub:pre-release
tycho:1.4
cri-o:1.11
setools:4.2
setools:4.2.0
ruby:master
tycho:1.3
avocado:master
kubernetes:openshift-3.10
lsd:rolling
dwm:6.1
dwm:6.0
openmpi:4.0
postgresql:10
rawtherapee:stable
cri-o:1.15
mysql:5.6
skychart:devel
openmpi:2.1
dwm:6.2
octave:5.1
openmpi:3.1

mohanboddu commented 4 years ago

This has been fixed long back.

Metadata Update from @mohanboddu:
- Issue close_status updated to: Fixed
- Issue status updated to: Closed (was: Open)

4 years ago

Metadata

Assignee

None

Tags

None

Blocking

None

Depending on

None

Milestone

None

duplicate

None

blockedby

None

blocking

None

Attachments 1

unfinished.txt

Attached 4 years ago View Comment

releng

Source Code

Documentation

#8664 Module mass-rebuild for F32 Rawhide needed Closed: Fixed 4 years ago by mohanboddu. Opened 4 years ago by sgallagh.

Metadata

Attachments 1

#8664 Module mass-rebuild for F32 Rawhide needed

Closed: Fixed 4 years ago by mohanboddu. Opened 4 years ago by sgallagh.