#10850 Modular builds stuck in init state
Closed: Fixed a month ago by kevin. Opened 2 months ago by yselkowitz.


Metadata Update from @phsmoura:
- Issue tagged with: medium-gain, medium-trouble, ops

2 months ago

I have poked MBS and it is processing builds again. If the issue appears again please reopen this ticket.

Metadata Update from @humaton:
- Issue close_status updated to: Fixed
- Issue status updated to: Closed (was: Open)

2 months ago

Metadata Update from @yselkowitz:
- Issue status updated to: Open (was: Closed)

2 months ago

Not having the ability to build modules has also broken the Fedora CoreOS CI: https://bugzilla.redhat.com/show_bug.cgi?id=2096890

So I have restarted the services and seems to update some tasks, but ignores the build result and stays stuck in the waiting/build state

there are no tracebacks in the logs, but this is weird:

Jun 23 12:39:21 mbs-backend01.iad2.fedoraproject.org fedmsg-hub[4981]: [2022-06-23 12:39:21][moksha.hub WARNING] Received invalid message RuntimeWarning('Failed to authn message.',)

some warnings from the logs

Jun 23 12:47:37 mbs-backend01.iad2.fedoraproject.org celery[5395]: [2022-06-23 12:47:37,365: WARNING/ForkPoolWorker-1] /usr/lib/python2.7/site-packages/dnf/base.py:212: Warning: cannot register existing type 'ModulemdModuleStream'
Jun 23 12:47:37 mbs-backend01.iad2.fedoraproject.org celery[5395]: self.conf.module_platform_id, False, self.conf.debug_solver)
Jun 23 12:47:37 mbs-backend01.iad2.fedoraproject.org celery[5395]: [2022-06-23 12:47:37,365: WARNING/ForkPoolWorker-1] /usr/lib/python2.7/site-packages/dnf/base.py:212: Warning: g_once_init_leave: assertion 'result != 0' failed

This seems like a dnf or libmodulemd level issue.

Here is the surrounding code from the warning:

for excl in set(self.conf.excludepkgs):
    subj = dnf.subject.Subject(excl)
    exclude_query = exclude_query.union(subj.get_best_query(
        self.sack, with_nevra=True, with_provides=False, with_filenames=False))

"cannot register existing type" is a Glib-Gobject warning.

That said, I'm not actually sure that warning is related to the build stall

This is what I see in the logs for 14614

[2022-06-23 08:05:53,081: INFO/ForkPoolWorker-1] State transition: 'init' -> 'wait', <ModuleBuild transmission-remote-gtk, id=14614, stream=stable, version=3620220620000329, scratch=False, state 'wait', batch 0, state_reason u'Resubmitted by yselkowitz'>
[2022-06-23 08:05:53,195: DEBUG/ForkPoolWorker-1] Running event scheduler with following events:
[2022-06-23 08:05:53,195: INFO/ForkPoolWorker-1] Task module_build_service.scheduler.handlers.modules.init[b616d4f1-b707-44aa-a862-ec75b71e01eb] succeeded in 142.863135703s: None
...
[2022-06-23 08:05:55,148: WARNING/ForkPoolWorker-1] Module build <ModuleBuild transmission-remote-gtk, id=14614, stream=stable, version=3620220620000329, scratch=False, state 'wait', batch 0, state_reason u'Resubmitted by yselkowitz'> has moved to wait state already.
[2022-06-23 08:05:55,149: WARNING/ForkPoolWorker-1] Ignore this message 2022-30fe31d7-92d9-49b0-8094-84f84cd8cc46. Is there something wrong with the frontend that sends duplicate messages?
[2022-06-23 08:05:55,150: INFO/MainProcess] Received task: module_build_service.scheduler.handlers.modules.init[fb24ffb3-fb8d-40f9-9a8a-f83dfac14652]
[2022-06-23 08:05:55,150: DEBUG/ForkPoolWorker-1] Running event scheduler with following events:
[2022-06-23 08:05:55,150: INFO/ForkPoolWorker-1] Task module_build_service.scheduler.handlers.modules.init[c5c68bff-43c8-4bcc-8ba7-fbc87ebeb4f2] succeeded in 0.00507247820497s: None
...
[2022-06-23 08:05:56,446: INFO/ForkPoolWorker-1] State transition: 'wait' -> 'failed', <ModuleBuild transmission-remote-gtk, id=14614, stream=stable, version=3620220620000329, scratch=False, state 'failed', batch 0, state_reason u'Resubmitted by yselkowitz'>
[2022-06-23 08:05:56,562: DEBUG/ForkPoolWorker-1] Running event scheduler with following events:
[2022-06-23 08:05:56,563: INFO/ForkPoolWorker-1] Task module_build_service.scheduler.handlers.modules.failed[4dfb35b7-5109-4f7a-8b82-8cfc2dd5a264] succeeded in 0.131899863016s: None
...
[2022-06-23 08:06:01,593: WARNING/ForkPoolWorker-1] Module build <ModuleBuild transmission-remote-gtk, id=14614, stream=stable, version=3620220620000329, scratch=False, state 'failed', batch 0, state_reason u'Resubmitted by yselkowitz'> has moved to failed state already.
[2022-06-23 08:06:01,594: WARNING/ForkPoolWorker-1] Ignore this message 2022-f4f40f00-9f08-4737-b9a0-7affff78baa0. Is there something wrong with the frontend that sends duplicate messages?
[2022-06-23 08:06:01,595: INFO/MainProcess] Received task: module_build_service.scheduler.handlers.modules.init[5596861b-5573-4348-aa6b-7799eee8b16e]
[2022-06-23 08:06:01,595: DEBUG/ForkPoolWorker-1] Running event scheduler with following events:
[2022-06-23 08:06:01,595: INFO/ForkPoolWorker-1] Task module_build_service.scheduler.handlers.modules.init[eb726cde-6d84-4908-a770-730c1afcbf81] succeeded in 0.00582038611174s: None
[2022-06-23 08:06:01,599: WARNING/ForkPoolWorker-1] Module build <ModuleBuild transmission-remote-gtk, id=14614, stream=stable, version=3620220620000329, scratch=False, state 'failed', batch 0, state_reason u'Resubmitted by yselkowitz'> has moved to failed state already.
[2022-06-23 08:06:01,600: WARNING/ForkPoolWorker-1] Ignore this message 2022-f4f40f00-9f08-4737-b9a0-7affff78baa0. Is there something wrong with the frontend that sends duplicate messages?

We have some changes in HEAD that aren't yet in a release that might help with these duplicate message issues. Has Fedora applied these yet? I thought this had come up before.

@mikem latest applied patch would be this one:
https://pagure.io/fm-orchestrator/pull-request/1711

There's also patches for 1709 and 1729. Were there other PRs you were thinking of?

Yeah, I was just comparing. The only thing that might help is https://pagure.io/fm-orchestrator/pull-request/1718

I thought there was something else though....

Actually, I think 1718 is probably the one I'm thinking of

I've deployed the 1718 patch to production, let's see if that helps nudge things along.

I've noticed the waitrepo tasks mbs does like https://koji.fedoraproject.org/koji/taskinfo?taskID=88851559 have "Parameters are not correct for this method."

perhaps it's related?

Hello @mikem , @breilly when do you expect new release to happen?

@humaton I deployed the mentioned patch to production on monday, unfortunately it did not fix the issue. I'm still investigating to find the root cause.

@jkaluza Hey, if you have some time, I wonder if you could help figure out what's going on with MBS?

Still investigating this issue, as of now it seems the backend is not picking up tasks that are handed off by the frontend. It's unclear if there's some issue with fedmsg, though there aren't any authentication problems.

As mentioned in previous comments, there is the dnf / libmodulemd related warning before the backend stalls. It always appears as MBS attempts to process this specific build: https://mbs.fedoraproject.org/module-build-service/1/module-builds/14588

It's possible there's something unexpected in the modulemd that's causing that warning. Since this build is past the point where MBS would normally fail it due to age, i tried to cancel it and restart MBS to see if that got things moving again, but I got a 500 error.

If I submit a scratch module build, I see the following (double) message on the backend (and nothing else).

Jul 07 19:32:36 mbs-backend01.iad2.fedoraproject.org fedmsg-hub[15272]: [2022-07-07 19:32:36][       MBS    INFO] Calling init: mbs_module_state_change, 2022-4aeec083-46f4-47f5-b160-8ec9d955af80
Jul 07 19:32:36 mbs-backend01.iad2.fedoraproject.org fedmsg-hub[15272]: [2022-07-07 19:32:36][       MBS    INFO] Calling init: mbs_module_state_change, 2022-4aeec083-46f4-47f5-b160-8ec9d955af80

This is the mbs code handling the message that the frontend sent about a new module build. It handles it by queuing a celery task. It seems that the celery tasks are going nowhere.

Checking journalctl, all the workers take tasks when they're rebooted, but they all stop processing tasks after reporting the "cannot register existing type 'ModulemdModuleStream'" warning.

I believe the "cannot register existing type 'ModulemdModuleStream'' error is a warning from libmodulemd. I've seen it before and I do not think it is the cause here.

Fairly sure that the issue is somewhere between celery and the rabbitmq server it is configured to use.

If I manually run the init handler from the command line, it manages to perform the init (it even gets the build into wait), but dies when it tries to send the notification on the message bus

Traceback (most recent call last):
  File "kick-init.py", line 24, in <module>
    init('some_msg_id', module_build_id, models.BUILD_STATES["init"])
  File "/usr/lib/python2.7/site-packages/celery/local.py", line 191, in __call__
    return self._get_current_object()(*a, **kw)
  File "/usr/lib/python2.7/site-packages/celery/app/task.py", line 375, in __call__
    return self.run(*args, **kwargs)
  File "/root/mikem/fm-orchestrator/module_build_service/scheduler/events.py", line 136, in wrapper
    raise e
RuntimeError: Unable to send message to backend 'fedmsg'

This error is raised by mbs in it's publish() function. It is masking the real error.

With manual debugging, that error appears to be: KeyError('Could not find endpoint for fedmsg-relay. Try installing fedmsg-relay.',)

Hmm, the above might be due the the way I'm running the handler (but I've never had an issue running them that way in the RH instance)

The fedmsg configuration, and the celery / rabbitmq configuration, has not changed since MBS was deployed (and all certs are up to date MBS side). Looking at the logs, the backend celery workers are establishing connections to the queue as expected.
I wonder if something has changed for the queue backend, or the messagebus, that could cause this behavior.

I wouldn't think so... mbs is still using fedmsg, so it shouldn't be constrained.

So is the problem that it doesn't get messages from koji? or that it can't send messages?

We did move our message-bridges from openshift3 cluster to openshift4, but they seem to be operating normally as far as I can see. ;(

When I look at mbs messages in fedmsg from today, all I see are the messages generated by the frontend — build init messages and build fail messages (due to user cancel). No other mbs messages appear.

http get https://apps.fedoraproject.org/datagrepper/v2/search    since==2022-07-11    category==mbs

On the backend, we can see the fedmsg receiving these messages and triggering a handler, but we don't see evidence of those handlers actually running.

The backend handlers are celery tasks, so they don't run directly under fedmsg; fedmsg just queues them. Your celery config is set up to use rabbitmq.fedoraproject.org

CELERY_BROKER_URL = 'amqps://mbs-private-queue@rabbitmq.fedoraproject.org//mbs-private-queue'

So it seems like something is going wrong very early in the celery process, possibly because messages aren't getting through.

The fedmsg error I pasted above was something I get by basically manually calling one of the handlers in python. This may also be a problem, but I don't think it's the problem. I don't see any evidence of the backend handlers trying to run at all and I don't see errors like this in the logs.

The active tasks reported by celery are from last week

$ celery -A module_build_service.scheduler.celery_app inspect active -t 10
-> celery@mbs-worker-2: OK
    * {u'args': u"('internal:mbs.module.state.change', 14588, 0)", u'time_start': 1657217629.670392, u'name': u'module_build_service.scheduler.handlers.modules.init', u'delivery_info': {u'priority': 0, u'redelivered': False, u'routing_key': u'mbs-2', u'exchange': u''}, u'hostname': u'celery@mbs-worker-2', u'acknowledged': True, u'kwargs': u'{}', u'type': u'module_build_service.scheduler.handlers.modules.init', u'id': u'd732946d-9067-48bb-8d08-359b63bc106a', u'worker_pid': 15760}
-> celery@mbs-worker-1: OK
    * {u'args': u'()', u'time_start': 1657217210.4333358, u'name': u'module_build_service.scheduler.handlers.modules.init', u'delivery_info': {u'priority': 0, u'redelivered': False, u'routing_key': u'mbs-1', u'exchange': u''}, u'hostname': u'celery@mbs-worker-1', u'acknowledged': True, u'kwargs': u"{'module_build_state': 0, 'msg_id': '2022-f7675924-f657-4605-a0a8-81cc13f8cbea', 'module_build_id': 14659}", u'type': u'module_build_service.scheduler.handlers.modules.init', u'id': u'd78f5a2a-75da-4245-a082-f8895e866114', u'worker_pid': 15397}
-> celery@mbs-worker-0: OK
    * {u'args': u"('internal:mbs.module.state.change', 14559, 0)", u'time_start': 1657217209.3966403, u'name': u'module_build_service.scheduler.handlers.modules.init', u'delivery_info': {u'priority': 0, u'redelivered': False, u'routing_key': u'mbs-0', u'exchange': u''}, u'hostname': u'celery@mbs-worker-0', u'acknowledged': True, u'kwargs': u'{}', u'type': u'module_build_service.scheduler.handlers.modules.init', u'id': u'2a5cf779-71b4-40be-a695-3d7d68082f86', u'worker_pid': 15395}
$ date -d '@1657217209.3966403'
Thu Jul  7 02:06:49 PM EDT 2022

This dates from when the service was last restarted.

I see some celery worker processes with matching runtimes. They all seem to be stuck on futex calls. Judging from their open files they were managing to run some mbs code (e.g. files under tmp/mbs/dnf/module-el8-build-37).

All three processes were stuck at the same point in libmodulemd

(gdb) bt
#0  0x00007f514331ae29 in syscall () from /lib64/libc.so.6
#1  0x00007f512abaf0cf in g_cond_wait () from /lib64/libglib-2.0.so.0
#2  0x00007f512ab9171b in g_once_init_enter () from /lib64/libglib-2.0.so.0
#3  0x00007f51263a7088 in modulemd_modulestream_get_type () from /lib64/libmodulemd.so.1
#4  0x00007f51263c1d0d in _parse_yaml () from /lib64/libmodulemd.so.1
#5  0x00007f51263c2642 in parse_yaml_string () from /lib64/libmodulemd.so.1
#6  0x00007f512639a71e in modulemd_objects_from_string_ext () from /lib64/libmodulemd.so.1
#7  0x00007f5126afa067 in ModuleMetadata::metadataFromString(std::string const&) () from /lib64/libdnf.so.2
#8  0x00007f5126b04142 in ModulePackageContainer::add(std::string const&, std::string const&) () from /lib64/libdnf.so.2
#9  0x00007f5126b0459c in ModulePackageContainer::add(_DnfSack*) () from /lib64/libdnf.so.2
#10 0x00007f5126a8350b in dnf_sack_filter_modules_v2(_DnfSack*, ModulePackageContainer*, char const**, char const*, char const*, bool, bool) () from /lib64/libdnf.so.2
#11 0x00007f5123322cc7 in filter_modules(_SackObject*, _object*, _object*) () from /usr/lib64/python2.7/site-packages/hawkey/_hawkeymodule.so

I tried killing one of these off (had to use -9) and this briefly resulted in som mbs activity in the logs. However it seem the new celery worker process is now stuck at the same point.

So, I may be wrong about the message bus. Looks like there is some indefinite stall in libmodulemd.

The current wedged processes are all looking a different builds for different packages (nginx, postgresql, cobbler), so I doubt this is due to some sort of strange data in a build submission.

@mikem / @breilly Thanks for looking into this.

Do let me/us know if there's anything specific you want us to check on the infra side. Happy to help...

I'm able to replicate the stall with a manually run init handler (didn't hit this before, so likely does depend on some details of the module build).

Running the init handler for build 14559, stalls with:

2022-07-11 17:05:35,993 - MainThread - dnf - DEBUG - repo_x86_64: using metadata from Mon 11 Jul 2022 04:49:01 PM UTC.
2022-07-11 17:05:35,993 - MainThread - dnf - INFO - Last metadata expiration check: 0:00:15 ago on Mon 11 Jul 2022 05:05:16 PM UTC.
/usr/lib/python2.7/site-packages/dnf/base.py:212: Warning: cannot register existing type 'ModulemdModuleStream'
  self.conf.module_platform_id, False, self.conf.debug_solver)
/usr/lib/python2.7/site-packages/dnf/base.py:212: Warning: g_once_init_leave: assertion 'result != 0' failed
  self.conf.module_platform_id, False, self.conf.debug_solver)

So, looks like @breilly was right the first time

After forcibly killing off several stalled celery tasks and canceling a handful of older stuck builds, the remainder seems to be moving again.

It is not clear what about those builds was triggering this behavior in libdnf, since it seems that the other builds are not hitting it.

Because I don't know the root cause, I can't promise it won't clog again.

After forcibly killing off several stalled celery tasks and canceling a handful of older stuck builds, the remainder seems to be moving again.

It is not clear what about those builds was triggering this behavior in libdnf, since it seems that the other builds are not hitting it.

Because I don't know the root cause, I can't promise it won't clog again.

If this keeps happening, one workaround might be to add a time limit for the celery worker and poller tasks (e.g. pass --time-limit the the celery worker invocation in the service scripts). We don't expect mbs handlers to take very long, so an hour or two here would be pretty conservative and help unclog situations like this.

I wouldn't rush to do this though. It would be better to fix the underlying problem if it recurs.

I can just add that from my subjective experience about 2/3 of my module builds (nginx) require me to issue a module build more than once to get all platforms to build successfully. Possibly unrelated though, I'm not sure.

(gdb) bt
#0  0x00007f514331ae29 in syscall () from /lib64/libc.so.6
#1  0x00007f512abaf0cf in g_cond_wait () from /lib64/libglib-2.0.so.0
#2  0x00007f512ab9171b in g_once_init_enter () from /lib64/libglib-2.0.so.0
#3  0x00007f51263a7088 in modulemd_modulestream_get_type () from /lib64/libmodulemd.so.1

This looks like looks*like https://gitlab.gnome.org/GNOME/glib/-/issues/541 - which is an inherent issue when different threads are trying to register new GObject types that depend on each other that is worked around in various places. But actually, I think it's something different. Once we've hit:

/usr/lib/python2.7/site-packages/dnf/base.py:212: Warning: cannot register existing type 'ModulemdModuleStream'
/usr/lib/python2.7/site-packages/dnf/base.py:212: Warning: g_once_init_leave: assertion 'result != 0' failed

We're in an entirely broken state - since once the second warning triggers, we don't unlock an internal GObject mutex.

The "Cannot register existing type" warning typically comes from reasons like:
* Multiple different versions of a GObject-using library are linked into the same process
* A library was unloaded and then reloaded in the same process

Since the root cause tends to be that the static variable holding the type ID doesn't have a value in it, but the type has already been registered.

The content of /proc/<pid>/maps for a stuck process might potentially be interesting.

Isn't MBS supposed to use libmodulemd2 from EPEL? modulemd_modulestream_get_type() does not exist in libmodulemd.so.2 There is modulemd_module_stream_get_type(). Be ware of the underscore. Isn't the problem that libdnf loads libmodulemd, but MBS loads libmodulemd2? The symbol names clashes:

# nm -D /lib64/libmodulemd.so.2 | grep _get_type
                 U g_variant_get_type
                 U g_variant_get_type_string
00000000000198b0 T modulemd_build_config_get_type
000000000001b650 T modulemd_buildopts_get_type
000000000001d660 T modulemd_component_get_type
000000000001ece0 T modulemd_component_module_get_type
00000000000205f0 T modulemd_component_rpm_get_type
00000000000181f0 T modulemd_compression_type_enum_get_type
0000000000023ae0 T modulemd_defaults_get_type
0000000000018260 T modulemd_defaults_version_enum_get_type
0000000000025570 T modulemd_defaults_v1_get_type
0000000000029580 T modulemd_dependencies_get_type
0000000000018340 T modulemd_error_enum_get_type
00000000000182d0 T modulemd_error_get_type
000000000002a850 T modulemd_module_get_type
000000000002cc60 T modulemd_module_index_get_type
0000000000030a20 T modulemd_module_index_merger_get_type
0000000000031210 T modulemd_module_stream_get_type
0000000000018490 T modulemd_module_stream_version_enum_get_type
0000000000034640 T modulemd_module_stream_v1_get_type
000000000003c900 T modulemd_module_stream_v2_get_type
000000000005ef50 T modulemd_obsoletes_get_type
0000000000018500 T modulemd_obsoletes_version_enum_get_type
0000000000018570 T modulemd_packager_version_enum_get_type
0000000000045a20 T modulemd_packager_v3_get_type
000000000004c660 T modulemd_profile_get_type
000000000004e280 T modulemd_rpm_map_entry_get_type
0000000000050540 T modulemd_service_level_get_type
00000000000519f0 T modulemd_subdocument_info_get_type
0000000000055180 T modulemd_translation_entry_get_type
0000000000052bb0 T modulemd_translation_get_type
0000000000018420 T modulemd_yaml_error_enum_get_type
00000000000183b0 T modulemd_yaml_error_get_type
[root@rhel-7-9 ~]# nm -D /lib64/libmodulemd.so.1 | grep _get_type
                 U g_date_get_type
                 U g_error_get_type
                 U g_hash_table_get_type
                 U g_ptr_array_get_type
                 U g_strv_get_type
                 U g_variant_get_type_string
000000000001cd20 T modulemd_buildopts_get_type
000000000001e330 T modulemd_component_get_type
000000000001f0e0 T modulemd_component_module_get_type
000000000001fa10 T modulemd_component_rpm_get_type
00000000000209c0 T modulemd_defaults_get_type
0000000000022c40 T modulemd_dependencies_get_type
0000000000023750 T modulemd_improvedmodule_get_type
0000000000024570 T modulemd_intent_get_type
00000000000256f0 T modulemd_module_get_type
000000000002a060 T modulemd_modulestream_get_type
000000000002f020 T modulemd_prioritizer_get_type
000000000002f940 T modulemd_profile_get_type
00000000000310a0 T modulemd_servicelevel_get_type
0000000000030650 T modulemd_simpleset_get_type
0000000000031ac0 T modulemd_subdocument_get_type
0000000000033490 T modulemd_translation_entry_get_type
0000000000032230 T modulemd_translation_get_type

The content of /proc/<pid>/maps for a stuck process might potentially be interesting.

I can confirm that in the maps file of one of the stuck processes both libmodulemd2 and libmodule are loaded:

7f9dea4b3000-7f9dea53b000 r-xp 00000000 fd:00 16779651                   /usr/lib64/libmodulemd.so.2.14.0
7f9dea53b000-7f9dea73a000 ---p 00088000 fd:00 16779651                   /usr/lib64/libmodulemd.so.2.14.0
7f9dea73a000-7f9dea73b000 r--p 00087000 fd:00 16779651                   /usr/lib64/libmodulemd.so.2.14.0
7f9dea73b000-7f9dea73d000 rw-p 00088000 fd:00 16779651                   /usr/lib64/libmodulemd.so.2.14.0
7f9de7036000-7f9de70a1000 r-xp 00000000 fd:00 16787785                   /usr/lib64/libmodulemd.so.1.6.3
7f9de70a1000-7f9de72a0000 ---p 0006b000 fd:00 16787785                   /usr/lib64/libmodulemd.so.1.6.3
7f9de72a0000-7f9de72a1000 r--p 0006a000 fd:00 16787785                   /usr/lib64/libmodulemd.so.1.6.3
7f9de72a1000-7f9de72a3000 rw-p 0006b000 fd:00 16787785                   /usr/lib64/libmodulemd.so.1.6.3

Isn't MBS supposed to use libmodulemd2 from EPEL? modulemd_modulestream_get_type() does not exist in libmodulemd.so.2 There is modulemd_module_stream_get_type(). Be ware of the underscore.

MBS is supposed to be using libmodulemd2, yes. MBS itself is also not calling modulemd_module_stream_get_type() or modulemd_modulestream_get_type() directly as far as I can tell.

Isn't the problem that libdnf loads libmodulemd, but MBS loads libmodulemd2?

If that is the case, what needs to be done to avoid the symbol names clashing?

modulemd_module_stream_get_type() will indirectly be called behind the scenes via the Python GObject bindings.

I think having libmodulemd.so.1 and libmodulemd.so.2 in the same process cannot work properly and has to be avoided - if MBS is using the DNF module, it needs to use the same libmodulemd as DNF. What is the host OS? What are we using for RHEL? It looks like RHEL8 should be using libmodulemd.so.2 for DNF.

To clarify: "What are we using for RHEL?" what I meant there is what are we using for a host OS for the internal RHEL deployment of MBS.

The OS is the same for both our internal deployment and Fedora's deployment - RHEL 7.9.

Where are the python dnf bindings coming from? DNF/yum4 isn't part of rhel7 is it?

Various DNF items were added to EL7 before EL8 came out so that transitioning and actually building modules would be possible. This system has the following:

dnf-data-4.0.9.2-2.el7_9.noarch
libdnf-0.22.5-2.el7_9.x86_64
libmodulemd-1.6.3-1.el7.x86_64
libmodulemd2-2.14.0-1.el7.x86_64
python2-dnf-4.0.9.2-2.el7_9.noarch
python2-libdnf-0.22.5-2.el7_9.x86_64
python2-libmodulemd2-2.14.0-1.el7.x86_64

all of them but the libmodulemd2 is from RHEL.

You can try creating a ModuleStream object in MBS after importing Modulemd2 but before spawning any threads and importing libdnf. It could mask this particular failure.

But in general having both libraries in the same process won't work. Specifically it depends how Python and Glib Introspection dlopen libraries (RTLD_LOCAL vs. RTLD_GLOBAL) and where glib stores its global data structures. But there are to many holes to be plugged in local-binding approach.

I'm thinking that the most straightforward solution here is probably to make _get_rpms_in_external_repo() run in a separate process:

subprocess.check_output(['mbs-get-rpms-in-external-repo', repo_url, cache_dir_name] + arches)

and make mbs-as-a-server not import dnf at all.

I see, thanks. And the non-work-around solution would be to upgrade the host to RHEL 8, right? Since dnf there would be using libmodulemd2.

I'm surprised we haven't seen this issue arise until now, and we as yet haven't seen it in our near identical internal deployment.

Yeah, on RHEL 8, dnf uses libmodulemd2, so this problem wouldn't exist. It's not clear to me why this problem wasn't happening before, or why it isn't happening internally - it could be as @ppisar says that it's masked for some orders of using dnf and libmodulemd - or maybe handle_collisions_with_base_module_rpms() is simply not getting called at all.

The non-work-around would be upgrade to Rawhide. Because Fedora MBS wants to process modules for Rawhide. If MBS freezes on RHEL-8 libmodulemd, it won't be possible to support Fedora features. If you want to keep using RHEL, you will need to rebuild Fedora's libmodulemd and install it into the host. And that will be only doable as long as Fedora's libmodulemd remains compatible with RHEL's libmodulemd. Now it's so (2.14.0 vs. 2.13.0).

For the sake of expediency I'm working on implementing the workaround fix. This should be only a temporary measure though, and an OS upgrade is going to be required for a longer term fix.

I'm happy to rebuild the backend/frontend/both vm's to a different os.

I am against running rawhide, we don't do that anywhere else (aside from rawhide-test machine). I think it would cause undue strain on scarce sysadmin cycles to keep it working.

I'm fine with rhel8, rhel9, fedora-36...

I am against rawhide mainly because the few times we have tried it for a service it led to complete burnout and quitting of people.

I would like to suggest RHEL9. RHEL8 is going to be in maintenance mode when RHEL7 goes out of life in 2024. That means while it will work, we won't be able to get any 'fixes' we might need for newer packages. After that I would say Fedora with a scheduled rotation of upgrading it every 6 months with the MBS team scheduled to help on it.

I've deployed the workaround patch to stage, now running some test builds before deploying to production.

Deployed patch to prod, will keep an eye on builds to see how they're going.

There's now a new, unrelated issue to the first on the backend - the cert fedmsg-hub uses is invalid:

Jul 14 13:40:38 mbs-backend01 fedmsg-hub: [2022-07-14 13:40:38][fedmsg.crypto.utils    INFO] Downloading x509 certificate from https://fedoraproject.org/fedmsg/ca.crt
Jul 14 13:40:38 mbs-backend01 fedmsg-hub: [2022-07-14 13:40:38][fedmsg.crypto.utils    INFO] Downloading x509 certificate from https://fedoraproject.org/fedmsg/crl.pem
Jul 14 13:40:38 mbs-backend01 fedmsg-hub: [2022-07-14 13:40:38][fedmsg.crypto.x509 WARNING] Failed validation.  X509 certificate is not valid.
Jul 14 13:40:38 mbs-backend01 fedmsg-hub: [2022-07-14 13:40:38][fedmsg.consumers WARNING] Received invalid message Failed to authn message.

@kevin can you investigate?

The fedmsg ca expired. ;(

I pushed out a new cert signed by the old ca... please try again now?

Seeing a new error now:

Jul 14 15:50:48 mbs-backend01 fedmsg-hub: [2022-07-14 15:50:48][moksha.hub WARNING] Received invalid message RuntimeWarning('Failed to authn message.',)
Jul 14 15:50:48 mbs-backend01 fedmsg-hub: [2022-07-14 15:50:48][fedmsg.crypto WARNING] Message specified an impossible crypto backend

And it is still saying the cert is invalid:

Jul 14 15:53:25 mbs-backend01 fedmsg-hub: [2022-07-14 15:53:25][fedmsg.crypto.utils    INFO] Downloading x509 certificate from https://fedoraproject.org/fedmsg/ca.crt
Jul 14 15:53:25 mbs-backend01 fedmsg-hub: [2022-07-14 15:53:25][fedmsg.crypto.utils    INFO] Downloading x509 certificate from https://fedoraproject.org/fedmsg/crl.pem
Jul 14 15:53:25 mbs-backend01 fedmsg-hub: [2022-07-14 15:53:25][fedmsg.crypto.x509 WARNING] Failed validation.  X509 certificate is not valid.
Jul 14 15:53:25 mbs-backend01 fedmsg-hub: [2022-07-14 15:53:25][fedmsg.consumers WARNING] Received invalid message Failed to authn message.

This was after a fedmsg-hub restart.

After another fedmsg-hub restart, MBS is working again.

Awesome. Thanks for everyone's work here...

Metadata Update from @kevin:
- Issue close_status updated to: Fixed
- Issue status updated to: Closed (was: Open)

a month ago

Login to comment on this ticket.

Metadata
Boards 1
Ops Status: Backlog