#10558 Fedora message bus instability - lost messages
Closed: Insufficient data 2 years ago by kevin. Opened 2 years ago by mvadkert.

We are fighting Fedora CI stability for pull requests. Both Zuul CI and Jenkins based CI for Fedora rely on Fedora message bus. We think that some of the messages get lost, causing user problems.

Please see this issue (sorry for the internal link):

https://issues.redhat.com/browse/OSCI-2908?focusedCommentId=19753369&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-19753369

Do we have some monitoring or stats on this service? We need it to be very reliable. We do not have these issues for internal RH message bus. Maybe we could move it to AWS if that would be the problem?

Any insight would help, we are struggling, thanks!


I... don't think it's possible we are dropping messages. We are using a 3 node rabbitmq cluster and consumers have to ack that they have gotten the message...

But perhaps @abompard could speak more to that than I.

What messages exactly aren't showing? pagure? src.fedoraproject.org? both?

Metadata Update from @kevin:
- Issue priority set to: Waiting on Assignee (was: Needs Review)
- Issue tagged with: medium-gain, medium-trouble, ops

2 years ago

Hello! It is indeed unlikely that the message bus server is dropping messages. I've checked the logs and I didn't see anything abnormal there. It's possible however that the clients are at fault:
- the sender may be crashing before they managed to send the event. I would check the sender's log for that (pagure?)
- when multiple receivers are connected to the same queue, they get messages in a round-robin fashion. Is it possible that another debug client is running somewhere and people forgot about it?

You can go to https://apps.fedoraproject.org/datagrepper/ and verify that the message you expected has indeed been sent.

Anything more we can help with here? Do you know specific messages you never got?

Anything else we can do from this end?

@kevin @abompard thanks, so message bus is fine. Can we check also logs for the service which updates src.fedoraproject.org/rpms merge request statuses?

We do not own it, not sure where it is nowadays running, I do believe @pingou will know ....

I think thats done in pagure itself.

Sadly, that message is too old to be in our current logs. ;(

;( @churchyard can you maybe update this ticket once you hit the problem again?

When the Ci won't even start? Sure.

@churchyard yep, does not start, or results not updated on PR. And just Fedora CI, Zuul CI afaik is not involved here ...

Can you re-open this or file a new one when you see that happen? Thanks.

Metadata Update from @kevin:
- Issue close_status updated to: Insufficient data
- Issue status updated to: Closed (was: Open)

2 years ago

yeah, I believe we are moving away from the message bus and just POST directly to resultsdb rather ... fewer systems involved = fewer problems ...

Login to comment on this ticket.

Metadata
Boards 1
ops Status: Backlog