#7493 The fedmsg relay on hub.fedoraproject.org is dead
Closed: Fixed 5 years ago by kevin. Opened 5 years ago by jcline.

  • Describe what you need us to do:

Something is broken with the fedmsg relay publishing to hub.fedoraproject.org:9940. It accepts connections, but no messages are being published.

I assume it's set up to connect to all the endpoints in infrastructure and relay it, so perhaps it's not actually connected. Apparently it broke yesterday

  • When do you need this? (YYYY/MM/DD)
    ASAP

  • When is this no longer needed or useful? (YYYY/MM/DD)
    When we don't use fedmsg anymore

  • If we cannot complete your request, what is the impact?
    Infra tools relying on hub.fedoraproject.org miss messages (Jenkins CI among them).


At least three other people were complaining about this issue on IRC. One of them said this started to happen on Tuesday.

Metadata Update from @mizdebsk:
- Issue priority set to: Waiting on Assignee (was: Needs Review)

5 years ago

Is there a possiblity to monitor this kind of issues in the future? On downstream we have prometheus.io + altertmanager for example, and semaphore tool (not sure if it is onopensource yet, using cachet.io).

Maybe https://status.fedoraproject.org/ checks could be improved?

This seems to be exactly the same as https://pagure.io/fedora-infrastructure/issue/7424 , only on prod , not staging. I reported this same problem for the staging instance last month, it never got fixed, it now has spread to prod. :/

I think there is kind of a dumb workaround you can do on any system that consumes fedmsgs and does stuff: just restart fedmsg-hub service (or fedmsg-hub-3 if that's what you use). The messages do seem to be reaching datagrepper, so when you restart fedmsg-hub, its 'get a message backlog from datagrepper and handle any messages we have not yet seen' mechanism will kick in and it will 'see' all the messages it missed (at least, going back as far as the backlog retrieval mechanism goes, anyway).

ok, I am pretty sure I have this fixed. Please re-open if anyone still sees issues.

as for monitoring, will have to look at an external consumer check somehow. All our existing checks were fine because everything was running and processing right, it's just haproxy wasn't looking for the right place for the messages, so they never went out from there. ;(

Metadata Update from @kevin:
- Issue close_status updated to: Fixed
- Issue status updated to: Closed (was: Open)

5 years ago

Login to comment on this ticket.

Metadata