#11829 Recurring issue that Bugzilla stops sending messages to fedora-messaging
Closed: Fixed with Explanation 2 months ago by kevin. Opened 10 months ago by frostyx.

This issue appears several times a week. For some reason Bugzilla randomly stops sending messages to fedora-messaging and Fedora Infra team needs to "poke it". Which probably means redeploying Bugzilla2fedmsg in OpenShift.

Describe what you would like us to do:

Is this issue fixable? Or can we bandage it by an automatic daily redeploy of Bugzilla2fedmsg?

When do you need this to be done by? (YYYY/MM/DD)

At your convenience. But the sooner the better. I get pinged at least once a week and I need to ping you at least once a week to fix it.


Metadata Update from @zlopez:
- Issue priority set to: Waiting on Assignee (was: Needs Review)
- Issue tagged with: medium-gain, medium-trouble, ops

10 months ago

I was looking at the pod today and there isn't anything helpful in the logs. So I'm not sure what actually causes bugzilla2fedmsg to stop processing.

Yeah, it's pretty odd. It used to be pretty reliable, but now it seems to get stuck pretty often. ;(

Perhaps @abompard has some ideas here?

I have restarted bugzilla2fedmsg with debug logging, hopefully it'll tell us what's going on next time it's stuck

I've had to restart it a few times... in the logs it looked like it was getting heartbeat messages back and forth, but nothing else. ;(

So, I don't think the debugging as it is now is going to tell us much.

This is happening again :-/
Can anybody poke it, please?

Done. Would it help if we lowered the threshold for the nagios alert?

I think so. I am not sure what the nagios alert threshold is but when we talked about this issue the last time, it occurred multiple times a day, so I guess nagios needs to check in the span of hours?

Right now it's set in roles/nagios_client/templates/check_datanommer_history.cfg.j2

command[check_datanommer_bugzilla]={{libdir}}/nagios/plugins/check_datanommer_timesince.py bugzilla 86400 259200

so, warning is 24 hours and critical is 72 hours.

Note that it's a balancing act because if we set it to something too small, there could be an actual gap in bug changes, but if we set it to high we miss when they stop flowing.

Do you think we can solve this caveman style and set a timer to reboot the Bugzilla2fedmsg service every hour? At this point, it sounds like the least painful solution to me

I restarted it again today, so I think the reboot every hour would be probably good solution.

I restarted the bugzilla2fedmsg, hopefully that helps again.

I've restarted it a bunch of times i the past few weeks. It's not working.

I wonder if it's something internally? We may need to try and find someone on the other side to help us debug...

The same is happening for both staging and production. I can look into that when I will have some spare time.

Metadata Update from @zlopez:
- Issue assigned to zlopez

5 months ago

I sent e-mail to bugzilla-owner@redhat.com describing the issue we have, hopefully I will get response from them.

Thank you both @kevin and @zlopez for trying to fix this. It is much appreciated as bugzilla2fedmsg is very important to me.

I got confirmed in the ticket the issue is on bugzilla side. Waiting for them to fix it.

Metadata Update from @zlopez:
- Issue priority set to: Waiting on External (was: Waiting on Assignee)

5 months ago

I think this is fixed. a bunch of messages came through this afternoon.

Can you confirm?

Oh, and @zlopez could you ask them about bugzilla.stage too?

I ask them about both in the ticket. Will check the ticket today.

They confirmed that the production is fixed and I asked them to check bugzilla.stage as well.

I think this is fixed. a bunch of messages came through this afternoon.
Can you confirm?

I can confirm, Fedora Review Service successfully triggered a Copr build from my Bugzilla comment.

Something happened again, the last Bugzilla message I see in the datagrepper is ca192b40-cbf0-46e3-9666-6b8fb75f9c09 (from 2024-08-26T23:23:49+00:00).

The STG instance works fine.

Not sure if it is an issue with Bugzilla itself, or bugzilla2fedmsg.
Could you please poke it again?

Also, IIRC bugzilla2femsg uses some private auth keys to listen to Bugzilla messages because the messaging bus isn't publicly available. Do you think you could give me those auth keys? At least for the STG instance but ideally for production as well? I'd like to run my own instance for debugging purposes.

I restarted it yesterday after seeing this (forgot to note that here).

I can't really give you our certs, but I can point you on how to get your own. ;)
Done that in a direct message (since it's an internal document).

I see this in the logs for bugzilla2fedmsg:

2024-09-02 11:51:38,463 [bugzilla2fedmsg.consumer ERROR] Exception when relaying the message:
Traceback (most recent call last):
  File "/opt/app-root/lib64/python3.9/site-packages/fasjson_client/response.py", line 40, in __call__
    call_result = self.operation(**kwargs).response().result
  File "/opt/app-root/lib64/python3.9/site-packages/bravado/http_future.py", line 200, in response
    swagger_result = self._get_swagger_result(incoming_response)
  File "/opt/app-root/lib64/python3.9/site-packages/bravado/http_future.py", line 124, in wrapper
    return func(self, *args, **kwargs)
  File "/opt/app-root/lib64/python3.9/site-packages/bravado/http_future.py", line 300, in _get_swagger_result
    unmarshal_response(
  File "/opt/app-root/lib64/python3.9/site-packages/bravado/http_future.py", line 353, in unmarshal_response
    raise_on_expected(incoming_response)
  File "/opt/app-root/lib64/python3.9/site-packages/bravado/http_future.py", line 420, in raise_on_expected
    raise make_http_exception(
bravado.exception.HTTPUnauthorized: 401 UNAUTHORIZED

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/opt/app-root/src/bugzilla2fedmsg/consumer.py", line 97, in consume
    self.relay.on_stomp_message(body, frame.headers)
  File "/opt/app-root/src/bugzilla2fedmsg/relay.py", line 63, in on_stomp_message
    message_body = self._get_message_body(body, headers)
  File "/opt/app-root/src/bugzilla2fedmsg/relay.py", line 134, in _get_message_body
    agent_name = email_to_fas(event["user"]["login"], self._fasjson)
  File "/opt/app-root/src/bugzilla2fedmsg/utils.py", line 45, in email_to_fas
    results = fasjson.search(rhbzemail=email).result
  File "/opt/app-root/lib64/python3.9/site-packages/fasjson_client/response.py", line 42, in __call__
    raise APIError.from_bravado_error(e)
fasjson_client.errors.APIError: Credential lifetime has expired

It seems that our credentials expired.

Thats... strange...

I rolled out new pods this morning and it seems like it's processing normally?

Was this staging or prod?

This was production deployment, but if the issue is not there anymore I think we can close this ticket.

Metadata Update from @zlopez:
- Issue close_status updated to: Fixed
- Issue status updated to: Closed (was: Open)

4 months ago

Feel free to reopen the ticket if the issue happens again.

Thank you very much for the fix. I can confirm it worked for a while (I got this message 2652d7eb-991e-4562-8a4f-1616a9c9f650) but it looks dead again :-/

I just checked the bugzilla2fedmsg pod in OpenShift and according to logs it works fine:

2024-09-04 14:30:33,897 [bugzilla2fedmsg.consumer DEBUG] Received message on STOMP with ID ID:umb-prod-2.umb-001.prod.us-east-1.aws.redhat.com-34515-1724356741325-5:186988:-1:1:1083
2024-09-04 14:30:33,897 [bugzilla2fedmsg.relay DEBUG] DROP: message has no object field. Non public.
2024-09-04 14:30:38,360 [bugzilla2fedmsg.consumer DEBUG] Sending heartbeat

It seems that there is still some issue with the production instance. It doesn't sent messages till yesterday, the last received message failed with this error:

2024-09-05 07:51:37,777 [bugzilla2fedmsg.consumer DEBUG] Received message on STOMP with ID ID:umb-prod-2.umb-001.prod.us-east-1.aws.redhat.com-34515-1724356741325-5:204563:-1:1:297
2024-09-05 07:51:37,777 [bugzilla2fedmsg.utils DEBUG] Looking for a FAS user with rhbzemail = christian.rohrer@switch.ch
2024-09-05 07:51:37,931 [bugzilla2fedmsg.consumer ERROR] Exception when relaying the message:
Traceback (most recent call last):
  File "/opt/app-root/lib64/python3.9/site-packages/fasjson_client/response.py", line 40, in __call__
    call_result = self.operation(**kwargs).response().result
  File "/opt/app-root/lib64/python3.9/site-packages/bravado/http_future.py", line 200, in response
    swagger_result = self._get_swagger_result(incoming_response)
  File "/opt/app-root/lib64/python3.9/site-packages/bravado/http_future.py", line 124, in wrapper
    return func(self, *args, **kwargs)
  File "/opt/app-root/lib64/python3.9/site-packages/bravado/http_future.py", line 300, in _get_swagger_result
    unmarshal_response(
  File "/opt/app-root/lib64/python3.9/site-packages/bravado/http_future.py", line 353, in unmarshal_response
    raise_on_expected(incoming_response)
  File "/opt/app-root/lib64/python3.9/site-packages/bravado/http_future.py", line 420, in raise_on_expected
    raise make_http_exception(
bravado.exception.HTTPUnauthorized: 401 UNAUTHORIZED

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/opt/app-root/src/bugzilla2fedmsg/consumer.py", line 97, in consume
    self.relay.on_stomp_message(body, frame.headers)
  File "/opt/app-root/src/bugzilla2fedmsg/relay.py", line 63, in on_stomp_message
    message_body = self._get_message_body(body, headers)
  File "/opt/app-root/src/bugzilla2fedmsg/relay.py", line 134, in _get_message_body
    agent_name = email_to_fas(event["user"]["login"], self._fasjson)
  File "/opt/app-root/src/bugzilla2fedmsg/utils.py", line 45, in email_to_fas
    results = fasjson.search(rhbzemail=email).result
  File "/opt/app-root/lib64/python3.9/site-packages/fasjson_client/response.py", line 42, in __call__
    raise APIError.from_bravado_error(e)
fasjson_client.errors.APIError: Credential lifetime has expired

After multiple failures it stopped receiving messages. Only the heartbeat messages are visible.
It's hard to tell if the same is happening on staging as there isn't much traffic on staging instance (I didn't see any message in the pod logs, just heartbeats).

I restarted the pod for now, but I'm reopening this issue as it seems that there is still something left to fix.

EDIT: After restart it's processing the messages without issue.

Metadata Update from @zlopez:
- Issue status updated to: Open (was: Closed)

4 months ago

Ah, I wonder if its session to fasjson to look up users is getting expired there?
like it needs to catch this and re-auth/connect?

Yeah it's likely that the ticket it got using the keytab expires, and it does not renew it. I'll have a look.

I noticed that the pod isn't running, so I started it again.

On the standup I found out that @abompard is working on that, so disabled the pod again.

No, it's still happenning, I'm testing things and moving forward with the diagnostic. The latest update is in this ticket: https://github.com/gssapi/mod_auth_gssapi/issues/316

Yeah, I saw this / the alerts... but there's a debug pod running and I wasn't sure if just deleting that would start a new one or mess up some debugging that @abompard has in place.

Hopefully he can look in the morning when he gets in.

Yep, that was me, sorry. On the plus side, I think I finally found a way to fix/workaround this issue. It would require an update of fasjson (easy) and fasjson-client (not so easy, it means updating all the apps that use it).
I still think the proper fix should be in mod_auth_gssapi so I'll follow up in the upstream ticket.

Metadata Update from @zlopez:
- Assignee reset

3 months ago

Metadata Update from @zlopez:
- Issue assigned to abompard

3 months ago

Sounds like a upstream solution was found... what do we need to do to get this deployed?

I think I deployed the fix already, so it should work now.

ok, great! Then I guess we can close this now? If it happens again let us know!

Metadata Update from @kevin:
- Issue close_status updated to: Fixed with Explanation
- Issue status updated to: Closed (was: Open)

2 months ago

Log in to comment on this ticket.

Metadata
Boards 1
ops Status: Backlog