Learn more about these different git repos.
Other Git URLs
It looks like that dummy-test-package-gloster has failed gating tests since June 17, 2023. After that point in time, no updates of this "canary" package got pushed to stable. They seem to all have ended up stuck in "testing" purgatory due to failed gating tests, and were only obsoleted by subsequent builds of the package.
The last stable build from ~a year ago, as can be seen here:
https://bodhi.fedoraproject.org/updates/?search=&packages=dummy-test-package-gloster&status=stable
All builds since were stuck in "testing" and then obsoleted:
https://bodhi.fedoraproject.org/updates/?search=&packages=dummy-test-package-gloster
Should these automated builds either be fixed, or turned off? Who is responsible for the bot that submits these builds? All "bot" accounts are supposed to have a person who looks after them, but if things have literally been broken for a year, that obviously isn't happening here.
AFAICS, this package is actually intentionally rigged to fail gating, and has been since 2020.
I think what's gone wrong is that a bot is supposed to waive the failures, but that isn't happening any more. The last bot-filed waiver appears to be https://waiverdb.fedoraproject.org/api/v1.0/waivers/?subject_identifier=dummy-test-package-gloster-0-9131.fc37 , for https://bodhi.fedoraproject.org/updates/FEDORA-2022-b6216202e8 . Since then a couple of other updates went stable, but only because waivers were filed (presumably manually) by mattia and patrikp.
The bot-filed waivers always used the message "This is fine, we are testing the workflow", so searching for that string is probably the best way to find the bot and start to figure out what's wrong with it.
Took me a bit of poking around, but https://pagure.io/fedora-ci/monitor-gating seems to be the codebase.
Manually running the command waive_update should run - bodhi updates waive FEDORA-2024-92d23d0013 "This is fine, we are testing the workflow" --debug - worked, so it's not that the syntax has gone out of date or anything.
waive_update
bodhi updates waive FEDORA-2024-92d23d0013 "This is fine, we are testing the workflow" --debug
I'll look into it more tomorrow, but at a guess, either this thing is trying to use user/password auth and that doesn't work any more, or its token has gone stale, or for some reason it's not reaching the waive code any more.
Tagging @zlopez and @patrikp , who seem to have touched this thing most recently (other than nirik). Will also poke the CI team on chat.
Metadata Update from @patrikp: - Issue assigned to patrikp
Ah, I do see this from nirik in 2023: https://pagure.io/fedora-infra/ansible/c/39ecc928f0734813733d19175c0964c2f8752cea?branch=main . That might not have worked as expected.
I'm not tagging him ATM because he's on PTO, it'd be best if we can figure it out without bothering him.
Metadata Update from @phsmoura: - Issue tagged with: medium-gain, medium-trouble, ops
Aha. Well, I figured out how to look at the monitor-gating logs, and...as it turns out, I'm pretty sure I broke this!
That changed the Bodhi UpdateReadyForTesting message (bodhi.update.status.testing.koji-build-group.build.complete), reducing the amount of stuff in its artifact dict to just the stuff strictly needed by Fedora CI, but it turns out monitor-gating used it too. So we're failing in utils.lookup_results_datagrepper because we're expecting the artifact dict to have an id key, which it does not any more. That happens before utils.waive_update, so we never do that.
bodhi.update.status.testing.koji-build-group.build.complete
artifact
utils.lookup_results_datagrepper
id
utils.waive_update
I'll send a PR to fix this (and any other use of the artifact dict I can find in the code).
Metadata Update from @adamwill: - Issue untagged with: medium-gain, medium-trouble, ops
gahhh, pagure...
Metadata Update from @adamwill: - Issue tagged with: medium-gain, medium-trouble, ops
https://pagure.io/fedora-ci/monitor-gating/pull-request/47 should fix this.
Doesn't seem to have worked. https://bodhi.fedoraproject.org/updates/?packages=dummy-test-package-gloster
I merged the PR and ran this playbook: https://pagure.io/fedora-infra/ansible/blob/main/f/playbooks/openshift-apps/monitor_gating.yml
Is there anything else that should have been done?
Odd. That should be all thats required as far as I know. Did you run the playbook right after merging the PR? There's a small delay before it syncs to batcave01... but it's like 15-20seconds.
I reran the playbook yesterday (just to be sure) and it hasn't fixed the issue, unfortunately.
(Posted this to https://pagure.io/fedora-ci/monitor-gating/pull-request/47#comment-205935 instead of here accidentally :sweat_smile:)
I added some comments in https://pagure.io/fedora-ci/monitor-gating/issue/46 after looking through some of the code. Removing --user and --password seems straight forward but I'm not clear on how bodhi expects it's kerberos information. The code looks like it may be either a configuration file on the host OR it may require a preexisting ticket to forward. I wasn't able to find docs around what was expected but it's highly likely I just missed them. Any pointers would be appreciated 🙏
--user
--password
@pingou Hello. :wave: Would you happen to have any idea about how this could be fixed?
So monitor-gating uses a keytab for getting a Kerberos ticket.
monitor-gating
I rsh'd into the running pod and ran: $ kinit -k -t /etc/keytabs/monitor-gating-keytab packagerbot/os-control01.iad2.fedoraproject.org@FEDORAPROJECT.ORG
$ kinit -k -t /etc/keytabs/monitor-gating-keytab packagerbot/os-control01.iad2.fedoraproject.org@FEDORAPROJECT.ORG
And when I run klist I can see a valid Kerberos ticket.
klist
Based on this I conclude that the keytab itself is fine and we aren't any closer to figuring out what's wrong.
Two tickets worth noting that may or may not be related. It looks like more people are having issues waiving failed tests: 1) https://pagure.io/fedora-infrastructure/issue/12073 2) https://github.com/release-engineering/waiverdb/issues/219
But it was opened relatively recently and monitor-gating has been unable to waive failed tests for many months now.
For the heck of it I checked F38 and F40 code for differences to see if maybe we were using an older version of the client. There are differences but they shouldn't have any effect:
$ diff -ur bodhi_client-8.0.0 bodhi_client-8.1.0 diff -ur bodhi_client-8.0.0/PKG-INFO bodhi_client-8.1.0/PKG-INFO --- bodhi_client-8.0.0/PKG-INFO 1969-12-31 19:00:00.000000000 -0500 +++ bodhi_client-8.1.0/PKG-INFO 1969-12-31 19:00:00.000000000 -0500 @@ -1,6 +1,6 @@ Metadata-Version: 2.1 Name: bodhi-client -Version: 8.0.0 +Version: 8.1.0 Summary: Bodhi client Home-page: https://bodhi.fedoraproject.org/ License: GPL-2.0-or-later diff -ur bodhi_client-8.0.0/pyproject.toml bodhi_client-8.1.0/pyproject.toml --- bodhi_client-8.0.0/pyproject.toml 2023-12-09 09:51:40.000000000 -0500 +++ bodhi_client-8.1.0/pyproject.toml 2024-04-09 07:57:53.241559300 -0400 @@ -1,6 +1,6 @@ [tool.poetry] name = "bodhi-client" -version = "8.0.0" +version = "8.1.0" description = "Bodhi client" authors = ["Fedora Infrastructure team"] maintainers = ["Fedora Infrastructure Team <infrastructure@lists.fedoraproject.org>"] diff -ur bodhi_client-8.0.0/setup.py bodhi_client-8.1.0/setup.py --- bodhi_client-8.0.0/setup.py 1969-12-31 19:00:00.000000000 -0500 +++ bodhi_client-8.1.0/setup.py 1969-12-31 19:00:00.000000000 -0500 @@ -20,7 +20,7 @@ setup_kwargs = { 'name': 'bodhi-client', - 'version': '8.0.0', + 'version': '8.1.0', 'description': 'Bodhi client', 'long_description': 'None', 'author': 'Fedora Infrastructure team',
Looking through the bodhi client code itself I was able to trace the authentication calls from bindings.py into oidcclient.py and, while a little confusing with the reused names for methods, should flow without issue for kerberos use. use_kerberos ends up True for our use case through bindings.py which is what we want. Specifically send_request calls it when we waive
bindings.py
oidcclient.py
use_kerberos
True
We know we have a valid kerberos ticket available. We see it in the log output during the run and @patrikp did a manual kinit to test but the bodhi client fails when attempting to use it with the bodhi service.
kinit
Since we no longer are using --user/--password calls with the bodhi client and only relying on kerberos there are some questions that come to mind: - Should we check this accounts privileges to ensure nothing was accidentally removed? - Do we see anything in the bodhi logs that could provide some insight? - Have we verified there isn't clock drift? (long shot, but kerberos doesn't like drift!)
--user/--password
@abompard I saw you were part of helping make the authentication changes in the bodhi client. Do you have any ideas of where to look next?
There's a small typo in the deploymentconfig, the path to the keytab is wrong, I'll fix it. But it's not the only solution, I still get a kerberos auth failure when I try to run the bodhi client in the pod:
kerberos.GSSError: (('Invalid token was supplied', 589824), ('Success', 100001))
There's a sort of error in ipsilon as well:
GSS ERROR gss_localname() failed: [The operation or option is not available or unsupported (No such file or directory)]
So maybe there's something wrong in ipsilon as well. I'll have a look.
So, the keytab belongs to the service packagerbot/os-control01.iad2.fedoraproject.org@FEDORAPROJECT.ORG. I am not sure that Ipsilon accepts that as an authentication, because it's not a user, so it may not find it.
packagerbot/os-control01.iad2.fedoraproject.org@FEDORAPROJECT.ORG
We could use a keytab for the packagerbot user instead, maybe?
packagerbot
Interesting! @patrikp do you know if that's possible or have an idea of who may be able to answer that question?
There's no automated way to get a user keytab in our ansible yet, but I can make a role based on the service keytab we already have. In fact, I'll do that, it may end up useful anyway.
It's apparently not so easy to retrieve a user keytab, except by logging in with kerberos as that user, but we have to be very careful with this on Openshift because last I heard the default location for kerberos credentials was not isolated between containers. Also, in bodhi the update is created by the packagerbot/os-control01.iad2.fedoraproject.org user. Those updates are automatically updated because the build is done in Rawhide, and Bodhi's consumer just takes koji's build owner as the user. So we should probably keep using the service keytab and find a way to make Bodhi accept it. Not sure exactly. I'll think about it, but I'm happy to hear other ideas or suggestions.
packagerbot/os-control01.iad2.fedoraproject.org
Small update to where my thinking brought me: we could teach Bodhi to login with Kerberos in addition to OIDC (pretty much like Copr). But we either need to have Apache with mod_auth_gssapi in front of it, which would be going back to the system we had pre-2020 when Bodhi was running in Apache with mod_wsgi, or we need to use a WSGI middleware to do Kerberos auth. There's one here but it's using the outdated python-kerberos library. Thankfully someone has recently sent a PR to upgrade to python-gssapi. So I'm going to tests that and help get it to a mergeable state. Then add it to bodhi and add a login-gssapi endpoint. And then teach bodhi-client to use it. As you can see, it's a lot of steps. Any other simpler idea?
mod_auth_gssapi
mod_wsgi
login-gssapi
bodhi-client
Huh. I haven't heard that I recall that kerberos credentials are not isolated. I would think they sure would be.
Could we use a OIDC token here? Like wikitcms uses for openqa or the kerneltest cli uses?
Yeah we could totally use a OIDC token, that would make things much easier :-) But those expire at some point, even though I think it's a long time (a year, no?)
You can specify the expiry when you make the token. We could set it for 5 years? or just leave it with 1 year so we know that it needs renewed / checking from time to time?
Sounds reasonable to me to use an OIDC token. Maybe split the difference and have a time frame that is a little longer than the our renew cadence to allow a little wiggle room just in case.
Any progress on ideas here?
I've been thinking about this some more and was wondering... Do we even need monitor-gating?
Its goal [1] states: "This project contains a script to monitor the health of gating in Fedora." But the waiving has been broken for a long time and we are still functioning. I wonder if anyone ever actually looks at it.
In other words, what would be some of the downsides of decommissioning it?
[1] https://pagure.io/fedora-ci/monitor-gating
In other words, what would be some of the downsides of decommissioning it? [1] https://pagure.io/fedora-ci/monitor-gating
That's a really good question. If the system failing for such a long time didn't cause problems, there isn't a clear owner/maintainer, documentation is sparse, and it's unclear if this is critical or helpful for anyone then the time could be spent elsewhere post a decommission.
What do people think? Is anyone relying on this at this point?
It's definitely a good question!
I think it was important at first to make sure the entire flow was working, but now that it's used by lots of people an automated test may not be as useful. Also, yeah, no one is really watching it/fixing it.
I guess I'd be ok retiring it at this point.
If we don't get dissenting points of view on this by the end of next week let's look at retiring the service.
Sounds ok to me. We might want to note it on the infrastructure list or in discussion to see if anyone knows something we missed?
What would be the most appropriate section of Discussion to post it in? Project Discussion?
https://discussion.fedoraproject.org/c/project/7
Yes, I think so and tag with '#infra-team' I guess?
Here's the Discussion thread: https://discussion.fedoraproject.org/t/retirement-of-monitor-gating/130577
The retirement of monitor-gating is now tracked in this issue: https://pagure.io/fedora-infrastructure/issue/12202
Metadata Update from @patrikp: - Issue close_status updated to: Can't Fix - Issue status updated to: Closed (was: Open)
Log in to comment on this ticket.