#105 Consider not reporting auto-restarted failures to resultsdb
Closed: Fixed 2 months ago by adamwill. Opened a year ago by adamwill.

Talking to @catanzaro , the experience for a packager when an update test fails and is auto-restarted is not the best. See https://bodhi.fedoraproject.org/updates/FEDORA-2023-f954593595 . The update gating status goes to 'failed' on the initial failure, then back to 'waiting' a while later (not entirely sure what causes the delay there, I'll have to look into it again - it may only get recalculated the next time a test passes, I think), then eventually to 'passed'.

Arguably, since we know we're retrying the test, ideally somehow the status should not go to 'failed' but should stay in 'waiting'. Achieving that is a bit tricky, though. I don't think I want to hack ugly special casing into greenwave or bodhi. Possibly the best idea is, when a test fails and we know it's going to get retried, just not to report the failure to resultsdb at all. That should result in the gating status not being changed till the restarted test completes.

I need to think through a bit whether this is viable and what the possible consequences are, though.


Also, the test shows as 'failed' in the Automated Results table all the way up until the restart completes.

I just came back to looking at this (it's been in my head for a while but haven't had time). I was thinking of going back to using our old auto-retry plugin to do the retrying (instead of openQA's own support for retrying, which we switched to when it was added), but looking at that today, I'm not sure it actually would solve all the problems here.

So, this might be the best way forward indeed. Never reporting the failure, if we're sure the job will be retried, should keep things clean all the way down the chain. I just have to check we can be sure whether or not the job will be retried (I think we can).

So I poked at this a bit more, developed a theory, and decided the easiest thing to do is just fix it as if my theory is true, then deploy that to prod and see what happens. It seems to be quite awkward to check for sure if my theory is right.

We already have code in the openQA result reporter to not report results for cloned jobs. But that's not helping here; it seems clear that when an update test fails and the reporter is triggered, the job dict it gets from the openQA API at that instant does not include the clone_id entry which indicates the job has a clone. My theory is that the event that triggers the reporter actually happens just before the clone is created, and the result reporter kicks in fast enough to get the job dict before the cloning happens.

So my 'fix' is...have the reporter check when a result looks like one that should be retried (i.e. it has the RETRY variable set, and it's not a clone itself), and have it wait a few seconds then re-get the job dict. Yeah, it's a bit dumb, but if my theory is right it should work. When it re-gets the dict after the wait, it should have the clone_id entry, and reporting should then be skipped.

My initial plan was just to unconditionally skip reporting when the job looks like it should be retried, so this should be safer than that, anyway. I've gone ahead and deployed this to prod and will keep an eye on things for a few days and see if it improves them.

OK, I think this is looking good. https://bodhi.fedoraproject.org/updates/FEDORA-2024-b7c1336a97 is an example. the desktop_terminal test failed, was restarted and passed, and we only reported QUEUED and then PASSED, not FAILED. The update never hit gating status 'failed'.

Metadata Update from @adamwill:
- Issue close_status updated to: Fixed
- Issue status updated to: Closed (was: Open)

2 months ago

Log in to comment on this ticket.

Metadata