The tasks from subject still sometimes have issues reaching out to identity.api.openshift.com, see e.g. https://koji.fedoraproject.org/koji/taskinfo?taskID=93971173
I looked into https://pagure.io/fedora-infra/ansible/blob/main/f/roles/koji_builder/templates/osbuildapi-update.sh and everything looks fine. Given that IP addresses of these domains are pretty stable, we shouldn't get these errors as often.
Maybe run the second part also on prod, because api.openshift.com might have different IPs than identity.api.openshift.com? That's my only idea, I'm currently monitoring it locally.
Do we have any kind of logs that shows that something is wrong in this script?
https://koji.fedoraproject.org/koji/taskinfo?taskID=93918683 https://koji.fedoraproject.org/koji/taskinfo?taskID=93753539 https://koji.fedoraproject.org/koji/taskinfo?taskID=93673051 https://koji.fedoraproject.org/koji/taskinfo?taskID=93672778 https://koji.fedoraproject.org/koji/taskinfo?taskID=92954073
Metadata Update from @phsmoura: - Issue priority set to: Waiting on Assignee (was: Needs Review) - Issue tagged with: medium-gain, medium-trouble, ops
So is this failing all the time? or just sometimes?
We get emails from the script when it takes longer than a minute to run, but I haven't seen any other issues and that one is rare.
Has this always been happening? Or can you tell when it started?
It's failing randomly. I have a feeling that this randomly happens since we integrated osbuild into koji.
@kevin I found the root cause! Here's a PR:
https://pagure.io/fedora-infra/ansible/pull-request/1260
:)
I merged that and pushed it out to builders.
Let us know if that fixes things.
We got one case today, see https://koji.fedoraproject.org/koji/taskinfo?taskID=94690440
Is buildvm-s390x-21.s390.fedoraproject.org running the updated code?
Hum... those are gonna be a problem.
The s390x builders cannot reach the internet at all. Only select ips in our internal network. ;(
Let me ponder on a solution.
Oh no, that's unfortunate, sorry... Thanks for looking into this! :)
Can we possibly run osbuild jobs in a special group then we can just exclude them? Given they're really only a pass-through to the osbuild API, shouldn't be a problem when we need to bring online PPC/s390 osbuild workers as they're different infra.
@kevin Do we have any ideas? @pbrobinson told me that this is now seriously blocking the IoT development.
Ah, I thought this was very sporadic and not affecting things.. thanks for letting me know.
I guess I can make a new channel for it. Just a pain.
ok. I setup a 'osbuild' channel, set koji to use it for osbuild jobs. I put all our buildhw-x86 machines in it (16 of them). I also changed the job to run every 5min instead of every minute. Every minute was causing it to sometimes hang and send us emails about it.
Hopefully that will fix things.
Shall we keep this open? Or just close it and open a new one if you hit anything further?
I think close it. There's another problem we're seeing but that's unrelated to this and @obudai was going to open another ticket for that.
Thanks @kevin :thumbsup:
Metadata Update from @kevin: - Issue close_status updated to: Fixed - Issue status updated to: Closed (was: Open)
Log in to comment on this ticket.