G'day, about a year ago we had some issues with reaching out to certain API hosts https://pagure.io/fedora-infrastructure/issue/10982 these issues became intermittent enough that they weren't hampering builds.
The past week or two however the issue has become consistent and IoT can barely do any builds. Did something change firewall or deployment wise?
Some builds affected by this over the past week:
https://koji.fedoraproject.org/koji/taskinfo?taskID=118162559 https://koji.fedoraproject.org/koji/taskinfo?taskID=118206033 https://koji.fedoraproject.org/koji/taskinfo?taskID=117946966 https://koji.fedoraproject.org/koji/taskinfo?taskID=118158177
Tagging in @pwhalen and @pbrobinson as it affects their builds.
Metadata Update from @zlopez: - Issue priority set to: Waiting on Assignee (was: Needs Review) - Issue tagged with: Needs investigation
Odd. Nothing I can see would have changed here. We still have a cron job that updates the ip's and refreshes a ipset which is allowed in the firewall. ;(
Is it always sso.redhat.com?
It's always sso.redhat.com
Sorry, I didn't get an email about your reply so hence my late reply.
I'm not sure what could be going on here. ;(
Is it still happening?
Any news here? are you still seeing the problem? I did see one rawhide osbuild that hit an issue, but it wasn't exactly the same and I only saw it the one time. ;(
Are you still seeing this?
Please re-open if so and we can look more...
Metadata Update from @kevin: - Issue close_status updated to: Insufficient data - Issue status updated to: Closed (was: Open)
Reopening, this week osbuild tasks have been failing again:
https://koji.fedoraproject.org/koji/taskinfo?taskID=125846804 https://koji.fedoraproject.org/koji/taskinfo?taskID=125846805 https://koji.fedoraproject.org/koji/taskinfo?taskID=125846806
Stuck builds: https://koji.fedoraproject.org/koji/tasks?method=osbuildImage&state=active&view=tree&order=-method
There is a new build of koji-osbuild that increases retries and might help: https://koji.fedoraproject.org/koji/buildinfo?buildID=2584829
koji-osbuild
Metadata Update from @kevin: - Issue status updated to: Open (was: Closed)
ok. I updated to that koji-osbuild version on all the builders and freed all the stuck ones (but I am not sure they handle that right).
@kevin after looking around a bit in the Ansible stuff for the builders together with @obudai there's a few things that might be relevant:
sso.redhat.com
osbuild-update.sh
ipset
api.opensift.com
api.openshift.com
It'd be interesting if we can take a look at what's in the relevant ipset during a failure and/or if we could know if there are resolver errors happening :)
There's a race conditition even when the DNS resolution suceeds, filled a PR to fix it: https://pagure.io/fedora-infra/ansible/pull-request/2375
@kevin Can you please take a look, merge it, and deploy it?
There's a race conditition even when the DNS resolution suceeds, filled a PR to fix it: https://pagure.io/fedora-infra/ansible/pull-request/2375 @kevin Can you please take a look, merge it, and deploy it?
This was deployed, unfortunately we still can't get a compose completed.
Logs: https://koji.fedoraproject.org/koji/taskinfo?taskID=126098428
Pretty strange. Going to that builder and running a curl on the same url works just fine. :(
So, right now we have osbuild going to only buildhw-x86 (our 16 old x86_64 hardware builders).
Perhaps I should try moving it to some buildvm's in case there is some networking oddity there? but I don't see anything in logs about it. ;(
Is this still happening after the pr to make it swap tables?
2 out of the last last 100 jobs failed because of this issue, I think that's acceptable for now.
Looks much better lately, thanks to everyone looking at it. I think we can close for now?
ok. Let us know if it starts in again.
Metadata Update from @kevin: - Issue close_status updated to: Fixed - Issue status updated to: Closed (was: Open)
Log in to comment on this ticket.