#11953 osbuildImage regularly have firewall issues
Closed: Fixed 4 months ago by kevin. Opened 11 months ago by supakeen.

G'day, about a year ago we had some issues with reaching out to certain API hosts https://pagure.io/fedora-infrastructure/issue/10982 these issues became intermittent enough that they weren't hampering builds.

The past week or two however the issue has become consistent and IoT can barely do any builds. Did something change firewall or deployment wise?

Some builds affected by this over the past week:

https://koji.fedoraproject.org/koji/taskinfo?taskID=118162559
https://koji.fedoraproject.org/koji/taskinfo?taskID=118206033
https://koji.fedoraproject.org/koji/taskinfo?taskID=117946966
https://koji.fedoraproject.org/koji/taskinfo?taskID=118158177

Tagging in @pwhalen and @pbrobinson as it affects their builds.


Metadata Update from @zlopez:
- Issue priority set to: Waiting on Assignee (was: Needs Review)
- Issue tagged with: Needs investigation

11 months ago

Odd. Nothing I can see would have changed here. We still have a cron job that updates the ip's and refreshes a ipset which is allowed in the firewall. ;(

Is it always sso.redhat.com?

It's always sso.redhat.com

Sorry, I didn't get an email about your reply so hence my late reply.

I'm not sure what could be going on here. ;(

Is it still happening?

Any news here? are you still seeing the problem? I did see one rawhide osbuild that hit an issue, but it wasn't exactly the same and I only saw it the one time. ;(

Are you still seeing this?

Please re-open if so and we can look more...

Metadata Update from @kevin:
- Issue close_status updated to: Insufficient data
- Issue status updated to: Closed (was: Open)

8 months ago

Metadata Update from @kevin:
- Issue status updated to: Open (was: Closed)

5 months ago

ok. I updated to that koji-osbuild version on all the builders and freed all the stuck ones (but I am not sure they handle that right).

@kevin after looking around a bit in the Ansible stuff for the builders together with @obudai there's a few things that might be relevant:

  1. sso.redhat.com has AAAA records, however the osbuild-update.sh script only puts the v4 addresses in the ipset (note: api.opensift.com has no AAAA records at all). This probably depends on the rest of the iptables configuration and if there are connectivity differences between builders.
  2. In osbuild-update.sh the ipset is flushed directly after the lookup of api.openshift.com; if the next lookup fails then the script exits and the sso.redhat.com ips are not in the ipset. If lookup failures then persist new calls to the cronjob will not update the ipset at all anymore. Perhaps it'd be worthwhile to only flush the ipset and add the new addresses if both lookups succeeded. The ips don't change often I don't think but this cronjob runs very often.

It'd be interesting if we can take a look at what's in the relevant ipset during a failure and/or if we could know if there are resolver errors happening :)

There's a race conditition even when the DNS resolution suceeds, filled a PR to fix it: https://pagure.io/fedora-infra/ansible/pull-request/2375

@kevin Can you please take a look, merge it, and deploy it?

There's a race conditition even when the DNS resolution suceeds, filled a PR to fix it: https://pagure.io/fedora-infra/ansible/pull-request/2375

@kevin Can you please take a look, merge it, and deploy it?

This was deployed, unfortunately we still can't get a compose completed.

Logs: https://koji.fedoraproject.org/koji/taskinfo?taskID=126098428

Pretty strange. Going to that builder and running a curl on the same url works just fine. :(

So, right now we have osbuild going to only buildhw-x86 (our 16 old x86_64 hardware builders).

Perhaps I should try moving it to some buildvm's in case there is some networking oddity there? but I don't see anything in logs about it. ;(

Is this still happening after the pr to make it swap tables?

2 out of the last last 100 jobs failed because of this issue, I think that's acceptable for now.

Looks much better lately, thanks to everyone looking at it. I think we can close for now?

ok. Let us know if it starts in again.

Metadata Update from @kevin:
- Issue close_status updated to: Fixed
- Issue status updated to: Closed (was: Open)

4 months ago

Log in to comment on this ticket.

Metadata