#8056 openshift(staging): networking often stalls coreos-cincinnati builds (seemingly related to egress-policy)
Opened 2 months ago by lucab. Modified 9 days ago

Before starting: the egress network policy itself is fine. Just keep retrying the same build-config (without any change), will eventually result in a success.

For context, coreos-cincinnati-stub contains a build-config to build a Rust application from github. The service itself is still a work-in-progress while we iron out the details of the auto-update system for fedora-coreos.

Initially, the "coreos-cincinnati" project was egress-unconstrained, and builds were fine.
Since the introduction of egress network policies on the namespace, I'm seeing a 50% rate of build jobs just getting stuck while trying to access "crates.io", "github.com", and "static.crates.io". On an interesting note, just restarting the build under the exact same input conditions will randomly result in a success, eventually.

The symptoms are a bit vague and I cannot further self-debug due to cluster RBAC policies. The application builder (cargo) is not reporting any name resolution errors, transport error or HTTP timeouts. Other earlier commands in the build-config (e.g. dnf install) can access the Internet.

It looks like something is misbehaving at the network or egress-policy level, resulting in networking which is fine on the surface but stalling HTTPS flows to certain destinations.

As egress policies are defined by DNS names and most of those endpoints are behind CDNs, I wonder if there may be an underlying de-sync in name/IPs whitelist due to rapidly changing DNS entries.

Amazingly, the openshift folks are one step ahead here and actually query the zone ttl and refresh every that often.

So, I don't think it's a dns / a records desync.

We should possibly schedule some time to debug this (or if you like I can just try a bunch of builds until one gets stuck?) (but I am out almost all next week for flock, so it might be the week after, or someone else.)

Does this happen in both stg and prod?

Metadata Update from @kevin:
- Issue priority set to: Waiting on Assignee (was: Needs Review)
- Issue tagged with: OpenShift

a month ago

I am at flock too, so we can try to schedule a live-debugging session too, if you wish.

I only have stg up at this point, so I don't know whether this would also happen in prod.

Metadata Update from @kevin:
- Issue tagged with: backlog

9 days ago

Login to comment on this ticket.