#8056 openshift(staging): networking often stalls coreos-cincinnati builds (seemingly related to egress-policy)
Closed: Fixed 6 months ago by lucab. Opened 12 months ago by lucab.

Before starting: the egress network policy itself is fine. Just keep retrying the same build-config (without any change), will eventually result in a success.

For context, coreos-cincinnati-stub contains a build-config to build a Rust application from github. The service itself is still a work-in-progress while we iron out the details of the auto-update system for fedora-coreos.

Initially, the "coreos-cincinnati" project was egress-unconstrained, and builds were fine.
Since the introduction of egress network policies on the namespace, I'm seeing a 50% rate of build jobs just getting stuck while trying to access "crates.io", "github.com", and "static.crates.io". On an interesting note, just restarting the build under the exact same input conditions will randomly result in a success, eventually.

The symptoms are a bit vague and I cannot further self-debug due to cluster RBAC policies. The application builder (cargo) is not reporting any name resolution errors, transport error or HTTP timeouts. Other earlier commands in the build-config (e.g. dnf install) can access the Internet.

It looks like something is misbehaving at the network or egress-policy level, resulting in networking which is fine on the surface but stalling HTTPS flows to certain destinations.

As egress policies are defined by DNS names and most of those endpoints are behind CDNs, I wonder if there may be an underlying de-sync in name/IPs whitelist due to rapidly changing DNS entries.

Amazingly, the openshift folks are one step ahead here and actually query the zone ttl and refresh every that often.

So, I don't think it's a dns / a records desync.

We should possibly schedule some time to debug this (or if you like I can just try a bunch of builds until one gets stuck?) (but I am out almost all next week for flock, so it might be the week after, or someone else.)

Does this happen in both stg and prod?

Metadata Update from @kevin:
- Issue priority set to: Waiting on Assignee (was: Needs Review)
- Issue tagged with: OpenShift

11 months ago

I am at flock too, so we can try to schedule a live-debugging session too, if you wish.

I only have stg up at this point, so I don't know whether this would also happen in prod.

Metadata Update from @kevin:
- Issue tagged with: backlog

10 months ago

@lucab do you still see this happening ? or should we close this ticket ?

I performed a few rebuilds in the last weeks (~10) and none of them failed. I'm closing this ticket for now.

Metadata Update from @lucab:
- Issue close_status updated to: Fixed
- Issue status updated to: Closed (was: Open)

6 months ago

Login to comment on this ticket.