#8300 Staging OpenShift build stuck, kill it please (Fedora Docs)
Closed: Fixed 16 days ago by kevin. Opened a month ago by asamalik.

In the staging openshift (os.stg.fedoraproject.org), in the "docsbuilding" project, I have a stuck build (#21) that's running for 9 hours now.

https://os.stg.fedoraproject.org/console/project/docsbuilding/browse/builds/builder-build/builder-build-21?tab=details

Can someone please kill it?

(I have already resubmitted the build, but the new one is waiting for the old, stuck one, to finish.)

Thanks!


Canceled (unfortunately I canceled yours as well... so I started another new build)

FYI, this is using the fedora 29 container, we may want to move it to 30 or 31 soon. :)

Metadata Update from @kevin:
- Issue close_status updated to: Fixed
- Issue status updated to: Closed (was: Open)

a month ago

Thanks!

So, unfortunately, the new one is stuck again. :-(

I don't believe there's an error in the Dockerfile — it's the very same one that was succeeding before, and it builds fine locally. Can that be something with OpenShift?

PS: Thanks for noticing it's an f29, I'll fix that when I'm back from vacation.

Metadata Update from @asamalik:
- Issue status updated to: Open (was: Closed)

a month ago

We have a Bodhi build stuck too, I have cancelled it and re triggered the build couple times but it is always stuck :-(.

I could not see anything useful in the Events or logs.

So, unfortunately, the new one is stuck again. :-(

IIUC we should be able to cancel our own builds now. Read https://pagure.io/fedora-infrastructure/issue/8005#comment-602921 and the following comments.

I fear this may be a cluster-wide issue. On the staging cluster, I tried five times building the same coreos-cincinnati config and 100% of them got stuck.

The Dockerfile is unchanged and looks like this:

FROM fedora:30
RUN dnf -y install g++ openssl-devel
RUN dnf -y install rust cargo
...

The build job never reaches the second RUN and just hangs after the first dnf is completed.

I fear this may be a cluster-wide issue. On the staging cluster, I tried five times building the same coreos-cincinnati config and 100% of them got stuck.
The Dockerfile is unchanged and looks like this:
FROM fedora:30
RUN dnf -y install g++ openssl-devel
RUN dnf -y install rust cargo
...

The build job never reaches the second RUN and just hangs after the the first dnf is completed.

Yes same here for the bodhi project. I ll try to check the cluster and see if this is a resource problem

I have restarted the docker service on os-node04.stg.phx2.fedoraproject.org and it seems to have fixed the problem. At least for bodhi.

@lucab and @asamalik could you try to start a new build ?

@cverna confirmed, I triggered another build (which ended up on node os-node04.stg.phx2) and it went fine. I can't see on which nodes the previous failing builds were scheduled, though.

@cverna confirmed, I triggered another build (which ended up on node os-node04.stg.phx2) and it went fine. I can't see on which nodes the previous failing builds were scheduled, though.

I have checked all the nodes and only os-node04 had some stuck build there.

Let's close this for now, feel free to reopen again if you experience the same issue.

Metadata Update from @kevin:
- Issue close_status updated to: Fixed
- Issue status updated to: Closed (was: Open)

a month ago

Sorry for lagging with my response, I was on vacation the whole week.

Thanks for all the help! As Dusty suggests, I can now kill pods myself which is super helpful.

However, the builds are still getting stuck. I killed the first one after 15 min, and the second one now hangs for about 40 mins. I left it run in case it helps with debugging. Both on os-node01 if that helps.

Metadata Update from @asamalik:
- Issue status updated to: Open (was: Closed)

16 days ago

ok, looks like docker was confused. The build finished, but it never exited right.

I restarted docker on all the staging nodes and fired a new build.

It seems to have completed ok.

Side note: you are using Fedora 29 there, please update to 31 before 29 goes end of life. ;)

Metadata Update from @kevin:
- Issue close_status updated to: Fixed
- Issue status updated to: Closed (was: Open)

16 days ago

Login to comment on this ticket.

Metadata