Issue #7197: Production Bodhi has a pod stuck at ContainerCreating - fedora-infrastructure

fedora-infrastructure

#7197 Production Bodhi has a pod stuck at ContainerCreating

Closed: Fixed 5 years ago Opened 5 years ago by bowlofeggs.

Describe what you need us to do:
A few minutes ago, I started to observe HTTP 503 errors from production Bodhi. At the time, I was unable to request logs from the pods with this error:

$ oc logs -f bodhi-web-4-nhgnw
Error from server: Get https://os-node02.phx2.fedoraproject.org:10250/containerLogs/bodhi/bodhi-web-4-nhgnw/bodhi-web?follow=true: net/http: TLS handshake timeout

After a minute or two of that, the container status for those pods started to show "Unknown" instead of "Running", though still showed 0 restarts. After a while, OpenShift started up two more bodhi-web-4 pods (with new hashes) and the "Unknown" ones disappeared. Bodhi now seems to be served, but one of the new pods has been showing as "ContainerCreating" for 11 minutes now:

$ oc get pods
NAME                READY     STATUS              RESTARTS   AGE
bodhi-web-3-build   0/1       Completed           0          21h
bodhi-web-4-q6dcv   1/1       Running             0          12m
bodhi-web-4-rwmsv   0/1       ContainerCreating   0          11m

Normally Bodhi would be served with two pods. I don't know how to see logs for what is happening right now, but the other pod does seem to be getting by for now.

Are there more logs available than I have access to? If so, do they reveal anything useful about what might have happened here?

When do you need this? (YYYY/MM/DD)
N/A
When is this no longer needed or useful? (YYYY/MM/DD)
If it gets back to two pods somehow, or if you don't want to investigate what might have happened.
If we cannot complete your request, what is the impact?
We may not learn what happened, and Bodhi may continue to only be served by one pod.

bowlofeggs commented 5 years ago

I swear - immediately when this page loaded after I submitted this ticket, the container showed as "Running". Weird timing.

Anyways, I'll leave this ticket open in case anyone wants to investigate the issue further, otherwise feel free to close it.

bowlofeggs commented 5 years ago

I just noticed that I had forgotten to adjust the hostname when I ran oc login, and I did run that on batcave01. I think this explains the oc logs issue I was having.

bowlofeggs commented 5 years ago

I now know why it took so long for the container to come up - it took 14 minutes to pull from the registry (so just slow performance):

$ oc describe pod bodhi-web-4-rwmsv
<snip>
Events:
  FirstSeen LastSeen    Count   From                    SubObjectPath           Type        Reason      Message
  --------- --------    -----   ----                    -------------           --------    ------      -------
  29m       29m     1   default-scheduler           Normal      Scheduled   Successfully assigned bodhi-web-4-rwmsv to os-node01.phx2.fedoraproject.org
  28m       28m     1   kubelet, os-node01.phx2.fedoraproject.org   spec.containers{bodhi-web}  Normal      Pulling     pulling image "docker-registry.default.svc:5000/bodhi/bodhi-web@sha256:8386c2b654561b984938373934908c31ed3ce74b880820a2988492881f62799b"
  14m       14m     1   kubelet, os-node01.phx2.fedoraproject.org   spec.containers{bodhi-web}  Normal      Pulled      Successfully pulled image "docker-registry.default.svc:5000/bodhi/bodhi-web@sha256:8386c2b654561b984938373934908c31ed3ce74b880820a2988492881f62799b"

So the only remaining mystery is what happened to the old pods.

kevin commented 5 years ago

Yeah, I see where they were dropped, it looks like some kind of network hiccup...

Aug 28 14:42:42 os-node02 atomic-openshift-node: W0828 14:42:42.469025 99291 prober.go:103] No ref for container "cri-o://0bb8d8a410d862b2563f918ada95fe92958fb87affb673ec3abd3180ca12d428" (bodhi-web-4-nhgnw_bodhi(3891c932-aa1f-11e8-83f4-52540068650e):bodhi-web)

Not sure we are going to find out much more...

Metadata Update from @kevin:
- Issue close_status updated to: Fixed
- Issue status updated to: Closed (was: Open)

5 years ago

Metadata

Assignee

None

Tags

Blocking

None

Depending on

None

Priority

Waiting on Assignee

fedora-infrastructure

Source Code

#7197 Production Bodhi has a pod stuck at ContainerCreating Closed: Fixed 5 years ago Opened 5 years ago by bowlofeggs.

Metadata

bodhi OpenShift

#7197 Production Bodhi has a pod stuck at ContainerCreating

Closed: Fixed 5 years ago Opened 5 years ago by bowlofeggs.