Hi,
Seems our pods have issues while cloning some data to the shared storage:
sh-5.0# cd /opt/cruncher/artifacts/00d31af1-aea4-405e-acfd-c0df210f4867 sh-5.0# git clone https://github.com/packit/hello-world.git Cloning into 'hello-world'... remote: Enumerating objects: 116, done. remote: Total 116 (delta 0), reused 0 (delta 0), pack-reused 116 fatal: write error: Bad file descriptor fatal: index-pack failed
Outside the shared storage, the cloning works nicely:
sh-5.0# cd /var/tmp/ sh-5.0# git clone https://github.com/packit/hello-world.git Cloning into 'hello-world'... remote: Enumerating objects: 116, done. remote: Total 116 (delta 0), reused 0 (delta 0), pack-reused 116 Receiving objects: 100% (116/116), 27.04 KiB | 6.76 MiB/s, done. Resolving deltas: 100% (53/53), done.
Project: testing-farm pvc: testing-farm-artifacts pv-5gi-28a74c90-1b7b-5342-b473-66c862abf222
Can we find another example where this happens? I need the pod name to see if we can narrow this down to one of the nodes.
@bstinson I have a debug pod running in testing-farm project:
testing-farm
$ oc describe pod/tft-00d31af1-aea4-405e-acfd-c0df210f4867-debug | grep -i node Node: n22.kempty.ci.centos.org/172.22.6.92 Node-Selectors: oci_kvm_hook=allowed $ oc project testing-farm $ oc rsh tft-00d31af1-aea4-405e-acfd-c0df210f4867-debug $ cd /opt/cruncher/artifacts/00d31af1-aea4-405e-acfd-c0df210f4867 # git clone https://github.com/packit/hello-world.git Cloning into 'hello-world'... .... HANGS ....
There were some hung NFS mounts on that node. I went ahead and evacuated it and rebooted. Let's see if we have another node exhibiting symptoms.
Metadata Update from @arrfab: - Issue assigned to bstinson - Issue priority set to: Waiting on Reporter (was: Needs Review) - Issue tagged with: centos-ci-infra
Seems to be working fine now. Several recent testing jobs successfully finished.
Thanks, so I believe we can close this
Metadata Update from @arrfab: - Issue close_status updated to: Fixed - Issue status updated to: Closed (was: Open)
Thanks!
Metadata Update from @mvadkert: - Issue status updated to: Open (was: Closed)
THe issue is back now @arrfab @bstinson @dkirwan .,.. so reopening the issue
Example pull request with failures: https://github.com/psss/tmt/pull/338
I've cordoned off n22.kempty, but I don't have a pod/host that I can trace this down. What is left on the blocker list for moving to the new cluster?
We are not able to move to kubevirt. It is just too much work for us right now. We still need the old cluster to live with us few months.
All recent packit tests are failing. @bstinson, could you please look into this once more?
A fresh failing job example: https://github.com/psss/tmt/pull/343
Another jobs failing: https://github.com/oamg/convert2rhel/pull/100
A lot of these are centered around n24.kempty. I took that one out while it's draining/rebooting
Two staging packit jobs failed with the "Problem with Testing-Farm cluster" error:
Other jobs for that pull request seem to be fine.
Metadata Update from @arrfab: - Issue marked as depending on: #53
FYI : linked to #53 so feel free to have a look there to follow status but normally we'll fix this for new openshift cluster, while legacy will have to disappear soon, so you'd rather migrate to new one anyway, and so benefit from new infra.
#53 is closed now, given the last update is 2 months ago, this issue is resolved now presumably?
Folks the older cluster will be unavailable soon (1-2 months), I would immediately look at finding a replacement location to host workloads if you are unable to move to the CentOS CI Openshift 4 cluster.
Closing this issue.
Metadata Update from @dkirwan: - Issue close_status updated to: Fixed - Issue status updated to: Closed (was: Open)
End of January sounds good, we should be finished by then.
@dkirwan one month is insufficient as the break is near
Login to comment on this ticket.