#85 old centos cluster: issues with storage
Closed: Fixed 3 years ago by dkirwan. Opened 3 years ago by mvadkert.

Hi,

Seems our pods have issues while cloning some data to the shared storage:

sh-5.0# cd /opt/cruncher/artifacts/00d31af1-aea4-405e-acfd-c0df210f4867
sh-5.0# git clone https://github.com/packit/hello-world.git 
Cloning into 'hello-world'...
remote: Enumerating objects: 116, done.
remote: Total 116 (delta 0), reused 0 (delta 0), pack-reused 116
fatal: write error: Bad file descriptor
fatal: index-pack failed

Outside the shared storage, the cloning works nicely:

sh-5.0# cd /var/tmp/ 
sh-5.0# git clone https://github.com/packit/hello-world.git 
Cloning into 'hello-world'...
remote: Enumerating objects: 116, done.
remote: Total 116 (delta 0), reused 0 (delta 0), pack-reused 116
Receiving objects: 100% (116/116), 27.04 KiB | 6.76 MiB/s, done.
Resolving deltas: 100% (53/53), done.

Project: testing-farm
pvc: testing-farm-artifacts pv-5gi-28a74c90-1b7b-5342-b473-66c862abf222


Can we find another example where this happens? I need the pod name to see if we can narrow this down to one of the nodes.

@bstinson I have a debug pod running in testing-farm project:

$ oc describe pod/tft-00d31af1-aea4-405e-acfd-c0df210f4867-debug | grep -i node
Node:         n22.kempty.ci.centos.org/172.22.6.92
Node-Selectors:  oci_kvm_hook=allowed

$ oc project testing-farm
$ oc rsh tft-00d31af1-aea4-405e-acfd-c0df210f4867-debug
$ cd /opt/cruncher/artifacts/00d31af1-aea4-405e-acfd-c0df210f4867
# git clone https://github.com/packit/hello-world.git 
Cloning into 'hello-world'...
.... HANGS ....

There were some hung NFS mounts on that node. I went ahead and evacuated it and rebooted. Let's see if we have another node exhibiting symptoms.

Metadata Update from @arrfab:
- Issue assigned to bstinson
- Issue priority set to: Waiting on Reporter (was: Needs Review)
- Issue tagged with: centos-ci-infra

3 years ago

Seems to be working fine now. Several recent testing jobs successfully finished.

Thanks, so I believe we can close this

Metadata Update from @arrfab:
- Issue close_status updated to: Fixed
- Issue status updated to: Closed (was: Open)

3 years ago

Metadata Update from @mvadkert:
- Issue status updated to: Open (was: Closed)

3 years ago

THe issue is back now @arrfab @bstinson @dkirwan .,.. so reopening the issue

I've cordoned off n22.kempty, but I don't have a pod/host that I can trace this down. What is left on the blocker list for moving to the new cluster?

We are not able to move to kubevirt. It is just too much work for us right now. We still need the old cluster to live with us few months.

All recent packit tests are failing. @bstinson, could you please look into this once more?

A lot of these are centered around n24.kempty. I took that one out while it's draining/rebooting

Two staging packit jobs failed with the "Problem with Testing-Farm cluster" error:

Other jobs for that pull request seem to be fine.

Metadata Update from @arrfab:
- Issue marked as depending on: #53

3 years ago

FYI : linked to #53 so feel free to have a look there to follow status but normally we'll fix this for new openshift cluster, while legacy will have to disappear soon, so you'd rather migrate to new one anyway, and so benefit from new infra.

#53 is closed now, given the last update is 2 months ago, this issue is resolved now presumably?

Folks the older cluster will be unavailable soon (1-2 months), I would immediately look at finding a replacement location to host workloads if you are unable to move to the CentOS CI Openshift 4 cluster.

Closing this issue.

Metadata Update from @dkirwan:
- Issue close_status updated to: Fixed
- Issue status updated to: Closed (was: Open)

3 years ago

End of January sounds good, we should be finished by then.

@dkirwan one month is insufficient as the break is near

Login to comment on this ticket.

Metadata
Boards 1
CentOS CI Infra Status: Blocked