Issue #85: old centos cluster: issues with storage - centos-infra

centos-infra

#85 old centos cluster: issues with storage

Closed: Fixed 3 years ago by dkirwan. Opened 3 years ago by mvadkert.

Hi,

Seems our pods have issues while cloning some data to the shared storage:

sh-5.0# cd /opt/cruncher/artifacts/00d31af1-aea4-405e-acfd-c0df210f4867
sh-5.0# git clone https://github.com/packit/hello-world.git 
Cloning into 'hello-world'...
remote: Enumerating objects: 116, done.
remote: Total 116 (delta 0), reused 0 (delta 0), pack-reused 116
fatal: write error: Bad file descriptor
fatal: index-pack failed

Outside the shared storage, the cloning works nicely:

sh-5.0# cd /var/tmp/ 
sh-5.0# git clone https://github.com/packit/hello-world.git 
Cloning into 'hello-world'...
remote: Enumerating objects: 116, done.
remote: Total 116 (delta 0), reused 0 (delta 0), pack-reused 116
Receiving objects: 100% (116/116), 27.04 KiB | 6.76 MiB/s, done.
Resolving deltas: 100% (53/53), done.

Project: testing-farm
pvc: testing-farm-artifacts pv-5gi-28a74c90-1b7b-5342-b473-66c862abf222

bstinson commented 3 years ago

Can we find another example where this happens? I need the pod name to see if we can narrow this down to one of the nodes.

mvadkert commented 3 years ago

@bstinson I have a debug pod running in testing-farm project:

$ oc describe pod/tft-00d31af1-aea4-405e-acfd-c0df210f4867-debug | grep -i node
Node:         n22.kempty.ci.centos.org/172.22.6.92
Node-Selectors:  oci_kvm_hook=allowed

$ oc project testing-farm
$ oc rsh tft-00d31af1-aea4-405e-acfd-c0df210f4867-debug
$ cd /opt/cruncher/artifacts/00d31af1-aea4-405e-acfd-c0df210f4867
# git clone https://github.com/packit/hello-world.git 
Cloning into 'hello-world'...
.... HANGS ....

bstinson commented 3 years ago

There were some hung NFS mounts on that node. I went ahead and evacuated it and rebooted. Let's see if we have another node exhibiting symptoms.

Metadata Update from @arrfab:
- Issue assigned to bstinson
- Issue priority set to: Waiting on Reporter (was: Needs Review)
- Issue tagged with: centos-ci-infra

3 years ago

psss commented 3 years ago

Seems to be working fine now. Several recent testing jobs successfully finished.

mvadkert commented 3 years ago

Thanks, so I believe we can close this

Metadata Update from @arrfab:
- Issue close_status updated to: Fixed
- Issue status updated to: Closed (was: Open)

3 years ago

lachmanfrantisek commented 3 years ago

Thanks!

Metadata Update from @mvadkert:
- Issue status updated to: Open (was: Closed)

3 years ago

mvadkert commented 3 years ago

THe issue is back now @arrfab @bstinson @dkirwan .,.. so reopening the issue

psss commented 3 years ago

Example pull request with failures: https://github.com/psss/tmt/pull/338

bstinson commented 3 years ago

I've cordoned off n22.kempty, but I don't have a pod/host that I can trace this down. What is left on the blocker list for moving to the new cluster?

mvadkert commented 3 years ago

We are not able to move to kubevirt. It is just too much work for us right now. We still need the old cluster to live with us few months.

psss commented 3 years ago

All recent packit tests are failing. @bstinson, could you please look into this once more?

psss commented 3 years ago

A fresh failing job example: https://github.com/psss/tmt/pull/343

mbocek commented 3 years ago

Another jobs failing: https://github.com/oamg/convert2rhel/pull/100

bstinson commented 3 years ago

A lot of these are centered around n24.kempty. I took that one out while it's draining/rebooting

psss commented 3 years ago

Two staging packit jobs failed with the "Problem with Testing-Farm cluster" error:

https://github.com/psss/tmt/pull/342

Other jobs for that pull request seem to be fine.

Metadata Update from @arrfab:
- Issue marked as depending on: #53

3 years ago

arrfab commented 3 years ago

FYI : linked to #53 so feel free to have a look there to follow status but normally we'll fix this for new openshift cluster, while legacy will have to disappear soon, so you'd rather migrate to new one anyway, and so benefit from new infra.

dkirwan commented 3 years ago

#53 is closed now, given the last update is 2 months ago, this issue is resolved now presumably?

Folks the older cluster will be unavailable soon (1-2 months), I would immediately look at finding a replacement location to host workloads if you are unable to move to the CentOS CI Openshift 4 cluster.

Closing this issue.

Metadata Update from @dkirwan:
- Issue close_status updated to: Fixed
- Issue status updated to: Closed (was: Open)

3 years ago

mvadkert commented 3 years ago

End of January sounds good, we should be finished by then.

mvadkert commented 3 years ago

@dkirwan one month is insufficient as the break is near