#353 CentOS CI OCP4 unscheduled outage: Storage Issue
Closed: Fixed 2 years ago by dkirwan. Opened 2 years ago by dkirwan.

  • We had multiple reports this morning of Structure needs cleaning when attempting to bring up tenant's workloads.
  • Cluster upgrade to 4.7.13 is stalled with multiple upgraded services failing to start with same error Structure needs cleaning

I've taken the cluster offline to allow diagnosis/repair of the storage. So far we think one of the disks is dying, we will remove it from the array, resync, check the filesystem etc.


This problem seems bigger than we had anticipated. So far from our investigation it seems this is a low level hardware issue that will need a onsite visit.

We may have to go with a server replacement (from logs this is a symptom of backplane issue or of the controller's) but hoping on-site visit reveals something like "power cable not connected properly or low voltage".

Speaking with @nirik he mentioned that there is a spare storinator in Fedora infra that we could use temporarily if the need arose. Hopefully we won't need it :pray:

Hi! Thanks for tracking the status in this issue.

The last update has been over 12 hours ago. Is there any information with some ETA on actions that are planned? I understand that giving an expected time for a fix is not easy, but at least sharing the next steps and when these hopefully happen would be much appreciated!

Thanks,
Niels

Hi! Thanks for tracking the status in this issue.

The last update has been over 12 hours ago. Is there any information with some ETA on actions that are planned? I understand that giving an expected time for a fix is not easy, but at least sharing the next steps and when these hopefully happen would be much appreciated!

With all CI Infra admins in EMEA timezone (and everyone working with less than full capacity), 12 hours gap is very expected
we are trying to move data to a different NFS temporarily to get cluster started.
We hope it should work after pointing things to a different location, but can't say for sure!
will update on how it goes

Data from bad to temporary node is now synced and we brought back the openshift cluster. It was in upgrade process when we noticed the issue and now that we brought it back, it's continuing the upgrade process. From a few pod, it seems PVs are working as expected. We are going to let the upgrade process complete and then do a final health check of Jenkins (if they have pvs connected or not)

Thanks a lot to the whole CentOS CI infra team for this! So far I can confirm that everything seems to be running smoothly.

Yup, looking good here too so far. Thanks a lot all!

The outage for the CentOS CI OCP 4 cluster is now over, service has been fully restored with a temporary workaround.

We had a hardware failure on the storinator node storage02, which provides storage services to our cluster. Logs show some issue with backplane.

As a temporary workaround, we have migrated this storage to an older node (which is out of warranty, it has had issues itself so :pray: it stays running). We'll have an on-site engineer visit the data center early next week to diagnose the problem affecting the main storinator node.

At a future date, once this storinator node is repaired/replaced, we will schedule an outage to migrate our storage back to that device.

Metadata Update from @dkirwan:
- Issue close_status updated to: Fixed
- Issue status updated to: Closed (was: Open)

2 years ago

Login to comment on this ticket.

Metadata
Boards 1
CentOS CI Infra Status: Backlog