#10698 Redeploy openQA workers to address 2009585
Closed: Fixed 2 years ago by adamwill. Opened 2 years ago by adamwill.

We seem to be sort of running out of ideas about fixing https://bugzilla.redhat.com/show_bug.cgi?id=2009585 . Which is a problem as we can't leave the openQA workers on 5.11 kernels forever. @cmurf suggests this as an alternative:

[Reprovision] us[ing] a kickstart specifying --chunksize 64 for software raid; and avoid resizing XFS, i.e. make sure the / file system, or separate /var, is the size you want at mkfs.xfs time, rather than later growing the LV and the file system, which is what was done with this file system.

This seems like a sensible idea to me. But I can't do it, as provisioning is done by infra team. Previously the worker boxes have always been provisioned with a small / filesystem which I have resized after initial deployment.

Could someone work with me to do this? Probably starting with the lab workers, then we'll test those out for a few days, then - if all looks good - do the prod workers?

Thanks!


Metadata Update from @zlopez:
- Issue priority set to: Waiting on Assignee (was: Needs Review)
- Issue tagged with: medium-gain, medium-trouble, ops

2 years ago

So, a few questions:

Are you wanting to stick with xfs? I'm actually moving our other non virthost fedora's to btrfs now...

Is this only the x86 workers? or all of them?

I'm assuming they would want to be f36?

I'm fine with any fs that doesn't fail on us. We can try btrfs.

This would be all of them. The bug does seem to affect multiple arches, though it's less common on some. And we may as well keep them all consistent.

Yes, F36.

So, I know @chrismurphy was trying to isolate this on a test machine...

Should we keep this ticket open to track that? Or should we just close it out and can reopen or file a new one once we know whats happening?

Probably close this and just carry on tracking it on chat for now, and file other tickets as needed.

For the record, the status is that we did the redeployment but it didn't solve the problem. We still have clear problems with any multi-disk configuration with kernels after 5.11, and even a single-disk configuration gives occasional test failures that are likely a less-severe form of the same problem.

So now we're trying to bisect the actual cause in the kernel, using worker05 as a sacrificial test box for this work.

Metadata Update from @adamwill:
- Issue close_status updated to: Fixed
- Issue status updated to: Closed (was: Open)

2 years ago

Login to comment on this ticket.

Metadata
Boards 1
ops Status: Backlog