#11013 Redeploy openQA workers with consistent storage configuration
Opened a year ago by adamwill. Modified 2 months ago

NOTE

If your issue is for security or deals with sensitive info please
mark it as private using the checkbox below.

Describe what you would like us to do:


With deploying different ones at different times, and trying to work around https://bugzilla.redhat.com/show_bug.cgi?id=2009585 , we've wound up with quite a mess in terms of the storage configuration on different openQA workers. Some are encrypted, some aren't. Some are luks-on-btrfs-on-mdraid (I think that's the right ordering), one (I think) uses btrfs native RAID, one (I think) just uses a single disk and leaves all the others idle...

Since none of the dodges actually worked around 2009585 (except using a single disk, but that makes things slower, so we probably shouldn't stick with that), we should probably just give them all one consistent storage config. Encrypting them is probably a good idea - they don't have anything really super super secret, but they do have the openQA database password and credentials for resultsdb and maybe a couple other not-really-that-big-of-a-deal things I'm forgetting.

When do you need this to be done by? (YYYY/MM/DD)


Whenever, it's not urgent. Please co-ordinate this with me on chat.fp.o . We'll want to do the staging ones first, then prod once we're sure staging is good.


Metadata Update from @kevin:
- Issue priority set to: Waiting on Assignee (was: Needs Review)
- Issue tagged with: low-trouble, medium-gain, ops

a year ago

Do we know what the actual config we want to use on all of them is?

I am traveling next week, but we could look at trying to do them the week after (which is a week before freeze)?

Well, the native btrfs RAID thing seemed to work fine, so maybe that on LUKS?

ok. Remind me week after next and we can try and get them done.

Also, I will try and rearrange some staging resources and possibly free up some more workers for you.

[backlog_refinement]
This is low priority right now, but we still want to do it.

We didn't get to this yet... after beta?

[backlog_refinement]
Moved to next year.

current state of play:

PROD WORKERS

  • x86-worker01 - xfs-on-lvm-on-luks-on-mdraid mq-deadline
  • x86-worker02 - xfs-on-lvm-on-mdraid mq-deadline
  • x86-worker06 - native-btrfs-raid mq-deadline

LAB WORKERS

  • x86-worker03 - btrfs-on-luks-on-mdraid bfq
  • x86-worker04 - btrfs-on-luks-on-mdraid bfq
  • x86-worker05 - btrfs-on-luks-on-mdraid bfq

so, a nice mix :D edit: we recently redeployed 04 and 05, and now the lab workers are consistent. I do not understand why they're on the bfq scheduler, though - they all have the udev dropin cmurf recommended, but it doesn't seem to be applying.

Still, I'll keep an eye on things for a while and see if lab behaves itself consistently now 04 isn't using just one disk as it had been for some time. prod seems to be behaving fine, despite the mix of layouts.

if we ever do get to a full redeployment, we should also standardize the network interface naming strategy and make the kickstart match it. right now we have the following mix:

a64-worker01 - enp1s0 (udev)
a64-worker02 - eth1 (old-style)
a64-worker03 - eth1 (old-style)
a64-worker04 - eth2 (old-style)
p09-worker01 - enp1s0f0 (udev)
p09-worker02 - eth0 (old-style)
x86-worker01 - em1 (biosdevname)
x86-worker02 - em1 (biosdevname)
x86-worker03 - eno1 (udev)
x86-worker04 - em1 (biosdevname)
x86-worker05 - eno1 (udev)
x86-worker06 - em1 (biosdevname)

in summary: ugh.

Login to comment on this ticket.

Metadata
Boards 1
ops Status: Backlog