#6824 Beef up openQA aarch64 worker hardware capacity
Closed: Fixed 8 months ago Opened 10 months ago by adamwill.

We have a couple of systems up and running as aarch64 workers for openQA right now: openqa-aarch64-01.qa and openqa-aarch64-02.qa. They're deployed on the staging openQA instance, but they seem to be a bit marginal resource-wise.

Initially we configured openQA such that each ran four 'workers' (VMs running a test, basically) concurrently, each with 4GiB of RAM. This obviously is a potential issue given that the boxes only have 16GiB each, and there's overhead for the machine itself; we could exhaust available RAM if all four workers get close to max usage. Each VM had 1 vCPU.

So I recently tweaked this config so each runs three 'workers' concurrently with 3GiB of RAM each, and 2 vCPUs (we use 2 vCPUs per worker on x86_64 and ppc64, so it seemed good to be consistent, and it more accurately mirrors real-world usage). However, we're still seeing some flakiness that looks like resource contention - tests stalling during intensive activity, or just timing out due to taking too long, stuff like that. This is quite important as aarch64 is now a primary arch for Server, and Server testing relies heavily on openQA: we really need the tests to execute as reliably as possible.

One thing I noticed is that each machine appears to have just a single consumer-grade 500GB 7200RPM SATA hard disk for storage - a Seagate ST500DM002 , to be precise. The x86_64 worker hosts have much more substantial storage (RAID-6 sets built out of enterprise grade 10000RPM / 15000RPM SAS disks). I can certainly believe we're hitting I/O bottleneck issues, particularly if three install tests run and hit package installation simultaneously (which commonly happens).

I think it may help if the hosts had 32GiB of RAM each and faster storage, perhaps SSD storage. They don't need a lot of storage - 128GiB should be fine.

It would also help if we had more worker hosts; 6 or 8 simultaneous jobs isn't really a lot, it's certainly not enough to deploy to stg and prod simultaneously. If we have more Mustangs available it'd be good to get some of them for openQA purposes, or @kevin suggested we may be able to give a couple of Moonshot carts to openQA. Apparently each cart has 64GiB RAM and a 500GB SSD, which should more than suffice.

@pwhalen @pbrobinson @puiterwijk

Looking into this, the Mustangs would need very specific DIMMs to use 32G, and only have two slots so would need two 16G sticks. There have also been issues with some SSD's depending on the hardware revision. I will update the ticket with model details when available.

The m400 might be the best option if we have carts to allocate to OpenQa, and Fedora installs are working with the current firmware.

Just an update here:

I thought it would be easier to just allocate some of the moonshot carts for this.
However, I cannot seem to get Fedora installed on them. :crying_cat_face:

I am working with arm folks to try and see if we can get an install going.

If not, we may have to fall back on making the mustangs better (in particular we can at least try SSDs).

Metadata Update from @kevin:
- Issue assigned to kevin
- Issue priority set to: Waiting on Asignee

10 months ago

Thanks a lot for the update.

The moonshot's are in place and being used now.


Metadata Update from @kevin:
- Issue close_status updated to: Fixed
- Issue status updated to: Closed (was: Open)

8 months ago

Yup, and as well as giving us more capacity they seem definitely to be behaving more reliably than the Mustangs. :thumbsup:

Login to comment on this ticket.