#8744 New openQA aarch64 worker hosts (openqa-aarch64-0*) bottlenecked by storage
Closed: Fixed 4 months ago by smooge. Opened 4 months ago by adamwill.

So, we wondered if this would be the case, and it looks like it is. The new aarch64 openQA worker hosts should be able to handle quite a lot of simultaneous jobs, but it looks like storage is a problem. I'm getting jobs timing out more than I'd expect, which is one indicator, but much more obviously, I just did a re-run of a lot of jobs - which means a lot all start up immediately, exactly simultaneously, on all the hosts - and something like a third of those jobs failed. Most failed to the UEFI shell, a couple failed claiming /etc was a read-only filesystem. They were all transient failures - restarting the test got past the failure.

I can't 100% guarantee it but I'm fairly sure storage bottleneck is the problem here, given we haven't really seen similar behaviour on other hosts. It could also be network contention, but I don't think so (especially since these are supposed to be on 100Gb, I think).

AFAICS each box just has a single SSD. The x86_64 and ppc64le worker hosts have big RAID sets of enterprise HDDs, I think. Is there anything we can do to level up the storage a bit on these boxes? Thanks! For now I think I might cut them down to 5 worker instances each...


hmm, in fact, on closer inspection, it may be just openqa-aarch64-02 wigging out for some reason. It seems like all the failures were on that host. I'm rebooting it and will see if it gets better.

We may be able to scavenge some ssds from the old calxedia box...

I am going to see about adding more drives today from scavenged parts. We need to do this anyway as these systems are donations and probably do not have active warranties on them if they are like other donated hardware we have.

So more storage would still be great, but in fact it really is that openqa-aarch64-02 is bad somehow. All tests run on it are failing, all tests run on the other hosts are fine. I can't figure out why it's bad yet, though.

Metadata Update from @smooge:
- Issue assigned to smooge
- Issue priority set to: Waiting on Assignee (was: Needs Review)

4 months ago

OK, let's keep this as a 'more storage would be nice' ticket, and I'll file another for the Great 02 Mystery (we really cannot figure out why 02 is weirdly busted).

OK we have no drives which will fit this system. New drives will need to be budgeted and ordered for this to work which will need various internal processes to work. I do not see this happening in the near future.

I can put this as a backlog item unless someone can get it moved up.

Metadata Update from @smooge:
- Issue tagged with: backlog, hardware, high-gain, medium-trouble

4 months ago

that's OK for me, as indicated the big problem turned out to be 02 being wackadoodle.

We are going to order 3 drives and get them added to the systems at a later time.

I am ordering disks for this next week. They will be shipped to PHX2 and we will get them installed next week. I think I will track openqa02 in a different ticket for its crackheadness

Metadata Update from @smooge:
- Issue close_status updated to: Fixed
- Issue status updated to: Closed (was: Open)

4 months ago

Login to comment on this ticket.

Metadata