#933 CentOS CI's C8S ppc64le machine pool is empty
Closed: Fixed 2 years ago by arrfab. Opened 2 years ago by mrc0mmand.

Since Saturday's midnight (according to CI logs) I can't get any machines from the virt-one-medium-centos-8s-ppc64le pool. The pool itself is in the provisioning state, but never manages to provision a machine:

$ duffy client show-pool virt-one-medium-centos-8s-ppc64le
{
  "action": "get",
  "pool": {
    "name": "virt-one-medium-centos-8s-ppc64le",
    "fill_level": 5,
    "levels": {
      "provisioning": 5,
      "ready": 0,
      "contextualizing": 0,
      "deployed": 0,
      "deprovisioning": 0
    }
  }
}

The virt-one-medium-centos-9s-ppc64le seems to be fine, since after draining the pool (requesting all 5 available machines and returning them back, sorry for that), the pool slowly fills back in (though, after monitoring the duffy client show-pool virt-one-medium-centos-9s-ppc64le command for ~10 minutes the pool managed to provision one machine and now the provisioning field keeps switching between 4, 3, and 0 without adding any machines into the pool, so there might be some issues in the background).


Metadata Update from @arrfab:
- Issue tagged with: centos-ci-infra, high-gain, high-trouble, investigation

2 years ago

Metadata Update from @arrfab:
- Issue assigned to arrfab

2 years ago

confirmed and it's affecting both centos linux 7 and stream 8 guests on that power 8 node running centos stream 8 . At first sight it seems there were some updates that are now blocking deploying any VM.
I've implemented a workaround for the CentOS 8 Stream 8 ppc64le guests, so at least these ones should be available again :

duffy client show-pool virt-one-medium-centos-8s-ppc64le
{
  "action": "get",
  "pool": {
    "name": "virt-one-medium-centos-8s-ppc64le",
    "fill_level": 5,
    "levels": {
      "provisioning": 0,
      "ready": 5,
      "contextualizing": 0,
      "deployed": 0,
      "deprovisioning": 0
    }
  }
}

Problem is that they are deployed on the Power9 ppc64le hypervisor, already used to deploy the Stream 9 ppc6le guests, so that will slow it down for both (but it's a workaround)

I'll investigate the underlying issue and if we can resolve it

I had a little bit of time yesterday to investigate and the previously documented solution doesn't seem to work on CentOS Stream 8 (it got several qemu-kvm , libvirt and related packages updates last week).
Instead of downgrading to find which one is the issue, I also saw in parallel that the underlying raid5 had some issue and that switching to raid10 (for such ci workload) would be faster and better suited.

So the following actions were taken :

  • updated IBM Power 8 firmware to latest supported version
  • updated ipr raid controller firmware to latest supported version
  • reconfigured array from raid5 to raid10 and added some disks as Hot-Spare ones
  • reinstalled that IBM Power8 with RHEL 8.6
  • ansible reconfigured it as opennebula host in the opennebula ci cluster
  • tested that I could provision both centos linux 7 and stream 8 ppc64le cloud images

As it's all working, closing for now and stream 8 workload should be rebalanced automatically for the next requests (ideally so stream 8 on power 8 and stream 9 on power 9)

Metadata Update from @arrfab:
- Issue close_status updated to: Fixed
- Issue status updated to: Closed (was: Open)

2 years ago

Log in to comment on this ticket.

Metadata
Boards 1
CentOS CI Infra Status: Backlog