#7868 openstack: copr: ppc64le boxes fail to spawn
Closed: Fixed 2 years ago by praiskup. Opened 2 years ago by praiskup.

... the boxes end with "Error" state, but they are wasting the RAM quota.

E.g. Copr_builder_263105240, 34ff749d-722a-41f9-a555-191c12820f0f

From client perspective, I don't seem to get more info about the reason.


By luck, few VMs started over the night (or someone stepped in?), but the problem still persists :-(, note also the #7721.

ping, still the same problem, https://bugzilla.redhat.com/show_bug.cgi?id=1718497

I tried to kill some errored VMs manually, and I hope that the stuck queue can get processed again: https://copr.fedorainfracloud.org/status/pending/

But someone with the rights to debug this should step in.

Note that not only ppc64le builders are affected; the errored VMs seem to eat from quota which is shared with x86 builders; so in turn this issue causes that x86 builders fail to spawn as well.

Temporarily I disabled ppc64le builders in /etc/copr/copr-be.conf:

group1_max_vm_per_user=0
group1_max_vm_total=0

To debug this, once can hit this on staging instance:

$ ansible-playbook /home/copr/provision/builderpb_nova_ppc64le.yml
...
fatal: [localhost -> localhost]: FAILED! => {"changed": false, "msg": "Error in creating the server, please check logs"}
...

Started instance from my attempt: Copr_builder_519856383

In dashboard I now found this error:

Message
    No valid host was found. 
Code
    500
Details
Created
    June 12, 2019, 7:41 a.m. 

With @msuchy we just took a look at this...

We tried to restart openstack-nova-compute service first on
fed-cloud-ppc02.cloud.fedoraproject.org hypervisor, but it did not help.
There was some problem with locks.

Then we rebooted, and the ppc64le architecture seem to be processing
builds in copr now.... we'll wait a bit till the queue get's a bit shorter
before announcing that it is working (to see that it doesn't get stuck
again).

Still, from time to time -> "No valid host was found." failure occurs. Such VMs are still not remove automatically, but kept in Error state. I can delete them manually (not always on the first attempt).

Closing this as the issue doesn't seem to occur again.

Metadata Update from @praiskup:
- Issue close_status updated to: Fixed
- Issue status updated to: Closed (was: Open)

2 years ago

Login to comment on this ticket.

Metadata