... the boxes end with "Error" state, but they are wasting the RAM quota.
E.g. Copr_builder_263105240, 34ff749d-722a-41f9-a555-191c12820f0f
From client perspective, I don't seem to get more info about the reason.
By luck, few VMs started over the night (or someone stepped in?), but the problem still persists :-(, note also the #7721.
ping, still the same problem, https://bugzilla.redhat.com/show_bug.cgi?id=1718497
I tried to kill some errored VMs manually, and I hope that the stuck queue can get processed again: https://copr.fedorainfracloud.org/status/pending/
But someone with the rights to debug this should step in.
Note that not only ppc64le builders are affected; the errored VMs seem to eat from quota which is shared with x86 builders; so in turn this issue causes that x86 builders fail to spawn as well.
Temporarily I disabled ppc64le builders in /etc/copr/copr-be.conf:
/etc/copr/copr-be.conf
group1_max_vm_per_user=0 group1_max_vm_total=0
To debug this, once can hit this on staging instance:
$ ansible-playbook /home/copr/provision/builderpb_nova_ppc64le.yml ... fatal: [localhost -> localhost]: FAILED! => {"changed": false, "msg": "Error in creating the server, please check logs"} ...
Started instance from my attempt: Copr_builder_519856383
In dashboard I now found this error:
Message No valid host was found. Code 500 Details Created June 12, 2019, 7:41 a.m.
With @msuchy we just took a look at this...
We tried to restart openstack-nova-compute service first on fed-cloud-ppc02.cloud.fedoraproject.org hypervisor, but it did not help. There was some problem with locks.
Then we rebooted, and the ppc64le architecture seem to be processing builds in copr now.... we'll wait a bit till the queue get's a bit shorter before announcing that it is working (to see that it doesn't get stuck again).
Still, from time to time -> "No valid host was found." failure occurs. Such VMs are still not remove automatically, but kept in Error state. I can delete them manually (not always on the first attempt).
Closing this as the issue doesn't seem to occur again.
Metadata Update from @praiskup: - Issue close_status updated to: Fixed - Issue status updated to: Closed (was: Open)
Log in to comment on this ticket.