#7721 fedorainfracloud.org: remove Deleting instances from `copr` tenant
Closed: Invalid 4 years ago by praiskup. Opened 5 years ago by praiskup.

I'm trying to re-reqeest delete, but without success:

Request to delete server Copr_builder_387608433 has been accepted.
Request to delete server Copr_builder_555122077 has been accepted.
Request to delete server Copr_builder_855138299 has been accepted.
Request to delete server Copr_builder_565664433 has been accepted.
...

I think that it could help with #7711 as well.


Because of leftovers, all the RAM quota is eaten.

I've tweaked things so it's not counting all those deleting instances. I am not sure I want to try and clean them completely as that could cause further problems.

In any case you should be able to spin up more more (and indeed it seems to be doing so)

Metadata Update from @kevin:
- Issue close_status updated to: Fixed
- Issue status updated to: Closed (was: Open)

5 years ago

I believe we are back in the previous state, reopening.

Metadata Update from @praiskup:
- Issue status updated to: Open (was: Closed)

4 years ago

Metadata Update from @bowlofeggs:
- Issue priority set to: Waiting on Assignee (was: Needs Review)

4 years ago

I am not sure I want to try and clean them completely as that could cause further problems.

Do you mean additional IO?

I've tweaked things so it's not counting all those deleting instances

Seems like the deleting instances are actually counted into the quota; there's 101 VMs in total (including errored, and deleting vms) which is slightly over the quota (500G). And copr sees only 30 to 50% of them as useful, so it tries to start over again without success.

Tried to manually delete by script, but no success - even though this output was printed:

Request to delete server a4cb11ac-d658-44a9-92d0-3f7b0a7ec199 has been accepted.
Request to delete server 06ef46e4-62b8-4c56-992f-94e726c285aa has been accepted.
Request to delete server 5573952d-a1e2-4f3c-af9b-80d441d09033 has been accepted.
Request to delete server f0a4926e-1cab-4ced-b8cd-2ea82601569c has been accepted.
Request to delete server a0adb799-2ef9-49a8-8bf2-fc343a0f0326 has been accepted.
Request to delete server 9dbd6789-bbf7-4717-9588-2b2491214651 has been accepted.
Request to delete server 83521103-4400-4ae5-a381-58ac9f256029 has been accepted.
Request to delete server 563f4a88-e947-407c-afe1-a5c4814c2cbe has been accepted.
Request to delete server 895fcbab-2cd6-447e-ab56-53f96fca2d93 has been accepted.
Request to delete server a0bdc530-3f5b-44fe-bf48-697fa12f769c has been accepted.
Request to delete server 71f74a49-1f8d-40c6-83b8-a5e2c790209d has been accepted.
Request to delete server 664068d2-2fb4-45ef-b704-e3d2d98d2e93 has been accepted.
Request to delete server 2bb8b615-46d9-47cf-ba87-f93e3649fdf6 has been accepted.
Request to delete server c987f0e6-c1af-4e12-8db7-3516797965f5 has been accepted.
Request to delete server 1d8f7ee3-f1b6-46c6-abaa-80dfc84d3490 has been accepted.
Request to delete server f6d574d8-9f84-4a74-aa1c-b0833e3080bf has been accepted.
Request to delete server e97836d6-0789-4c8a-ad42-6e966208b7c6 has been accepted.
Request to delete server f4d19339-197f-4078-8eb9-3c75257c1043 has been accepted.
Request to delete server 6b178873-0d7a-4719-a3c9-2042028baf4a has been accepted.
Request to delete server f3c44357-7a0f-4224-9ba9-2004bf3bcd38 has been accepted.
Request to delete server 6485829f-9dd8-4486-9b90-7760086f3029 has been accepted.
Request to delete server d468dd68-5f86-495e-8ca4-dc6270daf5c7 has been accepted.
Request to delete server 4dbae2fb-8280-435d-9281-d33c96fc86ec has been accepted.
Request to delete server 3e2edf2e-5244-40a7-a009-a7137335ce5b has been accepted.
Request to delete server 024970c4-a2c6-4448-836e-8bf63cec16b6 has been accepted.
Request to delete server 12ba08e0-3981-4a0a-8fcd-cc0576cdb458 has been accepted.
Request to delete server 318f500d-37c4-4863-ac0b-51bf7dfec22c has been accepted.
Request to delete server 409b0bd7-a8c8-4df3-ad3f-0e24ada29ad0 has been accepted.
Request to delete server d1558e5c-0438-4cb6-bee5-13ebe3b136fd has been accepted.
Request to delete server b84a2cb4-f25b-47a5-a381-55e71e4de8db has been accepted.
Request to delete server 545b02d8-c27c-487f-b922-0b71908ba599 has been accepted.
Request to delete server 9c4dd3cb-046b-436b-8821-fec1be10cba3 has been accepted.
Request to delete server d70a88ff-aec4-4490-8ff3-9db66b7f0249 has been accepted.
Request to delete server 7157f056-46eb-4628-a272-7ddc8b8fdce1 has been accepted.
Request to delete server eb9caa7e-2231-488a-86e1-daa078ea9f38 has been accepted.
Request to delete server 929b6024-2e58-444a-99cb-abfcf8302bbc has been accepted.
Request to delete server 70b15842-fb82-4fd0-b39d-f7f4f2599b78 has been accepted.
Request to delete server 9d1ae742-1da3-4395-9ca6-878778314303 has been accepted.
Request to delete server ad56ba8c-0fd6-4aac-b0ce-5ae2f7883ab9 has been accepted.
Request to delete server e43a1881-c310-4669-a900-aff3c420610a has been accepted.
Request to delete server efada16f-2336-46bf-9abb-8037304175ff has been accepted.
Request to delete server daeff13b-b2d8-4af6-a2f1-f7f93fe4c4ee has been accepted.
Request to delete server 3045bedb-d00e-459c-ae3c-8efdbced7e9c has been accepted.
Request to delete server 8f9e1ebd-6f98-424d-9976-3e55930cd26c has been accepted.
Request to delete server 6eb758d9-d8ee-468b-95e3-38ffcd46958d has been accepted.
Request to delete server 038c4f20-7b30-4239-82ee-20c3005b9ffb has been accepted.
Request to delete server c527c664-a967-470a-8bc6-3164f36f08a9 has been accepted.
Request to delete server 99589b28-8883-492b-a805-d10bd377325d has been accepted.
Request to delete server b6590cf1-fbf2-4ebf-81a3-dcf81d4bff8d has been accepted.
Request to delete server b6e9ebf6-59c7-4feb-b6e5-5a42ce6cb908 has been accepted.
Request to delete server bf76da45-bf0a-40ea-9adc-c19af11b4e15 has been accepted.
Request to delete server 2c79384a-3760-440a-bfd4-325171245442 has been accepted.
Request to delete server e6edb7af-1afd-4178-a2cc-de10bb976512 has been accepted.
Request to delete server 19722d9d-50f1-480f-8d16-35c03a66119b has been accepted.
Request to delete server c393ac25-148e-4974-b0dd-c087a76be96a has been accepted.
Request to delete server 323a5e31-cb39-464e-813d-2dbe02369096 has been accepted.
Request to delete server 898cb341-5e56-41e4-96d8-664e57cfedb7 has been accepted.
Request to delete server 71f74a49-1f8d-40c6-83b8-a5e2c790209d has been accepted.

I tried

🎩[msuchy@dri/~]$ nova force-delete 71f74a49-1f8d-40c6-83b8-a5e2c790209d
ERROR (Conflict): Cannot 'forceDelete' while instance is in vm_state error (HTTP 409) (Request-ID: req-a1f98602-50f5-4dc0-a367-dd93a61928e0)
🎩[msuchy@dri/~]1$ nova force-delete 898cb341-5e56-41e4-96d8-664e57cfedb7
ERROR (Conflict): Cannot 'forceDelete' while instance is in vm_state error (HTTP 409) (Request-ID: req-77e1bc0e-ad1c-495b-a896-305c759cd3d6)

with no luck :(

I will see what I can do, but the cloud is running on cargo-cult magic at this point. I may fix it by accident or I may not.

$ nova reset-state --active <instance-id>
$ nova force-delete <instance-id>

Could help, maybe? Only from reading the docs.. , I don't have the required permissions to give it a try.

I may not either ... so am checking

So it looks like the instances are ghosts and can't be deleted or removed. I am not sure we can do anything on this without probably causing this old openstack not to work anymore (it is very EOL)

Thanks for the attempt at least! Just not sure, I assume you don't have the admin openstack rights, so nova reset-state .. did not work for you. @msuchy could you please give it a try?

The version of nova on the cloud does not seem to have this command available. [No that isn't right.. I need to know a lot of items to do this which aren't documented it would seem. Will keep at it.]

We have historically set limit of 44 x86 VMs + 8 ppc64le, but because of this bug we are only allowed to allocate ~35VMs (and the more failed VMs there is, the lower the number is) and copr sucks ...

Ping on this, could I help with something?

List of 65 VMs safe to delete now:

Copr_builder_343943925 Copr_builder_757487059 Copr_builder_957841757 Copr_builder_957549044 Copr_builder_341800274 Copr_builder_199316924 Copr_builder_829293924 Copr_builder_341389324 Copr_builder_271539084 Copr_builder_532251784 Copr_builder_498451168 Copr_builder_550009750 Copr_builder_381397710 Copr_builder_93769745 Copr_builder_183848760 Copr_builder_426718452 Copr_builder_782283454 Copr_builder_99133868 Copr_builder_636162777 Copr_builder_669594372 Copr_builder_473125623 Copr_builder_522424400 Copr_builder_290830320 Copr_builder_52287043 Copr_builder_340201262 Copr_builder_971919841 Copr_builder_503643494 Copr_builder_304592442 Copr_builder_346329060 Copr_builder_103017032 Copr_builder_794688770 Copr_builder_711014198 Copr_builder_86988328 Copr_builder_346327526 Copr_builder_403739167 Copr_builder_754305564 Copr_builder_198255119 Copr_builder_578449553 Copr_builder_872665492 Copr_builder_536195719 Copr_builder_447479411 Copr_builder_461194235 Copr_builder_505089636 Copr_builder_334565328 Copr_builder_349759925 Copr_builder_407047249 Copr_builder_521297570 Copr_builder_909399945 Copr_builder_622486869 Copr_builder_109467372 Copr_builder_104583812 Copr_builder_800830552 Copr_builder_497640511 Copr_builder_350844108 Copr_builder_387608433 Copr_builder_555122077 Copr_builder_855138299 Copr_builder_565664433 Copr_builder_620427619 Copr_builder_5746527 Copr_builder_420136155 Copr_builder_343810812 Copr_builder_863613024 Copr_builder_696847353 Copr_builder_726452247

We are all out at a weeklong face 2 face with laptops off for the entire day. I am looking at this at the end of 12 hours of meetings and not really any energy to try and figure out what in in openshift can be tried. [I have done everything you and the docs say and it just says those images are errored with no way to clear up. I will talk with my managers first thing in the morning to let them know and see if I can get out to focus on this for a bit during lunch.]

OK we have worked out some of the tasks needed to try and get this fixed.
1. We need to work out a time together
2 shut down copr,
3 shutdown all copr instances,
4 then clean up the database manually of remaining copr ghost instances
5. update the database count because it is in a different table
6. stand up the copr-be, fe etc roles again
7. fix any issues we find
8. get copr back in production.

Sadly the other alternative commands we looked at were not available in the version of the cloud we have.

3 shutdown all copr instances,

Do you mean builders, right? Se everything in copr tenant? In that case, this is pretty much about systemctl stop copr-backend on backend machine. The system then stops allocating new VMs, and we can kill all the VMs in copr tenant (tasks from VMs which actually are working on something are rescheduled).

Staging backend has a separate tenant (that one isn't affected actually).

Ad 1), I'm pretty much OK with any time between 6:00-24:00 CEST; but I'd like to schedule outage window, how much time do you think the repair of openstack could cost?

Btw., I can see that someone stepped-in, and cleaned something (thanks!). So this is not anymore that much burning; and the ppc64le issue #7868 should have much more priority.

I hope that who ever stepped in comes into the ticket to say so and how they did so.. so we don't have 2 different methods going on or two sysadmins doing different things.

I believed that was you, but then ... well, to not blame anyone, could OpenStack become saner itself - without asistance?

How I noticed; I downloaded the html page "dashboard -> Instances", and did elinks -dump /tmp/Instances\ -\ OpenStack\ Dashboard.html | grep Copr | grep -e Deleting -e Error |cut -d] -f4 | cut -d' ' -f1 | wc -l and today there's only 53 VMs to be deleted, while yesterday it was 65.

List of removed VMs since yesterday:

-Copr_builder_183848760
-Copr_builder_271539084
-Copr_builder_341800274
-Copr_builder_343943925
-Copr_builder_346327526
-Copr_builder_426718452
-Copr_builder_498451168
-Copr_builder_550009750
-Copr_builder_622486869
-Copr_builder_757487059
-Copr_builder_829293924
-Copr_builder_909399945
-Copr_builder_93769745
-Copr_builder_957549044
-Copr_builder_957841757

list of newly "Deleting/Errored" VMs:

+Copr_builder_182819505
+Copr_builder_457282882
+Copr_builder_566075038

No idea how that happened as everything I had tried said they were still error'd. I don't know if there is anything we can do with this version of OpenStack since it is now over 4 years out the door. I will be on IRC tomorrow morning to see if we can work out a time later in the day to do what can be done.

Looks like there's 23 now.

Shall we just close this and live with those for now until we move to the new openshift cluster? I really prefer to avoid spending any more time than needed on the old openstack cluster.

I'm going to close this out. Please let us know if it becomes a blocker again.

Metadata Update from @kevin:
- Issue close_status updated to: Fixed
- Issue status updated to: Closed (was: Open)

4 years ago

problem appeared again, we receive quota tracebacks on backend:

fatal: [127.0.0.1 -> localhost]: FAILED! => {"changed": false, "msg": "Error in creating instance: Quota exceeded for cores: Requested 2, but already used 219 of 220 cores "}

Metadata Update from @praiskup:
- Issue status updated to: Open (was: Closed)

4 years ago

Btw., copr is only able to allocate about 50% of VMs now (compared with the state when there's no problem).

What about somewhat radical change? Create a new tenant for production builders? Note that the performance of x86_64/i686 builds really suck for last few days :-( (often more than our queue).

I am not sure that we can delete the old block to give you a new block. We can try but it could also mean copr is off-air completely for an unknown amount of time.

I am raising this with my management as this needs more focus than any of us on staff can currently give it with other Engineering priorities.

Sorry :-( this time it was our fault with @frostyx (probably), because some of the instances were in Runningstate. So it looks like we forgot to remove the old VMs handled by previous copr backend on F28, before we upgraded. Those running instances wasted the quota... I'm slowly removing them now, and I believe we'll be fine soon. If not, I'll reopen. Sorry again for rush this time.

Metadata Update from @praiskup:
- Issue close_status updated to: Invalid
- Issue status updated to: Closed (was: Open)

4 years ago

Let me know as I was still seeing problems with systems I could not delete, stop or remove when I was looking at it from the admin earlier.

Yes, there are dozens VMs that I can not delete. Those VMs are in deleting or error state. But at least there's enough quota now to spawn more resources.

Login to comment on this ticket.

Metadata