From 91c793e70eedc62ad0879df05941dbd0fc68daec Mon Sep 17 00:00:00 2001 From: Pavel Raiskup Date: Jul 01 2019 06:21:48 +0000 Subject: sops: copr: document possible aarch64 and action problems --- diff --git a/docs/sysadmin-guide/sops/copr.rst b/docs/sysadmin-guide/sops/copr.rst index e6b9637..b39502d 100644 --- a/docs/sysadmin-guide/sops/copr.rst +++ b/docs/sysadmin-guide/sops/copr.rst @@ -48,15 +48,21 @@ https://docs.pagure.org/copr.copr/maintenance_documentation.html TROUBLESHOOTING ================ -Almost every problem with Copr is due problem in OpenStack, in such case, -try to restart copr-backend service:: +Almost every problem with Copr is due problem with spawning builder VMs, or with +processing action queue on backend. + + +VM spawning/termination problems +-------------------------------- + +Try to restart copr-backend service:: $ ssh root@copr-be.cloud.fedoraproject.org $ systemctl restart copr-backend If this doesn't solve the problem, try to follow logs for some clues:: - $ tail -f /var/log/copr-backend/{vmm,spawner}.log + $ tail -f /var/log/copr-backend/{vmm,spawner,terminator}.log As the last resort option, you can terminate all builders and let copr-backend to throw all information about them. This action will @@ -70,7 +76,7 @@ obviously interrupt all running builds and reschedule them:: $ systemctl start copr-backend -Sometimes OpenStack can not handle spawning too much VM at the same time. +Sometimes OpenStack can not handle spawning too much VMs at the same time. So it is safer to edit on copr-be.cloud.fedoraproject.org:: vi /etc/copr/copr-be.conf @@ -83,6 +89,44 @@ to "6". Start copr-backend service and some time later increase it to original value. Copr automaticaly detect change in script and increase number of workers. +The set of aarch64 VMs isn't maintained by OpenStack, but by Copr's backend +itself. Steps to diagnose:: + + $ ssh root@copr-be.cloud.fedoraproject.org + [root@copr-be ~][PROD]# systemctl status resalloc + ● resalloc.service - Resource allocator server + ... + + [root@copr-be ~][PROD]# less /var/log/resallocserver/main.log + + [root@copr-be ~][PROD]# su - resalloc + + [resalloc@copr-be ~][PROD]$ resalloc-maint resource-list + 13569 - aarch64_01_prod_00013569_20190613_151319 pool=aarch64_01_prod tags=aarch64 status=UP + 13597 - aarch64_01_prod_00013597_20190614_083418 pool=aarch64_01_prod tags=aarch64 status=UP + 13594 - aarch64_02_prod_00013594_20190614_082303 pool=aarch64_02_prod tags=aarch64 status=STARTING + ... + + [resalloc@copr-be ~][PROD]$ resalloc-maint ticket-list + 879 - state=OPEN tags=aarch64 resource=aarch64_01_prod_00013569_20190613_151319 + 918 - state=OPEN tags=aarch64 resource=aarch64_01_prod_00013608_20190614_135536 + 904 - state=OPEN tags=aarch64 resource=aarch64_02_prod_00013594_20190614_082303 + 919 - state=OPEN tags=aarch64 + ... + +Be careful when there's some resource in ``STARTING`` state. If that's so, +check ``/usr/bin/tail -F -n +0 /var/log/resallocserver/hooks/013594_alloc``. +Copr takes tickets from resalloc server; and if the resources fail to spawn, +the ticket numbers are not assigned with appropriately tagged resource for a +long time. + +If that happens (it shouldn't) and there's some inconsistency between resalloc's +database and the actual status on aarch64 hypervisors (``ssh +copr@virthost-aarch64-os0{1,2}.fedorainfracloud.org``) +- use ``virsh`` there to introspect theirs statuses +- use ``resalloc-maint resource-delete``, ``resalloc ticket-close`` or ``psql`` + commands to fix-up the resalloc's DB. + Backend Troubleshoting ---------------------- @@ -117,6 +161,21 @@ to disable it for some projects:: $ rm -rf ./appdata +Backend action queue issues +--------------------------- + +First check the `number of not-yet-processed actions`_. If that number isn't +equal to zero, and is not decrementing relatively fast (say single action takes +longer than 30s) -- there might be some problem. Logs for the action dispatcher +can be found in:: + + /var/log/copr-backend/action_dispatcher.log + +Check if there's no stucked process under ``Action dispatch`` parent process in +``pstree -a copr`` output. + + + Deploy information ================== @@ -302,3 +361,6 @@ Keygen: +======+==========+=========+=================================+ | 22 | TCP | ssh | Remote control | +------+----------+---------+---------------------------------+ + + +.. _`number of not-yet-processed actions`_:: https://copr.fedorainfracloud.org/backend/pending-action-count/