#233 factory-build VMs keep hanging when something breaks
Closed: Fixed 6 years ago Opened 6 years ago by kparal.

If there's a problem during VM build in imagefactory (like network timing out, VM not rebooting, etc), the whole VM keeps hanging forever. Over time, RAM runs out.

We either need to configure imagefactory to kill hanging jobs (if it is possible and it's not bugged), or we need to e.g. create a cron script that will kill all factory-build VMs once a day or something like that.


Metadata Update from @kparal:
- Issue tagged with: infra

6 years ago

Today I killed 16 hanging imagefactory VMs on dev and 15 on production. It's a problem.

Metadata Update from @kparal:
- Issue priority set to: High

6 years ago

Metadata Update from @frantisekz:
- Issue assigned to frantisekz

6 years ago

The most "stupid" way to fix this (by killing all ImageFactory VMs once per day):

- name: Create cronjob to kill hanging VMs from ImageFactory
  cron:
    name: "Kill ImageFactory VMs"
    special_time: daily
    job: "pgrep -f '/usr/bin/qemu-system-x86_64 -machine accel=kvm -name guest=factory-build' | xargs kill"

I'll try to think about this a little more if I won't find a better solution. Using virsh command don't seem very reliable as it got stuck even when trying virsh list .... 🙄

I'd like to see a script that finds the relevant processes and examines how long they've run, and kills only those that have been running over a specified internal (e.g. 3 hours). Sending sigterm/sigkill to the processes is fine, I think. Using virsh destroy might be better (that's how I did it in the past, I believe). What is the reason virsh command gets stuck? Try to figure it out, it should work, I don't remember having any issue with it.

Also, perhaps sending some command to imagefactory could be another way how to kill the processes? Or maybe it's possible to configure a timeout in imagefactory that would do all the work for us? That would be the best solution. Please have a look whether it exists.

FWIW imagefactory auto-timeouts after (IIRC) an hour (maybe a bit more, it's been some time...), so the "hanging jobs" are most probably an issue with some other bit in the stack e.g. virsh command being stuck, like @frantisekz saw for himself.

This one should do the job :)

- name: Create cronjob to kill hanging VMs from ImageFactory
  cron:
    name: "Kill ImageFactory VMs"
    special_time: daily
    job: "if (( $(ps -o etimes= -p $(pgrep -f '/usr/bin/qemu-system-x86_64 -machine accel=kvm -name guest=factory-build')) >= 10800)); then pgrep -f '/usr/bin/qemu-system-x86_64 -machine accel=kvm -name guest=factory-build' | xargs kill; fi"

FWIW imagefactory auto-timeouts after (IIRC) an hour (maybe a bit more, it's been some time...), so the "hanging jobs" are most probably an issue with some other bit in the stack e.g. virsh command being stuck, like @frantisekz saw for himself.

Yeah, I've just checked man page for ImageFactory and default timeout is 3600 seconds. So the issue is somewhere else.

Proper version, ready for feedback :)

1
2
3
4
5
6
7
8
9
#!/bin/bash

while (( $(pgrep -o -f '/usr/bin/qemu-system-x86_64 -machine accel=kvm -name guest=factory-build') )); do
    if (( $(ps -o etimes= -p $(pgrep -o -f '/usr/bin/qemu-system-x86_64 -machine accel=kvm -name guest=factory-build')) >= 10800)); then 
        pgrep -o -f '/usr/bin/qemu-system-x86_64 -machine accel=kvm -name guest=factory-build' | xargs kill; 
    else
        exit
    fi
done

Yeah, I've just checked man page for ImageFactory and default timeout is 3600 seconds. So the issue is somewhere else.

But we don't run the build through the command line, but using the API. Have you checked the behavior when triggering through API? Or @jskladan, can you look at that?

Proper version, ready for feedback :)

I pushed infra-ansible branch into repo, so that you can create a PR against it. Please make the script more readable. Let the script take an argument (maximum factory run time in minutes). Use a variable instead of duplicating the same very long argument three times.

Also, I think the script logic is invalid or perhaps I don't understand it. Let's make it more obvious. Use for instead of while, and simply iterate through all the running VMs and either kill them or skip them. Thanks.

@kparal - it is just the same, triggered via api or from cmdline.

On Tue, Feb 13, 2018 at 3:04 PM, Kamil P=C3=A1ral pagure@pagure.io wrote:

kparal added a new comment to an issue you are following:
``

Yeah, I've just checked man page for ImageFactory and default timeout i=
s
3600 seconds. So the issue is somewhere else.

But we don't run the build through the command line, but using the API.
Have you checked the behavior when triggering through API? Or @jskladan,
can you look at that?

Proper version, ready for feedback :)

I pushed infra-ansible branch into repo, so that you can create a PR
against it. Please make the script more readable. Let the script take an
argument (maximum factory run time in minutes). Use a variable instead of
duplicating the same very long argument three times.

Also, I think the script logic is invalid or perhaps I don't understand
it. Let's make it more obvious. Use for instead of while, and simply
iterate through all the running VMs and either kill them or skip them.
Thanks.
``

To reply, visit the link below or just reply to this email
https://pagure.io/taskotron/issue/233

@kparal - it is just the same, triggered via api or from cmdline.

OK. If you're sure our configuration and triggering is ok, then the best way forward is to kill the hanging processes manually.

Metadata Update from @kparal:
- Issue close_status updated to: Fixed

6 years ago

Login to comment on this ticket.

Metadata