If the task is run in a disposable VM and the VM dies unexpectedly (e.g. a libvirt error, feel free to simulate by just killing the machine from virt-manager), our runner starts consuming 100% CPU and the process does not exit, you need to press Ctrl+C.
It might help waiting a few minutes for some timeout to occur. I haven't tested it, please do. Also figure out where the CPU loop occurs (I suspect paramiko) and find a way to mitigate it (configure better ssh timeouts? introduce sleep intervals if it's in our code?).
This ticket had assigned some Differential requests: D587 D604
So, this does not occur only if the machine dies unexpectedly. The 100% CPU utilization is the the whole time we wait for some output over shh! So this seems to be definitely a busy waiting loop either in our code or in paramiko.
Easy to test: Modify runner.py:RemoteRunner - disable prepare_task() (put a return as the first line) and change run() to
runner.py:RemoteRunner
prepare_task()
return
run()
self.exitcode = self.ssh.cmd('sleep 60')
Now watch the CPU go berserk with python2 runtask process.
python2 runtask
This seems to be a problem in remote_exec. Claiming since I'm already in the code for #597.
There were two problems here, busy loop and libtaskotron process hanging if ssh communication goes down. The first one is resolved, the second is not. Let's reopen this to track the second one (unless you want me to report a new ticket, no problem).
Linking issues from paramiko's github which deal with "dead remote detection":
https://github.com/paramiko/paramiko/issues/319 https://github.com/paramiko/paramiko/issues/503 https://github.com/paramiko/paramiko/pull/470 https://github.com/paramiko/paramiko/pull/197
Asked about the issue status in https://github.com/paramiko/paramiko/issues/503
I filed a new ticket for tracking the paramiko issue - #665. Closing this one.
Login to comment on this ticket.