#271 libtaskotron hangs (indefinitely?) if ssh communication with minion goes down
Closed: Fixed None Opened 8 years ago by kparal.

If the task is run in a disposable VM and the VM dies unexpectedly (e.g. a libvirt error, feel free to simulate by just killing the machine from virt-manager), our runner starts consuming 100% CPU and the process does not exit, you need to press Ctrl+C.

It might help waiting a few minutes for some timeout to occur. I haven't tested it, please do. Also figure out where the CPU loop occurs (I suspect paramiko) and find a way to mitigate it (configure better ssh timeouts? introduce sleep intervals if it's in our code?).


This ticket had assigned some Differential requests:
D587
D604

So, this does not occur only if the machine dies unexpectedly. The 100% CPU utilization is the the whole time we wait for some output over shh! So this seems to be definitely a busy waiting loop either in our code or in paramiko.

Easy to test:
Modify runner.py:RemoteRunner - disable prepare_task() (put a return as the first line) and change run() to

self.exitcode = self.ssh.cmd('sleep 60')

Now watch the CPU go berserk with python2 runtask process.

This seems to be a problem in remote_exec. Claiming since I'm already in the code for #597.

There were two problems here, busy loop and libtaskotron process hanging if ssh communication goes down. The first one is resolved, the second is not. Let's reopen this to track the second one (unless you want me to report a new ticket, no problem).

I filed a new ticket for tracking the paramiko issue - #665. Closing this one.

Login to comment on this ticket.

Metadata