#259 delay between VM spawning and network bringup on clients is causing task execution to hang
Closed: Fixed None Opened 8 years ago by tflink.

On my demo system, I've hit a problem where testcloud is returning an IP before the VM is actually ready to receive network data.

When libtaskotron attempts to connect to the remote machine, execution hangs because the connection to port 22 is rejected.

I've been able to fix this by adding a short delay between spawning and SSH connection attempts but a better solution would certainly be welcome


This ticket had assigned some Differential requests:
D515

After talking to some people on Flock (you might have been there too, Tim, I'm not sure about that) it seems like there is virtually no better solution then waiting for a while, before connecting.

I'd rather not implement a push-notification mechanism (which IMHO the cloud images have, but hardcoded to some specific IP), because that would IMHO mean weird changes on the other side- spawning a network service from inside the libtaskotron (or more probably testcloud) code, just so it can then be "pinged" back by the virt machine seems excesive, if a simple wait can solve the issue.

Or we might implement some active-waiting mechanism - e.g. making sure that the Firewall configuration is set in such a way that ping is disabled by default, and allowed by a service that is started after SSHD successfully started...

But honestly (and I know I'm repeating myself) the sleep(10) seems to have the best complexity/gain ration for me.

I'd rather have for i in range(1,11): if connect(): break; else sleep(1); :) , but yes, in general I think it's better to use the simple approach here rather than implementing our own VM->host notification mechanism.

after short discussion with @jskladan, we came up with something like this:

import socket
import time

s = socket.socket()
timeout = 60
start_time = time.time()
while True:
    try:
        s.connect((self.ipaddr, port))
    except socket.error:
        pass
    else:
        s.close()
        break
    if (start_time + timeout) < time.time():
        raise ...

it is probably the simplest solution besides sleep(), any thoughts?

I'd put a short delay (even 100ms) in that while loop just to make sure it's not going to suck down too much CPU but other than that, it does sound like the best of the options we have.

Login to comment on this ticket.

Metadata