#38 testcloud doesn't handle larger images well
Closed: Fixed None Opened 8 years ago by tflink.

The official Fedora cloud images are built with 3GB disk. testcloud can handle these images without issue but when I start using custom images with a larger disk (10G in this particular case), I've been hitting a few problems.

On the first time I try to create an instance with the larger image, I get:

$ testcloud instance create disktest -u file:///home/tflink/taskotron-cloud/f22/20151123-taskotron-f22-10G.qcow2
DEBUG:create instance
DEBUG:Local downloads will be stored in /var/lib/testcloud/cache.
DEBUG:successfully changed SELinux context for image /var/lib/testcloud/cache/20151123-taskotron-f22-10G.qcow2
DEBUG:Creating instance directories
DEBUG:Generated user-data for instance disktest
DEBUG:Generated meta-data for instance disktest
DEBUG:creating seed image /var/lib/testcloud/instances/disktest/disktest-seed.img
libvirt: XML-RPC error : Cannot write data: Transport endpoint is not connected
libguestfs: error: could not connect to libvirt (URI = qemu:///session): Cannot write data: Transport endpoint is not connected [code=38 domain=7]
ERROR:Seed image generation failed. Exiting
Traceback (most recent call last):
  File "/usr/bin/testcloud", line 9, in <module>
    load_entry_point('testcloud==0.1.5', 'console_scripts', 'testcloud')()
  File "/usr/lib/python2.7/site-packages/testcloud/cli.py", line 277, in main
    args.func(args)
  File "/usr/lib/python2.7/site-packages/testcloud/cli.py", line 84, in _create_instance
    tc_instance.prepare()
  File "/usr/lib/python2.7/site-packages/testcloud/instance.py", line 169, in prepare
    self._generate_seed_image()
  File "/usr/lib/python2.7/site-packages/testcloud/instance.py", line 240, in _generate_seed_image
    raise TestcloudInstanceError("Failure during seed image generation")
testcloud.exceptions.TestcloudInstanceError: Failure during seed image generation

If I clean up the instance dir and try again, I've been seeing:

$ testcloud instance create disktest -u file:///home/tflink/taskotron-cloud/f22/20151123-taskotron-f22-10G.qcow2
DEBUG:create instance
DEBUG:Local downloads will be stored in /var/lib/testcloud/cache.
DEBUG:successfully changed SELinux context for image /var/lib/testcloud/cache/20151123-taskotron-f22-10G.qcow2
DEBUG:Creating instance directories
DEBUG:Generated user-data for instance disktest
DEBUG:Generated meta-data for instance disktest
DEBUG:creating seed image /var/lib/testcloud/instances/disktest/disktest-seed.img
INFO:Seed image generated successfully
Formatting '/var/lib/testcloud/instances/disktest/disktest-local.qcow2', fmt=qcow2 size=10737418240 backing_file='/var/lib/testcloud/cache/20151123-taskotron-f22-10G.qcow2' encryption=off cluster_size=65536 lazy_refcounts=off refcount_bits=16
INFO:Successfully booted your local cloud image!
Traceback (most recent call last):
  File "/usr/bin/testcloud", line 9, in <module>
    load_entry_point('testcloud==0.1.5', 'console_scripts', 'testcloud')()
  File "/usr/lib/python2.7/site-packages/testcloud/cli.py", line 277, in main
    args.func(args)
  File "/usr/lib/python2.7/site-packages/testcloud/cli.py", line 93, in _create_instance
    tc_instance.create_ip_file(vm_ip)
  File "/usr/lib/python2.7/site-packages/testcloud/instance.py", line 294, in create_ip_file
    ip_file.write(ip)
TypeError: expected a character buffer object

Unfortunately, this error is transient - I don't hit it 100% of the time. That being said, if I introduce a 30 second delay in instance.vm_spawn(), before it returns, I haven't seen the second issue at all.

I suspect both of these issues are related to the large image size - specifically how the operations surrounding that larger image take longer than they would for the official images.

Triage the issue and propose a fix. If the two symptoms listed above do not trace back to the same root cause, file new issues.


After some poking, I have a potential solution.

We can't check for domain state because that turns to 'running' as soon as the domain is created. However, the domain appears to have no interface until after cloud-init has run so we can poll and return once we can find an interface in the domain.

The question left is whether we want to delay "boot" completion in test cloud until after the interface exists or if we just want to do that for instance creation. I'm leaning towards adding the delay to instance.start() but introduce a flag or config value which would skip the polling (mostly for debug purposes if there are problems).

Thoughts?

I'm leaning towards adding the delay to instance.start() but introduce a flag or config value which would skip the polling (mostly for debug purposes if there are problems).

That sounds good to me.

Login to comment on this ticket.

Metadata