#3 testcloud race condition when populating backingstores
Closed: Fixed None Opened 7 years ago by kparal.

If there's a new taskotron base image which testcloud doesn't know yet (not yet copied to backingstores), there's a race condition when multiple jobs are executed simultaneously. In our environment, this happens when the first event received (after a new base image is built) is koji-built-completed fedmsg. In that case rpmlint and rpmgrill are scheduled, one of them completes and one of them crashes with:

[libtaskotron:minion.py:251] 2016-08-12 01:13:35 ERROR   Was expecting to find instance taskotron-fc81d648-6029-11e6-a0e2-525400d7d6a4 but it does not already exist
Traceback (most recent call last):
  File "/usr/lib/python2.7/site-packages/libtaskotron/minion.py", line 240, in execute
    task_vm.teardown()
  File "/usr/lib/python2.7/site-packages/libtaskotron/ext/disposable/vm.py", line 174, in teardown
    tc_instance = self._check_existing_instance(should_exist=True)
  File "/usr/lib/python2.7/site-packages/libtaskotron/ext/disposable/vm.py", line 118, in _check_existing_instance
    " already exist".format(self.instancename))
TaskotronRemoteError: Was expecting to find instance taskotron-fc81d648-6029-11e6-a0e2-525400d7d6a4 but it does not already exist
[libtaskotron:logger.py:88] 2016-08-12 01:13:35 CRITICAL Traceback (most recent call last):
  File "/usr/bin/runtask", line 9, in <module>
    load_entry_point('libtaskotron==0.4.15', 'console_scripts', 'runtask')()
  File "/usr/lib/python2.7/site-packages/libtaskotron/main.py", line 163, in main
    overlord.start()
  File "/usr/lib/python2.7/site-packages/libtaskotron/overlord.py", line 92, in start
    runner.execute()
  File "/usr/lib/python2.7/site-packages/libtaskotron/minion.py", line 201, in execute
    task_vm.prepare(**env)
  File "/usr/lib/python2.7/site-packages/libtaskotron/ext/disposable/vm.py", line 136, in prepare
    tc_image = self._prepare_image(distro, release, flavor, arch)
  File "/usr/lib/python2.7/site-packages/libtaskotron/ext/disposable/vm.py", line 87, in _prepare_image
    tc_image.prepare()
  File "/usr/lib/python2.7/site-packages/testcloud/image.py", line 198, in prepare
    self._handle_file_url(self.remote_path, self.local_path)
  File "/usr/lib/python2.7/site-packages/testcloud/image.py", line 162, in _handle_file_url
    shutil.copy(source_path, dest_path)
  File "/usr/lib64/python2.7/shutil.py", line 119, in copy
    copyfile(src, dst)
  File "/usr/lib64/python2.7/shutil.py", line 83, in copyfile
    with open(dst, 'wb') as fdst:
IOError: [Errno 13] Permission denied: '/var/lib/testcloud/backingstores/160812_0000-fedora-23-taskotron_cloud-x86_64.qcow2'

The two tasks:
http://taskotron-dev.fedoraproject.org/taskmaster/builders/x86_64/builds/208534
http://taskotron-dev.fedoraproject.org/taskmaster/builders/x86_64/builds/208533

What probably happens:
1. The first task triggered asks testcloud to use image X, testcloud doesn't know it, starts copying to backingstores.
2. Before that is finished, the second task asks testcloud to use image X, testcloud already sees it in backingstores (even though incomplete), runs a VM. The VM fails, because the image is not finished (incompletely copied, wrong permissions).
3. The copy operation finished, the first task is started and runs OK.

What we can do:
Use the same dir for taskotron images dir and testcloud backingstore (if that works).
Make testcloud not copy images for file:// URLs. use them directly.
Use just a symlink from backingstores for file:// URLs.
Something else?


This ticket had assigned some Differential requests:
D972
D971
D970

Everything is committed and the patch seems to be working well on taskotron-dev. We'll wait a few more days and then we need to do a new testcloud+libtaskotron build and push it to updates.

The fix seems to be working on dev and @mkrizek built new testcloud and libtaskotron and pushed it to updates. Closing.

Login to comment on this ticket.

Metadata