If there's a new taskotron base image which testcloud doesn't know yet (not yet copied to backingstores), there's a race condition when multiple jobs are executed simultaneously. In our environment, this happens when the first event received (after a new base image is built) is koji-built-completed fedmsg. In that case rpmlint and rpmgrill are scheduled, one of them completes and one of them crashes with:
backingstores
[libtaskotron:minion.py:251] 2016-08-12 01:13:35 ERROR Was expecting to find instance taskotron-fc81d648-6029-11e6-a0e2-525400d7d6a4 but it does not already exist Traceback (most recent call last): File "/usr/lib/python2.7/site-packages/libtaskotron/minion.py", line 240, in execute task_vm.teardown() File "/usr/lib/python2.7/site-packages/libtaskotron/ext/disposable/vm.py", line 174, in teardown tc_instance = self._check_existing_instance(should_exist=True) File "/usr/lib/python2.7/site-packages/libtaskotron/ext/disposable/vm.py", line 118, in _check_existing_instance " already exist".format(self.instancename)) TaskotronRemoteError: Was expecting to find instance taskotron-fc81d648-6029-11e6-a0e2-525400d7d6a4 but it does not already exist [libtaskotron:logger.py:88] 2016-08-12 01:13:35 CRITICAL Traceback (most recent call last): File "/usr/bin/runtask", line 9, in <module> load_entry_point('libtaskotron==0.4.15', 'console_scripts', 'runtask')() File "/usr/lib/python2.7/site-packages/libtaskotron/main.py", line 163, in main overlord.start() File "/usr/lib/python2.7/site-packages/libtaskotron/overlord.py", line 92, in start runner.execute() File "/usr/lib/python2.7/site-packages/libtaskotron/minion.py", line 201, in execute task_vm.prepare(**env) File "/usr/lib/python2.7/site-packages/libtaskotron/ext/disposable/vm.py", line 136, in prepare tc_image = self._prepare_image(distro, release, flavor, arch) File "/usr/lib/python2.7/site-packages/libtaskotron/ext/disposable/vm.py", line 87, in _prepare_image tc_image.prepare() File "/usr/lib/python2.7/site-packages/testcloud/image.py", line 198, in prepare self._handle_file_url(self.remote_path, self.local_path) File "/usr/lib/python2.7/site-packages/testcloud/image.py", line 162, in _handle_file_url shutil.copy(source_path, dest_path) File "/usr/lib64/python2.7/shutil.py", line 119, in copy copyfile(src, dst) File "/usr/lib64/python2.7/shutil.py", line 83, in copyfile with open(dst, 'wb') as fdst: IOError: [Errno 13] Permission denied: '/var/lib/testcloud/backingstores/160812_0000-fedora-23-taskotron_cloud-x86_64.qcow2'
The two tasks: http://taskotron-dev.fedoraproject.org/taskmaster/builders/x86_64/builds/208534 http://taskotron-dev.fedoraproject.org/taskmaster/builders/x86_64/builds/208533
What probably happens: 1. The first task triggered asks testcloud to use image X, testcloud doesn't know it, starts copying to backingstores. 2. Before that is finished, the second task asks testcloud to use image X, testcloud already sees it in backingstores (even though incomplete), runs a VM. The VM fails, because the image is not finished (incompletely copied, wrong permissions). 3. The copy operation finished, the first task is started and runs OK.
What we can do: Use the same dir for taskotron images dir and testcloud backingstore (if that works). Make testcloud not copy images for file:// URLs. use them directly. Use just a symlink from backingstores for file:// URLs. Something else?
file://
This ticket is a duplicate of https://pagure.io/taskotron/base_images/issue/3
This ticket had assigned some Differential requests: D972 D971 D970
Everything is committed and the patch seems to be working well on taskotron-dev. We'll wait a few more days and then we need to do a new testcloud+libtaskotron build and push it to updates.
The fix seems to be working on dev and @mkrizek built new testcloud and libtaskotron and pushed it to updates. Closing.
Login to comment on this ticket.