In D970 we tried to eliminate race conditions in backingstores. But who would have thought that ln -sf is still not atomic? :smile:
ln -sf
[testcloud.image:image.py:203] 2016-08-31 01:15:49 DEBUG Local downloads will be stored in /var/lib/testcloud/backingstores. ln: cannot remove ‘/var/lib/testcloud/backingstores/160831_0000-fedora-24-taskotron_cloud-x86_64.qcow2’: No such file or directory [libtaskotron:minion.py:237] 2016-08-31 01:15:49 WARNING Task execution was interrupted by an error, doing emergency cleanup [libtaskotron:minion.py:245] 2016-08-31 01:15:49 INFO Shutting down the minion... [libtaskotron:minion.py:257] 2016-08-31 01:15:49 ERROR Was expecting to find instance taskotron-72799b7c-6f18-11e6-83fd-525400d7d6a4 but it does not already exist Traceback (most recent call last): File "/usr/lib/python2.7/site-packages/libtaskotron/minion.py", line 246, in execute task_vm.teardown() File "/usr/lib/python2.7/site-packages/libtaskotron/ext/disposable/vm.py", line 177, in teardown tc_instance = self._check_existing_instance(should_exist=True) File "/usr/lib/python2.7/site-packages/libtaskotron/ext/disposable/vm.py", line 121, in _check_existing_instance " already exist".format(self.instancename)) TaskotronRemoteError: Was expecting to find instance taskotron-72799b7c-6f18-11e6-83fd-525400d7d6a4 but it does not already exist [libtaskotron:logger.py:88] 2016-08-31 01:15:49 CRITICAL Traceback (most recent call last): File "/usr/bin/runtask", line 9, in <module> load_entry_point('libtaskotron==0.4.16', 'console_scripts', 'runtask')() File "/usr/lib/python2.7/site-packages/libtaskotron/main.py", line 163, in main overlord.start() File "/usr/lib/python2.7/site-packages/libtaskotron/overlord.py", line 92, in start runner.execute() File "/usr/lib/python2.7/site-packages/libtaskotron/minion.py", line 205, in execute task_vm.prepare(**env) File "/usr/lib/python2.7/site-packages/libtaskotron/ext/disposable/vm.py", line 139, in prepare tc_image = self._prepare_image(distro, release, flavor, arch) File "/usr/lib/python2.7/site-packages/libtaskotron/ext/disposable/vm.py", line 90, in _prepare_image tc_image.prepare(copy=False) File "/usr/lib/python2.7/site-packages/testcloud/image.py", line 206, in prepare self._handle_file_url(self.remote_path, self.local_path, copy=copy) File "/usr/lib/python2.7/site-packages/testcloud/image.py", line 165, in _handle_file_url subprocess.check_call(['ln', '-s', '-f', source_path, dest_path]) File "/usr/lib64/python2.7/subprocess.py", line 540, in check_call raise CalledProcessError(retcode, cmd) CalledProcessError: Command '['ln', '-s', '-f', '/var/lib/diskimages/160831_0000-fedora-24-taskotron_cloud-x86_64.qcow2', '/var/lib/testcloud/backingstores/160831_0000-fedora-24-taskotron_cloud-x86_64.qcow2']' returned non-zero exit status 1
http://taskotron-dev.fedoraproject.org/taskmaster/builders/x86_64/builds/218759/steps/runtask/logs/stdio
Looking into strace, ln -sf calls two kernel calls:
$ strace ln -sf a b ... unlink("b") = 0 symlink("a", "b") = 0 ...
If we use mv instead, that seems to be just a single call (which could hopefully be finally atomic):
mv
$ strace mv a b ... rename("a", "b") = 0 ...
Another solution would be to stop using backingstores (at least as an option in API) and simply use the file from its original location.
This ticket had assigned some Differential requests: D970
Here's some fun stuff to read: http://stackoverflow.com/questions/7054844/is-rename-atomic http://stackoverflow.com/questions/3222341/how-to-rename-without-race-conditions https://github.com/lattera/glibc/blob/a2f34833b1042d5d8eeb263b4cf4caaea138c4ad/sysdeps/posix/rename.c#L34
Another option is to use filesystem locks: lang=python import sys, fcntl, time with open("test.lock", "w+") as lock: fcntl.flock(lock.fileno(), fcntl.LOCK_EX) # begin critical sys.stdout.write("hello\n") time.sleep(3) sys.stdout.write("world\n") # end critical
lang=python import sys, fcntl, time with open("test.lock", "w+") as lock: fcntl.flock(lock.fileno(), fcntl.LOCK_EX) # begin critical sys.stdout.write("hello\n") time.sleep(3) sys.stdout.write("world\n") # end critical
Do we have any idea how often this is happening in our deployments?
! In #833#11876, @tflink wrote: Do we have any idea how often this is happening in our deployments?
I don't, especially since we've been recently seeing hundreds of other crashes daily.
Assigning to @lbrabec, he wanted to work on this (months ago), even had some proof of concept implemented.
This is still happening, but probably not a huge issue, so removing from workboard. Unassigning @lbrabec.
Login to comment on this ticket.