#46 race condition when creating a symlink in backingstores
Opened 7 years ago by kparal. Modified 6 years ago

In D970 we tried to eliminate race conditions in backingstores. But who would have thought that ln -sf is still not atomic? :smile:

[testcloud.image:image.py:203] 2016-08-31 01:15:49 DEBUG   Local downloads will be stored in /var/lib/testcloud/backingstores.
ln: cannot remove ‘/var/lib/testcloud/backingstores/160831_0000-fedora-24-taskotron_cloud-x86_64.qcow2’: No such file or directory
[libtaskotron:minion.py:237] 2016-08-31 01:15:49 WARNING Task execution was interrupted by an error, doing emergency cleanup
[libtaskotron:minion.py:245] 2016-08-31 01:15:49 INFO    Shutting down the minion...
[libtaskotron:minion.py:257] 2016-08-31 01:15:49 ERROR   Was expecting to find instance taskotron-72799b7c-6f18-11e6-83fd-525400d7d6a4 but it does not already exist
Traceback (most recent call last):
  File "/usr/lib/python2.7/site-packages/libtaskotron/minion.py", line 246, in execute
    task_vm.teardown()
  File "/usr/lib/python2.7/site-packages/libtaskotron/ext/disposable/vm.py", line 177, in teardown
    tc_instance = self._check_existing_instance(should_exist=True)
  File "/usr/lib/python2.7/site-packages/libtaskotron/ext/disposable/vm.py", line 121, in _check_existing_instance
    " already exist".format(self.instancename))
TaskotronRemoteError: Was expecting to find instance taskotron-72799b7c-6f18-11e6-83fd-525400d7d6a4 but it does not already exist
[libtaskotron:logger.py:88] 2016-08-31 01:15:49 CRITICAL Traceback (most recent call last):
  File "/usr/bin/runtask", line 9, in <module>
    load_entry_point('libtaskotron==0.4.16', 'console_scripts', 'runtask')()
  File "/usr/lib/python2.7/site-packages/libtaskotron/main.py", line 163, in main
    overlord.start()
  File "/usr/lib/python2.7/site-packages/libtaskotron/overlord.py", line 92, in start
    runner.execute()
  File "/usr/lib/python2.7/site-packages/libtaskotron/minion.py", line 205, in execute
    task_vm.prepare(**env)
  File "/usr/lib/python2.7/site-packages/libtaskotron/ext/disposable/vm.py", line 139, in prepare
    tc_image = self._prepare_image(distro, release, flavor, arch)
  File "/usr/lib/python2.7/site-packages/libtaskotron/ext/disposable/vm.py", line 90, in _prepare_image
    tc_image.prepare(copy=False)
  File "/usr/lib/python2.7/site-packages/testcloud/image.py", line 206, in prepare
    self._handle_file_url(self.remote_path, self.local_path, copy=copy)
  File "/usr/lib/python2.7/site-packages/testcloud/image.py", line 165, in _handle_file_url
    subprocess.check_call(['ln', '-s', '-f', source_path, dest_path])
  File "/usr/lib64/python2.7/subprocess.py", line 540, in check_call
    raise CalledProcessError(retcode, cmd)
CalledProcessError: Command '['ln', '-s', '-f', '/var/lib/diskimages/160831_0000-fedora-24-taskotron_cloud-x86_64.qcow2', '/var/lib/testcloud/backingstores/160831_0000-fedora-24-taskotron_cloud-x86_64.qcow2']' returned non-zero exit status 1

http://taskotron-dev.fedoraproject.org/taskmaster/builders/x86_64/builds/218759/steps/runtask/logs/stdio

Looking into strace, ln -sf calls two kernel calls:

$ strace ln -sf a b
...
unlink("b")                             = 0
symlink("a", "b")                       = 0
...

If we use mv instead, that seems to be just a single call (which could hopefully be finally atomic):

$ strace mv a b
...
rename("a", "b")                        = 0
...

Another solution would be to stop using backingstores (at least as an option in API) and simply use the file from its original location.


This ticket had assigned some Differential requests:
D970

Here's some fun stuff to read:
http://stackoverflow.com/questions/7054844/is-rename-atomic
http://stackoverflow.com/questions/3222341/how-to-rename-without-race-conditions
https://github.com/lattera/glibc/blob/a2f34833b1042d5d8eeb263b4cf4caaea138c4ad/sysdeps/posix/rename.c#L34

Another option is to use filesystem locks:
lang=python import sys, fcntl, time with open("test.lock", "w+") as lock:      fcntl.flock(lock.fileno(), fcntl.LOCK_EX)      # begin critical      sys.stdout.write("hello\n")      time.sleep(3)      sys.stdout.write("world\n")      # end critical

Do we have any idea how often this is happening in our deployments?

! In #833#11876, @tflink wrote:
Do we have any idea how often this is happening in our deployments?

I don't, especially since we've been recently seeing hundreds of other crashes daily.

Assigning to @lbrabec, he wanted to work on this (months ago), even had some proof of concept implemented.

This is still happening, but probably not a huge issue, so removing from workboard. Unassigning @lbrabec.

Login to comment on this ticket.

Metadata