Issue #46: race condition when creating a symlink in backingstores - testcloud

testcloud

#46 race condition when creating a symlink in backingstores

Opened 7 years ago by kparal. Modified 6 years ago

In D970 we tried to eliminate race conditions in backingstores. But who would have thought that ln -sf is still not atomic? :smile:

[testcloud.image:image.py:203] 2016-08-31 01:15:49 DEBUG   Local downloads will be stored in /var/lib/testcloud/backingstores.
ln: cannot remove ‘/var/lib/testcloud/backingstores/160831_0000-fedora-24-taskotron_cloud-x86_64.qcow2’: No such file or directory
[libtaskotron:minion.py:237] 2016-08-31 01:15:49 WARNING Task execution was interrupted by an error, doing emergency cleanup
[libtaskotron:minion.py:245] 2016-08-31 01:15:49 INFO    Shutting down the minion...
[libtaskotron:minion.py:257] 2016-08-31 01:15:49 ERROR   Was expecting to find instance taskotron-72799b7c-6f18-11e6-83fd-525400d7d6a4 but it does not already exist
Traceback (most recent call last):
  File "/usr/lib/python2.7/site-packages/libtaskotron/minion.py", line 246, in execute
    task_vm.teardown()
  File "/usr/lib/python2.7/site-packages/libtaskotron/ext/disposable/vm.py", line 177, in teardown
    tc_instance = self._check_existing_instance(should_exist=True)
  File "/usr/lib/python2.7/site-packages/libtaskotron/ext/disposable/vm.py", line 121, in _check_existing_instance
    " already exist".format(self.instancename))
TaskotronRemoteError: Was expecting to find instance taskotron-72799b7c-6f18-11e6-83fd-525400d7d6a4 but it does not already exist
[libtaskotron:logger.py:88] 2016-08-31 01:15:49 CRITICAL Traceback (most recent call last):
  File "/usr/bin/runtask", line 9, in <module>
    load_entry_point('libtaskotron==0.4.16', 'console_scripts', 'runtask')()
  File "/usr/lib/python2.7/site-packages/libtaskotron/main.py", line 163, in main
    overlord.start()
  File "/usr/lib/python2.7/site-packages/libtaskotron/overlord.py", line 92, in start
    runner.execute()
  File "/usr/lib/python2.7/site-packages/libtaskotron/minion.py", line 205, in execute
    task_vm.prepare(**env)
  File "/usr/lib/python2.7/site-packages/libtaskotron/ext/disposable/vm.py", line 139, in prepare
    tc_image = self._prepare_image(distro, release, flavor, arch)
  File "/usr/lib/python2.7/site-packages/libtaskotron/ext/disposable/vm.py", line 90, in _prepare_image
    tc_image.prepare(copy=False)
  File "/usr/lib/python2.7/site-packages/testcloud/image.py", line 206, in prepare
    self._handle_file_url(self.remote_path, self.local_path, copy=copy)
  File "/usr/lib/python2.7/site-packages/testcloud/image.py", line 165, in _handle_file_url
    subprocess.check_call(['ln', '-s', '-f', source_path, dest_path])
  File "/usr/lib64/python2.7/subprocess.py", line 540, in check_call
    raise CalledProcessError(retcode, cmd)
CalledProcessError: Command '['ln', '-s', '-f', '/var/lib/diskimages/160831_0000-fedora-24-taskotron_cloud-x86_64.qcow2', '/var/lib/testcloud/backingstores/160831_0000-fedora-24-taskotron_cloud-x86_64.qcow2']' returned non-zero exit status 1

http://taskotron-dev.fedoraproject.org/taskmaster/builders/x86_64/builds/218759/steps/runtask/logs/stdio

Looking into strace, ln -sf calls two kernel calls:

$ strace ln -sf a b
...
unlink("b")                             = 0
symlink("a", "b")                       = 0
...

If we use mv instead, that seems to be just a single call (which could hopefully be finally atomic):

$ strace mv a b
...
rename("a", "b")                        = 0
...

Another solution would be to stop using backingstores (at least as an option in API) and simply use the file from its original location.

somebody commented 7 years ago

This ticket had assigned some Differential requests:
D970

kparal commented 7 years ago

Here's some fun stuff to read:
http://stackoverflow.com/questions/7054844/is-rename-atomic
http://stackoverflow.com/questions/3222341/how-to-rename-without-race-conditions
https://github.com/lattera/glibc/blob/a2f34833b1042d5d8eeb263b4cf4caaea138c4ad/sysdeps/posix/rename.c#L34

Another option is to use filesystem locks:
lang=python import sys, fcntl, time with open("test.lock", "w+") as lock: fcntl.flock(lock.fileno(), fcntl.LOCK_EX) # begin critical sys.stdout.write("hello\n") time.sleep(3) sys.stdout.write("world\n") # end critical

tflink commented 7 years ago

Do we have any idea how often this is happening in our deployments?

kparal commented 7 years ago

! In #833#11876, @tflink wrote:
Do we have any idea how often this is happening in our deployments?

I don't, especially since we've been recently seeing hundreds of other crashes daily.

Assigning to @lbrabec, he wanted to work on this (months ago), even had some proof of concept implemented.

kparal commented 7 years ago

This is still happening, but probably not a huge issue, so removing from workboard. Unassigning @lbrabec.

Metadata

Assignee

None

Tags

None

Blocking

None

Depending on

None

Priority

Normal

testcloud

Source Code

#46 race condition when creating a symlink in backingstores Opened 7 years ago by kparal. Modified 6 years ago

Close issue as:

Metadata

#46 race condition when creating a symlink in backingstores

Opened 7 years ago by kparal. Modified 6 years ago