sanlock: transient timeout handling in acquire and release
If a transient i/o error causes an acquire or release
command to fail, the error path does not always clean up
properly.
The lockspace will generally survive transient i/o errors
without going into recovery. This means that a failed
acquire or release can leave a lease held on disk, but not
managed by sanlock. Because the host is still alive in the
lockspace, other hosts cannot acquire the abandoned lease.
(If the host leaves the lockspace and rejoins, any abandoned
lease state is invalidated because the host's lockspace
generation number will be newer than what it left on disk.)
If a release fails due to an io timeout, it needs to be
retried until it succeeds or experiences a non-timeout error.
This retrying is done asynchronously by the resource thread,
which already handled async releases for clients that exited
without first releasing their leases.
If an io timeout occurs in the acquire path after on-disk
state may have been written, the error path uses release to
ensure any possible on-disk changes are undone. If the
release cannot be done immediately within the failing command,
it is passed to the resource thread as above.
Changes include:
- calling release in all the necessary places in the
acquire or release error exit paths
- setting struct resource values to ensure release clears
all the necessary disk state (leader, dblock, mblock).
- retrying release from the resource_thread if the
on-disk release operations time out (either from a
direct release call or a release called from the acquire
error path)
Signed-off-by: David Teigland <teigland@redhat.com>