Commit - sanlock - 48d2841cad99084f037ccfb2c8da54df04fe8df5

    sanlock: transient timeout handling in acquire and release
    
    If a transient i/o error causes an acquire or release
    command to fail, the error path does not always clean up
    properly.
    
    The lockspace will generally survive transient i/o errors
    without going into recovery.  This means that a failed
    acquire or release can leave a lease held on disk, but not
    managed by sanlock.  Because the host is still alive in the
    lockspace, other hosts cannot acquire the abandoned lease.
    
    (If the host leaves the lockspace and rejoins, any abandoned
    lease state is invalidated because the host's lockspace
    generation number will be newer than what it left on disk.)
    
    If a release fails due to an io timeout, it needs to be
    retried until it succeeds or experiences a non-timeout error.
    This retrying is done asynchronously by the resource thread,
    which already handled async releases for clients that exited
    without first releasing their leases.
    
    If an io timeout occurs in the acquire path after on-disk
    state may have been written, the error path uses release to
    ensure any possible on-disk changes are undone.  If the
    release cannot be done immediately within the failing command,
    it is passed to the resource thread as above.
    
    Changes include:
    - calling release in all the necessary places in the
      acquire or release error exit paths
    - setting struct resource values to ensure release clears
      all the necessary disk state (leader, dblock, mblock).
    - retrying release from the resource_thread if the
      on-disk release operations time out (either from a
      direct release call or a release called from the acquire
      error path)
    
    Signed-off-by: David Teigland <teigland@redhat.com>

src/cmd.c

file modified

+2 -0

src/main.c

file modified

+1 -0

src/paxos_lease.c

file modified

+101 -17

src/paxos_lease.h

file modified

+5 -0

src/resource.c

file modified

+401 -69

src/resource.h

file modified

+2 -0

src/sanlock_internal.h

file modified

+4 -0

khemraj / sanlock

Source Code

48d2841 sanlock: transient timeout handling in acquire and release

Authored and Committed by teigland 10 years ago

`48d2841` sanlock: transient timeout handling in acquire and release