c61e717 gfs_controld: fix plock recovery

2 files Authored by teigland 10 years ago, Committed by chrissie 10 years ago,
    gfs_controld: fix plock recovery
    
    When there are two nodes in the cluster, and the
    the node in charge of the plock checkpoint fails,
    the remaining node does not unlink the checkpoint
    that had been created by the failed node.  When
    the failed node returns, and the new node attempts
    to transfer plock state, it fails to create a new
    checkpoint because it did not unlink the previous
    checkpoint created by the failed node.  This leads
    to any existing plock state not being transferred
    to the newly joined node.  The newly joined node
    will then mistakenly grant plocks to itself that
    may conflict with plocks that the other node could
    not transfer.  This leads to:
    
    1. conflicting plocks being held concurrently
    2. dangling plocks that are not held but not removed
    
    In the explanation above, the reason the remaining
    node does not unlink the checkpoint that had been
    created by the other node, is that it does not know
    that the other node was in charge of the checkpoint.
    It could only know this if it had been present before
    and after the previous membership change.  Because
    there are only two nodes, this was not possible.
    This, however, is also the point exploited to fix
    the problem.  When there are only two members, a new
    node can assume that the other node is in charge of
    the checkpoint.
    
    The following test shows the problem/fix using
    a program "doplock" that requests an exclusive,
    blocking posix lock on the given file.
    
    node1: mount /gfs
    node2: mount /gfs
    node1: touch /gfs/test
    node1: doplock /gfs/test (granted)
    node2: doplock /gfs/test (blocks)
    node1: killed
    node2: recovery for node1
    node2: doplock above granted the lock
    node1: restarts
    node1: mount /gfs
    node1: doplock /gfs/test
    
    In the last step, the node1 doplock should block
    because node2 holds the lock.  Before the fix,
    it was granted.
    
    Signed-off-by: David Teigland <teigland@redhat.com>
    
        
file modified
+7 -0
file modified
+12 -2