e4a4744 dlm-kernel: ignore unlock reply of zero

Authored and Committed by teigland 15 years ago
    dlm-kernel: ignore unlock reply of zero
    
    bz 495600
    
    Studying bug 349001 and the patch for it seems to leave a small window for the
    same problem to occur.  The symptoms seem consistent and it's the only
    possibility I can come up with.
    
    The problem that commit c92628dcc39e03a4e9eccc4fa76257c871e5ba00 aims to fix
    is a grant message followed by a convert reply message for the same lock.  I
    think the following sequence of events could still cause that to happen, even
    though the patch closes the window most of the way.
    
    dlm_recv thread
    1. receive convert -- process_cluster_request/GDLM_REMCMD_CONVREQUEST
    2.   lkb->lkb_request = freq
    3.   dlm_convert_stage2()
    4.     lkb can't be granted immediately, so it's put on the convert queue
    6.   if (lkb->lkb_request != NULL)
    9.     send convert reply
    
    other thread
    5. unlocks another lkb, finds lkb above can be granted, calls remote_grant()
    7. lkb->lkb_request = NULL
    8. send grant message
    
    remote_grant() is supposed to prevent the convert reply from being sent at all
    by setting lkb_request = NULL.  But, given the right race it doesn't work and
    both grant and convert reply messages are sent (and sent in the bad order).
    
    This isn't a general fix for the problem, but an additional work around.
    We check right away upon getting a reply (before doing anything) whether
    it's an unlock reply with status of 0, which shouldn't happen.  If it is,
    we ignore the reply (which the previous work-around on the sender side failed
    to prevent.)
    
    Currently, we recognize this "condition which shouldn't happen" after we've
    made changes (removed the lkb from the queue), making it too late to avoid
    disrupting the proper handling of the real unlock reply.  This just moves the
    check to earlier, where we can really effectively ignore the reply.
    
    Also, add more info to the process_lockqueue_reply state 0 message to
    help understand more about the cause.
    
    Signed-off-by: David Teigland <teigland@redhat.com>
    
        
file modified
+18 -4