Commit - linux-cluster/cluster - b32a74b2a1cf695196078baecd41014bd12405ac

    There are a couple of potential problems this should fix related to
    an overlap of dlm_recoverd in ls_nodes_reconfig (changing the ls_nodes
    list) for recovery event N and dlm_recvd in dlm_dir_rebuild_send
    (reading ls_nodes list) for recovery event N-1.
    
    We now wait, in dlm_ls_stop(), for dlm_recoverd to detect and abort
    recovery event N-1.  This ensures that when all nodes receive
    dlm_ls_start() and begin recovery event N, that no other nodes are
    still working on recovery event N-1 in any way.
    
    There's still the chance that a stray/delayed message (in particular a
    RECOVERNAMES request) pertaining to event N-1 will be delivered while all
    nodes are working on event N.  There's an added check that should
    prevent this from causing any trouble for all practical purposes.
    
    One other possible problem is dlm_ls_stop clearing the status bits right
    before ls_nodes_reconfig sets them.  Doing this after dlm_recoverd
    is suspended makes it safe.
    
    I don't know which, if any, of these potential problems have actually
    been observed; none of them are very likely.  But, I think there's
    a fair chance that the first problem matches bz 145831.

dlm-kernel/src/dlm_internal.h

file modified

+1 -0

dlm-kernel/src/lockspace.c

file modified

+10 -1

dlm-kernel/src/nodes.c

file modified

+1 -1

dlm-kernel/src/reccomms.c

file modified

+12 -0

dlm-kernel/src/recoverd.c

file modified

+17 -0

dlm-kernel/src/recoverd.h

file modified

+2 -0

linux-cluster / cluster

Source Code

Documentation

b32a74b There are a couple of potential problems this should fix related to

Authored and Committed by teigland 19 years ago

`b32a74b` There are a couple of potential problems this should fix related to