5843ec7 Bug 257881: Flush/recovery collision leads to deadlock after leg ...

Authored and Committed by Jonathan Brassow 16 years ago
    Bug 257881: Flush/recovery collision leads to deadlock after leg ...
    
    The procedure for coordinating nominal I/O and recovery I/O, was to
    either:
    1) delay a flush which contained a mark to a region being recovered
    2) skip over regions that are currently marked when assigning recovery
    
    This bug has to do with the way #1 was implemented.
    
    The following scenario would trigger it:
    1) node1 is assigned recovery on region X
    2) node1 also does a mark (write) on region Y
    3) node2 attempts to mark region X
    **) any flush issued here will delay waiting for recovery to complete on X
    4) node1 needs to perform the flush before it can get on with completing
    recovery - but it can't flush, so everyone is delayed *forever*.
    
    The fix was to allow flushes from nodes that are not attempting to mark
    regions that are being recovered.  In the example above, node1 should be
    allowed to complete the flush because it is not trying to write to the
    same region that is being recovered.  node2 would be correctly delayed.
    Since node1 can complete the flush, it can also complete the recovery -
    thus allowing things to proceed.
    
    This bug only affects mirrors that are not in-sync and are doing I/O.
    This bug can occur whether there are device/machine failures or not.
    This bug is most easily reproduced with a number of mirrors, but would
    be possible with just one.
    
    I've also fixed up some debugging output so it is more consistent and
    easier to follow the flow of events.