95d4db6 Fix potential cluster mirror corruption: 456575/471291

Authored and Committed by Jonathan Brassow 13 years ago
    Fix potential cluster mirror corruption: 456575/471291
    
    From my inline comments:
    * If the mirror was successfully recovered, we want to always
    * force every machine to write to all devices - otherwise,
    * corruption will occur.  Here's how:
    *    Node1 suffers a failure and marks a region out-of-sync
    *    Node2 attempts a write, gets by is_remote_recovering,
    *          and queries the sync status of the region - finding
    *          it out-of-sync.
    *    Node2 thinks the write should be a nosync write, but it
    *          hasn't suffered the drive failure that Node1 has yet.
    *          It then issues a generic_make_request directly to
    *          the primary image only - which is exactly the device
    *          that has suffered the failure.
    *    Node2 suffers a lost write - which completely bypasses the
    *          mirror layer because it had gone through generic_m_r.
    *    The file system will likely explode at this point due to
    *    I/O errors.  If it wasn't the primary that failed, it is
    *    easily possible in this case to issue writes to just one
    *    of the remaining images - also leaving the mirror inconsistent.
    *
    * We let in_sync() return 1 in a cluster regardless of what is
    * in the bitmap once recovery has successfully completed on a
    * mirror.  This ensures the mirroring code will continue to
    * attempt to write to all mirror images.  The worst that can
    * happen for reads is that additional read attempts may be
    * taken.
    
        
file modified
+42 -2