bb70c22 sanlock: do not close connection in error handling

Authored and Committed by teigland 2 years ago
    sanlock: do not close connection in error handling
    
    When processing a client connection, a problem with the message
    or with the client handling would cause the sanlock daemon to
    close the client connection (the socket fd) and release any
    leases if the connection was "registered".  If this happened,
    the client may be unaware of it, and may continue running,
    using the leases that have been dropped.  This could lead to
    different clients on different hosts believing that they hold
    the same lease concurrently.
    
    A known cause of this is when a sanlock client program makes
    libsanlock calls on a registered connection concurrently
    from multiple threads without serialization.  These calls are:
    sanlock_acquire, sanlock_release, sanlock_inquire,
    sanlock_convert, sanlock_restrict, sanlock_killpath.
    
    These calls involve:
    - sending a header to the sanlock daemon
    - sending a body to the sanlock daemon
    - receving a reply from the sanlock daemon
    
    If these steps are interleaved from multiple threads, the
    sanlock daemon will read incorrect data when processing
    a request, or the wrong result could be received by the caller.
    
    A specific example that's been seen involves two concurrent
    sanlock_release calls from different threads.  The proper
    sequence would be:
    
    sanlock_release_1
      send header_1
      send body_1
      recv result_1
    sanlock_release_2
      send header_2
      send body_2
      recv result_2
    
    Without locking, data from both requests are interleaved,
    causing a sequence to be received in the daemon:
    
      header_1
      header_2
      body_1
    
    The sanlock daemon expects the correct sequence of data,
    so it misinterprets the mixed data and reports errors when
    it finds unexpected fields in what it thinks are headers
    and body structs.
    
    The sanlock daemon recvs header_1, then recvs header_2,
    and thinks header_2 is body_1.  When it finds an unknown
    resource name in what it thinks is body_1 (really header_2),
    it logs an error to sanlock.log:
    
    cmd_release 19,86,41799 no resource ...
    
    Then the sanlock dameon recvs data from body_1, and thinks
    it is header_2.  When it finds an invalid magic number in
    header_2 (really body_1), it logs an error to sanlock.log:
    
    ci 19 recv 32 magic 0 vs 4282010
    
    The error path after seeing a bad magic number in a message
    header is to call deadfn().  For a registered connection,
    this is client_pid_dead() which closes the socket fd for
    the client and drops leases held by the client.
    
    Closing the connection is the wrong way to handle a bad message
    because the client is still running, and a closed connection
    implies that a registered client has exited.
    
    The fix is to simply ignore the bad data (logging an error).
    By ignoring the messages, the sanlock clients will likely be
    stuck waiting for replies, or possibly receive errors from
    their calls.  So, this fix only prevents leases from being
    dropped incorrectly.  Clients must still serialize access
    to sockets.
    
    Other error conditions in processing a client connection also
    use this same incorrect error handling, and they are also
    changed to simply ignore the issue and log an error.
    
        
file modified
+24 -20
file modified
+3 -0
file modified
+7 -2
file added
+208