qdiskd: Make multipath issues go away
Qdiskd hsitorically has required significant tuning to work around
delays which occur during multipath failover, overloaded I/O, and LUN
trespasses in both device-mapper-multipath and EMC PowerPath
environments.
This patch goes a very long way towards eliminating false evictions
when these conditions occur by making qdiskd whine to the other
cluster members when it detects hung system calls. When a cluster
member whines, it indicates the source of the problem (which system
call is hung), and the act of receiving a whine from a host indicates
that qdiskd is operational, but that I/O is hung. Hung I/O is different
from losing storage entirely (where you get I/O errors).
Possible problems:
- Receive queue getting very full, causing messages to become blocked on
a node where I/O is hung. 1) that would take a very long time, and 2)
node should get evicted at that point anyway.
Resolves: rhbz#782900
this version of the patch is a backport of:
e2937eb33f224f86904fead08499a6178868ca6a
34d2872fb7e60be1594158acaaeb8acd74f78d22
There is a minor change vs original patch based on how qdiskd
in RHEL5 handles cman connection. We add an extra call to cman_alive
in main qdisk_loop to make sure data are not stalled on the
cman port, and data_callback to qdiskd_whine executed.
Signed-off-by: Lon Hohberger <lhh@redhat.com>
Signed-off-by: Fabio M. Di Nitto <fdinitto@redhat.com>