a0d57d8 dlm_controld: ensure to stop kernel dlm if quit

Authored and Committed by aring 2 years ago
    dlm_controld: ensure to stop kernel dlm if quit
    
    This patch will ensure that we stop kernel lockspace activity before
    dlm_controld exits itself because e.g. the connection to corosync got
    lost. This can be if corosync got killed or shutdown.
    
    If we don't ensure the lockspace is stopped it can be that a kernel
    warning got triggered:
    
    Call trace of warning was:
    [14003.162881] Call Trace:
    [14003.162883]  [<000003ff80796d70>] dlm_midcomms_get_mhandle+0x170/0x1f0 [dlm]
    [14003.162892] ([<000003ff80796d6c>] dlm_midcomms_get_mhandle+0x16c/0x1f0 [dlm])
    [14003.162901]  [<000003ff80787366>] create_message+0x56/0x100 [dlm]
    [14003.162909]  [<000003ff8078849c>] send_common+0x7c/0x130 [dlm]
    [14003.162928]  [<000003ff8078b50c>] _convert_lock+0x3c/0x140 [dlm]
    [14003.162936]  [<000003ff8078b698>] convert_lock+0x88/0xd0 [dlm]
    [14003.162944]  [<000003ff80790008>] dlm_lock+0x158/0x1b0 [dlm]
    [14003.162952]  [<000003ff807ff4c6>] gdlm_lock+0x1f6/0x2f0 [gfs2]
    [14003.162997]  [<000003ff807d96c8>] do_xmote+0x1f8/0x440 [gfs2]
    [14003.163008]  [<000003ff807d9d88>] gfs2_glock_nq+0x88/0x130 [gfs2]
    [14003.163020]  [<000003ff807fac92>] gfs2_statfs_sync+0x52/0x180 [gfs2]
    [14003.163031]  [<000003ff807f2b70>] gfs2_quotad+0xc0/0x360 [gfs2]
    [14003.163043]  [<0000000050527cfc>] kthread+0x17c/0x190
    [14003.163061]  [<00000000504af5dc>] __ret_from_fork+0x3c/0x60
    [14003.163064]  [<0000000050d6df4a>] ret_from_fork+0xa/0x30
    
    Which indicates that there was still lock activity and a dlm fence action
    "closing connection" which gets triggered by a configfs removal was not
    synchronized between lock activity and recovery.
    
    On dlm_controld log side there was a:
    
    Feb 24 12:12:40 4008 cpg_dispatch error 2
    
    which probably indicates that the corosync daemon left.
    
    Instrumenting the dlm kernel handling indicates when a:
    
    killall corosync
    
    is executed the "ls->ls_in_recovery" write lock is not held. I did a
    write lock instrumentation by printout "RECOVERY LOCK" and
    "RECOVERY UNLOCK", when the per ls "ls_in_recovery" write lock is being
    held. This lock is important to held, because the "closing connection"
    aka dlm kernel fence action requires to have no lockspace lock activity
    anymore going on.
    
    Instrumented printout when corosync gets killed:
    
    [   28.863103] RECOVERY UNLOCK 1
    [   28.868559] dlm: test: dlm_recover 1 generation 11 done: 99 ms
    [   46.776997] dlm: connection 000000004b240e16 got EOF from 1
    [   46.779023] dlm: connection 000000003833c546 got EOF from 1
    [   46.781163] dlm: connection 00000000a48c3263 got EOF from 3
    [   46.782559] dlm: connection 0000000009964aad got EOF from 3
    [   48.657932] dlm: closing connection to node 3
    [   48.660090] dlm: closing connection to node 2
    [   48.661558] dlm: closing connection to node 1
    [   48.691884] dlm: test: no userland control daemon, stopping lockspace
    [   48.693888] RECOVERY LOCK 2
    [   48.695633] dlm: dlm user daemon left 1 lockspaces
    
    To fix this issue we ensure that the dlm lockspace activity gets stopped
    before removing configfs entries if dlm_controld main loop exits. On the
    above handling you can see that it is done afterwards which is too late.
    
    After this patch the ls_in_recovery is held before removal of configfs
    entries (closing connection):
    
    [   36.412544] RECOVERY UNLOCK 1
    [   36.418378] dlm: test: dlm_recover 1 generation 15 done: 233 ms
    [   70.616509] RECOVERY LOCK 2
    [   70.666016] dlm: connection 00000000df3f9abb got EOF from 1
    [   70.671155] dlm: connection 00000000e69b1ae0 got EOF from 3
    [   70.675919] dlm: connection 00000000d18e6d72 got EOF from 3
    [   70.730863] dlm: closing connection to node 3
    [   70.732917] dlm: closing connection to node 2
    [   70.734949] dlm: closing connection to node 1
    [   70.843747] dlm: dlm user daemon left 1 lockspaces
    
    Reported-by: Nate Straz <nstraz@redhat.com>
    
        
file modified
+8 -2
file modified
+2 -0
file modified
+1 -0