dlm_controld: ensure to stop kernel dlm if quit
This patch will ensure that we stop kernel lockspace activity before
dlm_controld exits itself because e.g. the connection to corosync got
lost. This can be if corosync got killed or shutdown.
If we don't ensure the lockspace is stopped it can be that a kernel
warning got triggered:
Call trace of warning was:
[14003.162881] Call Trace:
[14003.162883] [<000003ff80796d70>] dlm_midcomms_get_mhandle+0x170/0x1f0 [dlm]
[14003.162892] ([<000003ff80796d6c>] dlm_midcomms_get_mhandle+0x16c/0x1f0 [dlm])
[14003.162901] [<000003ff80787366>] create_message+0x56/0x100 [dlm]
[14003.162909] [<000003ff8078849c>] send_common+0x7c/0x130 [dlm]
[14003.162928] [<000003ff8078b50c>] _convert_lock+0x3c/0x140 [dlm]
[14003.162936] [<000003ff8078b698>] convert_lock+0x88/0xd0 [dlm]
[14003.162944] [<000003ff80790008>] dlm_lock+0x158/0x1b0 [dlm]
[14003.162952] [<000003ff807ff4c6>] gdlm_lock+0x1f6/0x2f0 [gfs2]
[14003.162997] [<000003ff807d96c8>] do_xmote+0x1f8/0x440 [gfs2]
[14003.163008] [<000003ff807d9d88>] gfs2_glock_nq+0x88/0x130 [gfs2]
[14003.163020] [<000003ff807fac92>] gfs2_statfs_sync+0x52/0x180 [gfs2]
[14003.163031] [<000003ff807f2b70>] gfs2_quotad+0xc0/0x360 [gfs2]
[14003.163043] [<0000000050527cfc>] kthread+0x17c/0x190
[14003.163061] [<00000000504af5dc>] __ret_from_fork+0x3c/0x60
[14003.163064] [<0000000050d6df4a>] ret_from_fork+0xa/0x30
Which indicates that there was still lock activity and a dlm fence action
"closing connection" which gets triggered by a configfs removal was not
synchronized between lock activity and recovery.
On dlm_controld log side there was a:
Feb 24 12:12:40 4008 cpg_dispatch error 2
which probably indicates that the corosync daemon left.
Instrumenting the dlm kernel handling indicates when a:
killall corosync
is executed the "ls->ls_in_recovery" write lock is not held. I did a
write lock instrumentation by printout "RECOVERY LOCK" and
"RECOVERY UNLOCK", when the per ls "ls_in_recovery" write lock is being
held. This lock is important to held, because the "closing connection"
aka dlm kernel fence action requires to have no lockspace lock activity
anymore going on.
Instrumented printout when corosync gets killed:
[ 28.863103] RECOVERY UNLOCK 1
[ 28.868559] dlm: test: dlm_recover 1 generation 11 done: 99 ms
[ 46.776997] dlm: connection 000000004b240e16 got EOF from 1
[ 46.779023] dlm: connection 000000003833c546 got EOF from 1
[ 46.781163] dlm: connection 00000000a48c3263 got EOF from 3
[ 46.782559] dlm: connection 0000000009964aad got EOF from 3
[ 48.657932] dlm: closing connection to node 3
[ 48.660090] dlm: closing connection to node 2
[ 48.661558] dlm: closing connection to node 1
[ 48.691884] dlm: test: no userland control daemon, stopping lockspace
[ 48.693888] RECOVERY LOCK 2
[ 48.695633] dlm: dlm user daemon left 1 lockspaces
To fix this issue we ensure that the dlm lockspace activity gets stopped
before removing configfs entries if dlm_controld main loop exits. On the
above handling you can see that it is done afterwards which is too late.
After this patch the ls_in_recovery is held before removal of configfs
entries (closing connection):
[ 36.412544] RECOVERY UNLOCK 1
[ 36.418378] dlm: test: dlm_recover 1 generation 15 done: 233 ms
[ 70.616509] RECOVERY LOCK 2
[ 70.666016] dlm: connection 00000000df3f9abb got EOF from 1
[ 70.671155] dlm: connection 00000000e69b1ae0 got EOF from 3
[ 70.675919] dlm: connection 00000000d18e6d72 got EOF from 3
[ 70.730863] dlm: closing connection to node 3
[ 70.732917] dlm: closing connection to node 2
[ 70.734949] dlm: closing connection to node 1
[ 70.843747] dlm: dlm user daemon left 1 lockspaces
Reported-by: Nate Straz <nstraz@redhat.com>