There are a couple of potential problems this should fix related to
an overlap of dlm_recoverd in ls_nodes_reconfig (changing the ls_nodes
list) for recovery event N and dlm_recvd in dlm_dir_rebuild_send
(reading ls_nodes list) for recovery event N-1.
We now wait, in dlm_ls_stop(), for dlm_recoverd to detect and abort
recovery event N-1. This ensures that when all nodes receive
dlm_ls_start() and begin recovery event N, that no other nodes are
still working on recovery event N-1 in any way.
There's still the chance that a stray/delayed message (in particular a
RECOVERNAMES request) pertaining to event N-1 will be delivered while all
nodes are working on event N. There's an added check that should
prevent this from causing any trouble for all practical purposes.
One other possible problem is dlm_ls_stop clearing the status bits right
before ls_nodes_reconfig sets them. Doing this after dlm_recoverd
is suspended makes it safe.
I don't know which, if any, of these potential problems have actually
been observed; none of them are very likely. But, I think there's
a fair chance that the first problem matches bz 145831.