monitor: Service restart fixes
There are actually two bugs here:
1) When either the kill(SIGTERM) or kill(SIGKILL) commands returned
failure (for any reason), we would talloc_free(svc) which removed it
from being eligible for restart, resulting in the service never
starting again without an SSSD service restart.
2) There is a fairly wide race condition where it's possible for a
SIGKILL timer to "catch up" to the child exit handler between us
noticing the termination and actually restarting it. The race
happens because we re-enter the mainloop and add a restart
timeout to avoid a quick failure if we keep restarting due to a
transitory issue (the mt_svc object, and therefore the SIGKILL
timer, were never freed until we got to the actual service
restart).
We can minimize this race by recording the timer_event for the
SIGKILL timeout in the mt_svc object. This way, if the process
exits via SIGTERM, we will immediately remove the timer for the
SIGKILL. Additionally, we'll catch the special-case of an ESRCH
response from the kill(SIGKILL) and assume that it means that the
process has exited. The only other two possible errors are
* EINVAL: (an invalid signal was specified) - This should be
impossible, obviously.
* EPERM: This process doesn't have permission to send signals to
this PID. If this happens, it's either an SELinux bug or
else the process has terminated and a new process that
SSSD doesn't control has taken the ID over.
So in the incredibly unlikely case that one of those occurs, we'll
just go ahead and try to start a new process.
This patch also removes the incorrect talloc_free(svc) calls on the
kill() failures and replaces them with an attempt to just start up
the service again and hope for the best.
Resolves:
https://fedorahosted.org/sssd/ticket/2525
Reviewed-by: Pavel Březina <pbrezina@redhat.com>
(cherry picked from commit 152251b13a99c88054055d46600e0478c4f7bd05)