#2525 Monitor SIGKILL timer issue and service restart failure
Closed: Invalid None Opened 5 years ago by kieren.

Per IRC conv with sgallagh, sssd (1.9.2) failed to SIGKILL sssd_pam which subsequently prevents the service being restarted.

Log extract:

(Wed Dec 10 15:16:52 2014) [sssd] [mt_svc_sigkill] (0x0010): [mydomain][933] is not responding to SIGTERM. Sending SIGKILL.
(Wed Dec 10 15:24:18 2014) [sssd] [mt_svc_sigkill] (0x0010): [pam][935] is not responding to SIGTERM. Sending SIGKILL.
(Wed Dec 10 15:24:18 2014) [sssd] [mt_svc_sigkill] (0x0010): Sending signal to child (pam:935) failed! Ignore and pretend child is dead.

IRC log:

18:32 < kieren> if sssd kills a process (like sssd_pam), will it try at some point to restart it itself?
19:01 < sgallagh> kieren: Yes, if SSSD detects the death of (or kills) one of its subprocesses, it *should*
                  immediately relaunch it
19:02 < kieren> sgallagh: great - do you know if that appeared in a particular version?
19:02 < kieren> i have rhel6.4 / sssd 1.9.2 and it didn't seem to respawn it
19:02 < sgallagh> kieren: It was supposed to work that way from the very beginning
19:03 < sgallagh> Actually, it will try three times to restart it, then give up
19:03 < kieren> after "[pam][935] is not responding to SIGTERM. Sending SIGKILL." i got the error "Sending
                signal to child (pam:935) failed! Ignore and pretend child is dead."
19:03 < sgallagh> Wait, what?
19:04 < kieren> then nothing else in the sssd.log
19:04 < sgallagh> That... shouldn't be possibel
19:04 < sgallagh> *possible
19:06 < kieren> which bit shouldn't be possible - the 'ignore and pretend child is dead' bit?
19:07 < sgallagh> "Sending signal to child failed!"
19:13 < sgallagh> OK, unfortunately, we're not printing the reason that kill() fails here.
19:15 < sgallagh> That talloc_free() is likely incorrect.
19:21 < sgallagh> kieren: Please file a bug on this. I'll have a patch ready shortly
19:21 < sgallagh> But there's actually two bugs here.
19:22 < sgallagh> 1) When the child exit handler files, it doesn't remove the SIGKILL timer
19:22 < sgallagh> 2) The SIGKILL timer talloc_free()s the service, so it doesn't restart.
19:23 < sgallagh> Interestingly, I think it will only have an effect the *second* time the sssd_pam crashes.
19:23 < sgallagh> Because unless there's a race, the child will be restarted before the SIGKILL tries to hit
                  the old PID and then delete the svc object
19:28 < sgallagh> Actually, there's a third bug here too.. an access-after-free() if the kill(SIGTERM) fails...
19:30 < sgallagh> Looks like some pieces of it are fixed in master, but not all

This ticket gave me a good laugh!

Patch submitted: https://lists.fedorahosted.org/pipermail/sssd-devel/2014-December/022793.html

The actual reasons turned out to be a little more complex and esoteric. It was a combination of two small bugs, a race condition and an improper talloc_free().

The short version is that there's a race where, if the SIGTERM takes a while to process through, it leaves open a several-second gap where the SIGKILL timer could fire, fail (because the process already exited) and then talloc_free() the service, preventing it from being started again. Ugly and next to impossible to reproduce reliably. I think the patch will fix it, though.

owner: somebody => sgallagh
patch: 0 => 1
status: new => assigned

Fields changed

milestone: NEEDS_TRIAGE => SSSD 1.12.4

resolution: => fixed
status: assigned => closed

Fields changed

rhbz: => 0

I'm afraid that the bug was not fixed completely:

https://bugzilla.redhat.com/show_bug.cgi?id=1276781

resolution: fixed =>
sensitive: => 0
status: closed => reopened

Reopened bugs belong to triage.

milestone: SSSD 1.12.4 => NEEDS_TRIAGE

We should take a look at the code again but we don't have a reproducer.

milestone: NEEDS_TRIAGE => SSSD 1.13.3
priority: major => minor

This ticket still needs work and we need to release 1.13.3 soon.

milestone: SSSD 1.13.3 => SSSD 1.13.4
owner: sgallagh => somebody
status: reopened => new

This will be (hopefully) mitigated by some changes being worked on

Simo rewrote the watchdog to be in-process
the cache writes should be less frequent in 1.14 as well
Pavel is changing the requests talloc hierarchy

Because of the two above and because we don't have a way to reproduce this problem, I'm marking this bug as minor and moving to a release further away. I would prefer to see if we still have issues after 1.14 changes.

milestone: SSSD 1.13.4 => SSSD 1.13.5

Fields changed

milestone: SSSD 1.13.5 => SSSD 1.15 beta

The watchdog and the DP rewrite make this ticket obsolete in my opinion.

review: 0 => 1
selected: => Not need

Bugs like these shouldn't happen with the new talloc hierarchy of the requests. Please reopen if you can reproduce the issue with 1.14 or newer.

resolution: => worksforme
status: new => closed

Metadata Update from @kieren:
- Issue set to the milestone: SSSD Future releases (no date set yet)

3 years ago

SSSD is moving from Pagure to Github. This means that new issues and pull requests
will be accepted only in SSSD's github repository.

This issue has been cloned to Github and is available here:
- https://github.com/SSSD/sssd/issues/3567

If you want to receive further updates on the issue, please navigate to the github issue
and click on subscribe button.

Thank you for understanding. We apologize for all inconvenience.

Login to comment on this ticket.

Metadata