#2525 Monitor SIGKILL timer issue and service restart failure
Closed: Invalid None Opened 4 years ago by kieren.

Per IRC conv with sgallagh, sssd (1.9.2) failed to SIGKILL sssd_pam which subsequently prevents the service being restarted.

Log extract:

(Wed Dec 10 15:16:52 2014) [sssd] [mt_svc_sigkill] (0x0010): [mydomain][933] is not responding to SIGTERM. Sending SIGKILL.
(Wed Dec 10 15:24:18 2014) [sssd] [mt_svc_sigkill] (0x0010): [pam][935] is not responding to SIGTERM. Sending SIGKILL.
(Wed Dec 10 15:24:18 2014) [sssd] [mt_svc_sigkill] (0x0010): Sending signal to child (pam:935) failed! Ignore and pretend child is dead.

IRC log:

18:32 < kieren> if sssd kills a process (like sssd_pam), will it try at some point to restart it itself?
19:01 < sgallagh> kieren: Yes, if SSSD detects the death of (or kills) one of its subprocesses, it *should*
                  immediately relaunch it
19:02 < kieren> sgallagh: great - do you know if that appeared in a particular version?
19:02 < kieren> i have rhel6.4 / sssd 1.9.2 and it didn't seem to respawn it
19:02 < sgallagh> kieren: It was supposed to work that way from the very beginning
19:03 < sgallagh> Actually, it will try three times to restart it, then give up
19:03 < kieren> after "[pam][935] is not responding to SIGTERM. Sending SIGKILL." i got the error "Sending
                signal to child (pam:935) failed! Ignore and pretend child is dead."
19:03 < sgallagh> Wait, what?
19:04 < kieren> then nothing else in the sssd.log
19:04 < sgallagh> That... shouldn't be possibel
19:04 < sgallagh> *possible
19:06 < kieren> which bit shouldn't be possible - the 'ignore and pretend child is dead' bit?
19:07 < sgallagh> "Sending signal to child failed!"
19:13 < sgallagh> OK, unfortunately, we're not printing the reason that kill() fails here.
19:15 < sgallagh> That talloc_free() is likely incorrect.
19:21 < sgallagh> kieren: Please file a bug on this. I'll have a patch ready shortly
19:21 < sgallagh> But there's actually two bugs here.
19:22 < sgallagh> 1) When the child exit handler files, it doesn't remove the SIGKILL timer
19:22 < sgallagh> 2) The SIGKILL timer talloc_free()s the service, so it doesn't restart.
19:23 < sgallagh> Interestingly, I think it will only have an effect the *second* time the sssd_pam crashes.
19:23 < sgallagh> Because unless there's a race, the child will be restarted before the SIGKILL tries to hit
                  the old PID and then delete the svc object
19:28 < sgallagh> Actually, there's a third bug here too.. an access-after-free() if the kill(SIGTERM) fails...
19:30 < sgallagh> Looks like some pieces of it are fixed in master, but not all

This ticket gave me a good laugh!

Patch submitted: https://lists.fedorahosted.org/pipermail/sssd-devel/2014-December/022793.html

The actual reasons turned out to be a little more complex and esoteric. It was a combination of two small bugs, a race condition and an improper talloc_free().

The short version is that there's a race where, if the SIGTERM takes a while to process through, it leaves open a several-second gap where the SIGKILL timer could fire, fail (because the process already exited) and then talloc_free() the service, preventing it from being started again. Ugly and next to impossible to reproduce reliably. I think the patch will fix it, though.

owner: somebody => sgallagh
patch: 0 => 1
status: new => assigned

Fields changed

milestone: NEEDS_TRIAGE => SSSD 1.12.4

resolution: => fixed
status: assigned => closed

Fields changed

rhbz: => 0

I'm afraid that the bug was not fixed completely:


resolution: fixed =>
sensitive: => 0
status: closed => reopened

Reopened bugs belong to triage.

milestone: SSSD 1.12.4 => NEEDS_TRIAGE

We should take a look at the code again but we don't have a reproducer.

milestone: NEEDS_TRIAGE => SSSD 1.13.3
priority: major => minor

This ticket still needs work and we need to release 1.13.3 soon.

milestone: SSSD 1.13.3 => SSSD 1.13.4
owner: sgallagh => somebody
status: reopened => new

This will be (hopefully) mitigated by some changes being worked on

Simo rewrote the watchdog to be in-process
the cache writes should be less frequent in 1.14 as well
Pavel is changing the requests talloc hierarchy

Because of the two above and because we don't have a way to reproduce this problem, I'm marking this bug as minor and moving to a release further away. I would prefer to see if we still have issues after 1.14 changes.

milestone: SSSD 1.13.4 => SSSD 1.13.5

Fields changed

milestone: SSSD 1.13.5 => SSSD 1.15 beta

The watchdog and the DP rewrite make this ticket obsolete in my opinion.

review: 0 => 1
selected: => Not need

Bugs like these shouldn't happen with the new talloc hierarchy of the requests. Please reopen if you can reproduce the issue with 1.14 or newer.

resolution: => worksforme
status: new => closed

Metadata Update from @kieren:
- Issue set to the milestone: SSSD Future releases (no date set yet)

2 years ago

Login to comment on this ticket.