Learn more about these different git repos.
Other Git URLs
Per IRC conv with sgallagh, sssd (1.9.2) failed to SIGKILL sssd_pam which subsequently prevents the service being restarted.
(Wed Dec 10 15:16:52 2014) [sssd] [mt_svc_sigkill] (0x0010): [mydomain] is not responding to SIGTERM. Sending SIGKILL.
(Wed Dec 10 15:24:18 2014) [sssd] [mt_svc_sigkill] (0x0010): [pam] is not responding to SIGTERM. Sending SIGKILL.
(Wed Dec 10 15:24:18 2014) [sssd] [mt_svc_sigkill] (0x0010): Sending signal to child (pam:935) failed! Ignore and pretend child is dead.
18:32 < kieren> if sssd kills a process (like sssd_pam), will it try at some point to restart it itself?
19:01 < sgallagh> kieren: Yes, if SSSD detects the death of (or kills) one of its subprocesses, it *should*
immediately relaunch it
19:02 < kieren> sgallagh: great - do you know if that appeared in a particular version?
19:02 < kieren> i have rhel6.4 / sssd 1.9.2 and it didn't seem to respawn it
19:02 < sgallagh> kieren: It was supposed to work that way from the very beginning
19:03 < sgallagh> Actually, it will try three times to restart it, then give up
19:03 < kieren> after "[pam] is not responding to SIGTERM. Sending SIGKILL." i got the error "Sending
signal to child (pam:935) failed! Ignore and pretend child is dead."
19:03 < sgallagh> Wait, what?
19:04 < kieren> then nothing else in the sssd.log
19:04 < sgallagh> That... shouldn't be possibel
19:04 < sgallagh> *possible
19:06 < kieren> which bit shouldn't be possible - the 'ignore and pretend child is dead' bit?
19:07 < sgallagh> "Sending signal to child failed!"
19:13 < sgallagh> OK, unfortunately, we're not printing the reason that kill() fails here.
19:15 < sgallagh> That talloc_free() is likely incorrect.
19:21 < sgallagh> kieren: Please file a bug on this. I'll have a patch ready shortly
19:21 < sgallagh> But there's actually two bugs here.
19:22 < sgallagh> 1) When the child exit handler files, it doesn't remove the SIGKILL timer
19:22 < sgallagh> 2) The SIGKILL timer talloc_free()s the service, so it doesn't restart.
19:23 < sgallagh> Interestingly, I think it will only have an effect the *second* time the sssd_pam crashes.
19:23 < sgallagh> Because unless there's a race, the child will be restarted before the SIGKILL tries to hit
the old PID and then delete the svc object
19:28 < sgallagh> Actually, there's a third bug here too.. an access-after-free() if the kill(SIGTERM) fails...
19:30 < sgallagh> Looks like some pieces of it are fixed in master, but not all
This ticket gave me a good laugh!
Patch submitted: https://lists.fedorahosted.org/pipermail/sssd-devel/2014-December/022793.html
The actual reasons turned out to be a little more complex and esoteric. It was a combination of two small bugs, a race condition and an improper talloc_free().
The short version is that there's a race where, if the SIGTERM takes a while to process through, it leaves open a several-second gap where the SIGKILL timer could fire, fail (because the process already exited) and then talloc_free() the service, preventing it from being started again. Ugly and next to impossible to reproduce reliably. I think the patch will fix it, though.
owner: somebody => sgallagh
patch: 0 => 1
status: new => assigned
milestone: NEEDS_TRIAGE => SSSD 1.12.4
resolution: => fixed
status: assigned => closed
rhbz: => 0
Linked to Bugzilla bug: https://bugzilla.redhat.com/show_bug.cgi?id=1267761 (Red Hat Enterprise Linux 6)
rhbz: 0 => [https://bugzilla.redhat.com/show_bug.cgi?id=1267761 1267761]
I'm afraid that the bug was not fixed completely:
resolution: fixed =>
sensitive: => 0
status: closed => reopened
Reopened bugs belong to triage.
milestone: SSSD 1.12.4 => NEEDS_TRIAGE
We should take a look at the code again but we don't have a reproducer.
milestone: NEEDS_TRIAGE => SSSD 1.13.3
priority: major => minor
Linked to Bugzilla bug: https://bugzilla.redhat.com/show_bug.cgi?id=1276781 (Red Hat Enterprise Linux 6)
rhbz: [https://bugzilla.redhat.com/show_bug.cgi?id=1267761 1267761] => [https://bugzilla.redhat.com/show_bug.cgi?id=1267761 1267761], [https://bugzilla.redhat.com/show_bug.cgi?id=1276781 1276781]
This ticket still needs work and we need to release 1.13.3 soon.
milestone: SSSD 1.13.3 => SSSD 1.13.4
owner: sgallagh => somebody
status: reopened => new
This will be (hopefully) mitigated by some changes being worked on
Simo rewrote the watchdog to be in-process
the cache writes should be less frequent in 1.14 as well
Pavel is changing the requests talloc hierarchy
Because of the two above and because we don't have a way to reproduce this problem, I'm marking this bug as minor and moving to a release further away. I would prefer to see if we still have issues after 1.14 changes.
milestone: SSSD 1.13.4 => SSSD 1.13.5
milestone: SSSD 1.13.5 => SSSD 1.15 beta
The watchdog and the DP rewrite make this ticket obsolete in my opinion.
review: 0 => 1
selected: => Not need
Bugs like these shouldn't happen with the new talloc hierarchy of the requests. Please reopen if you can reproduce the issue with 1.14 or newer.
resolution: => worksforme
status: new => closed
Metadata Update from @kieren:
- Issue set to the milestone: SSSD Future releases (no date set yet)
to comment on this ticket.