Learn more about these different git repos.
Other Git URLs
Per IRC conv with sgallagh, sssd (1.9.2) failed to SIGKILL sssd_pam which subsequently prevents the service being restarted.
(Wed Dec 10 15:16:52 2014) [sssd] [mt_svc_sigkill] (0x0010): [mydomain] is not responding to SIGTERM. Sending SIGKILL.
(Wed Dec 10 15:24:18 2014) [sssd] [mt_svc_sigkill] (0x0010): [pam] is not responding to SIGTERM. Sending SIGKILL.
(Wed Dec 10 15:24:18 2014) [sssd] [mt_svc_sigkill] (0x0010): Sending signal to child (pam:935) failed! Ignore and pretend child is dead.
18:32 < kieren> if sssd kills a process (like sssd_pam), will it try at some point to restart it itself?
19:01 < sgallagh> kieren: Yes, if SSSD detects the death of (or kills) one of its subprocesses, it *should*
immediately relaunch it
19:02 < kieren> sgallagh: great - do you know if that appeared in a particular version?
19:02 < kieren> i have rhel6.4 / sssd 1.9.2 and it didn't seem to respawn it
19:02 < sgallagh> kieren: It was supposed to work that way from the very beginning
19:03 < sgallagh> Actually, it will try three times to restart it, then give up
19:03 < kieren> after "[pam] is not responding to SIGTERM. Sending SIGKILL." i got the error "Sending
signal to child (pam:935) failed! Ignore and pretend child is dead."
19:03 < sgallagh> Wait, what?
19:04 < kieren> then nothing else in the sssd.log
19:04 < sgallagh> That... shouldn't be possibel
19:04 < sgallagh> *possible
19:06 < kieren> which bit shouldn't be possible - the 'ignore and pretend child is dead' bit?
19:07 < sgallagh> "Sending signal to child failed!"
19:13 < sgallagh> OK, unfortunately, we're not printing the reason that kill() fails here.
19:15 < sgallagh> That talloc_free() is likely incorrect.
19:21 < sgallagh> kieren: Please file a bug on this. I'll have a patch ready shortly
19:21 < sgallagh> But there's actually two bugs here.
19:22 < sgallagh> 1) When the child exit handler files, it doesn't remove the SIGKILL timer
19:22 < sgallagh> 2) The SIGKILL timer talloc_free()s the service, so it doesn't restart.
19:23 < sgallagh> Interestingly, I think it will only have an effect the *second* time the sssd_pam crashes.
19:23 < sgallagh> Because unless there's a race, the child will be restarted before the SIGKILL tries to hit
the old PID and then delete the svc object
19:28 < sgallagh> Actually, there's a third bug here too.. an access-after-free() if the kill(SIGTERM) fails...
19:30 < sgallagh> Looks like some pieces of it are fixed in master, but not all
This ticket gave me a good laugh!
Patch submitted: https://lists.fedorahosted.org/pipermail/sssd-devel/2014-December/022793.html
The actual reasons turned out to be a little more complex and esoteric. It was a combination of two small bugs, a race condition and an improper talloc_free().
The short version is that there's a race where, if the SIGTERM takes a while to process through, it leaves open a several-second gap where the SIGKILL timer could fire, fail (because the process already exited) and then talloc_free() the service, preventing it from being started again. Ugly and next to impossible to reproduce reliably. I think the patch will fix it, though.
owner: somebody => sgallagh
patch: 0 => 1
status: new => assigned
milestone: NEEDS_TRIAGE => SSSD 1.12.4
resolution: => fixed
status: assigned => closed
rhbz: => 0
Linked to Bugzilla bug: https://bugzilla.redhat.com/show_bug.cgi?id=1267761 (Red Hat Enterprise Linux 6)
rhbz: 0 => [https://bugzilla.redhat.com/show_bug.cgi?id=1267761 1267761]
I'm afraid that the bug was not fixed completely:
resolution: fixed =>
sensitive: => 0
status: closed => reopened
Reopened bugs belong to triage.
milestone: SSSD 1.12.4 => NEEDS_TRIAGE
We should take a look at the code again but we don't have a reproducer.
milestone: NEEDS_TRIAGE => SSSD 1.13.3
priority: major => minor
Linked to Bugzilla bug: https://bugzilla.redhat.com/show_bug.cgi?id=1276781 (Red Hat Enterprise Linux 6)
rhbz: [https://bugzilla.redhat.com/show_bug.cgi?id=1267761 1267761] => [https://bugzilla.redhat.com/show_bug.cgi?id=1267761 1267761], [https://bugzilla.redhat.com/show_bug.cgi?id=1276781 1276781]
This ticket still needs work and we need to release 1.13.3 soon.
milestone: SSSD 1.13.3 => SSSD 1.13.4
owner: sgallagh => somebody
status: reopened => new
This will be (hopefully) mitigated by some changes being worked on
Simo rewrote the watchdog to be in-process
the cache writes should be less frequent in 1.14 as well
Pavel is changing the requests talloc hierarchy
Because of the two above and because we don't have a way to reproduce this problem, I'm marking this bug as minor and moving to a release further away. I would prefer to see if we still have issues after 1.14 changes.
milestone: SSSD 1.13.4 => SSSD 1.13.5
milestone: SSSD 1.13.5 => SSSD 1.15 beta
The watchdog and the DP rewrite make this ticket obsolete in my opinion.
review: 0 => 1
selected: => Not need
Bugs like these shouldn't happen with the new talloc hierarchy of the requests. Please reopen if you can reproduce the issue with 1.14 or newer.
resolution: => worksforme
status: new => closed
Metadata Update from @kieren:
- Issue set to the milestone: SSSD Future releases (no date set yet)
SSSD is moving from Pagure to Github. This means that new issues and pull requests
will be accepted only in SSSD's github repository.
This issue has been cloned to Github and is available here:
If you want to receive further updates on the issue, please navigate to the github issue
and click on subscribe button.
Thank you for understanding. We apologize for all inconvenience.
to comment on this ticket.