Learn more about these different git repos.
Other Git URLs
Per IRC conv with sgallagh, sssd (1.9.2) failed to SIGKILL sssd_pam which subsequently prevents the service being restarted.
Log extract:
(Wed Dec 10 15:16:52 2014) [sssd] [mt_svc_sigkill] (0x0010): [mydomain][933] is not responding to SIGTERM. Sending SIGKILL. (Wed Dec 10 15:24:18 2014) [sssd] [mt_svc_sigkill] (0x0010): [pam][935] is not responding to SIGTERM. Sending SIGKILL. (Wed Dec 10 15:24:18 2014) [sssd] [mt_svc_sigkill] (0x0010): Sending signal to child (pam:935) failed! Ignore and pretend child is dead.
IRC log:
18:32 < kieren> if sssd kills a process (like sssd_pam), will it try at some point to restart it itself? 19:01 < sgallagh> kieren: Yes, if SSSD detects the death of (or kills) one of its subprocesses, it *should* immediately relaunch it 19:02 < kieren> sgallagh: great - do you know if that appeared in a particular version? 19:02 < kieren> i have rhel6.4 / sssd 1.9.2 and it didn't seem to respawn it 19:02 < sgallagh> kieren: It was supposed to work that way from the very beginning 19:03 < sgallagh> Actually, it will try three times to restart it, then give up 19:03 < kieren> after "[pam][935] is not responding to SIGTERM. Sending SIGKILL." i got the error "Sending signal to child (pam:935) failed! Ignore and pretend child is dead." 19:03 < sgallagh> Wait, what? 19:04 < kieren> then nothing else in the sssd.log 19:04 < sgallagh> That... shouldn't be possibel 19:04 < sgallagh> *possible 19:06 < kieren> which bit shouldn't be possible - the 'ignore and pretend child is dead' bit? 19:07 < sgallagh> "Sending signal to child failed!" 19:13 < sgallagh> OK, unfortunately, we're not printing the reason that kill() fails here. 19:15 < sgallagh> That talloc_free() is likely incorrect. 19:21 < sgallagh> kieren: Please file a bug on this. I'll have a patch ready shortly 19:21 < sgallagh> But there's actually two bugs here. 19:22 < sgallagh> 1) When the child exit handler files, it doesn't remove the SIGKILL timer 19:22 < sgallagh> 2) The SIGKILL timer talloc_free()s the service, so it doesn't restart. 19:23 < sgallagh> Interestingly, I think it will only have an effect the *second* time the sssd_pam crashes. 19:23 < sgallagh> Because unless there's a race, the child will be restarted before the SIGKILL tries to hit the old PID and then delete the svc object 19:28 < sgallagh> Actually, there's a third bug here too.. an access-after-free() if the kill(SIGTERM) fails... 19:30 < sgallagh> Looks like some pieces of it are fixed in master, but not all
This ticket gave me a good laugh!
Patch submitted: https://lists.fedorahosted.org/pipermail/sssd-devel/2014-December/022793.html
The actual reasons turned out to be a little more complex and esoteric. It was a combination of two small bugs, a race condition and an improper talloc_free().
The short version is that there's a race where, if the SIGTERM takes a while to process through, it leaves open a several-second gap where the SIGKILL timer could fire, fail (because the process already exited) and then talloc_free() the service, preventing it from being started again. Ugly and next to impossible to reproduce reliably. I think the patch will fix it, though.
owner: somebody => sgallagh patch: 0 => 1 status: new => assigned
Fields changed
milestone: NEEDS_TRIAGE => SSSD 1.12.4
resolution: => fixed status: assigned => closed
rhbz: => 0
Linked to Bugzilla bug: https://bugzilla.redhat.com/show_bug.cgi?id=1267761 (Red Hat Enterprise Linux 6)
rhbz: 0 => [https://bugzilla.redhat.com/show_bug.cgi?id=1267761 1267761]
I'm afraid that the bug was not fixed completely:
https://bugzilla.redhat.com/show_bug.cgi?id=1276781
resolution: fixed => sensitive: => 0 status: closed => reopened
Reopened bugs belong to triage.
milestone: SSSD 1.12.4 => NEEDS_TRIAGE
We should take a look at the code again but we don't have a reproducer.
milestone: NEEDS_TRIAGE => SSSD 1.13.3 priority: major => minor
Linked to Bugzilla bug: https://bugzilla.redhat.com/show_bug.cgi?id=1276781 (Red Hat Enterprise Linux 6)
rhbz: [https://bugzilla.redhat.com/show_bug.cgi?id=1267761 1267761] => [https://bugzilla.redhat.com/show_bug.cgi?id=1267761 1267761], [https://bugzilla.redhat.com/show_bug.cgi?id=1276781 1276781]
This ticket still needs work and we need to release 1.13.3 soon.
milestone: SSSD 1.13.3 => SSSD 1.13.4 owner: sgallagh => somebody status: reopened => new
This will be (hopefully) mitigated by some changes being worked on
Simo rewrote the watchdog to be in-process the cache writes should be less frequent in 1.14 as well Pavel is changing the requests talloc hierarchy
Because of the two above and because we don't have a way to reproduce this problem, I'm marking this bug as minor and moving to a release further away. I would prefer to see if we still have issues after 1.14 changes.
milestone: SSSD 1.13.4 => SSSD 1.13.5
milestone: SSSD 1.13.5 => SSSD 1.15 beta
The watchdog and the DP rewrite make this ticket obsolete in my opinion.
review: 0 => 1 selected: => Not need
Bugs like these shouldn't happen with the new talloc hierarchy of the requests. Please reopen if you can reproduce the issue with 1.14 or newer.
resolution: => worksforme status: new => closed
Metadata Update from @kieren: - Issue set to the milestone: SSSD Future releases (no date set yet)
SSSD is moving from Pagure to Github. This means that new issues and pull requests will be accepted only in SSSD's github repository.
This issue has been cloned to Github and is available here: - https://github.com/SSSD/sssd/issues/3567
If you want to receive further updates on the issue, please navigate to the github issue and click on subscribe button.
subscribe
Thank you for understanding. We apologize for all inconvenience.
Log in to comment on this ticket.