Learn more about these different git repos.
Other Git URLs
I have a system with a reproducible problem with sssd when under load.
The sssd.log shows a reoccurring number of messages stating: A service PING timed out on [domain.com]. Attempt 
Followed by: Killing service [expertcity.com], not responding to pings!
Following a restart of sssd, the sssd_be process spikes at 99% cpu, and a delay of 30-60secs can be experienced sshing to the device. Subsequent logins seem fine until whichever cache is effected needs to be renewed again, which in turn reproduces the long delay.
The system is a VM with 2 cores assigned. Load can be anywhere from 4-12 to reproduce the issue.
JR, do you happen to have some kind of debug logs from the sssd_be process? We suspect that the load might be due to the memberof plugin processing group memberships while saving nested group structures, but we need the logs to be completely sure.
By the way, incrasing the internal "timeout" might help you avoid the monitor process kill the back end process. The "timeout" parameter is undocumented, but it specifies the value (in seconds) between heartbeat pings between the monitor and the sssd_be. By default, it is set to 10, you might want to increase to, say, 30.
Hi, did you have a chance to test the increased timeout or gather the logs?
Putting to 1.9.5 for investigation. We will clone as appropriate when/if we know the scope of the problem.
milestone: NEEDS_TRIAGE => SSSD 1.9.5
Today I can reproduce this problem on an 8 core physical host. I will attach the logs, but I think it may be necessary for me to host a GoToMeeting session and just give you Keyboard and mouse to perform GDB debugging...
CPU locks out at 100% on sssd_be for at least 25secs during login any time after a sssd restart.
Worked with sgallagh today:
Turned debugging up to 9 and found what appears to be a 20+ second ltdb operation:
(Wed Mar 6 18:45:09 2013) [sssd[be[expertcity.com]]] [ldb] (0x4000): tevent: Added timed event "ltdb_callback": 0x7d52da0
(Wed Mar 6 18:45:09 2013) [sssd[be[expertcity.com]]] [ldb] (0x4000): tevent: Added timed event "ltdb_timeout": 0x7d52ec0
(Wed Mar 6 18:45:09 2013) [sssd[be[expertcity.com]]] [ldb] (0x4000): tevent: Destroying timer event 0x7cfbc50 "ltdb_timeout"
(Wed Mar 6 18:45:09 2013) [sssd[be[expertcity.com]]] [ldb] (0x4000): tevent: Ending timer event 0x7cfbb30 "ltdb_callback"
(Wed Mar 6 18:45:27 2013) [sssd[be[expertcity.com]]] [ldb] (0x4000): tevent: Destroying timer event 0x7d52ec0 "ltdb_timeout"
(Wed Mar 6 18:45:27 2013) [sssd[be[expertcity.com]]] [ldb] (0x4000): tevent: Ending timer event 0x7d52da0 "ltdb_callback"
Signs seem to point to a memberof operation
Via IRC Simo suggested: only cachegrind comes to mind as an alternative
there is a 20 seconds gap between the time when LDAP lookup of HBAC rules is finished and rules processing is started. I expect that the rules are beeing stored in the cache during this time.
(Wed Mar 6 17:51:38 2013) [sssd[be[expertcity.com]]] [sdap_get_generic_ext_done] (0x0400): Search result: Success(0), no errmsg set
(Wed Mar 6 17:51:58 2013) [sssd[be[expertcity.com]]] [hbac_attrs_to_rule] (0x1000): Processing rule [dev-general]
How many HBAC rules do you have? How large are they?
Can you test whether setting access_provider = permit or ipa_hbac_refresh to some greater value (maybe 30) help?
Do you still have the log from comment # 7? Can you attach it please?
I have 62 HBAC rules. I am not sure what you want to qualify for 'large'
I have hosts in development that are members of multiple HBAC rules...
access_provider = ipa
ipa_hbac_refresh = 1800
I've privately emailed you the cachegrind and callgrind logs.
review: => 0
milestone: SSSD 1.9.5 => NEEDS_TRIAGE
Is there a status update on this ticket? I've not heard anything in quite some time.
Andreas (who maintains Samba) was discussing the issue with other Samba developers as the problem is most likely not in the sssd but the underlying Samba libraries. They suggested gathering performance data for the cache libraries using perf, but that's not available on RHEL5, I'm afraid.
What's available on RHEL5 is Google's perftools, I will create you a build instrumented with the perftools and get back to you.
owner: somebody => jhrozek
patch: 0 => 1
priority: major => critical
status: new => assigned
milestone: NEEDS_TRIAGE => SSSD 1.10.0
Ticket has been cloned to Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=979045
rhbz: => [https://bugzilla.redhat.com/show_bug.cgi?id=979045 979045]
resolution: => fixed
status: assigned => closed
changelog: => N/A, just a bugfix
Metadata Update from @jraquino:
- Issue assigned to jhrozek
- Issue set to the milestone: SSSD 1.10.0
to comment on this ticket.