#1551 sssd_nss process hangs, stuck in loop; "self restart" does recover, but old process hangs around using 100% CPU
Closed: Fixed None Opened 6 years ago by jhrozek.

https://bugzilla.redhat.com/show_bug.cgi?id=814513 (Red Hat Enterprise Linux 5)

Description of problem:

The sssd_nss process, shown in ps as:
/usr/libexec/sssd/sssd_nss -d 0 --debug-to-files

will occasionally hang, using 100% CPU. sssd's "self restart" system
automatically starts a new process, as noted in /var/log/messages:
Apr 18 19:43:10 blog2 sssd[nss]: Starting up

so the system does continue to function normally. However, the old process will
continue to run until killed, using 100% CPU.

Version-Release number of selected component (if applicable):

$ yum list sssd
sssd.i386 1.5.1-49.el5_8.1 installed

$ rpm -qi sssd
Name        : sssd                         Relocations: (not relocatable)
Version     : 1.5.1                             Vendor: CentOS
Release     : 49.el5_8.1                    Build Date: Mon 02 Apr 2012
02:00:34 PM EDT
Install Date: Fri 06 Apr 2012 11:43:52 PM EDT      Build Host:
builder10.centos.org
Group       : Applications/System           Source RPM:
sssd-1.5.1-49.el5_8.1.src.rpm
Size        : 3680458                          License: GPLv3+
Signature   : DSA/SHA1, Mon 02 Apr 2012 02:12:41 PM EDT, Key ID
a8a447dce8562897

How reproducible:

Unknown.

This intermittently, but somewhat regularly, occurs on a particular class of
servers I run. The servers are running several (3 separate) apache httpd
instances and some custom perl scripts. I don't know what exactly the servers
are doing that triggers this bug. The bug is triggered approximately once a
week.

Actual results:

sssd process hangs, stuck in a loop.

Expected results:

sssd doesn't hang.


Additional info:

* nothing in /var/log/sssd log files
* no output from strace
sudo strace -p 19311
Process 19311 attached - interrupt to quit
# nothing more

* attached gdb
sudo yum install gdb
wget
http://debuginfo.centos.org/5/i386/sssd-debuginfo-1.5.1-49.el5_8.1.i386.rpm
wget http://debuginfo.centos.org/5/i386/glibc-debuginfo-2.5-81.el5_8.1.i686.rpm
wget http://debuginfo.centos.org/5/i386/glibc-debuginfo-common-2.5-81.el5_8.1.i
386.rpm
wget
http://debuginfo.centos.org/5/i386/libtevent-debuginfo-0.9.8-10.el5.i386.rpm
sudo rpm -ivh glibc-debuginfo-2.5-81.el5_8.1.i686.rpm
glibc-debuginfo-common-2.5-81.el5_8.1.i386.rpm
libtevent-debuginfo-0.9.8-10.el5.i386.rpm
sssd-debuginfo-1.5.1-49.el5_8.1.i386.rpm

* see attachment for gdb info

Fields changed

blockedby: =>
blocking: =>
coverity: =>
feature_milestone: =>
milestone: NEEDS_TRIAGE => SSSD 1.9.2
testsupdated: => 0

Fields changed

owner: somebody => pbrezina
status: new => assigned

As Daniel found out, the problem here is this line in nss_cmd_retpwent():

n = pdom->res->count - cctx->pwent_cur;

It somehow happens that pwent_cur > res->count, which puts -1 in n when there is no other domain to try. This brings sssd_nss into infinite loop.

We have discussed this with Jakub a lot. We came up with only one possible scenario when this could happen:

  1. setpwent() is called, sssd will fetch all users and store them in memory for enum_cache_timeout seconds
  2. client calls getpwent() several times
  3. client is idle long enough, that users stored in the memory are expired and deleted from the memory
  4. several users are deleted from LDAP
  5. client calls getpwent() again
  6. sssd fetches all users again but the result does not contain deleted users
  7. if enough users where deleted, res->count < pwent_cur

I have managed to reproduce it, but there are so many coincidences that must happen together, that I don't really believe this is the reason why it is happening:

  1. the client must be idle between two getent() calls for at least enum_cache_timeout seconds (default is 2 mins)
  2. enumeration must occur in this time span (default period is 5 mins)
  3. this enumeration must also remove deleted users from cache (default is 12 hours)
  4. at least res->count - pwent_cur - 1 user must be deleted from LDAP

_comment0: As Daniel found out, the problem here is this line in nss_cmd_retpwent():

{{{
n = pdom->res->count - cctx->pwent_cur;
}}}

It somehow happens that ''pwent_cur > res->count'', which puts ''-1'' in ''n'' when there is no other domain to try. This brings sssd_nss into infinite loop.

We have discussed this with Jakub a lot. We came up with only one possible scenario when this could happen:

  1. ''setpwent()'' is called, sssd will fetch all users and store them in memory for enum_cache_timeout seconds
  2. client calls ''getpwent()'' several times
  3. client is idle long enough, that users stored in the memory are expired and deleted from the memory
  4. several users are deleted from LDAP
  5. client calls ''getpwent()'' again
  6. sssd fetches all users again but the result does not contain deleted users
  7. if enough users where deleted, res->count < pwent_cur

I have managed to reproduce it, but there are so many coincidences that must happen together, that I don't really believe this is the reason why it is happening. => 1349783261802014

I would propose the following:

  1. Put additional check into the loop so that we simply abort and error out on the special condition. It's not technically the best solution but it's way better that eating up 100% of CPU in a tight loop.
  2. file a new ticket to track the root cause.
  3. create an instrumented build for the reporter. We are quite lucky the person is technical and willing to test, so he might be able to run a build with extra debugging to give us more information.

Fields changed

patch: 0 => 1

master: 89b93a4

resolution: => fixed
status: assigned => closed

Metadata Update from @jhrozek:
- Issue assigned to pbrezina
- Issue set to the milestone: SSSD 1.9.2

2 years ago

Login to comment on this ticket.

Metadata